Getting started¶
Jupyter Enterprise Gateway requires Python (Python 3.3 or greater, or Python 2.7) and is intended to be installed on a Apache Spark 2.x cluster.
The following Resource Managers are supported with the Jupyter Enterprise Gateway:
- YARN Resource Manager - Client Mode
- YARN Resource Manager - Cluster Mode
The following kernels have been tested with the Jupyter Enterprise Gateway:
- Python/Apache Spark 2.x with IPython kernel
- Scala 2.11/Apache Spark 2.x with Apache Toree kernel
- R/Apache Spark 2.x with IRkernel
To support Scala kernels, Apache Toree must be installed. To support IPython kernels and R kernels to run in YARN containers, various packages have to be installed on each of the YARN data nodes. The simplest way to enable all the data nodes with required dependencies is to install Anaconda on all cluster nodes.
To take full advantage of security and user impersonation capabilities, a Kerberized cluster is recommended.
Installing Enterprise Gateway¶
For new users, we highly recommend installing Anaconda. Anaconda conveniently installs Python, the Jupyter Notebook, the IPython kernel and other commonly used packages for scientific computing and data science.
Use the following installation steps:
- Download Anaconda. We recommend downloading Anaconda’s latest Python 3 version (currently Python 3.5).
- Install the version of Anaconda which you downloaded, following the instructions on the download page.
- Install the latest version of Jupyter Enterprise Gateway from PyPI
using
pip
(part of Anaconda) along with its dependencies.
# install from pypi
pip install jupyter_enterprise_gateway
Similarly, you can use pip uninstall jupyter_enterprise_gateway
to uninstall
Jupyter Enterprise Gateway.
At this point, the Jupyter Enterprise Gateway deployment provides local kernel support which is fully compatible with Jupyter Kernel Gateway.
Using a docker-stacks image¶
You can add the enterprise gateway to any docker-stacks
image by writing a Dockerfile
patterned after the following example:
# start from the jupyter image with R, Python, and Scala (Apache Toree) kernels pre-installed
FROM jupyter/all-spark-notebook
# install Jupyter Enterprise Gateway
RUN pip install jupyter_enterprise_gateway
# run Jupyter Enterprise Gateway on container start
EXPOSE 8888
CMD ["jupyter", "enterprisegateway", "--ip=0.0.0.0", "--port=8888"]
You can then build the Docker image and run it as shown below:
docker build -t enterprise-gateway .
docker run -it --rm -p 8888:8888 enterprise-gateway
Enabling Distributed Kernel support¶
To leverage the full distributed capabilities of Jupyter Enterprise Gateway, there is a need to provide a few additional configuration options in a cluster deployment.
The distributed capabilities are currently based on a Apache Spark cluster utilizing YARN as the Resource Manager and thus require the following environment variables to be set to facilitate the integration between Apache Spark and YARN components:
- SPARK_HOME: Must point to the Apache Spark installation path
SPARK_HOME:/usr/hdp/current/spark2-client #For HDP distribution
- EG_YARN_ENDPOINT: Must point to the YARN Resource Manager endpoint
EG_YARN_ENDPOINT=http://${YARN_RESOURCE_MANAGER_FQDN}:8088/ws/v1/cluster #Common to YARN deployment
This value can also be specified on the command-line when starting Enterprise Gateway
--EnterpriseGatewayApp.yarn_endpoint=http://${YARN_RESOURCE_MANAGER_FQDN}:8088/ws/v1/cluster
Installing support for Scala (Apache Toree kernel)¶
We have tested the latest version of Apache Toree for Scala 2.11 support, and to enable that support, please do the following steps:
- Install Apache Toree
# pip-install the Apache Toree installer
pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0-incubating-rc1/toree-pip/toree-0.2.0.tar.gz
# install a new Toree Scala kernel which will be updated with Enterprise Gateway's custom kernel scripts
jupyter toree install --spark_home="${SPARK_HOME}" --kernel_name="Spark 2.1" --interpreters="Scala"
- Update the Apache Toree Kernelspecs
We have provided some customized kernelspecs as part of the Jupyter Enterprise Gateway releases. These kernelspecs come pre-configured with Yarn client and/or cluster mode. Please use the steps below as an example on how to update/customize your kernelspecs:
wget https://github.com/SparkTC/enterprise_gateway/releases/download/v0.6/enterprise_gateway_kernelspecs.tar.gz
SCALA_KERNEL_DIR="$(jupyter kernelspec list | grep -w "spark_2.1_scala" | awk '{print $2}')"
KERNELS_FOLDER="$(dirname "${SCALA_KERNEL_DIR}")"
tar -zxvf enterprise_gateway_kernelspecs.tar.gz --strip 1 --directory $KERNELS_FOLDER/spark_2.1_scala_yarn_cluster/ spark_2.1_scala_yarn_cluster/
cp $KERNELS_FOLDER/spark_2.1_scala/lib/*.jar $KERNELS_FOLDER/spark_2.1_scala_yarn_cluster/lib
Installing support for Python (IPython kernel)¶
The IPython kernel comes pre-configured.
Installing support for R (IRkernel)¶
Installing support for R (IRkernel)¶
# Perform the following steps on Jupyter Enterprise Gateway hosting system as well as all YARN workers
yum install -y "https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm"
yum install -y git openssl-devel.x86_64 libcurl-devel.x86_64
# Create an R-script to run and install packages
cat <<'EOF' > install_packages.R
install.packages(c('repr', 'IRdisplay', 'evaluate', 'git2r', 'crayon', 'pbdZMQ',
'devtools', 'uuid', 'digest', 'RCurl', 'argparser'),
repos='http://cran.rstudio.com/')
devtools::install_github('IRkernel/IRkernel')
IRkernel::installspec(user = FALSE)
EOF
# run the package install script
$ANACONDA_HOME/bin/Rscript install_packages.R
# OPTIONAL: check the installed R packages
ls $ANACONDA_HOME/lib/R/library
Next copy the R kernelspecs to all YARN workers
[ ENTERPRISE_GATEWAY ] is the root directory of the JEG github repository
cp -r [ ENTERPRISE_GATEWAY ]/etc/kernelspecs/spark_2.1_R* /usr/local/share/jupyter/kernels/
cp -r [ ENTERPRISE_GATEWAY ]/etc/kernel-launchers/R/scripts /usr/local/share/jupyter/kernels/spark_2.1_R_yarn_client/
cp -r [ ENTERPRISE_GATEWAY ]/etc/kernel-launchers/R/scripts /usr/local/share/jupyter/kernels/spark_2.1_R_yarn_cluster/
Installing Required Packages on Yarn Worker Nodes¶
To support IPython and R kernels, run the following commands on all Yarn worker nodes.
Installing Required Packaged for IPython Kernels on Yarn Worker Nodes¶
yum -y install "https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm"
yum install -y python2-pip.noarch
# upgrade pip
python -m pip install --upgrade --force pip
# install IPython kernel packages
pip install ipykernel 'ipython<6.0'
# OPTIONAL: check installed packages
pip list | grep -E "ipython|ipykernel"
Installing Required Packaged for R Kernels on Yarn Worker Nodes¶
yum install -y "https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm"
yum install -y R git openssl-devel.x86_64 libcurl-devel.x86_64
# create a install script
cat <<'EOF' > install_packages.R
install.packages('git2r', repos='http://cran.rstudio.com/')
install.packages('devtools', repos='http://cran.rstudio.com/')
install.packages('RCurl', repos='http://cran.rstudio.com/')
library('devtools')
install_github('IRkernel/repr', repos='http://cran.rstudio.com/')
install_github('IRkernel/IRdisplay', repos='http://cran.rstudio.com/')
install_github('IRkernel/IRkernel', repos='http://cran.rstudio.com/')
EOF
# run the package install script in the background
R CMD BATCH install_packages.R &
# OPTIONAL: tail the progress of the installation
tail -F install_packages.Rout
# OPTIONAL: check the installed packages
ls /usr/lib64/R/library/
Starting Enterprise Gateway¶
Very few arguments are necessary to minimally start Enterprise Gateway. The following command could be considered a minimal command:
jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0
where --ip=0.0.0.0
exposes Enterprise Gateway on the public network and --port_retries=0
ensures
that a single instance will be started.
It is recommended that you start Enterprise Gateway with kernel culling so as to better control kernel resources. In addition, we recommend starting Enterprise Gateway as a background task. As a result, you might find it best to create a start script to maintain options, file redirection, etc.
The following script starts Enterprise Gateway with DEBUG
tracing enabled (default is INFO
) and idle
kernel culling for any kernels idle for 12 hours where idle check intervals occur every minute. The Enterprise
Gateway log can then be monitored via tail -F enterprise_gateway.log
and it can be stopped via kill $(cat enterprise_gateway.pid)
#!/bin/bash
START_CMD="jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG"
CULLING_PARAMS="--MappingKernelManager.cull_idle_timeout=43200 --MappingKernelManager.cull_interval=60 --MappingKernelManager.cull_connected=True"
LOG=~/enterprise_gateway.log
PIDFILE=~/enterprise_gateway.pid
$START_CMD $CULLING_PARAMS > $LOG 2>&1 &
if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
else
exit 1
fi
Adding modes of distribution¶
By default, without kernelspec modifications, all kernels run local to Enterprise Gateway. This is what is referred to as LocalProcessProxy mode. Enterprise Gateway provides two additional modes out of the box, which are reflected in modified kernelspec files. These modes are YarnClusterProcessProxy and DistributedProcessProxy. The system architecture page provides more details regarding process proxies.
YarnClusterProcessProxy¶
YarnClusterProcessProxy mode launches the kernel as a managed resource within Yarn as noted above.
This launch mode requires that the command-line option --EnterpriseGatewayApp.yarn_endpoint
be provided
or the environment variable EG_YARN_ENDPOINT
be defined. If neither value exists, the default
value of http://localhost:8088/ws/v1/cluster
will be used.
DistributedProcessProxy¶
DistributedProcessProxy provides for a simple, round-robin remoting mechanism where each successive kernel is launched on a different host. It requires that each of the kernelspec files reside in the same path on each node and that password-less ssh has been established between nodes.
When launched, the kernel runs as a Yarn client - meaning that the kernel process itself is not managed by the Yarn resource manager. This mode allows for the distribution of kernel (spark driver) processes across the cluster.
To use this form of distribution, the command-line option --EnterpriseGatewayApp.remote_hosts=
should be set. It should be noted that this command-line option is a list, so values are
indicated via bracketed strings: ['host1','host2','host3']
. These values can also be set via
the environment variable EG_REMOTE_HOSTS
, in which case a simple comma-separated value is
sufficient. If neither value is provided and DistributedProcessProxy kernels are invoked,
Enterprise Gateway defaults this option to localhost
.
Amending the start script with a more complete example that includes distribution modes, one might use the following:
#!/bin/bash
START_CMD="jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG"
CULLING_PARAMS="--MappingKernelManager.cull_idle_timeout=43200 --MappingKernelManager.cull_interval=60 --MappingKernelManager.cull_connected=True"
YARN_ENDPOINT=--EnterpriseGatewayApp.yarn_endpoint="http://yarn-resource-manager-host:8088/ws/v1/cluster"
REMOTE_HOSTS=--EnterpriseGatewayApp.remote_hosts="['host1','host2','host3']"
LOG=~/enterprise_gateway.log
PIDFILE=~/enterprise_gateway.pid
$START_CMD $CULLING_PARAMS $YARN_ENDPOINT $REMOTE_HOSTS > $LOG 2>&1 &
if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
else
exit 1
fi
Connecting a Notebook Client to Enterprise Gateway¶
NB2KG is used to connect from a
local desktop or laptop to the Enterprise Gateway instance on the Yarn cluster. The most convenient
way to use a pre-configured installation of NB2KG would be using the Docker image
biginsights/jupyter-nb-nb2kg:dev. Replace
the <ENTERPRISE_GATEWAY_HOST_IP>
in the command below:
docker run -t --rm \
-e KG_URL='http://<ENTERPRISE_GATEWAY_HOST_IP>:8888' \
-p 8888:8888 \
-e KG_HTTP_USER=guest \
-e KG_HTTP_PASS=guest-password \
-e VALIDATE_KG_CERT='no' \
-e LOG_LEVEL=INFO \
-e KG_REQUEST_TIMEOUT=40 \
-v ${HOME}/notebooks/:/tmp/notebooks \
-w /tmp/notebooks \
biginsights/jupyter-nb-nb2kg:dev
Note that the KG_HTTP_USER and KG_HTTP_PASS variables are necessary when Enterprise Gateway is behind an Apache Knox gateway.