DEV Community

Cover image for How to run Spark on kubernetes in jupyterhub
akoshel
akoshel

Posted on • Updated on

How to run Spark on kubernetes in jupyterhub

This is a basic tutorial on how to run Spark in client mode from jupyterhub notebook.
All required files are presented here https://github.com/akoshel/spark-k8s-jupyterhub

DS_ARCH (1)
Final architecture

Motivation

I found a lot of tutorials on this topic and almost all of them have custom spark and jupyterhub deployment. So I decided to minimize custom configuration and use raw open-source solutions as it is possible.

Install minikube & helm

Firstly we should create k8s infrastructure
Minikube installation instruction https://minikube.sigs.k8s.io/docs/start/
Helm installation instruction https://helm.sh/docs/intro/install/

Make local docker images available from minikube:



eval $(minikube docker-env)


Enter fullscreen mode Exit fullscreen mode

Install spark

Let's install spark locally.
Further, we will build a spark image and run the spark-pi example with spark-submit



sudo apt-get -y install openjdk-8-jdk-headless
wget https://downloads.apache.org/spark/spark-3.2.2/spark-3.2.2-bin-hadoop3.2.tgz
tar xvf spark-3.2.2-bin-hadoop3.2.tgz
sudo mv spark-3.2.2-bin-hadoop3.2 /opt/spark


Enter fullscreen mode Exit fullscreen mode

Build spark image

Spark has kubernetes dockerfile. Let's build spark image



cat /opt/spark/kubernetes/dockerfiles/spark/Dockerfile
cd /opt/spark
docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .


Enter fullscreen mode Exit fullscreen mode

Spark base image does not support python. So we should build pyspark image (opt/spark/spark-3.2.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/bindings/python/Dockerfile).
The basic image does not support s3a and postgres. That is why maven jars should be added.
See modified image here https://github.com/akoshel/spark-k8s-jupyterhub/blob/main/pyspark.Dockerfile

Build pyspark image



cd /opt/spark
docker build -t pyspark:latest -f kubernetes/dockerfiles/spark/bindings/python/Dockerfile .


Enter fullscreen mode Exit fullscreen mode

Run spark-pi

Before running examples namespace, service account, role and rolebinding should be deployed.



kubectl apply -f spark_namespace.yaml
kubectl apply -f spark_sa.yaml
kubectl apply -f spark_sa_role.yaml


Enter fullscreen mode Exit fullscreen mode

Now we are ready to check the spark-pi example using spark-submit
(Use kubectl cluster-info to find your master address)



/opt/spark/bin/spark-submit \
  --master k8s://https://192.168.49.2:8443 \
  --deploy-mode cluster \
  --driver-memory 1g \
  --conf spark.kubernetes.memoryOverheadFactor=0.5 \
  --name sparkpi-test1 \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.container.image=spark:latest \
  --conf spark.kubernetes.driver.pod.name=spark-test1-pi \
  --conf spark.kubernetes.namespace=spark \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.2.1.jar 1000


Enter fullscreen mode Exit fullscreen mode

Check logs



kubectl logs -n spark spark-test1-pi | grep "Pi is roughly"
Pi is roughly 3.1416600314166003


Enter fullscreen mode Exit fullscreen mode

Great! spark is running on k8s.

Install jupyterhub

Before jupyterhub installation service account, role and rolebinding should be deployed in jupyterhub namespace



kubectl apply -f jupyterhub_sa.yaml
kubectl apply -f jupyterhub_sa_role.yaml


Enter fullscreen mode Exit fullscreen mode

Spark executors have to be deployed in spark namespace from a notebook which is deployed in jupyterhub.
That is why we have to deploy driver service. (driver_service.yaml)



kubectl apply -f driver_service.yaml


Enter fullscreen mode Exit fullscreen mode

To get access to spark UI ingress should be deployed



kubectl apply -f driver_ingress.yaml


Enter fullscreen mode Exit fullscreen mode

Java is not installed in the default jupyterhub singleuser image.
Build modified singleuser image.



docker build -f singleuser.Dockerfile -t singleuser:v1 .


Enter fullscreen mode Exit fullscreen mode

See jhub_values.yaml. There are the following modifications: new image, service account and resources.
Now we are ready to deploy jupyterhub



helm upgrade --cleanup-on-fail \
--install jupyterhub jupyterhub/jupyterhub \
--namespace jupyterhub \
--create-namespace \
--version=2.0.0 \
--values jhub_values.yaml


Enter fullscreen mode Exit fullscreen mode

The easiest way to get access to jupyterhub is port-forwarding from the proxy pod. Alternatively you can configure ingress in jhub_values.yaml



kubectl port-forward proxy-dd5964d5b-6lkwp  -n jupyterhub  8000:8000 # Set your pod name


Enter fullscreen mode Exit fullscreen mode

Pyspark from jupyterhub

Open jupyterhub in your browser http://localhost:8000/
Create a jupyterhub terminal and install pyspark version that matches spark version in the image



pip install pyspark==3.2.2


Enter fullscreen mode Exit fullscreen mode

Create notebook

Create SparkContext



from pyspark import SparkConf, SparkContext
conf = (SparkConf().setMaster("k8s://https://192.168.49.2:8443") # Your master address name
        .set("spark.kubernetes.container.image", "pyspark:latest") # Spark image name
        .set("spark.driver.port", "2222") # Needs to match svc
        .set("spark.driver.blockManager.port", "7777")
        .set("spark.driver.host", "driver-service.jupyterhub.svc.cluster.local") # Needs to match svc
        .set("spark.driver.bindAddress", "0.0.0.0")
        .set("spark.kubernetes.namespace", "spark")
        .set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
        .set("spark.kubernetes.authenticate.serviceAccountName", "spark")
        .set("spark.executor.instances", "2")
        .set("spark.kubernetes.container.image.pullPolicy", "IfNotPresent")
       .set("spark.app.name", "tutorial_app"))



Enter fullscreen mode Exit fullscreen mode

Run spark application



# Calculate the approximate sum of values in the dataset
t = sc.parallelize(range(10))
r = t.sumApprox(3)
print('Approximate sum: %s' % r)

Approximate sum: 45.0


Enter fullscreen mode Exit fullscreen mode

See executor pods



kubectl get pods -n spark
NAME                                   READY   STATUS    RESTARTS   AGE
tutorial-app-d63d4c83e68ed465-exec-1   1/1     Running   0          16s
tutorial-app-d63d4c83e68ed465-exec-2   1/1     Running   0          15s


Enter fullscreen mode Exit fullscreen mode

Congratulations! Pyspark in client mode is running from jupyterhub

Further steps:

  1. Configure your spark config
  2. Configure jupyterhub https://z2jh.jupyter.org/en/stable/jupyterhub/customization.html
  3. Install spark operator https://googlecloudplatform.github.io/spark-on-k8s-operator/docs/quick-start-guide.html

Recources:

  1. https://spark.apache.org/docs/latest/running-on-kubernetes.html
  2. https://z2jh.jupyter.org/
  3. https://scalingpythonml.com/2020/12/21/running-a-spark-jupyter-notebooks-in-client-mode-inside-of-a-kubernetes-cluster-on-arm.html
  4. https://oak-tree.tech/blog/spark-kubernetes-jupyter

P.S. My second post about spark on k8s
Optimize spark on kubernetes
https://dev.to/akoshel/optimize-spark-on-kubernetes-32la

Top comments (4)

Collapse
 
embeddedfreedom profile image
embeddedfreedom • Edited

I do not see the jupyterhub namespace get created anywhere. I also see spark_namespace.yaml missing in git.
Can't find driver_ingress.yaml

Collapse
 
akoshel profile image
akoshel

You can find jupterhub ns here
github.com/akoshel/spark-k8s-jupyt...

Collapse
 
embeddedfreedom profile image
embeddedfreedom • Edited

The Helm command always times out and a failure message is issued

Collapse
 
akoshel profile image
akoshel

Try to reinstall helm again and doublecheck kubectl profile