DEV Community

Cover image for  Running Apache Spark on EKS Fargate

Running Apache Spark on EKS Fargate

Apache Spark is one of the most famous Big Data frameworks that allows you to process data at any scale.

Spark jobs can run on the Kubernetes cluster and have native support for the Kubernetes scheduler in GA from release 3.1.1 onwards.

Spark comes with a spark-submit script that allows submitting spark applications on a cluster using a single interface without the need to customize the script for different cluster managers.

spark-submit on Kubernetes cluster works as follows:

  1. Spark creates a Spark driver running within a Kubernetes pod.
  2. The driver creates executors which are also running within Kubernetes pods and connects to them and executes application code.
  3. When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

spark-eks

To submit a spark job on a kubernetes cluster using spark-submit :

./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar
Enter fullscreen mode Exit fullscreen mode

While spark-submit provides support for several Kubernetes features such as secrets, persistentVolumes, rbac via configuration parameters, it still lacks a lot of features thus it's not suitable to use in production effectively.

Spark on K8s Operator

Spark on K8s Operator is a project from Google that allows submitting spark applications on Kubernetes cluster using CustomResource Definition SparkApplication.
It uses mutating admission webhook to modify the pod spec and add the features not officially supported by spark-submit.

The Kubernetes Operator for Apache Spark consists of:

  1. A SparkApplication controller that watches events of creation, updates, and deletion of SparkApplication objects and acts on the watch events, a submission runner that runs spark-submit for submissions received from the controller,
  2. A Spark pod monitor that watches for Spark pods and sends pod status updates to the controller,
  3. A Mutating Admission Webhook that handles customizations for Spark driver and executor pods based on the annotations on the pods added by the controller,
  4. A command-line tool named sparkctl for working with the operator.

The following diagram shows how different components interact and work together.

spark-operator

Setup Spark on K8s Operator on EKS Fargate

  • Setup EKS cluster using eksctl with fargate profile for default, kube-system, and spark namespaces.
    eksctl apply -f - <<EOF
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    metadata:
      name: spark-cluster
      region: us-east-1
      version: "1.21"
    availabilityZones: 
      - us-east-1a
      - us-east-1b
      - us-east-1c
    fargateProfiles:
      - name: fp-all
        selectors:
          - namespace: kube-system
          - namespace: default
          - namespace: spark
    EOF
Enter fullscreen mode Exit fullscreen mode
  • Install Spark on K8s Operator using helm3 in the spark namespace.
    helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
    helm upgrade \
        --install \
        spark-operator \
        spark-operator/spark-operator \
        --namespace spark \
        --create-namespace \
        --set webhook.enable=true \
        --set sparkJobNamespace=spark \
        --set serviceAccounts.spark.name=spark \
        --set logLevel=10 \
        --version 1.1.6 \
        --wait
Enter fullscreen mode Exit fullscreen mode
  • Verify Operator installation
    kubectl get pods -n spark
Enter fullscreen mode Exit fullscreen mode

Submit SparkPi on EKS Cluster

  • Submit the SparkPi application to the EKS cluster
    kubectl apply -f - <<EOF
    apiVersion: "sparkoperator.k8s.io/v1beta2"
    kind: SparkApplication
    metadata:
      name: spark-pi
      namespace: spark
    spec:
      type: Scala
      mode: cluster
      image: "gcr.io/spark-operator/spark:v3.1.1"
      imagePullPolicy: Always
      mainClass: org.apache.spark.examples.SparkPi
      mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
      sparkVersion: "3.1.1"
      restartPolicy:
        type: Never
      driver:
        cores: 1
        coreLimit: "1200m"
        memory: "512m"
        labels:
          version: 3.1.1
        serviceAccount: spark
      executor:
        cores: 1
        instances: 1
        memory: "512m"
        labels:
          version: 3.1.1
    EOF
Enter fullscreen mode Exit fullscreen mode

Note: hostPath volume mounts are not supported in Fargate.

  • Check the status of SparkApplication
    kubectl -n spark describe sparkapplications.sparkoperator.k8s.io spark-pi
Enter fullscreen mode Exit fullscreen mode
  • Access Spark UI by port-forwarding to the spark-pi-ui-svc
   kubectl -n spark port-forward svc/spark-pi-ui-svc 4040:4040
Enter fullscreen mode Exit fullscreen mode

spark-pi

Managing SparkApplication with sparkctl

sparkctl is CLI tool for creating, listing, checking status of, getting logs of, and deleting SparkApplications running on the Kubernetes cluster.

  • Build sparkctl from source:
   git clone git@github.com:GoogleCloudPlatform/spark-on-k8s-operator.git
   cd spark-on-k8s-operator/sparkctl
   go build -o /usr/local/bin/sparkctl
Enter fullscreen mode Exit fullscreen mode
  • List SparkApplication objects in spark namespace:
   sparkctl list -n spark
   +----------+-----------+----------------+-----------------+
   |   NAME   |   STATE   | SUBMISSION AGE | TERMINATION AGE |
   +----------+-----------+----------------+-----------------+
   | spark-pi | COMPLETED | 1h             | 1h              |
   +----------+-----------+----------------+-----------------+
Enter fullscreen mode Exit fullscreen mode
  • Check the status of SparkApplication spark-pi :
   sparkctl status spark-pi -n spark

    application state:
    +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
    |   STATE   | SUBMISSION AGE | COMPLETION AGE |   DRIVER POD    |     DRIVER UI      | SUBMISSIONATTEMPTS | EXECUTIONATTEMPTS |
    +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
    | COMPLETED | 1h             | 1h             | spark-pi-driver | 10.100.97.206:4040 |                  1 |                 1 |
    +-----------+----------------+----------------+-----------------+--------------------+--------------------+-------------------+
    executor state:
    +----------------------------------+-----------+
    |           EXECUTOR POD           |   STATE   |
    +----------------------------------+-----------+
    | spark-pi-418ac87b48d177c9-exec-1 | COMPLETED |
    +----------------------------------+-----------+
Enter fullscreen mode Exit fullscreen mode
  • Check SparkApplication spark-pi logs:
   sparkctl log spark-pi -n spark 
Enter fullscreen mode Exit fullscreen mode
  • Port-forward to Spark UI:
   sparkctl forward spark-pi -n spark
Enter fullscreen mode Exit fullscreen mode

you can access the Spark UI at http://localhost:4040

Discussion (0)