Rafael Lourenço

Posted on May 2

Setting Up a PySpark Cluster with Docker: Guide

Diving back into Python with a focus on expanding my knowledge in data processing, which has sparked my interest in creating a Proof of Concept (POC) for a data playground. With help of @caiocampoos we started this repo here.

Even with almost zero knowledge in the area, we began discussing ideas about what we want to do. The main idea is to create knowledge through learning, and one of the best ways to learn is by creating something. In this project, we're going to develop a basic data analysis platform. Throughout the process, we documented the code.

Talk is cheap, let's code!

Let's start with a PySpark. PySpark is essentially Apache Spark tailored to integrate smoothly with Python. Python dominance in the data science realm makes PySpark an ideal choice for our business-oriented project. While Apache Spark offers support for various languages, our emphasis on Python stems from its versatility and efficiency in our business-focused endeavor. Therefore, we'll proceed with PySpark to ensure seamless integration with our Python-centric workflow.

Pyspark has a convenience Docker container image you can also browse around Apache Spark website to check others version.

We starting our project creating a Dockerfile and with this code.

FROM python:3.10-bullseye as spark-base
ARG SPARK_VERSION=3.5.1

The first line of code set the base image to a Python image with version 3.10 on the Debian "Bullseye" distribution, naming it spark-base.

The second line of code defines a version of spark we gonna use, you may change that in the future.

Then we write this.


RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      sudo \
      curl \
      vim \
      unzip \
      rsync \
      openjdk-11-jdk \
      build-essential \
      software-properties-common \
      ssh && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

With this next lines of code we setup the directories for our Spark and Hadoop installations.

ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV HADOOP_HOME=${HADOOP_HOME:-"/opt/hadoop"}

RUN mkdir -p ${HADOOP_HOME} && mkdir -p ${SPARK_HOME}
WORKDIR ${SPARK_HOME}

This command downloads and installs Spark and Hadoop.

RUN curl https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
 && tar xvzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
 && rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz

The next lines of code in the Dockerfile import and install all Python dependencies that we are going to cover in another folder.

FROM spark-base as pyspark

COPY requirements/requirements.txt .
RUN pip3 install -r requirements.txt

Now we gonna set all the enveriroment variables.


ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"
ENV SPARK_MASTER="spark://spark-master:7077"
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
ENV PYSPARK_PYTHON python3

The next lines of code will copy the configuration files, set permissions, and configure the environment variables accordingly.

COPY conf/spark-defaults.conf "$SPARK_HOME/conf"

RUN chmod u+x /opt/spark/sbin/* && \
    chmod u+x /opt/spark/bin/*

ENV PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

And finally, copy the entrypoint script and set it.

COPY entrypoint.sh .

ENTRYPOINT ["./entrypoint.sh"]

Here the full code of dockerfile.



FROM python:3.10-bullseye as spark-base

ARG SPARK_VERSION=3.5.1

# Install tools required by the OS
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      sudo \
      curl \
      vim \
      unzip \
      rsync \
      openjdk-11-jdk \
      build-essential \
      software-properties-common \
      ssh && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*


# Setup the directories for our Spark and Hadoop installations
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV HADOOP_HOME=${HADOOP_HOME:-"/opt/hadoop"}

RUN mkdir -p ${HADOOP_HOME} && mkdir -p ${SPARK_HOME}
WORKDIR ${SPARK_HOME}

# Download and install Spark
RUN curl https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
 && tar xvzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
 && rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz


FROM spark-base as pyspark

# Install python deps
COPY requirements/requirements.txt .
RUN pip3 install -r requirements.txt

# Setup Spark related environment variables
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"
ENV SPARK_MASTER="spark://spark-master:7077"
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
ENV PYSPARK_PYTHON python3

# Copy the default configurations into $SPARK_HOME/conf
COPY conf/spark-defaults.conf "$SPARK_HOME/conf"

RUN chmod u+x /opt/spark/sbin/* && \
    chmod u+x /opt/spark/bin/*

ENV PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

# Copy appropriate entrypoint script
COPY entrypoint.sh .

ENTRYPOINT ["./entrypoint.sh"]

Now we need create a entrypoint.sh file for our script, this Bash script is designed to start different components of Apache Spark based on the value passed to the environment variable SPARK_WORKLOAD. We are using a flag --memory 1g to limit the use of memory for workers, you can change that if you want.


#!/bin/bash

SPARK_WORKLOAD=$1

echo "SPARK_WORKLOAD: $SPARK_WORKLOAD"

if [ "$SPARK_WORKLOAD" == "master" ];
then
  start-master.sh -p 7077
elif [ "$SPARK_WORKLOAD" == "worker" ];
then
  start-worker.sh spark://spark-master:7077 --memory 1g
elif [ "$SPARK_WORKLOAD" == "history" ]
then
  start-history-server.sh
fi

And a folder requirements with requirements.txt file inside.

ipython
pandas
pyarrow
numpy
pyspark

Our Docker Compose configuration is divided into three key components: the Spark Master, responsible for orchestrating the Spark cluster; the Spark History Server, which provides historical data on completed Spark applications; and the Spark Worker, representing a worker node within the cluster.

This is all the code we use in the docker-compose file.

version: '3.8'

services:
  spark-master:
    container_name: da-spark-master
    build: .
    image: da-spark-image
    entrypoint: ['./entrypoint.sh', 'master']
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:8080" ]
      interval: 5s
      timeout: 3s
      retries: 3
    volumes:
      - ./book_data:/opt/spark/data
      - ./spark_apps:/opt/spark/apps
      - spark-logs:/opt/spark/spark-events
    env_file:
      - .env.spark
    ports:
      - '9090:8080'
      - '7077:7077'


  spark-history-server:
    container_name: da-spark-history
    image: da-spark-image
    entrypoint: ['./entrypoint.sh', 'history']
    depends_on:
      - spark-master
    env_file:
      - .env.spark
    volumes:
      - spark-logs:/opt/spark/spark-events
    ports:
      - '18080:18080'

  spark-worker:
#    container_name: da-spark-worker
    image: da-spark-image
    entrypoint: ['./entrypoint.sh', 'worker']
    depends_on:
      - spark-master
    env_file:
      - .env.spark
    volumes:
      - ./book_data:/opt/spark/data
      - ./spark_apps:/opt/spark/apps
      - spark-logs:/opt/spark/spark-events


volumes:
  spark-logs:

We also need to create a folder named 'config' with a file named 'spark-defaults.conf' to store all configurations for Spark.


spark.master                            spark://localhost:7077
spark.eventLog.enabled                  true
spark.eventLog.dir                      /opt/spark/spark-events
spark.history.fs.logDirectory           /opt/spark/spark-events

And a ssh_config file with.


Host *
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no

And a .env.spark file with.

SPARK_NO_DAEMONIZE=true

And finally our makefile file.


build:
    docker-compose build

build-nc:
    docker-compose build --no-cache

build-progress:
    docker-compose build --no-cache --progress=plain

down:
    docker-compose down --volumes --remove-orphans

run:
    make down && docker-compose up

run-scaled:
    make down && docker-compose up --scale spark-worker=3 

run-d:
    make down && docker-compose up -d

stop:
    docker-compose stop

submit:
    docker exec da-spark-master spark-submit --master spark://spark-master:7077 --deploy-mode client ./apps/$(app)

submit-da-book:
    make submit app=data_analysis_book/$(app)

rm-results:
    rm -r book_data/results/*

Now we can just build and run how many workers we need!

DEV Community

Setting Up a PySpark Cluster with Docker: Guide

Top comments (0)

Read next

Stop the rot

Three Simple Rules to Solve Unsolvable Organizational Problems

Stay Ahead of the Curve: Embrace .NET MAUI for .NET 9

First post