Diving back into Python with a focus on expanding my knowledge in data processing, which has sparked my interest in creating a Proof of Concept (POC) for a data playground. With help of @caiocampoos we started this repo here.
Even with almost zero knowledge in the area, we began discussing ideas about what we want to do. The main idea is to create knowledge through learning, and one of the best ways to learn is by creating something. In this project, we're going to develop a basic data analysis platform. Throughout the process, we documented the code.
Talk is cheap, let's code!
Let's start with a PySpark. PySpark is essentially Apache Spark tailored to integrate smoothly with Python. Python dominance in the data science realm makes PySpark an ideal choice for our business-oriented project. While Apache Spark offers support for various languages, our emphasis on Python stems from its versatility and efficiency in our business-focused endeavor. Therefore, we'll proceed with PySpark to ensure seamless integration with our Python-centric workflow.
Pyspark has a convenience Docker container image you can also browse around Apache Spark website to check others version.
We starting our project creating a Dockerfile and with this code.
FROM python:3.10-bullseye as spark-base
ARG SPARK_VERSION=3.5.1
The first line of code set the base image to a Python image with version 3.10 on the Debian "Bullseye" distribution, naming it spark-base.
The second line of code defines a version of spark we gonna use, you may change that in the future.
Then we write this.
RUN apt-get update && \
apt-get install -y --no-install-recommends \
sudo \
curl \
vim \
unzip \
rsync \
openjdk-11-jdk \
build-essential \
software-properties-common \
ssh && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
With this next lines of code we setup the directories for our Spark and Hadoop installations.
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV HADOOP_HOME=${HADOOP_HOME:-"/opt/hadoop"}
RUN mkdir -p ${HADOOP_HOME} && mkdir -p ${SPARK_HOME}
WORKDIR ${SPARK_HOME}
This command downloads and installs Spark and Hadoop.
RUN curl https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
&& tar xvzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
&& rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz
The next lines of code in the Dockerfile import and install all Python dependencies that we are going to cover in another folder.
FROM spark-base as pyspark
COPY requirements/requirements.txt .
RUN pip3 install -r requirements.txt
Now we gonna set all the enveriroment variables.
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"
ENV SPARK_MASTER="spark://spark-master:7077"
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
ENV PYSPARK_PYTHON python3
The next lines of code will copy the configuration files, set permissions, and configure the environment variables accordingly.
COPY conf/spark-defaults.conf "$SPARK_HOME/conf"
RUN chmod u+x /opt/spark/sbin/* && \
chmod u+x /opt/spark/bin/*
ENV PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
And finally, copy the entrypoint script and set it.
COPY entrypoint.sh .
ENTRYPOINT ["./entrypoint.sh"]
Here the full code of dockerfile.
FROM python:3.10-bullseye as spark-base
ARG SPARK_VERSION=3.5.1
# Install tools required by the OS
RUN apt-get update && \
apt-get install -y --no-install-recommends \
sudo \
curl \
vim \
unzip \
rsync \
openjdk-11-jdk \
build-essential \
software-properties-common \
ssh && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Setup the directories for our Spark and Hadoop installations
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV HADOOP_HOME=${HADOOP_HOME:-"/opt/hadoop"}
RUN mkdir -p ${HADOOP_HOME} && mkdir -p ${SPARK_HOME}
WORKDIR ${SPARK_HOME}
# Download and install Spark
RUN curl https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
&& tar xvzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
&& rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz
FROM spark-base as pyspark
# Install python deps
COPY requirements/requirements.txt .
RUN pip3 install -r requirements.txt
# Setup Spark related environment variables
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"
ENV SPARK_MASTER="spark://spark-master:7077"
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
ENV PYSPARK_PYTHON python3
# Copy the default configurations into $SPARK_HOME/conf
COPY conf/spark-defaults.conf "$SPARK_HOME/conf"
RUN chmod u+x /opt/spark/sbin/* && \
chmod u+x /opt/spark/bin/*
ENV PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
# Copy appropriate entrypoint script
COPY entrypoint.sh .
ENTRYPOINT ["./entrypoint.sh"]
Now we need create a entrypoint.sh file for our script, this Bash script is designed to start different components of Apache Spark based on the value passed to the environment variable SPARK_WORKLOAD. We are using a flag --memory 1g to limit the use of memory for workers, you can change that if you want.
#!/bin/bash
SPARK_WORKLOAD=$1
echo "SPARK_WORKLOAD: $SPARK_WORKLOAD"
if [ "$SPARK_WORKLOAD" == "master" ];
then
start-master.sh -p 7077
elif [ "$SPARK_WORKLOAD" == "worker" ];
then
start-worker.sh spark://spark-master:7077 --memory 1g
elif [ "$SPARK_WORKLOAD" == "history" ]
then
start-history-server.sh
fi
And a folder requirements with requirements.txt file inside.
ipython
pandas
pyarrow
numpy
pyspark
Our Docker Compose configuration is divided into three key components: the Spark Master, responsible for orchestrating the Spark cluster; the Spark History Server, which provides historical data on completed Spark applications; and the Spark Worker, representing a worker node within the cluster.
This is all the code we use in the docker-compose file.
version: '3.8'
services:
spark-master:
container_name: da-spark-master
build: .
image: da-spark-image
entrypoint: ['./entrypoint.sh', 'master']
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:8080" ]
interval: 5s
timeout: 3s
retries: 3
volumes:
- ./book_data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events
env_file:
- .env.spark
ports:
- '9090:8080'
- '7077:7077'
spark-history-server:
container_name: da-spark-history
image: da-spark-image
entrypoint: ['./entrypoint.sh', 'history']
depends_on:
- spark-master
env_file:
- .env.spark
volumes:
- spark-logs:/opt/spark/spark-events
ports:
- '18080:18080'
spark-worker:
# container_name: da-spark-worker
image: da-spark-image
entrypoint: ['./entrypoint.sh', 'worker']
depends_on:
- spark-master
env_file:
- .env.spark
volumes:
- ./book_data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events
volumes:
spark-logs:
We also need to create a folder named 'config' with a file named 'spark-defaults.conf' to store all configurations for Spark.
spark.master spark://localhost:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark/spark-events
spark.history.fs.logDirectory /opt/spark/spark-events
And a ssh_config file with.
Host *
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
And a .env.spark file with.
SPARK_NO_DAEMONIZE=true
And finally our makefile file.
build:
docker-compose build
build-nc:
docker-compose build --no-cache
build-progress:
docker-compose build --no-cache --progress=plain
down:
docker-compose down --volumes --remove-orphans
run:
make down && docker-compose up
run-scaled:
make down && docker-compose up --scale spark-worker=3
run-d:
make down && docker-compose up -d
stop:
docker-compose stop
submit:
docker exec da-spark-master spark-submit --master spark://spark-master:7077 --deploy-mode client ./apps/$(app)
submit-da-book:
make submit app=data_analysis_book/$(app)
rm-results:
rm -r book_data/results/*
Now we can just build and run how many workers we need!
Top comments (0)