Michal Slovík

Posted on Sep 26, 2023

Python Interpreter in Docker and Pyspark Tests in Docker

#docker #python #pyspark #interpreter

Overview

There are two main ideas behind this article: security and mobility. When you create your environment only on your server, laptop, Raspberry Pi, etc., it may be great, but without any backup, regular updates, or automation, it can easily become a SnowflakeServer anti-pattern. Python offers virtual environments to avoid this issue.

Another point is the security of the environment. Virtual environments are great, but they can easily become insecure. You should use the latest major version of Python and be careful when installing and using external libraries. It is possible for a malicious module with the same name to be in a popular open-source library and find its way into the system path. If the malicious module is found before the actual module, it will be imported and could be used to exploit applications that have it in their dependency tree. To prevent this, you must ensure that you use either an absolute import or an explicit relative import, as the latter guarantees the import of the actual and intended modules.

Fortunately, Docker can keep many of these things in mind.

Using Docker as a Python Interpreter

There are a few different approaches to using Docker as a Python interpreter, depending on your needs. If you have an existing Python application that you want to run in a Docker container, you can “dockerize” your application by creating a Docker image.

Dockerize your Python application

Docker images are a great way to ensure that your application runs consistently across different machines and servers. By containerizing your application, you can separate it from the underlying hardware and operating system as much as possible, making it easier to move and deploy.

To dockerize your Python application, you'll need to create a Dockerfile that specifies the base image, any dependencies you need to install, and the command to run your application. For more info see Dockerize your python application.

Benefits of using Docker as a Python interpreter

Using Docker as a Python interpreter has a number of benefits. For one, it makes it easier to ensure that your application runs consistently across different environments. Additionally, it can simplify the process of managing dependencies, as all the dependencies for your application can be included in the Docker image. Finally, by using Docker, you can avoid some of the pitfalls of maintaining a "snowflake server" - a server that is difficult to reproduce and maintain over time.

Using Docker as a remote Python interpreter

What about when you want to develop and build your application from scratch and you want to use separated python interpreter.
There is an existing solution like PyCharm from JetBrains, but it requires Professional version of PyCharm.

Visual Studio Code can handle Dev Containers plugin this job. Basically, you can attach to a Docker container that contains Python.

First step is to create some simple Dockerfile

FROM python:latest

WORKDIR /app

COPY . ./

Build docker build -t pyground . and start docker container docker run -it --rm --name pyground pyground:latest. After these command your python interpreter is alive, you need to attach with your code editor (vscode in our example), however when you have terminal you can basically connect and run whatever you want.

In VSCode via plugin pick up new interpreters

Attach your running docker containers

After that you should see new visual studio window open and in left corner attached running containers

Via terminal I get current python version:

root@704b87b076d8:~# python --version
Python 3.11.2

If you have files in the same folder as your Dockerfile, you can run and use them in the container:

root@704b87b076d8:~# ls /app/
Dockerfile  solver.py
root@704b87b076d8:~# python /app/solver.py 
a: 1
b: 10
c: 1
(-0.10102051443364424, -9.898979485566356)

Code for this solver.py is available here JetBrains Example.

After you done with your work you can stop docker container and vscode should automatically detach that container. Keep in mind option --rm will remove docker container after run. For more a persistent solution, take a look at bind mount solution by Docker

We can improve Dockerfile to be ready with preinstalled libraries. Best practices for python libraries is to use requirement.txt file.

numpy==1.24.2
pandas==1.5.3

And improved Dockerfile:

FROM python:latest

RUN pip install --upgrade pip

ADD requirements.txt .

RUN pip install -r requirements.txt

Build docker image with specific Dockerfile filename docker build -t pyground2 -f .\Dockerfile-with-requirements.dockerfile .

And then we can test it:

Python 3.11.2 (main, Mar  1 2023, 14:46:02) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>>

We can put some pre-configuration in Dockerfile or sample files. This way we can prepare the same environment for colleagues or students. Everyone will have the same version of python along with same versions of all libraries and dependencies.

One last note: Don’t overfill your Dockerfile, always check if you need these things in your images, for example my base pyground image with the latest version of python Python 3.11.2 was 925 MB and with the pandas and numpy libraries we got over 1 GB!

> docker images
REPOSITORY        TAG       IMAGE ID       CREATED          SIZE
pyground          latest    c69a7214f5e5   33 minutes ago   925MB
pyground2         latest    bfe8fc2400e4   7 minutes ago    1.12GB

Docker containers as test instances for PySpark

We used the previous example for testing purposes. We have several goals that we need to achieve. First, we need to ensure that we are able to test our code locally (on any operating system and any hardware). We also need to check our code during the deployment pipeline. Docker is perfect for all these tasks. Moreover, our code is written for PySpark.

First, we need to configure a Dockerfile containing PySpark and Java.

ARG IMAGE_VARIANT=slim-buster
ARG OPENJDK_VERSION=8
ARG PYTHON_VERSION=3.9.8

FROM python:${PYTHON_VERSION}-${IMAGE_VARIANT} AS py3
FROM openjdk:${OPENJDK_VERSION}-${IMAGE_VARIANT}

As you can see, more lines need to be explained. We use the basic slim-buster image. This slim image generally contains only the minimal packages needed to run Python. The buster is the codename for the stable version of Debian, whose release is 10.4. This python image was based on it.

Next, we will use a second base image openjdk with a similar codename and a Java version of OPENJDK_VERSION=8.

ARG PYSPARK_VERSION=3.2.0
RUN pip --no-cache-dir install pyspark==${PYSPARK_VERSION}

With run command we install PySpark (version 3.2.0) itself and then we install all system dependencies and often used libraries.

WORKDIR app

RUN apt-get update && apt-get install -y build-essential libxml2
COPY . /app
RUN pip3 install cython numpy pytest pandas coverage pyspark_test dummy_spark IPython pytest-cov
RUN pip3 install -r requirements.txt

ENTRYPOINT python3 -m coverage run --source=. -m pytest -v test/ && coverage report && coverage xml && cat coverage.xml

The last command is for the test itself. Here we call coverage and pytest tools. This command will run all the unit tests in the test/ folder. A test report is generated and a coverage message is returned (shown).

Lift and shift

As mentioned earlier, we can reuse a Dockerfile in multiple environments. Here are some examples where we can use our previous Dockerfile image (sparktest).

In an Azure Devops pipeline, we can use following approach to run this image and get a coverage report:

  - task: Docker@2
    displayName: 'Build an image'
    inputs:
      repository: 'sparktest'
      command: 'build'
      Dockerfile: '**/Dockerfile'
      tags: 'latest'

  - script: |
      docker run -e PYTHONPATH=./src -v :/app --name sparktest sparktest 
      CONTAINER_ID=`docker ps -aqf "name=sparktest"`
      docker cp $CONTAINER_ID:/app/coverage.xml test-coverage.xml
    displayName: "Image unittest by Docker"
    workingDirectory: ${{ parameters.appPath }}

In docker-compose.yaml we can specify the Dockerfile with context:

version: "3.9"
services:
  test:
    environment:
      - PYTHONPATH=./src
    image: "sparktest"
    build:
      context: .
      dockerfile: ./Dockerfile
    volumes:
      - ./our_current_project:/app
    command: python3 -m coverage run --source=. -m pytest -v test/ && coverage report && coverage xml && cat coverage.xml

And of course you can build and run that image from your local machine.

In summary, using Docker as a Python interpreter can be a powerful tool for managing your Python applications and development environment. By containerizing your application or development environment, you can ensure that it runs consistently across different machines and servers, simplify dependency management, and avoid the pitfalls of maintaining a snowflake server.

Sources

Originally published at https://mishco.gitlab.io on March 8, 2023.

DEV Community

Python Interpreter in Docker and Pyspark Tests in Docker

Overview

Using Docker as a Python Interpreter

Dockerize your Python application

Benefits of using Docker as a Python interpreter

Using Docker as a remote Python interpreter

Docker containers as test instances for PySpark

Lift and shift

Sources

Top comments (0)

Read next

AI Pronunciation Trainer

Resolving the K3s Config File Permission Denied Error

Day 14: Docker Debugging

Introducing uv: Next-Gen Python Package Manager