DEV Community

Philip Mutua
Philip Mutua

Posted on

The best Docker base image for your Python application

When you’re building a Docker image for your Python application, you’re building on top of an existing image—and there are many possible choices. There are OS images like Ubuntu and CentOS, and there are the many different variants of the python base image.

Which one should you use? Which one is better? There are many choices, and it may not be obvious which is the best for your situation.

So to help you make a choice that fits your needs, in this article I’ll go through some of the relevant criteria, and suggest some reasonable defaults that will work for most people.

What do you want from a base image?

There are a number of common criteria for choosing a base image, though your particular situation might emphasize, add, or remove some of these:

Stability: You want a build today to give you the same basic set of libraries, directory structure, and infrastructure as a build tomorrow, otherwise your application will randomly break.
Security updates: You want the base image to be well-maintained, so that you get security updates for the base operating system in a timely manner.
Up-to-date dependencies: Unless you’re building a very simple application, you will likely depend on operating system-installed libraries and applications (e.g. a compiler). You’d like them not to be too old.
Extensive dependencies: For some applications less popular dependencies may be required—a base image with access to a large number of libraries makes this easier.

Up-to-date Python: While this can be worked around by installing Python yourself, having an up-to-date Python available saves you some effort.
Small images: All things being equal, it’s better to have a smaller Docker image than a bigger Docker image.
The need for stability suggests not using operating systems with limited support lifetime, like Fedora or non-LTS Ubuntu releases.

Why you shouldn’t use Alpine Linux

A common suggestion for people who want small images is to use Alpine Linux, but that can lead to longer build times, smaller images, and obscure bugs.

You can see the linked article for details, but I recommend against using Alpine.

Option #1: Ubuntu LTS, CentOS, Debian
There are three major operating systems that roughly meet the above criteria (dates and release versions are accurate at time of writing; the passage of time may require slightly different choices).

Ubuntu 18.04 (the ubuntu:18.04 image) was released in April 2018, and since it’s a Long Term Support release it will get security updates until 2023.
Ubuntu 20.04 (the ubuntu:20.04 image) will be released in late April 2020, and since it’s a Long Term Support release it will get security updates until 2025.
CentOS 8 (centos:8) was released in 2019, and will have full updates until 2024 and maintenance updates until 2029.
Debian 10 (“Buster”) was released on July 6th 2019, and will be supported until 2024.
Only Ubuntu 20.04 includes the latest version of Python (until 3.9 is out, anyway), so you’ll have to install Python yourself.

Option #2: The Python Docker image
Another alternative is Docker’s own “official” python image, which comes pre-installed with multiple versions of Python (3.5, 3.6, 3.7, 3.8 beta, etc.), and has multiple variants:

Alpine Linux, which as I explained above I don’t recommend using.
Debian Buster, with many common packages installed. The image itself is large, but the theory is that these packages are installed via common image layers that other official Docker images will use, so overall disk usage will be low.
Debian Buster slim variant. This lacks the common packages’ layers, and so the image itself is much smaller, but if you use many other Docker images based off Buster the overall disk usage will be somewhat higher.
The size benefit for Alpine isn’t even particularly compelling: the download size of python:3.8-slim-buster is 60MB, and python:3.8-alpine is 35MB, and their uncompressed on-disk size is 193MB and 109MB respectively.

So what should you use?
So as of April 2020, Debian Buster is a good operating system base:

It’s more up-to-date than ubuntu:18.04.
ubuntu:20.04 will take the lead in terms of packages being up-to-date, and it’s a Long Term Support release, so it’s a good choice too once it’s released in April 2020. It will limit you to Python 3.8 only, however, without doing a bit more work. Also, as with any new major software release, it’s probably worth waiting a month or three after its initial release for all the bugs to be fixed.
It’s stable, and won’t have significant library changes.
There’s less chances of weird production bugs than Alpine.
And the official Python Docker images based off of Debian Buster also give you the full range of Python releases.

The official Docker Python image in its slim variant—e.g. python:3.8-slim-buster—is a good base image for most use cases. it’s 60MB when downloaded, 180MB when uncompressed to disk, it gives you the latest Python releases, and it’s got all the benefits of Debian Buster.

Discussion (7)

Collapse
cairocafe profile image
David

This line doesn't read right "A common suggestion for people who want small images is to use Alpine Linux, but that can lead to longer build times, smaller images, and obscure bugs." Don't you mean "larger images" instead?

Collapse
pmutua profile image
Philip Mutua Author • Edited on

Hey David, Thank for the feedback. Could you please elaborate about why the line doesn't read right? When you’re choosing a base image for your Docker image, Alpine Linux is often recommended. Using Alpine, you’re told, will make your images smaller and speed up your builds. And if you’re using Go that’s reasonable advice.
But if you’re using Python, Alpine Linux will quite often:

  • Make your builds much slower.
  • Make your images bigger.
  • Waste your time.

On occasion, introduce obscure runtime bugs.

Let’s see why Alpine is recommended, and why you probably shouldn’t use it for your Python application.

Why people recommend Alpine

Let’s say we need to install gcc as part of our image build, and we want to see how Alpine Linux compares to Ubuntu 18.04 in terms of build time and image size.

First, I’ll pull both images, and check their size:

    $ docker pull --quiet ubuntu:18.04
    docker.io/library/ubuntu:18.04
    $ docker pull --quiet alpine
    docker.io/library/alpine:latest
    $ docker image ls ubuntu:18.04
    REPOSITORY          TAG        IMAGE ID         SIZE
    ubuntu              18.04      ccc6e87d482b     64.2MB
    $ docker image ls alpine
    REPOSITORY          TAG        IMAGE ID         SIZE
    alpine              latest     e7d92cdc71fe     5.59MB
Enter fullscreen mode Exit fullscreen mode

As you can see, the base image for Alpine is much smaller.

Next, we’ll try installing gcc in both of them. First, with Ubuntu:

FROM ubuntu:18.04
RUN apt-get update && \
    apt-get install --no-install-recommends -y gcc && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
Enter fullscreen mode Exit fullscreen mode

We can then build and time that:


$ time docker build -t ubuntu-gcc -f Dockerfile.ubuntu --quiet .
sha256:b6a3ee33acb83148cd273b0098f4c7eed01a82f47eeb8f5bec775c26d4fe4aae

real    0m29.251s
user    0m0.032s
sys     0m0.026s
$ docker image ls ubuntu-gcc
REPOSITORY   TAG      IMAGE ID      CREATED         SIZE
ubuntu-gcc   latest   b6a3ee33acb8  9 seconds ago   150MB
Enter fullscreen mode Exit fullscreen mode

Now let’s make the equivalent Alpine Dockerfile:

FROM alpine
RUN apk add --update gcc
Enter fullscreen mode Exit fullscreen mode

And again, build the image and check its size:

$ time docker build -t alpine-gcc -f Dockerfile.alpine --quiet .
sha256:efd626923c1478ccde67db28911ef90799710e5b8125cf4ebb2b2ca200ae1ac3

real    0m15.461s
user    0m0.026s
sys     0m0.024s
$ docker image ls alpine-gcc
REPOSITORY   TAG      IMAGE ID       CREATED         SIZE
alpine-gcc   latest   efd626923c14   7 seconds ago   105MB
Enter fullscreen mode Exit fullscreen mode

As promised, Alpine images build faster and are smaller: 15 seconds instead of 30 seconds, and the image is 105MB instead of 150MB. That’s pretty good!

But when we switch to packaging a Python application, things start going wrong.

Let’s build a Python image

We want to package a Python application that uses pandas and matplotlib. So one option is to use the Debian-based official Python image (which I pulled in advance), with the following Dockerfile:

FROM python:3.8-slim
RUN pip install --no-cache-dir matplotlib pandas
Enter fullscreen mode Exit fullscreen mode

And when we build it:

$ docker build -f Dockerfile.slim -t python-matpan.
Sending build context to Docker daemon  3.072kB
Step 1/2 : FROM python:3.8-slim
 ---> 036ea1506a85
Step 2/2 : RUN pip install --no-cache-dir matplotlib pandas
 ---> Running in 13739b2a0917
Collecting matplotlib
  Downloading matplotlib-3.1.2-cp38-cp38-manylinux1_x86_64.whl (13.1 MB)
Collecting pandas
  Downloading pandas-0.25.3-cp38-cp38-manylinux1_x86_64.whl (10.4 MB)
...
Successfully built b98b5dc06690
Successfully tagged python-matpan:latest

real    0m30.297s
user    0m0.043s
sys     0m0.020s
Enter fullscreen mode Exit fullscreen mode

The resulting image is 363MB.

Can we do better with Alpine? Let’s try:

FROM python:3.8-alpine
RUN pip install --no-cache-dir matplotlib pandas
Enter fullscreen mode Exit fullscreen mode

And now we build it:

$ docker build -t python-matpan-alpine -f Dockerfile.alpine .                                 
Sending build context to Docker daemon  3.072kB                                               
Step 1/2 : FROM python:3.8-alpine                                                             
 ---> a0ee0c90a0db                                                                            
Step 2/2 : RUN pip install --no-cache-dir matplotlib pandas                                                  
 ---> Running in 6740adad3729                                                                 
Collecting matplotlib                                                                         
  Downloading matplotlib-3.1.2.tar.gz (40.9 MB)                                               
    ERROR: Command errored out with exit status 1:                                            
     command: /usr/local/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/
tmp/pip-install-a3olrixa/matplotlib/setup.py'"'"'; __file__='"'"'/tmp/pip-install-a3olrixa/matplotlib/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-a3olrixa/matplotlib/pip-egg-info                              

...
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
The command '/bin/sh -c pip install matplotlib pandas' returned a non-zero code: 1
Enter fullscreen mode Exit fullscreen mode

What’s going on?

Standard PyPI wheels don’t work on Alpine

If you look at the Debian-based build above, you’ll see it’s downloading matplotlib-3.1.2-cp38-cp38-manylinux1_x86_64.whl. This is a pre-compiled binary wheel. Alpine, in contrast, downloads the source code (matplotlib-3.1.2.tar.gz), because standard Linux wheels don’t work on Alpine Linux.

Why? Most Linux distributions use the GNU version (glibc) of the standard C library that is required by pretty much every C program, including Python. But Alpine Linux uses musl, those binary wheels are compiled against glibc, and therefore Alpine disabled Linux wheel support.

Most Python packages these days include binary wheels on PyPI, significantly speeding install time. But if you’re using Alpine Linux you need to compile all the C code in every Python package that you use.

Which also means you need to figure out every single system library dependency yourself. In this case, to figure out the dependencies I did some research, and ended up with the following updated Dockerfile:

FROM python:3.8-alpine
RUN apk --update add gcc build-base freetype-dev libpng-dev openblas-dev
RUN pip install --no-cache-dir matplotlib pandas
Enter fullscreen mode Exit fullscreen mode

And then we build it, and it takes…

… 25 minutes, 57 seconds! And the resulting image is 851MB.

Here’s a comparison between the two base images:

Base image Time to build Image size Research required
python:3.8-slim 30 seconds 363MB No
python:3.8-alpine 1557 seconds 851MB Yes

Alpine builds are vastly slower, the image is bigger, and I had to do a bunch of research.

Can’t you work around these issues?

Build time

For faster build times, Alpine Edge, which will eventually become the next stable release, does have matplotlib and pandas. And installing system packages is quite fast. As of January 2020, however, the current stable release does not include these popular packages.

Even when they are available, however, system packages almost always lag what’s on PyPI, and it’s unlikely that Alpine will ever package everything that’s on PyPI. In practice most Python teams I know don’t use system packages for Python dependencies, they rely on PyPI or Conda Forge.

Image size

Some readers pointed out that you can remove the originally installed packages, or add an option not to cache package downloads, or use a multi-stage build. One reader attempt resulted in a 470MB image.

So yes, you can get an image that’s in the ballpark of the slim-based image, but the whole motivation for Alpine Linux is smaller images and faster builds. With enough work you may be able to get a smaller image, but you’re still suffering from a 1500-second build time when they you get a 30-second build time using the python:3.8-slim image.

But wait, there’s more!

Alpine Linux can cause unexpected runtime bugs

While in theory the musl C library used by Alpine is mostly compatible with the glibc used by other Linux distributions, in practice the differences can cause problems. And when problems do occur, they are going to be strange and unexpected.

Some examples:

  1. Alpine has a smaller default stack size for threads, which can lead to Python crashes.
  2. One Alpine user discovered that their Python application was much slower because of the way musl allocates memory vs. glibc.
  3. Another user discovered issues with time formatting and parsing.

Most or perhaps all of these problems have already been fixed, but no doubt there are more problems to discover. Random breakage of this sort is just one more thing to worry about.

Don’t use Alpine Linux for Python images

Unless you want massively slower build times, larger images, more work, and the potential for obscure bugs, you’ll want to avoid Alpine Linux as a base image. For some recommendations on what you should use, see on an article on choosing a good base image.

Collapse
cairocafe profile image
David

ahh. yes, in the same sentence you say for "smaller" images alpine is recommended but then you say later on that one of the problems with alpine is it gives smaller images. when i looked at the article you referenced they say it sometimes gives "larger" images.

Thread Thread
pmutua profile image
Philip Mutua Author

@cairocafe just curious, which python docker images do you normally use and have you ever had builds taking too long and how did you resolve this?

Collapse
lucian profile image
Lucian BLETAN

Your answer is biased. You forget to mention that you can build an image as base and use the bins after for your needed docker image.

Thread Thread
pmutua profile image
Philip Mutua Author

@lucian Apologies if the you feel the answer seemed biased in your opinion. I would like to hear out your suggestions that might support your opinion .

Thread Thread
lucian profile image
Lucian BLETAN

Well, debian doesn't install code from source.
In your example I can see WHL

Collecting matplotlib
  Downloading matplotlib-3.1.2-cp38-cp38-manylinux1_x86_64.whl (13.1 MB)
Collecting pandas
  Downloading pandas-0.25.3-cp38-cp38-manylinux1_x86_64.whl (10.4 MB)
Enter fullscreen mode Exit fullscreen mode

WHL format was developed as a quicker and more reliable method of installing Python software than re-building from source code every time. WHL files only have to be moved to the correct location on the target system to be installed.

If I take a look on your alpine docker build I clearly see .tgz that means the pip manager install all from source.
I don't have much time for test and give you a clear feedback but you can test it yourself by building from source on debian and see how much time it's take.

Wheel packages pandas and numpy for example are not supported in images based on Alpine platform.

You can also wget a .whl in your alpine and install with cmd:
pip install pandas-**.whl