In my last post, we looked at how to use containers for machine learning from scratch and covered the complexities of configuring a Python environment suitable to train a model with the powerful (and understandably popular) combination of the Jupyter, Scikit-Learn and XGBoost packages.
We worked through the complexities of setting up this environment, and then how to use containers to make it easily reproducible and portable. We also looked at how to build and run that environment at scale on Docker Swarm and Kubernetes.
That article intended to introduce containers to data scientists, and demonstrate how machine learning can fit into the world of containers for those already familiar. If this sounds useful to you, you should definitely check it out first and then come back right here 👇
In the opening section, I joked about that the title of the article (...from scratch to Kubernetes...) was not a reference to the FROM scratch
command that you might find in certain Dockerfiles choosing to forgo a base image such as centos:7
that we used to build our Jupyter environment.
Well, in this follow-on article, we're going to explore why you would actually build a machine learning container using scratch
, and a method for doing so that can avoid re-engineering an entire data science workflow from Python into another language.
What is scratch
? Don't I need an operating system?
In Docker, the scratch
image is actually a reserved keyword that literally means "nothing". Normally, you would specify in your Dockerfile a base image from which to build upon. This might be an official base image (such as centos:7
) representing an operating system that includes a package manager and a bunch of tools that will be helpful for you to build your application into a container. This might also be another container you've built previously, where you want to add new layers of functionality such as new packages or scripts for specific tasks.
When you build a container on the scratch
base, it starts with a total size of 0kB, and only grows as you ADD
or COPY
files into your container and manipulate them from there throughout the build process.
Why is this good?
Creating containers that are as small as possible is a challenging practice which has many benefits:
- Smaller images build quicker, transmit faster through a network (no more long wait time for
docker push
anddocker pull
), take up less space on disk and require less memory - Smaller images have a reduced attack surface (which means would-be attackers have fewer options for exploiting or compromising your application)
- Smaller images have less components to upgrade, patch and secure (which means less work is required to maintain them over time!)
Of course there are tradeoffs.
Creating containers to be as small as possible often sacrifices tooling that can help with debugging, which means you'll need to consider your approach for this by the time you reach production. It also limits reusability, which means you might end up with many more containers each with highly specialised functionality.
It turns out that there are many ways to reduce the size of a container before resulting to scratch
. We won't go into these in any more detail in this article, but the techniques include:
- switching to a different base image like
alpine
, a Linux distribution commonly used with containers due to its small size (rundocker pull centos:7
,docker pull alpine
, and thendocker images
to findalpine
is a conservative5.58MB
compared to the202MB
size ofcentos:7
)
minimising packages and other dependencies to only install what you need for running your application (in the Python world, this means checking every line of your
requirements.txt
file)clearing caches and other build artefacts that are not required after install
We could also decide to implement our own machine learning algorithm entirely in a language that we can execute with minimal dependencies, but that will make it really hard to build, maintain and collaborate with others on.
What about existing data science workflows?
Our aim is to create a workflow that allows us to keep using our favourite Python tools to train our model, so let's build a Docker image to do just that.
Create a suitable directory and add the following to a new file called Dockerfile
:
FROM centos:7 AS jupyter
RUN yum install -y epel-release && \
yum install -y python36-devel python36-pip libgomp
RUN pip3 install jupyterlab scikit-learn xgboost
RUN adduser jupyter
USER jupyter
WORKDIR /home/jupyter
EXPOSE 8888
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"]
You can build the container with the following command:
-
docker build -t devto-jupyter --target jupyter .
-
--target
allows us to build to a specificFROM
step in a multi-stage Dockerfile (more on this in a bit)
-
Run the container and bring up your Jupyter instance by browsing to the localhost
address output in the console:
docker run -it --rm -p "8888:8888" -v "$(pwd):/home/jupyter" devto-jupyter
Create a new Jupyter notebook called iris_classifier.ipynb
and within it the following three cells:
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
import xgboost as xgb
train = xgb.DMatrix(X, label=y)
params = {
'objective': 'multi:softmax',
'num_class': 3,
}
model = xgb.train(params, train, num_boost_round = 5)
model.save_model('iris.model')
In order, these three cells load a dataset from which we can base our example (the Iris flower dataset), train an XGBoost classifier, and finally dump the trained model as a file called iris.model
.
After running each cell, the directory where you executed docker run ...
above should now contain your notebook file and the trained model file.
Introducing Multi-stage builds
As we were building our Dockerfile above, we specifically targeted the first FROM
section called jupyter
by using the --target
option in our docker build
command.
It turns out that we can have multiple FROM
sections in a single Dockerfile, and combine them to copy our build artefacts from earlier steps in the process to later steps.
It's quite common when using containers to build microservices with other languages, such as Go, to follow a multi-stage build where the final step copies only the compiled binaries and any dependencies required for execution into an otherwise empty scratch
container.
Since the build tools for this type of workflow are quite mature in Go, we are going to find a way to apply this to our Python data science process. Importantly, Python is an interpreted language, which makes it difficult to create small application distributions as they would need to bundle the Python interpreter and the full contents of any package dependencies.
The next step in our Dockerfile simply looks for the notebook we created above, and executes it in place to output the trained model. Go ahead and add this to the bottom of Dockerfile
:
FROM jupyter AS trainer
COPY --chown=jupyter:jupyter ./iris_classifier.ipynb .
RUN jupyter nbconvert --to noteook --inplace --execute iris_classifier.ipynb
Predictions with an XGBoost model in Go
It turns out there is an existing pure Go implementation of the XGBoost prediction function in a package called Leaves, and the documentation includes some helpful examples of how to get started.
For this article, we're just looking to load up our trained model from the previous step and run a single prediction. We'll take the features as command line arguments so we can run the container with a simple docker run
command.
Create a file in the same directory as your Dockerfile and call it iris_classifier_predict.go
, with the contents:
package main
import (
"fmt"
"os"
"strconv"
"github.com/dmitryikh/leaves"
)
// Based on: https://godoc.org/github.com/dmitryikh/leaves
func main() {
// load model
model, err := leaves.XGEnsembleFromFile("/go/bin/iris.model", false)
if err != nil {
panic(err)
}
// preallocate slice to store model prediction
prediction := make([]float64, model.NOutputGroups())
// get inputs as floats
var inputs []float64
for _, arg := range os.Args[1:] {
if n, err := strconv.ParseFloat(arg, 64); err == nil {
inputs = append(inputs, n)
}
}
// make predction
model.Predict(inputs, 0, prediction)
fmt.Printf("%v\n", prediction)
}
Now we need to create a third step in our multi-stage build to compile our microservice so it's ready for prediction. Add this to the bottom of Dockerfile
:
FROM golang:alpine AS builder
RUN apk update && apk add --no-cache git upx
WORKDIR $GOPATH/src/xgbscratch/iris/
COPY ./iris_classifier_predict.go .
RUN go get -d -v
RUN GOOS=linux GOARCH=amd64 go build -ldflags="-w -s" -o /go/bin/iris
# https://blog.filippo.io/shrink-your-go-binaries-with-this-one-weird-trick/
RUN upx --brute /go/bin/iris
These steps start with a ready made Go build environment, install Git (to grab Leaves from GitHub) and upx (the Ultimate Packer for eXecutables), copy our microservice script from above, build it with a series of options that basically translate into "everything needed to run standalone", and then compress the resulting binary.
(For the purposes of this article, upx compression helps us achieve a roughly 60% reduction in our final image footprint. In a future post we'll look at performance benchmarks of these various techniques and the tradeoffs with size, especially around the compression step.)
Building our tiny final container and generating predictions
The last step of our Dockerfile needs to take the trained model file iris.model
from the second step, and the compiled Go binary from the third step, and run it.
You can add this to the bottom of Dockerfile
:
FROM scratch
COPY --from=builder /go/bin/iris /go/bin/iris
COPY --from=trainer /home/jupyter/iris.model /go/bin/
ENTRYPOINT ["/go/bin/iris"]
Build the final container with the following command:
docker build -t devto-iris .
Run docker images
and you'll find the final image to be around a tiny 486kB!
Compared to our original training image based on centos:7
which weighed in at a hefty 1.24GB, we've been able to achieve a size reduction of 99.96%, which is over 2500x times smaller.
How about actually making some predictions?
Since our Go binary accepts feature inputs as command line arguments, we can generate individual predictions using docker run
with the following command:
-
docker run -it --rm devto-iris 1 2 3 4
-
1 2 3 4
can be replaced with the parameter inputs for our model, from which predictions are generated. With this example, the output should be similar to[-0.43101535737514496 0.39559850541076447 0.933891354361549]
which are the relative positive probabilities of each label
-
What does this mean?
In addition to the tiny container benefits we discussed around data volume, application security and maintenance, tiny containers bring two great benefits to the world of machine learning:
- being able to easily deploy a model into heavily resourced constrained places, such as embedded devices with low amounts of storage. Who knows, you could soon be running XGBoost predictions through your light switch, your sunglasses or your toaster! I'm looking forward to checking out k3OS, a low-resource operating system based on Kuberenetes to do exactly that.
- with a much smaller footprint, a model can achieve a much greater predictive throughput ("predictions per second", or pps) and benefit high-permutation and prediction hungry applications of machine learning such as recommandation engines, simulations and scenario testing and pairwise comparisons.
Top comments (0)