Hyunho Richard Lee for Meadowrun

Posted on Sep 1, 2022 • Originally published at betterprogramming.pub on Sep 1, 2022

Kubernetes Was Never Designed for Batch Jobs

#airflow #mlops #dataengineering #kubernetes

In this post we’ll make the case that Kubernetes is philosophically biased towards microservices over batch jobs. This results in an impedance mismatch that makes it harder than it “should” be to use Kubernetes for batch jobs.

Batch world vs microservice world

One thing that took me a while to figure out was why so many wildly popular technologies felt so weird to me. It clicked when I realized that I was living in “batch world”, and these technologies were coming out of “microservice world”.

I’ve mostly worked at hedge funds, and the ones I’ve worked at are part of batch world. This means that code is usually run when triggered by an external event, like a vendor sending the closing prices for US equities for that day or a clearinghouse sending the list of trades that it recorded for the day. A job scheduler like Airflow triggers a job to run depending on which event happened. That job usually triggers other jobs, and the ultimate output of these jobs is something like an API call to an external service that executes trades or a PnL (profit and loss) report that gets emailed to humans.

In contrast, in microservice world (and take this with a grain of salt, as I’ve only ever touristed there), instead of jobs that run to completion, everything is a long-running service. So instead of a job that ingests closing US equity prices and then kicks off a job that calculates the PnL report, there might be a PnL report service that polls the prices service until the prices it needs are available.

But what’s the difference, really?

The difference between batch jobs and services seems easy to define — batch jobs are triggered by some “external” event, do some computation and then exit on their own. Services on the other hand run forever in a loop that accepts requests, does a smaller bit of computation, and then responds to those requests.

But we could describe batch jobs in a way that makes them sound like services by saying that each invocation of a batch job is a “request”. In that view, the overall pattern of running batch jobs looks like a service with the job scheduler playing the role of the load balancer and each batch job invocation plays the role of handling a request. This pattern of handling each request in its own process is similar to the fairly common “forking server” pattern.

And vice versa, consider the thought experiment where we’ve broken up our code into thousands of different microservices but we don’t have enough hardware to run all of these microservices at the same time. So we configure a sophisticated, responsive autoscaler that only starts each microservice for the duration of a single request/reply when a request comes in. At that point, our autoscaler is starting to look more like a job scheduler, and our microservices kind of look like batch jobs!

Let’s clarify our original intuition — our description of services “running forever” is imprecise. It’s true that most services don’t exit “successfully” on their own, but they might exit due to a catastrophic exception, in response to an autoscaling decision, or as part of a deployment process. So what are we actually getting at when we say that services “run forever”? The most important aspect is that we think of services as “stateless” which can also be stated as “services are easily restartable”. Because each request/reply cycle is usually on the order of milliseconds to seconds and there’s usually minimal state in the server process outside of the request handler, restarting a service shouldn’t lose us more than a few seconds of computation. In contrast, batch jobs are generally “not restartable”, which is to say that if we have a batch job that takes several hours to run and we restart it after one hour, we will need to redo that first hour of work which is usually unacceptable. If we somehow wrote a batch job that created checkpoints of its state every 5 seconds, thereby making it more or less “stateless”, then that batch job would be just as “easily restartable” as a service.

So the most defensible distinction between service and batch jobs is that services do a small (milliseconds to seconds) amount of work at a time which makes them easy to restart, while batch jobs do a large (minutes to hours) amount of work at a time which makes them harder to restart. Stated this way, the difference between services and batch jobs is fuzzier than it first appears.

But just because we can come up with hard-to-categorize examples that challenge this binary, it doesn’t mean the difference is meaningless. In practice, services generally run as long-lived processes that respond to requests, do a small bit of work for each request and then respond. Meanwhile batch jobs are usually triggered ad-hoc by a data scientist or from a job scheduler and do a large amount of work at a time which makes them harder to restart.

Another way to see the difference between services vs batch jobs is how they scale. Scaling a service usually means dealing with a large number of concurrent requests by having an autoscaler run replicas of the service with a load balancer sending different requests to different replicas. Scaling a batch job usually usually means dealing with a large amount of data by running the same code over different chunks of data in parallel. We’ll go into these aspects below in more depth as we walk through how well (or poorly) Kubernetes supports these scenarios.

How Kubernetes sees the world

Technology is neither good nor bad; nor is it neutral. Melvin Kranzberg

We’ll skip the full history and functionality of Kubernetes as there’s plenty of existing coverage (this post from Coinbase gives a good overview). Our focus here is understanding Kubernetes’ philosophy — how does it see the world and what kinds of patterns does it create for users.

It’s all about philosophy. The Thinker in The Gates of Hell at the Musée Rodin via Wikimedia Commons

Kubernetes is for services

We’ll start with the “what Kubernetes can do” section on the “Overview” page in the Kubernetes documentation. Most of the key features listed here are focused on services rather than batch jobs:

“Service discovery and load balancing” resolves domain names to one or more replicas of a service. This isn’t relevant for batch jobs which usually don’t have a concept of request/response, so there’s no need to resolve domain names to containers, or round-robin requests between different instances of a service.
“Automated rollouts and rollbacks” makes deployment of services easier by turning off a few instances, restarting them with the new deployment, and then repeating until all of the instances have been updated. This idea doesn’t apply to batch jobs because batch jobs are “hard to restart” and naturally exit on their own, so the right thing to do is to wait until the batch job finishes and then start subsequent jobs with the new deployment rather than losing work to a restart. And we certainly wouldn’t want a rolling deployment to result in a distributed job where different tasks run on different version of our code!
“Self-healing”: Restarting jobs that fail is useful, but batch jobs don’t have a concept of a “health check”, and not advertising services to clients until they’re ready isn’t relevant to batch jobs.
“Automatic bin packing” is only partially relevant for batch jobs — we definitely want to be smart about the initial allocation of where we run a job, but again, batch jobs can’t be restarted willy-nilly, so they can’t be “moved” to a different node.
“Secret and configuration management” and “Storage orchestration” are equally relevant for services and batch jobs.

One theme throughout these features is that Kubernetes assumes that the code it’s running is relatively easy to restart. In other words, it assumes it’s running services.

Kubernetes doesn’t believe in orchestration

That same Overview page declares:

Kubernetes is not a mere orchestration system. In fact, it eliminates the need for orchestration. The technical definition of orchestration is execution of a defined workflow: first do A, then B, then C. In contrast, Kubernetes comprises a set of independent, composable control processes that continuously drive the current state towards the provided desired state. It shouldn’t matter how you get from A to C. Centralized control is also not required. This results in a system that is easier to use and more powerful, robust, resilient, and extensible.

This paragraph presumably refers to the idea that in Kubernetes you define your configuration declaratively (e.g. make sure there are 3 instances of this service running at all times) rather than imperatively (e.g. check how many instances are running, if there more than 3, kill instances until there are 3 remaining; if there are fewer than 3, start instances until we have 3).

Nevertheless, job schedulers like Airflow are orchestration frameworks in the exact way this paragraph describes. And of course we can just run Airflow on top of Kubernetes to work around this bit of philosophy, but Kubernetes intentionally makes it hard to implement this kind of orchestration natively.

Kubernetes has a “job” concept for running batch jobs, but the aversion to the idea of “first do A, then B” means that the job API will probably never be able to express this core concept. The only hope for expressing job dependencies in Kubernetes would be a declarative model — “in order to run B, A must have run successfully”. But this feature doesn’t exist either, and while a fully declarative model does have its advantages, it’s not currently the dominant paradigm for expressing job dependencies.

Moreover, jobs are clearly secondary to services in Kubernetes. Jobs don’t appear in that Overview page at all, and are mostly ignored throughout the documentation aside from the sections that are explicitly about jobs. One conspicuous absence is this page which states “Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.” This inexplicably ignores the case where pods for jobs complete naturally.

More missing features

Not only are Kubernetes jobs themselves not fully featured, they exist in a larger system that is designed with a philosophy that makes jobs less useful than they could be. For the rest of this article, we’ll examine some of these details of Kubernetes and why they make life harder for batch jobs.

Pod preemption

Both jobs and services are implemented as “pods” which are neutral in theory but in practice biased towards services. One example is that pods can be preempted to make room for other pods, which assumes that pods are easily restarted. The documentation acknowledges this isn’t appropriate for batch jobs but the recommendation it makes is a bit backwards — it suggests that batch jobs should set preemptionPolicy: Never. This means those pods will never preempt another job which only works if all of the pods on the cluster do the same thing. Ideally there would be a way to guarantee that the pod itself would never be preempted even in a cluster that runs both batch jobs and services. There are workarounds like reserving higher priorities for batch jobs or using pod disruption budgets, but these aren’t mentioned on that page. This is exactly what we mean by an impedance mismatch — we can ultimately accomplish what we need to, but it takes more effort than it “should”.

Composability

Kubernetes only works with containers, and containers themselves are also biased towards services in how they compose. If we have two containers that need to talk to each other, exposing services in one or both of the containers is the only game in town.

For example, let’s say we want to run some Python code and that code needs to call ImageMagick to do some image processing. We want to run the Python code in a container based on the Python Docker image and we want to run ImageMagick in a separate container based on this ImageMagick Docker image. Let’s think about our options for calling the ImageMagick container from the Python container, i.e. for composing these two containers.

We could use the Python image as a base image and copy parts of the ImageMagick Dockerfile into a new Dockerfile that builds a custom combined image. This is probably the most practical solution, but it has all the usual drawbacks of copy/paste — any improvements to the ImageMagick Dockerfile won’t make it into our Dockerfile without manually updating our Dockerfile.
We could invoke the ImageMagick container as if it were a command line application. Kubernetes supports sharing files between containers in the same pod so at least we can send our inputs/outputs back and forth, but there isn’t a great way to invoke a command and get notified when it’s done. Anything is possible, of course (e.g. starting a new job and polling the pods API to see when it completes), but Kubernetes’ philosophical aversion to orchestration is not helping here.
We could modify the ImageMagick image to expose a service. This seems silly, but is effectively what happens most of the time — instead of building command-line tools like ImageMagick, people end up building services to sidestep this problem.

In fact in Docker, “Compose” means combining services:

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

Even if we acquiesce to the idea of composing via services, Kubernetes doesn’t give us great options for running a service in the context where a batch job is the consumer:

The simplest option is to run one or more copies of the ImageMagick service constantly, but that’s a waste of resources when we’re not running anything that needs the service, and it will be overwhelmed when we launch a distributed job running thousands of concurrent tasks.
So we might reach for the HorizontalPodAutoscaler to spin up instances when we need them and turn them off when we don’t. This works, but to get the responsiveness of calling something on the command line, we’ll need to make the autoscaler’s “sync-period” much shorter than the default 15 seconds.
Another option is “sidecar containers” where we run the Python image and the ImageMagick service image in the same pod. This mostly works, but there’s no way to automatically killthe service’s sidecar container when the “main” batch job container is done. The proposed feature to allow this was ultimately rejected because it “is not an incremental step in the right direction”.

Exploring these options for how batch jobs can call other containers shows that we can make it work, but Kubernetes makes it harder than it “should” be.

Ad-hoc jobs

One aspect of batch jobs is that we often run them ad-hoc for research, development, or troubleshooting. For example, we edit some code or tweak some data, and then rerun our linear regression to see if we get better results. For these ad-hoc jobs there would ideally be some way to take e.g. a CSV file that we’re working with locally, and “upload” it to a volume that we could read from a pod. This scenario isn’t supported by Kubernetes, though, so we have to figure out other ways to get data into our pod.

One option would be to set up an NFS (Network File System) that’s accessible from outside of the cluster and expose it to our pods in the cluster. The other option is, as usual, a service of some sort. We could use a queue like RabbitMQ that will temporarily store our data and make it available to our pod.

But with a service, we now have another problem of accessing this service from outside of the cluster. One way to solve this problem is to make our dev machine part of the Kubernetes cluster. Then we can configure a simple Service for RabbitMQ which will be accessible from inside the cluster. If that’s not an option, though, we’ll need to explore Accessing Applications in a Cluster. The options boil down to using port forwarding, which is a debugging tool not recommended for “real” use cases, or setting up an Ingress Controller which is a formalized way of exposing services running in a Kubernetes cluster that is overkill for a single client accessing a single instance of a service from an internal network. None of these options are ideal.

Kubernetes doesn’t make this easy because when we’re running services, we either don’t want anything outside of the cluster interacting with our service, or we want a relatively heavy-duty load balancer in front of our service to make sure our service doesn’t go down under load from the outside world. The missing feature here is some way for the client that is launching a job/pod to upload a file that the pod can access. This scenario is relatively common when running batch jobs and uncommon when running services, so it doesn’t exist in Kubernetes and we have to use these workarounds.

Distributed jobs that kind of work

Another aspect of batch jobs is that we’ll often want to run distributed computations where we split our data into chunks and run a function on each chunk. One popular option is to run Spark, which is built for exactly this use case, on top of Kubernetes. And there are other options for additional software to make running distributed computations on Kubernetes easier.

The Kubernetes documentation, however, doesn’t cede to third-party frameworks, and instead gives several options for running distributed computations directly on Kubernetes. But none of these options are compelling especially in contrast to how well-designed Kubernetes is for services workloads.

The simplest option is to just create a single Job object per task. As the documentation points out, this won’t work well with a large number of tasks. One user’s experience is that it’s hard to go beyond a few thousand jobs total. It seems like the best way around that is to use Indexed Jobs, which is a relatively new feature for running multiple copies of the same job where the different copies of the job have different values for the JOB_COMPLETION_INDEX environment variable. This gives us the most fundamental layer for running distributed jobs. As long as each task just needs an index number and doesn’t need to “send back” any outputs, this works. E.g. if all of the tasks are working on a single file and the tasks “know” that they need to process n rows that come after skipping the first JOB_COMPLETION_INDEX * n rows, and then write their output to a database, this works great.

But in some cases, we’ll want to run tasks that e.g. need a filename to know where their input data is, and it could be convenient to send back results directly to the process that launched the distributed job for post-processing. In other words we might need to send more data back and forth from the tasks beyond a single number. For that, the documentation offers two variations of using a message queue service that you start in your Kubernetes cluster. The main difficulty with this approach is that we have the same problem as before of accessing services inside of the Kubernetes cluster from outside of it so that we can add messages to the message queue service. The documentation suggests creating a temporary interactive Pod but that only really makes sense for testing. We have the same options as before — make sure everything including our dev machines run inside the cluster, use port forwarding, or create an Ingress.

Distributed groupby

An additional wrinkle with distributed computations is distributed “groupbys”. Distributed groupbys are necessary when our dataset is “chunked” or “grouped” by one column (e.g. date) but we want to group by a different column (e.g. zip code) before applying a computation. This requires re-grouping our original chunks, which is implemented as a “shuffle” (also known as a map-reduce). Worker-to-worker communication is central to shuffles. Each per-date worker gets a chunk of data for a particular date, and then sends rows for each zip code to a set of per-zip code downstream workers. The per-zip code workers receive the rows for “their” zip code from each upstream per-date worker.

To implement this worker-to-worker communication, Kubernetes could implement some version of shared volumes that would allow us to expose the results from one pod to a downstream pod that needs those outputs. Again, this functionality would be really useful in “batch world”, but the use case doesn’t exist in “service world”, so this functionality doesn’t exist. Instead we need to write our own service for worker-to-worker communication.

This is what, for example, Spark does, but it runs into an impedance mismatch as well. Spark’s shuffle implementation makes extensive use of local disk in order to deal with data that doesn’t fit in memory. But Kubernetes makes it hard to get the full performance of your local disks because it only allows for disk caching at the container, meaning that the kernel’s disk caching isn’t used and Spark’s shuffle performance suffers on Kubernetes. A more native way to share files across pods would enable a faster implementation.

Caching data

Another aspect of distributed computations that we’ll talk about is caching data. Most distributed computations have the concept of a distributed dataset that’s cached on the workers that can be reused later. For example, Spark has the concept of RDDs (resilient distributed dataset). We can cache an RDD in Spark, which means that each Spark worker will store one or more chunks of the RDD. For any subsequent computations on that RDD, the worker that has a particular chunk will run the computation for that chunk. This general idea of “sending code to the data” is a crucial aspect of implementing an efficient distributed computation system.

Kubernetes in general is a bit unfriendly to the idea of storing data on the node, recommending that “it is a best practice to avoid the use of HostPaths when possible”. And even though there are extensive capabilities for assigning pods to nodes via affinity, anti-affinity, taints, tolerations, and topology spread constraints, none of these work with the concept of “what cached data is available on a particular node”.

Overall, Kubernetes has some very half-hearted support for natively running distributed computations. Anything beyond the most simple scenario requires either implementing workarounds or layering on another platform like Spark.

The future of batch jobs on Kubernetes

The point of this article is not to suggest that Kubernetes is poorly thought out or poorly implemented. Arguably, if you have a monolithic service, Kubernetes is overkill, and there’s at least one person predicting it will be gone in 5 years, but it seems like a great choice for e.g. this person running 120 microservices. And it makes deploying your web app with a database, DNS records, and SSL certificates easy. What we’re saying here is that Kubernetes, like all technologies, takes a point of view, and that point of view isn’t particularly friendly to batch jobs.

Instead of the currently half-hearted support for batch jobs, one option we’d love to see is Kubernetes making its stance more explicit and declaring that Kubernetes is designed primarily for services. This could open up space for other platforms more specifically designed for the batch job use case. Alternatively, as Kubernetes is already a “platform for building platforms” in some ways, we could see something like Spark-on-Kubernetes become better supported. It seems unlikely that Kubernetes would adopt a significantly different overall philosophy where batch jobs are a first-class use case.

Star us on Github or follow us on Twitter! We’re working on Meadowrun which is an open source library that solves some of these problems and makes it easier to run your Python code on the cloud, with or without Kubernetes.

Top comments (2)

femolacaster • Sep 2 '22

Great initiative with Meadowrun. How can I contribute via documentation?

Kurt • Sep 2 '22 • Edited

Thanks for considering a contribution!

Are you looking for areas of the documentation that we think need improvement? In that case I think it would make sense for you to try a few things out and see where it is lacking - a fresh pair of eyes helps. From our side we think we are in a reasonable spot but of course we made the thing so definitely biased.

If you're looking for technical guidance on how to make documentation changes, we don't have any contributor documentation yet, sorry! We are using mkdocs for the documentation, so at least it's worth becoming familiar with that.

You're very welcome to stop by in our gitter chat, we can help you get started: gitter.im/meadowdata/meadowrun

DEV Community