mkdev.me for mkdev

Posted on Oct 7, 2022

Kubernetes Is Not an Orchestrator: The Jump to Universality for Infrastructure Abstractions

#kubernetes #capacity #management

Kubernetes Capacity Management in a Nutshell

Let’s try to re-cap everything we’ve learned so far about Kubernetes Resource management in Part I and Part II.

When you do a quick proof of concept of Kubernetes, you can just run any pod without thinking about how much CPU or RAM it needs.

Once you start being serious about using Kubernetes in production, you have to figure out how much CPU and RAM each of your applications need, and set resources for each pod accordingly. The process of setting those require deep knowledge of how your application functions, whether it needs more memory or CPU, and whether it would function well with a few bigger pods or many smaller pods.

As soon as your pods have resources configured, you should also optimise the nodes where these pods run. Your nodes should be able to provide the resources that your pods need, and your whole cluster needs to be able to scale out dynamically to fit more pods.

Once you have more variety in applications that you run, you also need to think about whether you want more clusters, or if you want to mix different kind of nodes within the same cluster. This involves more automation, more maintenance, and higher effort to properly configure both nodes and pods to match each other as good as possible.

If you are working in an environment where multiple independent teams run many different apps on Kubernetes, you also need to automate initial onboarding of new apps and projects, as well as provide a way to request more resources for tenants of the cluster, by adjusting the quotas accordingly.

The more workloads you run and the more teams use your Kubernetes clusters, the ever more important proper cluster capacity automation becomes.

Quite naturally, even with all the automations around quotas, node auto scaling and pod autoscaling in place, every now and then you will still end up investigating why some pods can’t be scheduled, or why some cluster nodes repeatedly become unresponsive due to lack of memory or CPU. We only looked at the CPU and RAM, but same apples to persistent volumes, GPUs etc.

And, of course, we didn’t even touch the topic of cost management for those clusters, which starts from simple “how many nodes at once we can afford” and evolves into “how do we charge cluster tenants for the resources they consume”.

Does Kubernetes make capacity management any better?

Kubernetes has one very powerful component at it’s core: a Scheduler. Having a dedicated component that takes over responsibility of finding the best server to run your containers is very handy. Having a component that is as as flexible and powerful as Kubernetes Scheduler is even better, because it unlocks you to setup truly complex scenarios around how you map your pods on your cluster nodes.

The hard question to ask ourselves is the following: is this the right new layer of infrastructure abstraction? Does Kubernetes bring us to the next level?

At the beginning of the article, we explored how the abstractions around infrastructure evolved and changed the way we work with the infrastructure. We moved from physical servers, that are very rigid and static, to virtual machines, that, given someone takes care of physical servers for us, are very flexible. The only thing left for you to do was to figure out how much CPU and RAM your applications need, and just right-size the virtual machines for each application.

We also discussed how containers bring another useful abstraction around how we package our applications, detaching the packaging from the runtime for this package - and even bringing the idea of such runtime in the first place. How does Kubernetes bring things forward from there?

As we just saw, it doesn’t do that much, and some things it does are rather a step backwards. Kubernetes does allow us to use container images as packaging format, and then define how many resources each instance of this container image requires. This is not conceptually different from baking cloud machine images, like AWS AMIs, but it is definitely a nicer, more standards-oriented and more performant approach (if we look at performance from the standpoint of build times and boot times for each application instance).

In return, we have to manage another layer, the one, where those containers will run. This other layer involves automating virtual machines, with auto scaling groups, with different instance sizes and so on. Most cloud providers have some way of simplifying some of those activities, like providing native Kubernetes-focused wrappers around cloud-specific virtual machine primitives - also known as “Managed Kubernetes”, as in, for example, AWS EKS, GCP GKE and Azure AKS.

In the end of the day, you have to scale two different compute layers both horizontally and vertically, ideally automatically, but sometimes manually, by analysing various metrics in two dimensions - containers and servers that run those containers.

Almost serverless Kubernetes

Cloud providers like AWS, GCP and Azure, have the power to abstract away the most problematic part of Kubernetes capacity management: nodes. And most of them did, but in a slightly different manner. Let’s look at how AWS and GCP did it. We are going to spend a bit more time talking about AWS approach, as AWS, unlike GCP, does bring new abstractions.

GCP GKE Autopilot

GKE Autopilot promises the following:

GKE provisions and manages the cluster's underlying infrastructure, including nodes and node pools, giving you an optimized cluster with a hands-off experience.

GKE Autopilot does not remove nodes from your Kubernetes experience, but instead completely takes over node management. You pay for the pods compute resources, and the job of Autopilot is to spawn the nodes based on what your pods need. You can not touch those nodes at all, and you just rely on GKE software to maintain that layer of infrastructure.

If there is no node in the GKE cluster to fit your pod, Autopilot creates this node for you - it will take just around a minute to bring the new node and schedule that pod, but if you want to have even faster times, you can create empty pods, that just force GKE to spawn new node, and that scheduler will evict to place your real pods - in this case, of course, you pay for those idle pods.

Conceptually, GKE Autopilot is an extremely comprehensive Cluster Autoscaler. It is similar to an open source (and AWS-sponsored) Karpeneter, but implemented by GCP and for GCP. It works in a similar way to the Cluster Autoscaler, minus configuration of nodes by the user. It does not remove the nodes, but it does attempt to completely remove the need to manage them. Google Cloud being heavily invested in Kubernetes ecosystem is likely to do this management better than any one else. AWS took a bit different approach.

As a side note, GKE also provides “multi-dimensional pod scaling” - a combination of HPA and VPA: Configuring multidimensional Pod autoscaling

AWS EKS Fargate

AWS approach over the last years is to go serverless. The biggest investment on Serverless side from AWS is AWS Lambda, followed by ECS Fargate. AWS pioneered serverless offering in the could, and kept pushing and pushing it forward, to the moment where Lambda can run at an insane scale.

It’s nor surprising then that the approach AWS took for making Kubernetes “better” is to make it truly serverless, with it’s EKS Fargate offering. The way EKS Fargate works is by allocating a micro VM for each pod, meaning each pod you run is also a node - you can even see every Fargate pod as a node if you run kubectl get nodes.

This is a rather unique approach, with it’s own quirks. For example, it takes up to a minute to create a new Fargate Pod, simply because Fargate needs to boot a new micro vm (with Firecracker, as far as we know), start all the required processes and start your application. It also has some limitations - you can not use DaemonSets, for example. Whether you need DaemonSets if you don't really have nodes is another question - in the end, the most common use case for DaemonSets is to automate some node-level activities, like log collection or security scanning. You also can’t use EBS volumes, as of now, and configuring metrics and logs collection can be a bit tricky if you don’t use CloudWatch.

When each of your pods is wrapped with it’s own node, many issues we talked about disappear. You are not afraid of malicious process breaking out of the application container, because it will break out into a micro VM that only runs this single pod. Permissions of a pod are permissions of a node, and permissions of a node are permissions of a pod. You also do not think about resources of the node, because resources of a pod are resources of a node. If you ask Fargate for a pod with 2 CPU and 4GB of RAM, it will just transparently wrap this pod with a node with this amount of resources (plus extra spare space on top for micro vm operations).

AWS EKS Fargate brings truly new abstraction to Kubernetes, as it fully removes the node from the picture. It does not automate nodes, does not manage nodes. Nodes disappear. It’s even a bit unfortunate that Kubernets API on EKS still reports every Fargate pod as a node, as it would be even nicer to never see them at all, but this is probably the limitation of having to integrate AWS-specific services into an open source technology.

AWS EKS Fargate has it’s own issues. But all those issues are supposed to go away over time - one thing we can be sure about is that any cloud provider will keep innovating and improving it’s services. Fargate will get faster boot times and better and better integrations, the question is when. What you get already today is a conceptual leap, a true attempt to make Kubernetes nodes disappear - and thus accelerate embracing the true nature of Kubernetes, the toolkit for new powerful infrastructure abstractions.

The jump to universality

We discussed the path from physical servers to Kubernetes, the system that is considered to be cloud native, but also the one that brings back many things that cloud aimed to solve. Let's now briefly look at how we can move forward from here to the next iteration.

First important statement: the main value of Kubernetes is not in it’s schedulers and resource management, but in it’s API. If we close our eyes on all the resource management challenges, we can see the beauty and simplicity in this API: instance of your application is a Pod, you can run and horizontally scale pods via Deployments and expose them with Services. Pods can talk to Services over simple DNS, and you have Service Discovery out of the box, mostly solved. You can feed configuration and credentials to those Pods via ConfigMaps and Secrets, and even provide some persistence via PersistentVolumes.

Second important statement: the true power of Kubernetes API is that it’s extensible. Not only it gives you a nice set of just mentioned abstractions, it let’s you build even more powerful abstractions on top. In addition to Deployment you can create objects like PostgreSQL, Mattermost, Jenkins, Redis - all of those implemented as custom API resources, processed with various Kubernetes Operators.

Kubernetes API and the way you can extend it are so powerful, that major cloud providers thought it can be appropriate to manage their cloud services via Kubernetes Custom Resources - for example, via AWS ACK or Google Cloud Config Connector. In those cases, custom resources do not end up as lower level Kubernetes objects, and instead they become resources in native cloud providers.

If you look at Kubernetes from this perspective, then you will see it’s not a scheduler that automates some of the traditional capacity management challenges. Kubernetes is something entirely different: it’s a new standard to define new abstractions around cloud infrastructure.

First wave of powerful infrastructure abstractions were defined and owned by cloud providers. Kubernetes unlocks the next wave, where everyone can define such abstractions. Kubernetes API can become what David Deutsch, pioneer of quantum computation, calls “the jump to universality” for cloud infrastructure:

The jump to universality - The tendency of gradually improving systems to undergo a sudden large increase in functionality, becoming universal in some domain. - Beginning of Infinity, David Deutsch

Scheduling pods and nodes, setting CPU and RAM requests and limits, managing cluster nodes, taints, affinity rules and everything else is rather an unfortunate need to make those APIs play nicely on top of any infrastructure. After all, Kubernetes is an open source technology, not bound to particular cloud provider, and thus Kubernetes alone can not abstract away the capacity management layer. But cloud providers can.

Final notes

This series of articles had 3 goals:

Explain challenges involved in adopting Kubernetes in terms of resource management;
Provide guidance and best practices on how to resolve those challenges given existing tools;
Define what Kubernetes really is and where it should move in terms of resource management topic;

The hope is that those goals were achieved. At mkdev, we all love and use Kubernetes daily, but we are always conscious about what Kubernetes is and what it is not. It’s future is bright, and removing some of the traditional infrastructure management layers will only accelerate this future.

Eventually, we all will be able to build ever more powerful abstractions in our infrastructure. The world of modern infrastructure is defined by software, and there are no limits of what we can achieve with software.

This article was written by Kirill Shirinkin for mkdev.me.