Kaslin Fields for Google Cloud

Posted on May 13, 2024 • Edited on May 21, 2024 • Originally published at kaslin.rocks

AI & Kubernetes

#ai #kubernetes #cloud #datascience

Just like so many in the tech industry, Artificial Intelligence (AI) has come to the forefront in my day-to-day work. I've been starting to learn about how "AI" fits into the world of Kubernetes - and vice versa. This post will start a series where I explore what I'm learning about AI and Kubernetes.

Types of AI Workloads on Kubernetes

To describe the AI workloads engineers are running on Kubernetes, we need some terminology. In this post I’m going to describe two major types of workloads: training and inference. Each of these two terms describes a different aspect of the work Platform Engineers do to bring AI workloads to life. In this post, I’ll highlight two roles in the path from concept to production for AI workloads. Platform Engineers bridge the gap between Data Scientists who design models, and the end users who interact with trained implementations of the models those Data Scientists designed.

Data Scientists design models while Platform Engineers have an important role to play in making them run on hardware.

There's a lot of work that happens before we get to the stage of running an AI model in production. Data scientists choose the model type, implement the model (the structure of the "brain" of the program), choose the objectives for the model, and likely gather training data. Infrastructure engineers manage the large amounts of compute resources needed to train the model and to run it for end users. The first step between designing a model and getting it to users, is training.

Note: AI training Workloads are generally a type of Stateful Workload, which you can learn more in my post about them.

Training Workloads

"Training" a model is the process for creating or improving the model for its intended use. It's essentially the learning phase of the model's lifecycle. During training, the model is fed massive amounts of data. Through this process, the AI "learns" patterns and relationships within the training data through algorithmic adjustment of the model's parameters. This is the main workload folks are usually talking about when discussing the massive computational and energy requirements of AI.

During training, the AI model is fed massive amounts of data, which it "learns" from, algorithmically adjusting its own parameters.

It’s becoming a common strategy for teams to utilize pre-trained models instead of training their own from scratch. However, a generalized AI is often not well-equipped to handle specialized use cases. For scenarios that require a customized AI, a team is likely to do a similar “training” step to customize the model without fully training it. We call this “fine-tuning.” I’ll dive deeper into fine-tuning strategies another time, but this overview of model tuning for Google Cloud’s Gemini model is a good resource to start with.

Why Kubernetes for Training

Kubernetes makes a lot of sense as a platform for AI training workloads. As a distributed system, Kubernetes is designed to manage a huge amount of distributed infrastructure and the networking challenges that come with it. Training workloads have significant hardware requirements, which Kubernetes can support with GPUs, TPUs, and other specialized hardware. The scale of a model can vary greatly- from fairly simple, to very complex and resource-intensive. Scaling is one of Kubernetes' core competencies, meaning it can manage the variability of training workloads' needs as well.

Kubernetes is also very extensible, meaning it can integrate with additional useful tools, for example, for observability/monitoring massive training workloads. A whole ecosystem has emerged, full of useful tools for AI/Batch/HPC workloads on Kubernetes. Kueue is one such tool- a Kubernetes-native open source project for managing the queueing of batch workloads on Kubernetes. To learn more about batch workloads on Kubernetes with Kueue, you might check out this tutorial. You can also learn more about running batch workloads on Kubernetes with GPUs in this guide about running them on GKE in Google Cloud.

Inference Workloads

You could say that training makes the AI into as much of an "expert" as it's going to be. Running a pre-trained model is its own type of workload. These "inference" workloads are generally much less resource-intensive than "training" workloads, but the resource needs of inference workloads can vary significantly. IBM defines “AI Inferencing as: "the process of running live data through a trained AI model to make a prediction or solve a task."

An "Inference Workload" describes running a trained model. This model should be able to do its expected tasks relatively well.

Inference workloads can range from a fairly simple, lightweight implementation - to much more complex and resource-intensive ones. The term "inference workload" can describe a standalone, actively running implementation of a pre-trained AI model. Or, it can describe an AI model that functions essentially as a backend service within a larger application, often in a microservice-style architecture. This term seems to be used as a catch-all for any case where a trained AI model is being run. I’ve heard it used interchangeably with the terms “serving workload” and “prediction workload.”

Why Kubernetes for Inference

Inference workloads can have diverse resource needs. Some might be lightweight and run on CPUs, while others might require powerful GPUs for maximum performance. Kubernetes excels at managing heterogeneous hardware, allowing you to assign the right resources to each inference workload for optimal efficiency.

Kubernetes provides flexibility in the way an inference workload is used. Users may interact with it directly as a standalone application. Or they may go through a separate frontend as part of a microservice. Whether the workload is standalone or one part of a whole, we call the workload that runs an AI model, "inference."

On Terminology

In writing this blog post, I learned that the terminology of AI workloads is still actively being determined. “Inference” is currently used interchangeably with “serving,” “prediction,” and maybe more. Words have meaning, but when meanings are still being settled, it’s especially important to be as clear as possible about what you mean when you use a term.

One area which I think is not well-served by existing AI terminology is the difference between running a model, and running a full AI application. I have also seen the “inference” family of terms used to describe not just the running model, but the full application it is a part of, when the speaker is focusing on the AI aspects of that application.

It’s worth noting that Kubernetes is good not just for running the AI part, but also for running the application that AI model is part of, as well as related services. Serving Frameworks like Ray are useful tools for managing not just AI models, but the applications around them. I’ll likely dive deeper into Ray in a future blog post. If you’d like to learn more about Ray, you might check out this blog post about Ray on GKE.

AI models often fill a role within a larger application that serves users.

Ultimately, be careful when you talk about AI workloads! Try to explain what you mean as clearly as you can so we can all learn and understand together!

Customizing AI: It's All About Context

I'm enjoying learning about the ways "AI" fits into the world of "Kubernetes," and there's a lot more to learn! In this post, we explored AI training, inference, and serving workloads and why to run them on Kubernetes. These workload types are great for understanding what it means to run AI models on Kubernetes. But the real value of AI is in its ability to understand and convey information in-context. To make a generic AI model useful in many use cases, it needs to be made aware of the context it's operating in, and what role it's fulfilling. "Fine-tuning" refers to the techniques for customizing a generic model, often by partially retraining it. There are also other techniques like RAG and prompt engineering that can be used to customize a generic model’s responses without altering the model itself. I’ll dive deeper into these techniques in a future blog post.