DEV Community

Michael Levan
Michael Levan

Posted on

Ollama LLM On Kubernetes Locally (Run It On Your Laptop)

Now that we’ve gotten over the “buzz lifecycle” of AI and LLMs, it’s time to start thinking about how to run the workloads in our environments. Wiping away all of the said “buzz”, there’s a solid use case for running training Models and other LLM/AI workloads on Kubernetes. One of the biggest reasons is the decoupling of memory, CPU, and GPUs.

In this blog post, you’ll learn how to get started with running an LLM on a local Kubernetes cluster.

💡

This will work on a production cluster and/or a cloud-based cluster (AKS, EKS, GKE, etc.) as well.

Prerequisites

To follow along from a hands-on perspective with this blog post, you should have the following:

  1. A code editor like VS Code.
  2. Minikube installed. You can find the installation here.

Why Ollama?

When using Ollama, it comes down to:

  1. Privacy
  2. Local-style LLM workloads

The biggest piece of the puzzle is that Ollama allows you to use whatever Models you want, train your own, and fine-tune those Models with your own Retrieval Augmented Generation (RAG). One of the largest advantages for a lot of engineers and organizations when using Ollama is that because it’s installed “locally” (could be your local machine, but could also be a Kubernetes cluster or a standard VM), you control what data gets fed to it. There are also pre-existing Models that you can go off of, but you don’t have to.

Meta made Llama and OpenAI made gpt4. As you’ve probably seen, there are other chat-bot/LLM-based tools out there as well.

Google Gemini, Microsoft Bing AI, ChatGPT, Grok, and the other chat-based AIs/LLMs are all essentially SaaS based. They’re hosted for you and you can call upon them/use them (for free or for a price). Ollama is local and you fully control it (aside from the general llama Model that you can bring down to get started).

Setting Up Kubernetes Locally

Ollama is an LLM, and although it’s not huge (I don’t believe it falls under the Small Language Model (SLM) category), it still requires a solid amount of resources to run it as all AI/LLM tools do. Because of that, you’ll need a Minikube environment with 3 nodes as the extra CPU/memory is necassary.

To run a local Kubernetes cluster using Minikube with 3 nodes, run the following command:

minikube start --nodes 3
Enter fullscreen mode Exit fullscreen mode

You’ll be able to see the three nodes with kubectl get nodes.

Ollama Kubernetes Manifest

Now that the cluster is created, you can deploy the Kubernetes Pod that runs Ollama.

Luckily, there’s already a container image for Ollama that exists, so you don’t have to worry about building out a Dockerfile yourself.

💡

As with all pre-built container images, you want to ensure that it’s secure. You can use a container image scanner like docker scout for confirmation.

Use the following Kubernetes Manifest which deploys Ollama using:

  • A Deployment object/resource.
  • One Pod.
  • The latest container image version of Ollama.
  • Port 11434.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
Enter fullscreen mode Exit fullscreen mode

Save the Manifest in a location of your choosing with the name ollama.yaml and run the following command:

kubectl apply -f ollama.yaml
Enter fullscreen mode Exit fullscreen mode

Confirm Ollama Works

The Kubernetes Deployment with one replica is now deployed, so you can start testing to see if it works.

First, exec (which is like an SSH) into the Pod. Ensure to swap out “pod name” with the name of your Pod.

kubectl -n default exec -ti pod name  /bin/bash
Enter fullscreen mode Exit fullscreen mode

You should now be able to run ollama commands. You can confirm with the --version subcommand.

ollama version
Enter fullscreen mode Exit fullscreen mode

Once you’ve confirmed Ollama works, pull the latest Llama Model.

ollama pull llama3.2
Enter fullscreen mode Exit fullscreen mode

Run the Model.

ollama run llama3.2
Enter fullscreen mode Exit fullscreen mode

Last but not least, you can confirm it’s working by asking it a question. Here’s an example (I’m on my way to re:Invent, so I figured this was appropriate).

Image description

Congrats! You’ve successfully deployed an LLM to Kubernetes.

Top comments (0)