Build a GPU Enabled Llamafile Container

It's been fun playing with LLMs with a CPU. However, the novelty wears off as I watch the completion slowly display word by word. Enter the GPU, I have an older Ubuntu gaming laptop with a GPU I purchased for machine learning (I haven't played a game since Doom in the early 90's). Enabling LLM software to run on GPUs can be tricky because it is system and hardware-dependent. This article shows how I run llamafile on an NVIDIA RTX 2060. The examples in this article use llamafile, NVIDIA CUDA, Ubuntu 22.04, and Docker.

Check the GPU and NVIDIA CUDA software

Check if CUDA is installed. NVIDIA provides a utility to show the status of the GPU and CUDA.



% nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:01:00.0  On |                  N/A |
| N/A   44C    P8               8W /  90W |   1322MiB /  6144MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

If you see a similar output, then CUDA is installed. Note the CUDA version. The CUDA version and compute capabiilty determine which base image to use when building your image. NVIDIA provides charts of compute capabilities of NVIDIA devices. Choose the menu item for your device. For example, I have a CUDA-Enable GeForce card, e.g., a RTX 2060

Configuring Docker

The NVIDIA Container Toolkit enables Docker to use the GPU. NVIDIA provides detailed instructions for installing the Container Toolkit. If you're impatient, the abbreviated steps are below.



% distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

% sudo apt-get update

% sudo apt-get install -y nvidia-docker2

In Linux, update /etc/docker/daemon.json to configure the Docker Engine daemon and register the NVIDIA container runtime. If daemon.json is absent, create a file with the following content.



{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
            }
        }
}

Restart Docker to apply the changes.



% sudo systemctl restart docker

Test the runtime is working.



% $ docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

If working correctly, you should see a similar result.



+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:01:00.0  On |                  N/A |
| N/A   44C    P8              10W /  90W |   1530MiB /  6144MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Building the Llamafile Image

Let's break down the Dockerfile. I'll go through each part and explain the choices I made. The Dockerfile is a multi-stage build, and we'll start with building the llamafile binaries.



FROM debian:trixie as builder

WORKDIR /download

RUN mkdir out && \
    apt-get update && \
    apt-get install -y curl git gcc make && \
    git clone https://github.com/Mozilla-Ocho/llamafile.git  && \
    curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip && \
    chmod 755 unzip && mv unzip /usr/local/bin && \
    cd llamafile && make -j8 && \ 
    make install PREFIX=/download/out

This part of the Dockerfile builds the llamafile binaries. Like many open-source projects, llamafile is active, with new features and bug fixes delivered weekly. For this reason, I build llamafile from source instead of using binaries. In addition, llamafile can be built into a single executable that includes a model. To keep things simple, I omitted this step.

The next part of the multi-stage build is the image that runs llamafile with the host system's GPU. NVIDIA provides base images for CUDA that we can use for GPU-enabled applications like llamafile. Earlier, we took note of the CUDA version and the driver version. NVIDIA states the following:

"In order to run a CUDA application, the system should have a CUDA enabled GPU and an NVIDIA display driver that is compatible with the CUDA Toolkit that was used to build the application itself. If the application relies on dynamic linking for libraries, then the system should have the right version of such libraries as well."

The base NVIDIA CUDA image must match the CUDA version on the host machine. Suppose you have a later-model video card. In that case, you choose the latest image for the host platform. You have a choice of several Linux distributions and a choice between a development image with a software toolchain for building applications or a runtime version for deploying a prebuilt image.

NOTE: The GeForce RTX 2060 on my laptop was incompatible with the latest CUDA image. I could have fallen back to an earlier image, but the NVIDIA CUDA repository maintains Dockerfiles for different CUDA versions. I built an image specific to my CUDA version and Linux distribution. If you want to build your own CUDA image, download the Dockerfile and build it:



% docker build -t cuda12.2-base-ubuntu-22.04 .

This section installs and configures the CUDA toolkit, which is needed to enable the GPU for llamafile in a CUDA base image. Replace the base image that matches your CUDA version and Linux distribution. In addition, a user is created so that the container does not run as root.



FROM cuda-12.2-base-ubuntu-22.04 as out

RUN apt-get update && \
    apt-get install -y linux-headers-$(uname -r) && \
    apt-key del 7fa2af80 && \
    apt-get update && \
    apt-get install -y clang && \
    apt-get install -y cuda-toolkit && \
    addgroup --gid 1000 user && \
    adduser --uid 1000 --gid 1000 --disabled-password --gecos "" user

USER user

The following section copies the llamafile binaries and man pages from the builder image and the LLM model. In this example, the LLM, codellama-7b-instruct.Q4_K_M.gguf, was downloaded from Hugging Face. You can use any llamafile (or llama.cpp) compatible model in the GGUF format. Note that llamafile is started as a server that includes an OPENAI API endpoint and enables GPU usage with -ng 9999.



WORKDIR /usr/local

COPY --from=builder /download/out/bin ./bin
COPY --from=builder /download/out/share ./share/man
COPY codellama-7b-instruct.Q4_K_M.gguf /model/codellama-7b-instruct.Q4_K_M.gguf

# Don't write log file
ENV LLAMA_DISABLE_LOGS=1

# Expose 8080 port.
EXPOSE 8080

# Set entrypoint.
ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]

# Set default command.
CMD ["--server", "--nobrowser", "-ngl", "9999","--host", "0.0.0.0", "-m", "/model/codellama-7b-instruct.Q4_K_M.gguf"]

Build and tag the image.



% docker build -t llamafile-codellama-gpu .

This is the complete Dockerfile.



FROM debian:trixie as builder

WORKDIR /download

RUN mkdir out && \
    apt-get update && \
    apt-get install -y curl git gcc make && \
    git clone https://github.com/Mozilla-Ocho/llamafile.git  && \
    curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip && \
    chmod 755 unzip && mv unzip /usr/local/bin && \
    cd llamafile && make -j8 && \ 
    make install PREFIX=/download/out

FROM cuda-12.2-base as out

RUN apt-get update && \
    apt-get install -y linux-headers-$(uname -r) && \
    apt-key del 7fa2af80 && \
    apt-get update && \
    apt-get install -y clang && \
    apt-get install -y cuda-toolkit && \
    addgroup --gid 1000 user && \
    adduser --uid 1000 --gid 1000 --disabled-password --gecos "" user

USER user

WORKDIR /usr/local

COPY --from=builder /download/out/bin ./bin
COPY --from=builder /download/out/share ./share/man
COPY codellama-7b-instruct.Q4_K_M.gguf /model/codellama-7b-instruct.Q4_K_M.gguf

# Don't write log file.
ENV LLAMA_DISABLE_LOGS=1

# Expose 8080 port.
EXPOSE 8080

# Set entrypoint.
ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]

# # Set default command.
CMD ["--server", "--nobrowser", "-ngl", "9999","--host", "0.0.0.0", "-m", "/model/codellama-7b-instruct.Q4_K_M.gguf"]

Enjoy the Performance Boost

I ran a prompt using continue.dev. an open-source LLM coding VSCode extension to test GPU enabled llamafile with codellama-7b-instruct.Q4_K_M.gguf. Continue.dev can use any LLM with an OpenAI APO. I ran llamafile in server mode on the laptop with the GPU ad VSCode on a M2 Mac mini. I used the the following.

Prompt:

"Depth-first search for a URL and print the pages as a tree."

Completion:



import requests
from bs4 import BeautifulSoup
def depth_first_search(url):
  """Depth-first search for an URL and print the pages as a tree."""
    # Initialize a set to keep track of visited URLs
    visited = set()
    # Create a queue to store the URLs to be visited
    queue = [url]
    while queue:
        # Get the next URL from the queue
        url = queue.pop(0)
        # If the URL has not been visited, visit it and add its children to the queue
        if url not in visited:
            response = requests.get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            for link in soup.find_all('a'):
                queue.append(link.get('href'))
            print(f"{url} -> {', '.join(queue)}")
            visited.add(url)

For a baseline, I ran the llamafile executable with the CPU



% llamafile --server --host 0.0.0.0 -m codellama-7b-instruct.Q4_K_M.gguf

With the CPU, the prompt evaluation process was 24 tokens/second, and the total time for the completion was over one minute.



print_timings: prompt eval time =    4963.79 ms /   121 tokens (   41.02 ms per token,    24.38 tokens per second)
print_timings:        eval time =   64191.60 ms /   472 runs   (  136.00 ms per token,     7.35 tokens per second)
print_timings:       total time =   69155.39 ms

Next, I ran the llamafile executable with the GPU enabled.



 % llamafile --server --host 0.0.0.0 -ngl 9999 -m codellama-7b-instruct.Q4_K_M.gguf

With the GPU, prompt evaluation was 17 times faster than the CPU, processing 426 tokens/second. The completion returned in 11 seconds, a significant improvement in response time.



print_timings: prompt eval time =     283.92 ms /   121 tokens (    2.35 ms per token,   426.18 tokens per second)
print_timings:        eval time =   11134.10 ms /   470 runs   (   23.69 ms per token,    42.21 tokens per second)
print_timings:       total time =   11418.02 ms

I ran the llamafile container with the GPU enabled to see if containerization affected performance.



% docker run -it --gpus all --runtime nvidia -p 8111:8080 llamafile-codellama-gpu

Surprisingly, the container did not perform poorly and was slightly better than the GPU-enabled executable.



print_timings: prompt eval time =     257.56 ms /   121 tokens (    2.13 ms per token,   469.80 tokens per second)

print_timings:        eval time =   11498.98 ms /   470 runs   (   24.47 ms per token,    40.87 tokens per second)

print_timings:       total time =   11756.53 ms

Summary

Running llamafile with the GPU enabled changes it from a toy application for experimentation and learning to a practical component in your software toolchain. In addition, containerizing an LLM lets anyone run the LLM without installing and downloading binaries. Users can pull an LLM from a repository and launch it with minimum setup. Containerization also opens up other avenues for deploying LLMs in orchestration frameworks.