DEV Community

Cover image for A Quick Guide to Containerizing Llamafile
Sophia Parafina
Sophia Parafina

Posted on

A Quick Guide to Containerizing Llamafile

Llamafile is a Mozilla project that runs open source Large Language Models such as Llama-2-7B, Mistral 7B, or any other models in the GGUF format.

Containerize Llamafile

The Dockerfile builds, containerizes llamafile, and runs it in server mode. It uses Debian trixie as the base image to build llamafile. The final or output image uses debian:stable as the base image. Copy, paste, and save the following in a file named Dockerfile.



# Use debian trixie for gcc13
FROM debian:trixie as builder

# Set work directory
WORKDIR /download

# Configure build container and build llamafile
RUN mkdir out && \
    apt-get update && \
    apt-get install -y curl git gcc make && \
    git clone https://github.com/Mozilla-Ocho/llamafile.git  && \
    curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip && \
    chmod 755 unzip && mv unzip /usr/local/bin && \
    cd llamafile && make -j8 LLAMA_DISABLE_LOGS=1 && \ 
    make install PREFIX=/download/out

# Create container
FROM debian:stable as out

# Create a non-root user
RUN addgroup --gid 1000 user && \
    adduser --uid 1000 --gid 1000 --disabled-password --gecos "" user

# Switch to user
USER user

# Set working directory
WORKDIR /usr/local

# Copy llamafile and man pages
COPY --from=builder /download/out/bin ./bin
COPY --from=builder /download/out/share ./share/man

# Expose 8080 port.
EXPOSE 8080

# Set entrypoint.
ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]

# Set default command.
CMD ["--server", "--host", "0.0.0.0", "-m", "/model"]



Enter fullscreen mode Exit fullscreen mode

To build the container run:



$ docker build -t llamafile:0.6 .


Enter fullscreen mode Exit fullscreen mode

Note that the current version is 0.6, used as a tag for the image.

Running the Llamafile Container

To run the container, download a model such as Mistral-7b-v0.1. The example below saves the model to the model directory, which is mounted as a volume.



$ docker run -d -v ./model/mistral-7b-v0.1.Q5_K_M.gguf:/model -p 8080:8080 llamafile:0.6


Enter fullscreen mode Exit fullscreen mode

The container will open a browser window with the llama.cpp interface.

Llamafile web UI

Llamafile has an OpenAI API-compatible endpoint, and you can send requests to the server. This example uses curl and pretty prints the JSON response.



curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}' | python3 -c '
import json
import sys
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
print()
'


Enter fullscreen mode Exit fullscreen mode

Llamafile has many parameters to tune the model. You can see the parameters with man llama file or llama file --help. Parameters can be set in the Dockerfile CMD directive.

Top comments (0)