Llamafile is a Mozilla project that runs open source Large Language Models such as Llama-2-7B, Mistral 7B, or any other models in the GGUF format.
Containerize Llamafile
The Dockerfile builds, containerizes llamafile, and runs it in server mode. It uses Debian trixie as the base image to build llamafile. The final or output image uses debian:stable as the base image. Copy, paste, and save the following in a file named Dockerfile.
# Use debian trixie for gcc13
FROM debian:trixie as builder
# Set work directory
WORKDIR /download
# Configure build container and build llamafile
RUN mkdir out && \
apt-get update && \
apt-get install -y curl git gcc make && \
git clone https://github.com/Mozilla-Ocho/llamafile.git && \
curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip && \
chmod 755 unzip && mv unzip /usr/local/bin && \
cd llamafile && make -j8 LLAMA_DISABLE_LOGS=1 && \
make install PREFIX=/download/out
# Create container
FROM debian:stable as out
# Create a non-root user
RUN addgroup --gid 1000 user && \
adduser --uid 1000 --gid 1000 --disabled-password --gecos "" user
# Switch to user
USER user
# Set working directory
WORKDIR /usr/local
# Copy llamafile and man pages
COPY --from=builder /download/out/bin ./bin
COPY --from=builder /download/out/share ./share/man
# Expose 8080 port.
EXPOSE 8080
# Set entrypoint.
ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]
# Set default command.
CMD ["--server", "--host", "0.0.0.0", "-m", "/model"]
To build the container run:
$ docker build -t llamafile:0.6 .
Note that the current version is 0.6, used as a tag for the image.
Running the Llamafile Container
To run the container, download a model such as Mistral-7b-v0.1. The example below saves the model to the model
directory, which is mounted as a volume.
$ docker run -d -v ./model/mistral-7b-v0.1.Q5_K_M.gguf:/model -p 8080:8080 llamafile:0.6
The container will open a browser window with the llama.cpp interface.
Llamafile has an OpenAI API-compatible endpoint, and you can send requests to the server. This example uses curl and pretty prints the JSON response.
curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}' | python3 -c '
import json
import sys
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
print()
'
Llamafile has many parameters to tune the model. You can see the parameters with man llama file
or llama file --help
. Parameters can be set in the Dockerfile CMD
directive.
Top comments (0)