Building and Running llamafile with a GPU

Getting llamafile running with a GPU can be a little tricky. For reference, I have a four-year-old Kubuntu laptop with a Nvidia RTX 2060.

Check to see if the correct Nvidia driver is installed.

nvida-smi

If the driver is installed, you will see a similar output.

Building llamafile

To build llamafile on Ubuntu, make sure you have the git, GCC, make, and curl installed.

sudo apt-get install -y curl git gcc make

Next clone the llamafile repository and download unzip, make it executable and move it to /usr/local/bin so it's in the PATH.

git clone https://github.com/Mozilla-Ocho/llamafile.git
curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip
chmod 755 unzip 
mv unzip /usr/local/bin

Build and install llamafile.

make -j8
make install PREFIX=/usr/local

Running llamafile with the GPU enabled

Llama.cpp will need clang, cuda-toolkit, and nvidia-gds to enable the GPU. You'll need the kernel headers and development packages for your kernel if they are not installed. Additionally, you will need to remove the outdated signing key.

sudo apt install linux-headers-$(uname -r)
sudo apt-key del 7fa2af80
sudo apt update
sudo apt install -y clang 
sudo apt install -y cuda-toolkit
sudo apt install -y nvidia-gds

To run llamafile with the GPU, add the option --ngl 9999. This example runs llamafile as a server with the GPU enabled.

llamafile --server -ngl 999 --host 0.0.0.0 -m codellama-7b-instruct.Q4_K_M.gguf

The RTX 2060 with 6GB can run models with 7B parameters. Note that llamafile will core dump if you try to run a model larger than what the GPU memory can hold.

Even with an older GPU, the number of tokens per second is almost two orders of magnitude than running with the CPU.