Getting llamafile running with a GPU can be a little tricky. For reference, I have a four-year-old Kubuntu laptop with a Nvidia RTX 2060.
Check to see if the correct Nvidia driver is installed.
nvida-smi
If the driver is installed, you will see a similar output.
Building llamafile
To build llamafile on Ubuntu, make sure you have the git, GCC, make, and curl installed.
sudo apt-get install -y curl git gcc make
Next clone the llamafile repository and download unzip, make it executable and move it to /usr/local/bin
so it's in the PATH.
git clone https://github.com/Mozilla-Ocho/llamafile.git
curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip
chmod 755 unzip
mv unzip /usr/local/bin
Build and install llamafile.
make -j8
make install PREFIX=/usr/local
Running llamafile with the GPU enabled
Llama.cpp will need clang, cuda-toolkit, and nvidia-gds to enable the GPU. You'll need the kernel headers and development packages for your kernel if they are not installed. Additionally, you will need to remove the outdated signing key.
sudo apt install linux-headers-$(uname -r)
sudo apt-key del 7fa2af80
sudo apt update
sudo apt install -y clang
sudo apt install -y cuda-toolkit
sudo apt install -y nvidia-gds
To run llamafile with the GPU, add the option --ngl 9999
. This example runs llamafile as a server with the GPU enabled.
llamafile --server -ngl 999 --host 0.0.0.0 -m codellama-7b-instruct.Q4_K_M.gguf
The RTX 2060 with 6GB can run models with 7B parameters. Note that llamafile will core dump if you try to run a model larger than what the GPU memory can hold.
Even with an older GPU, the number of tokens per second is almost two orders of magnitude than running with the CPU.
Top comments (0)