Ricardo

Posted on Jul 9 • Originally published at rmauro.dev on Jun 14

Running LLM llama.cpp Bare Metal on Raspberry Pi

#ai #raspberrypi #llamacpp #llm

For developers and hackers who enjoy squeezing maximum potential out of compact machines, getting a large language model like llama.cpp running natively on a Raspberry Pi is a rewarding challenge. This guide walks you through compiling llama.cpp from source, downloading a model, and running inference - all on the Pi itself.

Prerequisites

Hardware

Raspberry Pi 4, 5, or newer
64-bit Raspberry Pi OS
4GB RAM minimum (8GB+ recommended)
Heatsink or fan recommended for cooling

Software

Git
CMake (v3.16+)
GCC or Clang
Python 3 (optional, for Python bindings)

Step-by-Step Guide

Install required tools

sudo apt update && sudo apt upgrade -y

# 👇 install dependencies and tools to build
sudo apt install -y git build-essential cmake python3-pip libcurl4-openssl-dev

Clone and Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

cmake -B build
cmake --build build --config Release -j$(nproc)

This step takes sometime. Here we're compiling llama-cpp software.

Download a Quantized Model

mkdir -p models && cd models

wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf

cd ..

Let's use the model https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF for testing.

4. Run Inference

./build/bin/llama-cli \
  -m ./models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
  -p "Hello, Raspberry Pi!"

Optional: Python Bindings

Note: The Python bindings have been moved to a separate repository.

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
python3 -m pip install -r requirements.txt
python3 -m pip install .

Use in Python:

# Use in Python:

from llama_cpp import Llama
llm = Llama(model_path="./models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf")
print(llm("Hello from Python!"))

Conclusion

Running llama.cpp natively on a Raspberry Pi is a geeky thrill. It teaches you about compiler optimizations, quantized models, and pushing hardware to the edge—literally. Bonus points if you run it headless over SSH.

DEV Community