DEV Community

Cover image for How to Efficiently Run Meta LLaMA on a MacBook Air M1 with Limited RAM
Deepak Patil
Deepak Patil

Posted on

How to Efficiently Run Meta LLaMA on a MacBook Air M1 with Limited RAM

Running advanced AI models like Meta’s LLaMA on a MacBook might seem ambitious specifically when you have M1 with 8 GB of RAM, But with the right steps, you can start building AI apps locally on your Mac easily. Thanks to Apple’s processor architecture and efficient libraries like Llama.cpp, you can unlock the power of large language models right from your lightweight laptop.

Let's get you started with the MacBook Air M1 for running these models efficiently.

Step 1:

Download the entire model from Meta's official site by providing your information & details for usage.

Install the necessary packages as mentioned in the readme file.

Run the below command to run the model

torchrun \
  --nproc_per_node=$NGPUS \
  llama_models/scripts/example_chat_completion.py $CHECKPOINT_DIR \
  --model_parallel_size $NGPUS
Enter fullscreen mode Exit fullscreen mode

Definitely, this is not going to work ☹️. To solve this issue we will be following two methods below

  • first, we will be using llama.cpp which provides lightweight C++ implementation for running models on various hardware.

  • We will use quantization to reduce the model size so that we will be able to run it easily.


Step 1 (This one will work):

Install llama.cpp using brew.

brew install llama.cpp
Enter fullscreen mode Exit fullscreen mode

Step 2:

Now let's quantize the llama model. To perform this we have a very good space on HuggingFace called GGUF-My-Repo. Follow the below link to go to space.

GGUF-MY-REPO

On this space login with your HuggingFace credentials. Then select the model repository that you want to quantize. For llama, you need to get access to the repo. Select the checkbox for 'Create a private repo under your username'. If you are a beginner then leave other options to default and proceed.

Once the process is done you will have a quantized model created in your private repository.

Step 3:

Clone the HuggingFace repo just created on your Mac. Run the model using the command below

llama-cli -m GGUF_MODEL_FILE_NAME -n 1024 -ngl 1 -c 512 --prompt PROMT cnv
Enter fullscreen mode Exit fullscreen mode

example

llama-cli -m meta-llama-3.1-8b-q4_k_m.gguf -n 1024 -ngl 1 -c 512 --prompt "Hello" -cnv
Enter fullscreen mode Exit fullscreen mode

You can also run directly from the repository name.

llama-cli --hf-repo Deepcodr/llama_sample_chat-Q4_K_M-GGUF --hf-file llama_sample_chat-q4_k_m.gguf -p "The meaning to life and the universe is"

Enter fullscreen mode Exit fullscreen mode



Tip: If you are a beginner avoid using base models if you don't want to get some gibberish or random responses. Instead use instruct models such as chat, or text. etc. You can find some already quantized models here

Top comments (0)