If you are interested in running language models on AWS, please take a look at my other post: Running language models on AWS
What is LLaMa?
You've probably heard of or even used ChatGPT. ChatGPT and similar neural networks are called LLMs (large language models). LLaMa is one such model, developed by Meta AI (formerly Facebook) for research purposes, with a license that isn't suitable for commercial use. You can read more about it here: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
What is this post about?
Shortly after the release of LLaMa, enthusiasts started exploring this suite of models, training and creating new ones. As a result, we now have Alpaca, Koala, Vicuna, StableVicuna, and many more. In this blog post, I'll describe my experience of trying to run a LLaMa model with 7 billion parameters on my 2018 MacBook 12".
Getting weights, converting and quantizing
All these models require a GPU by default to handle all these complex computations. I don't have any PCs with this kind of GPUs so I started exploring other ways. Right now there is one agreed approach to run open-source models at home: using llama.cpp. This is a freely available software, which converts models to ggml format (just another format for neural nets, which requires CPU and RAM instead of GPU), then quantizes them to make them runnable (is that even a word?) even on lower specs. I really struggled throughout the whole process, but as they say, no pain no gain! This guide isn't a one-size-fits-all solution, but rather the approach that worked for me.
Okay, let's start:
Download the weights. Actually, the weights (also called parameters, or even for simplicity, the model itself) should be requested from Meta but it seems they no longer provide them to anyone. No weights, no fun! Luckily, we have many torrents floating around that can download the models pretty quickly. I will not share any links here (and please, refrain from sharing anything in the comment section) but if you google "How do I obtain Llama weights" probably the first result would lead you to the GitHub discussion where you can find the magnet link. Just grab it and put into your favorite Bittorrent client. Select the model which you would like to download (there are 7B, 13B, 33B and 65B, the larger the better). I used the smallest one.
Converting the weights and quantizing the model. As I said, we will use llama.cpp: https://github.com/ggerganov/llama.cpp. Just scroll down to the usage section and follow the instructions. Esentially, you should clone the repo, make (compile) the binary program, put your downloaded model to models folder, convert to ggml FP16 format, then quantize to 4-bit. You will only need section "Build" and "Prepare Data and Run". After this, you should get a .bin file in your models folder, which is the model itself! Then you can just run it and communicate with in your terminal:
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
For the steps above, you should have Python installed (should already come pre-installed with macOS and most Linux distros). I recommend you use miniconda to manage different python environments.
Actually I thought this blog post would have been a lot longer. However in reality, all my struggles were solved with one or two commands, solution for which took forever to find. I decided to cover some additional points:
Are there some bindings for llama.cpp?
Yes, there are some bindings for different languages. The list of supported ones is listed in llama.cpp repository. Binding is just a convenient library to call methods against the models. For example, in Python you could run something like this:
response = llm(prompt="What is the meaning of life?")
Is there a UI available for interaction?
Yes, llama.cpp repo points out to 2 different UI solutions. Neither of them worked for me. Maybe, I'm too dumb, anyway I just skipped this and developed my own very basic UI solution with streamlit and llama-cpp-python binding. If you want me to share the process and how it works, feel free to let me know in the comments!
What is quantization?
Let's ask ChatGPT! "LLMs, such as GPT-3, have a vast number of parameters (175 billion in the case of GPT-3), making them quite resource-intensive. For instance, GPT-3 in FP16 format consumes at least 350GB of memory to store and run. Quantization addresses this problem by representing the weights and activations of the model with low-bit integers. This can significantly reduce GPU memory requirements and accelerate compute-intensive operations like matrix multiplications. For example, using INT8 (8-bit integer) quantization can halve the GPU memory usage and nearly double the throughput of matrix multiplications compared to FP16". In other words, you will get a way smaller model, but not that impressive in performance although generally it's not noticeable.
What are the tech requirements to run this?
See here: https://github.com/ggerganov/llama.cpp#memorydisk-requirements
Why do I need this if I have ChatGPT?
Just for fun! For real tasks I use GPT-4 and Claude. Yet these models are so much fun to play with! I have only tried LLaMa 7B and Koala 7B, and planning to try Alpaca, GPT4ALL, Falcon and other models! Anyway, the models are made for researching so let's do it!
I would be happy to answer any questions you have (if I know the answer, of course)
Top comments (1)
Thanks for the great insights.
Do check out my this blog where I run mpt 7b locally to interact with my knowledgebase
Beyond OpenAI- Harnessing Open Source Models to Create Your Personalized AI Companion 💻