Local AI

Local AI is the ability to run artificial intelligence models on commodity hardware. Many models can run on CPUs without a GPU, though a GPU is an advantage when running models. This means that anyone with a relatively recent computer build or a gaming rig can build AI applications, build a domain-specific model through RAG or fine-tuning, or experiment and learn about AI without using or building cloud infrastructure or paying a subscription fee to a service.

First, We'll look at eight local AI software or projects you can run. Note that these are not turn-key solutions. Some require minimal programming experience, while others require compiling the application.

What Makes Local AI Possible

As their name indicates, Large Language Models are large. LLMs have parameters, also called weights and biases, that take the input into the model and guide the resulting output. You'll often see LLMs named with the number of parameters, such as 7B or 13B, in the case of the LLama 2 model. To put this in perspective, a billion parameters using a 32-bit floating point format is approximately 4Gb. OpenAI's GPT-4 has 1.76 trillion parameters, or 1760 billion times 4 GB, resulting in a 7076 GB model! LLMs run in memory, and it takes dedicated hardware to run such a model. I recommend this article to understand the tradeoffs between parameter count and training data size.

A Brief Guide To LLM Numbers: Parameter Count vs. Training Size

We can run LLMs on consumer hardware because of quantization, which is a process that converts high-precision data types, e.g., float32, to a lower-precision data type, such as a 4-bit integer (INT4). Quantization can reduce the precision of a model because it changes the value of the parameters. However, larger models with over 70 billion parameters can be quantized to INT4 with minimal impact on performance. A 70-billion parameter model that uses FP16 (2 GB per billion parameters) is approximately 140 GB. Converting the model to INT4 could result in a 35 GB model, which is well within the capabilities of modern desktop computing. I've found the following articles helpful to learn more about quantization.

What are Quantized LLMs?
How And Why To Quantize Large Language Models *paywalled

There are tradeoffs as the size of an LLM decreases. However, changes in performance vary depending on the model and the quantization method used. Some models will be better suited for specific tasks, and experimentation is necessary to determine if a model meets the application requirements.

Local AI Software and Projects

This is a TL;DR of local AI software and projects.

Ollama

Ollama runs many foundation models, including LLMs and multi-modal models. The application runs on Linux and macOS; Windows is in a future release. There is an official container release on Docker Hub. Detailed documentation is available on Github.

To run an Ollama development environment using Docker, check out Localstack AI.

Hugging Face

Hugging Face offers over 120,000 models, 20,000 data sets, and 50,000 demo applications. You can run models on the site using transformers, which can be installed via Python packages; this method requires compiling some packages locally. The installation can include PyTorch and TensorFlow, supporting various models.

If installing on an Apple M1/M2/M3 silicon, check out this [post(https://dev.to/spara_50/installing-hugging-face-on-an-m2-macos-machine-3kla)] if you run into an error.

LocalAI

LocalAI is an open-source application that can use LLMs, stable diffusion, and other multi-modal models. It provides a REST API compatible with OpenAI API specifications. You can develop locally without connecting to OpenAI and use the same code in production. You can build LocalAI from source or use binary available on Github.

Llamafile

Llamafile lets you build and distribute LLM applications as an executable file. It runs on Windows, macOS, Linux, OpenBSD, FreeBSD, and NetBSD. There are example llamafiles for both command-line and server llamafiles. Llamafile combines llama. cpp and Cosmopolitan to build and run anywhere executable.

LLM

LLM is a CLI and Python library that works with remote services and local models. It is installed and managed with Python. LLM uses a plugin architecture to run both local models and remote APIs. It has extensive documentation, including a section on creating embeddings with the CLI.

GGML

GGML is a CLI and a tensor library that runs LLMs. Ggml requires compiling the source code and familiarity with building on your operating system of choice. A quantizer is included in the game repository, which allows you to build your own models. Ggml is used by llama.cpp, whisper.cpp, LLM Studio, and llamafile.

LM Studio

LM Studio is an application for running compatible LLMs (Llama, Mistral, Mixtral, Phi-2, and vision-enabled models) from HuggingFace. Binaries are available for macOS, Windows, and Linux.

MLX-LLM

Apple computers with M1/M2/M3 silicon do not use Nvidia GPUs. Apple machine learning released ml-explore, which speeds up computation and allows LLMs to run without quantization. Mlx-LLM is in its early stages and currently supports LLama-2, Mistral, and Open Hermes.