Pieces 🌟 for Pieces.app

Posted on Mar 18, 2024 • Originally published at code.pieces.app

How to Run an LLM Locally with Pieces

#ai #programming #llm

The demand for local, secure, and efficient machine learning solutions is more prominent than ever, especially for software developers working with sensitive code. At Pieces for Developers, we understand the importance of leveraging Local Large Language Models (LLLMs) not just for the enhanced privacy and security they offer, but also for their offline AI capabilities. Our commitment to a local-first philosophy has led us to support CPU and GPU versions of popular LLMs like Mistral, Phi-2, and Llama 2, with more on the way.

Our focus on privacy and security goes beyond the ability to leverage LLLMs within Pieces Copilot. In order to ensure all of your data remains on your local machine rather than transmitting it over the internet, we’ve meticulously fine-tuned several small language models so that anytime you store code snippets in Pieces, your code can be automatically enriched with titles, descriptions, tags, related links, and other useful context and metadata, without needing to connect to the cloud.

While small language models can be supported on almost any machine, running LLMs locally that have 7 billion parameters or more can bring challenges, especially considering the hardware requirements. We still consider our on-device AI as “experimental”, as this implementation and the availability of these models is so new, and we can’t always guarantee success for our users. This guide aims to demystify how to run an LLM locally within Pieces, outlining the minimum and recommended machine specs, the best GPUs for LLMs locally, and how to troubleshoot common issues when using Pieces Copilot in your development workflow.

Understanding Local LLM Hardware Requirements

Navigating the hardware requirements to run an LLM locally is crucial for developers seeking privacy, security, and the convenience of offline access. This section discusses the essential hardware components and configurations needed for efficiently running large language models locally, and how to check your machine’s specs to determine what it can handle.

Minimum and Recommended Specifications

Local LLMs, while offering unparalleled privacy and offline access, require significant machine resources. For an optimal experience, we recommend using machines from 2021 or newer with a dedicated GPU boasting more than 6/7GB of available GPU-RAM (VRAM). Older machines or those without a dedicated GPU may need to opt for CPU versions, which, while efficient, may not offer the same performance level.

As a user, if you are receiving the error message “I’m sorry, something went wrong with processing…” or the Pieces application crashes when you are using a local model, you may be trying to run a local model that requires more resources than your machine has available, or you’ve selected a GPU model when you don’t have a dedicated GPU, and you will need to change it. Unfortunately, if you have an older machine or one with very low resources, you may have to use a cloud model in order to use the copilot.

GPU vs CPU

Central Processing Units (CPUs) and Graphics Processing Units (GPUs) serve distinct functions within computing systems, often collaborating for optimal performance. CPUs, the general-purpose processors, are key for executing a broad range of tasks, excelling in sequential processing with their limited number of cores. They're crucial for running operating systems, applications, and managing system-level tasks.

Conversely, GPUs specialize in accelerating graphics and data-heavy tasks. Originating in video game graphics, GPUs now handle tasks requiring parallel processing like video editing, scientific simulations, and machine learning, thanks to their thousands of cores. While most computers have both CPUs for general tasks and GPUs for graphics-intensive work, some may only include integrated graphics within the CPU for basic tasks, opting for compact designs over dedicated GPUs.

The Best GPUs for Local LLMs

When it comes to running local LLMs, the GPU plays a pivotal role. Dedicated GPUs with high VRAM are preferable, as they can significantly speed up computations required by these models. NVIDIA's GeForce RTX series and AMD's Radeon RX series are excellent choices, offering a balance between performance and power efficiency.

When it comes to Apple products, the new M-series machines do not use dedicated GPUs, but the integrated GPUs they have are more than sufficient to run local LLMs.

Checking Your Machine Specs

In order to select the best on-device LLM for you, you should first check your machine specifications.

Windows

Right-click on the taskbar and select "Task Manager".
In the Task Manager window, go to the "Performance" tab.
Here you can see your CPU and GPU details. Click on "GPU" to see GPU information.
To see detailed GPU information including VRAM, click on "GPU 0" or your GPU's name.

Mac

Click on the Apple logo in the top-left corner of the screen.
Select "About This Mac".
In the window that opens, you can see if your machine runs an Intel chip or an Apple Silicon M-series chip.

Linux

Open a terminal window.
For CPU information, you can use the lscpu command.
For GPU information, you can use commands like lspci | grep -i vga to list GPU devices.

If you don’t have this installed, you may need to consult online forums for other options for your Linux machine.

Recommendations

If you’re interested in updating your machine to more efficiently run LLMs locally, you would generally want a GPU with a large amount of VRAM (Video Random Access Memory) to handle the memory-intensive operations involved in processing these models. GPUs with higher CUDA core counts and memory bandwidth can also contribute to faster computation.

In order to ensure your system can handle hefty local LLM hardware requirements, we recommend you double check the available RAM and VRAM based on these specifications:

Llama2 7B, a model trained by Meta AI optimized for completing general tasks.
Requires a minimum of 5.6GB RAM for the CPU model and 5.6GB of VRAM for the GPU-accelerated model.
Mistral 7B, a dense Transformer, fast-deployed and fine-tuned on code datasets. Small, yet powerful for a variety of use cases.
Requires a minimum of 6GB RAM for the CPU model and 6GB of VRAM for the GPU-accelerated model.
Phi-2 2.7B, a small language model that demonstrates outstanding reasoning and language understanding capabilities.
Requires a minimum of 3.1GB RAM for the CPU model and 3.1GB of VRAM for the GPU-accelerated model.

Note that as we continue supporting larger LLLMs like 13B models from Llama 2, these local LLM hardware requirements will change.

Performance and Troubleshooting

Deciding Which Local Model To Use

In a previous article, we discussed the best LLMs for coding, whether that be cloud vs local LLMs. If you’ve decided you want to stick with a local model within Pieces for increased security and offline capabilities, then you’ll want to first choose which one to use before downloading.

You will see that all of our local models have a CPU and GPU option. Now that you know a little more about your machine and GPU vs CPU, you can use the chart below to decide whether to use a GPU or a CPU version of a model.

Once you’ve decided on a GPU or CPU version, the choice between which brand model is more or less a matter of opinion, as models tend to excel in different types of knowledge (general QA, science, math, coding, etc.).

We at Pieces like to use a variety of models, to see the differences between answers we get based on the model’s knowledge base and training, and for different purposes - for example, the Pieces team member writing this article likes Phi-2 for its lightweight speed, but Mistral for its quality of answers. We’ve included some links to model evaluations below, but there is new information being released almost daily at this time.

Troubleshooting Common Issues

Encountering crashes or performance issues can often be attributed to exceeding your machine's resource capabilities. Checking your system specifications and comparing them against the requirements of your chosen LLM model is a good first step (we outlined how earlier in this article).

Another common issue on Linux and Windows is a corrupted or outdated Vulkan API, which we use to communicate with your GPU. Vulkan should be bundled with your AMD or NVIDIA drivers, and you can check its health by executing the command vulkaninfo in your terminal and scanning the resulting logs for errors or warnings. This could indicate that either your GPU drivers need updated or there is an issue with the API itself. Please contact the Pieces support team if you believe your Vulkan installation is broken.

Future-Proofing Your Setup

As technology advances, so do the requirements for running sophisticated models like LLMs. Upgrading to a system with a high-performance GPU and ample RAM can ensure your setup remains capable of handling new and more demanding large language models run locally. Additionally, staying informed about emerging trends in hardware can help you make knowledgeable decisions about future upgrades.

While we aimed to make this a complete guide to running local LLM models, things are evolving quickly and you may need to reference current research in order to understand how to run LLM models locally based on your hardware and memory requirements.

Join the Discussion

We hope this guide has shed some light on how to run an LLM locally efficiently and effectively within Pieces. We encourage our users to join the discussion on GitHub, share their experiences, and provide feedback. Your input is invaluable as we continue to refine our support for running LLMs locally, ensuring a seamless and productive experience for all our users.

If you’re just getting started with Large Language Models (LLMs), check out our whitepaper in the corresponding link.

DEV Community