DEV Community

Cover image for Running Local LLMs, CPU vs. GPU - a Quick Speed Test

Running Local LLMs, CPU vs. GPU - a Quick Speed Test

Maxim Saplin on March 11, 2024

Updated on March 14, more configs tested Today, tools like LM Studio make it easy to find, download, and run large language models on consumer-gra...
Collapse
 
adderek profile image
Maciej Wakuła

This depends much on the settings. I tried the same model and example query "tell me about Mars". Having Ryzen 3900 PRO CPU (12 cores, 24 threads, I got it for less than half price of 3900x), AMD RX 6700 (without x) which I also got cheap. RAM is pretty cheap as well so 128GB is in range of most. Using kobald-cpp rocm. With (14 layers on gpu, 14 cpu threads) it gave 6 tokens per second. (28,14) gave 15 T/s. (30,24) gave 4.43 T/s. Finally 35 layers, 24 CPU threads consumed total 7.3GB on GPU giving 34.61 T/s.

I'm writing to show that results depends very much on the settings.

Collapse
 
maximsaplin profile image
Maxim Saplin

JIC, I tested pure cases, 100% CPU and 100% offloading to GPU

Collapse
 
orlando_arroyo_1 profile image
Orlando Arroyo

How did you get to use 100% of the CPU?, which config or settings did you have?

Collapse
 
adderek profile image
Maciej Wakuła • Edited

You can offload all layers to GPU (CUDA, ROCm) or use CPU implementation (ex. HIPS). Just run LM Studio for your first steps. Run kobaldcpp or kobapldcpp-ROCm as second. Then try to use python and transformers. From there you should know enough about the basics to choose your directions. And remember that offloading all to GPU still consumes CPU

Image description

This is a peak when using full ROCm (GPU) offloading. See CPU usage on the left (initial CPU load is to start the tools, LLM was used on the peak at the end - there is GPU usage but also CPU used)
Image description

And this is windows - ROCm still is very limited on other operating systems :/

Collapse
 
bharath063 profile image
Bharath B

Intel i7 14700k - 9.82 token/s with no GPU offloading(peaked at 35% CPU usage in LMStudio. Guessing issue with multithreading)
Zotac Trinity non-OC 4080 Super - 71.61 tokens/s max GPU offloading

All numbers measured on non-overclocked factory default setup

Collapse
 
maximsaplin profile image
Maxim Saplin

Thanks for sharing the numbers!

Collapse
 
orlando_arroyo_1 profile image
Orlando Arroyo

Indeed there’s something odd with the multithreading of the CPUs

Collapse
 
orlando_arroyo_1 profile image
Orlando Arroyo • Edited

Just for fun, here are some additional results:

iPad Pro M1 256GB, using LLM Farm to load the model: 12.05tok/s
Asus ROG Ally Z1 Extreme (CPU): 5.25 tok/s using the 25W preset, 5.05tok/s using the 15W preset

Update:
Asked a friend with a M3 Pro 12core CPU 18GB. Running from CPU: 17.93tok/s, GPU: 21.1tok/s

Collapse
 
maximsaplin profile image
Maxim Saplin

The CPU result for ROG is close to the one from 7840U, after all they almost identical CPUs

Collapse
 
clegger profile image
clegger

The ROG Ally has a Ryzen Z1 Extreme which appears to be nearly identical to the 7840U, but from what I can discern, the NPU is disabled. So if / when LM Studio gets around to implementing support for that AI accelerator the 7840U should be faster at inferencing workloads.

Thread Thread
 
maximsaplin profile image
Maxim Saplin

AMD GPU seems to be an underdog in the ML world, when compared to Nvidia... I doubt that AMD's NPU will see better compatibility with ML stack than it's GPUs

Collapse
 
orlando_arroyo_1 profile image
Orlando Arroyo

Adding some info here:

Running on a Razer Blade 2021 with a Ryzen 5900HX, a GF 3070Ti and 16GB RAM, I got 41.75tok/s. I used the same test as you, asking about Mars on the same model.

Hope that adds information to this very interesting topic.

Collapse
 
maximsaplin profile image
Maxim Saplin

Thanks for the contribution! I assume you used 100% GPU off-loading , right? Just checking:)

Collapse
 
orlando_arroyo_1 profile image
Orlando Arroyo

Indeed, 100%GPU off-loading.

I also tested an Ryzen 7950X with 0% off loading, but there’s something odd. I set 32 threads but CPU use is not going beyond 60% and only gets 7tok/s. Any thoughts how about possible cause?

Just for fun, I’ll check with an Asus ROG Ally later (Z1 Extreme version).

Thread Thread
 
maximsaplin profile image
Maxim Saplin

Seems the threads param is ignored, I saw same behaviour when testing CPU inference

Collapse
 
nicolay profile image
Nicolay • Edited

On my rtx 3050 the speed was 28.6 tok/s.
Based on the comments above, I made a table.

RTX 3050         8gb    28.6 tok/s
RTX 3070 TI     8gb    41,75
RTX 4060         8gb    37.9 tok/s
RTX 4070         12gb   58.2tok
RTX 4080         8gb     78.1

Collapse
 
maximsaplin profile image
Maxim Saplin

Are all those videocards desktop ones?

Collapse
 
orlando_arroyo_1 profile image
Orlando Arroyo

Just a quick update: using a RTX 4070 Super gets 58.2tok/s

Collapse
 
oliverdevto profile image
Oliver Stutz

78.51 tok/s with AMD 7900 XTX on RoCm Supported Version of LM Studio with llama 3
33 gpu layers (all while sharing the card with the screen rendering)

Collapse
 
andy_h profile image
Andy Harris

Here's some additional config data for the list.

Laptop - 7940HS + 32gb RAM + RTX 4070 (8g)
GPU only - RTX 4070 mobile (8GB) = 30.69 T/S
CPU only - 7940HS + 32gb RAM = 8.28 T/S
Note. I'm not sure why the 4070 is posting lower than the 4060 mobile.

Desktop - R5 3600x + 80gb RAM + RX 6800XT (16gb)
GPU only - Radeon RX 6800 XT (16gb) = 52.92 T/S
CPU only - R5 3600x + 80gb RAM = 4.07 T/S

Collapse
 
maximsaplin profile image
Maxim Saplin

There're different power levels for 4xxx mobile GPUs - 40-140w. 4070 might be coming with a thinner laptop with TGP at arpind 40w. My 4060 Mobile has 105w TGP

Collapse
 
andy_h profile image
Andy Harris

Good point! I'll check later and post an update.

Collapse
 
clegger profile image
clegger

In these tests is the 7840U utilizing the integrated NPU to accelerate the workload?

Collapse
 
maximsaplin profile image
Maxim Saplin

The result for "780M iGPU" is indeed the result coming from the GPU integrated into 7840U APU

Collapse
 
clegger profile image
clegger • Edited

@maximsaplin GPU != NPU
They are distinct accelerators.

Collapse
 
maximsaplin profile image
Maxim Saplin

NPU is not mentioned anywhere