DEV Community

Joel Kang for Dala

Posted on

A practical guide to deploying Large Language Models Cheap, Good *and* Fast

Not every company can or wants to rely on 3rd party Large Language Models (LLMs) for their product features, and rather deploy open source LLMs to support their use cases.

With all the current hype, one would think that would be easy to do, but as I learn in this post, the toolsets available to do this are surprisingly nascent. Join me as I go through the steps of how I got a usable language model deployed on Kubernetes (specifically on AWS’s EKS). Most of the steps should apply to any k8s, but I do have a section below specifically for EKS and CUDA drivers.

Along the way, I run through how I wrangled my environment to run the tools I tried, and which github issues and PRs reveal hidden commands and setup requirements you might have missed. Hopefully it saves you a few rabbit holes.

tld;r, here are the things I tried:

  • TitanML’s TakeOff Server
    • A CUDA Segue
  • Llama.cpp
  • HuggingFace’s Text Generation Inference

And some that I haven’t (yet!)

  • vLLM
  • OpenLLM
  • FastChat


My company, Dala, serves as your company’s system of intelligence, sitting on top of your various systems of record (aka where your work gets done), to give you that bird’s eye view of how knowledge is flowing within your team. This means that we have access to internal company data — data that you might not want us to send to other 3rd parties. That’s why it’s imperative for us to be able to provide our services with software fully within our control.

In order to help you gain context and awareness of the knowledge that your co-workers are creating, we[1] use LLMs to analyse these pieces of internal knowledge so that our search and summarisation features can give you a quick high level overview before you decide whether or not you want to dig into the sources. The main use case we’ll go through here is to analyse some search results:

A video preview of the Dala search analysis

Choosing a model

Thanks to the wonderful folks at DataSpartan for their compute (and emotional support ❤️), we were able to test a variety of commercially available LLMs, starting with the various sizes of Llama2. In a series of non-scientific experiments, using OpenAI’s GPT-4 as a benchmark, we combined various models and prompts in service of answering the question: What is the

  1. biggest model we can use
  2. that will fit into the most commercially viable hardware we can get our hands on,
  3. that will be fast enough, and
  4. that will return us well-formed and meaningful JSON.

We landed on Vicuna-13B-v1.5, which is a Llama2 based chat model fine-tuned on instructions. We found that this size of model is the perfect balance between the accuracy of output that we wanted, and the speed and compute required to run this model. It also consistently gave us JSON in the format we asked—sometimes even without us prompting!

Llama2, while being pretty decent with its analysis, was always too eagerly “Happy to help” 👀, and sometimes would generate a short JSON summary and then further analyse the JSON in prose. Weird.


In tech (and in life) we’re always asked to choose 2 out of the holy trinity of good, cheap, and fast. But don’t underestimate the unyielding stubbornness of a startup founder as I go on this journey of trying to choose all three. Let’s begin by enumerating our three requirements:

  • Good: At minimum, the outputs from our chosen model must be meaningful — the prompt should be attended to completely, and the model should not output gibberish. Here, we’re not testing the model itself since we’ve already chosen a model above; this tests the goodness of the model.

  • Cheap: We’re very early stages in our product, so with few users, we want to ensure that we keep our costs low. [2]

  • Fast: The output needs to be fast enough — no one is going to use a search system if it takes seconds for usable results to appear.

This translates to a few things in our deployment:

  • Good: Given that we’ve taken the time to choose the right prompt and model for our use-case, we should make sure that we maintain that level of quality in our deployment. Ideally we don’t change the chosen model, or have to reword our prompt to take up fewer tokens.

  • Fast: Streaming responses is a hard requirement. As long as we’re streaming tokens faster than the user can read, the user knows that something is happening, and doesn’t have to wait for the full output (which can take seconds or even minutes). Ideally this as fast as the average reading speed, if not faster. Taking into consideration that we might need additional time to update the UI and other such data transfer, let’s set this to about 10 tokens/s.

  • Fast: Ideally, we’re going to require at least a concurrency of 2. We won’t have many users to begin with, so the likelihood of more than two users making a request at exactly the same time is low, but non-zero. Users should not have to queue up for any ongoing request to finish completely before theirs is kicked off.

  • Cheap: Keeping the above requirements in mind, to achieve a similar speed solely on CPU requires a really large node with many vCPUs, at which point it actually becomes cheaper to just get a small GPU node. So we’re going to use the smallest cloud GPU available to us, an NVIDIA Tesla T4. What’s most important is the VRAM size, since we’re going to need to squeeze the entire model into the GPU. As we’ll see later, we can play with the number of nodes, or the GPU type itself, but there are cost and performance implications for both.


Let’s get started by recognising that while we want to use a 13B parameter model, each parameter is a 32-bit (=4 bytes) floating-point number. That’s going to require 13 Billion ⨉ 4 Bytes = 52 GB of VRAM. Because that violates our Cheap, and potentially our Fast, requirement, we’re going to have to quantise the models.

There are already some good resources on what is quantisation, so the high level take away here is that we can compress our models by storing each parameter as a 4-bit (=0.5 bytes) integer while trading that off with only a minimal drop in performance. 13 Billion ⨉ 0.5 Bytes ≅ 7.5 GB, which fits nicely into an NVIDIA T4’s 16 GB of VRAM.


While I may be stubborn, I’m not completely delusional, so I acknowledge that there’s still a tradeoff to be made. Here we’re accepting that there’ll be a small drop in the goodness of the model — determining exactly what this means is an open area of research, so we take into consideration two metrics:

  1. How closely would a human (i.e. me) accept that the model’s output adheres to my prompt’s instruction?
  2. The calculated perplexity of a given model (lower is better).

Note that neither of these, not even both together, will be a true representation of how good a model is, but they’re good approximations, and good for comparing between different quantisation schemes.

Giving up some goodness for cheapness and fastness is—in my view—an okay tradeoff since it’s much more difficult to quantify a drop in performance compared to a drop in speed, or an increase in costs.

If your generated search result analysis is so-so, users can still ignore it. As long as from time to time it has interesting insights that can save them time or give them new ideas, they’ll be quite resilient against quality variance.

If your search analysis is slow, however, it might be faster for them to just read the source document and always ignore your analysis (or worse they might prefer to manually dig through the systems of record instead of searching).

Cost is an interesting lever to manage for LLMs, since the cost levers you pull are never on a simple unit scale—it’s not the case that I can simply increase VRAM by 1GB, or choose a 1GB larger model, or serve N more queries with the next best instance type or size. Additionally, because the cold-start time for loading a model into GPU memory can be quite long, spot or serverless instances (where the unit economics are a little bit closer to such a mental model) are also often not viable.

Larger companies are able to amortise these costs over their many users, but until we get there, the unit economics matters a lot, so we’ll want to try to minimise this as much as we can (without going insane trying to save every single dollar).

Taking off with TitanML

I started this journey trying out the Takeoff Server by my friends at TitanML 👋. With a single command, you can pass in a local or HuggingFace model, and it’ll quantise and serve the model for you. They even just released support for batched (concurrent) inference (though not yet for streaming). Their community version only quantises to int8, which means that with a 13B model, it’ll take about 13 GB of VRAM. Given the length of some of our prompts we’ve hit the infamous CUDA_OUT_OF_MEMORY error on our T4. That means that unless you can gain access to their pro version we’ll have to stick with int4 quantisation for now.

During this time, I also tried to run Takeoff in CPU mode, running it on a c5.4xlarge node with 16 vCPU and 32 GiB RAM costing $0.808 per hour in London. It was noticeably slower, taking many seconds to generate just a single token. Compared to a g4dn.xlarge node with the T4 GPU at $0.615 per hour in London, it became clear to me that GPU acceleration would be more cost effective just based on the barrier to entry alone.

That said, if you only need a smaller model (<=7B parameters), or have a larger GPU (an A10 would be the next step up with 24 GB VRAM), you should absolutely try out Takeoff.

It has wonderful reference documentation and guides, no major gotchas, and because they chose Docker as their main abstraction, there’s almost zero setup time. The tradeoff is that there’s not much you can change with how the model is quantised, which is true in our case, so let’s try something else.

Compiling Models with MLC

My favourite AI podcast Latent Space recently released this episode discussing the Machine Learning Compilation group and its various projects aiming at making LLMs available on all kinds of consumer hardware (including phones and browsers, cool!)

The MLC-LLM project allows you to run LLMs on a variety of hardware and allegedly outperforms existing inference tooling in speed. I’m a huge fan of compilation in general, so I thought I’d give MLC a test run.

MLC has a lot of documentation, which generally is good thing, but there are a lot of different and sometimes related components that are named somewhat generically, so it can be confusing at first to wrap your mind around.


The easiest way to get everything you need is to follow the instructions at and pip install the correct wheels for your platform. Despite the instruction to install gcc for conda, I would actually advise against doing that unless you actually have a libstdc++ issue. I blindly followed that instruction initially at first, and at some point down the line, there were a lot of conflicts since the gcc version (in my conda env) and the libraries (outside of conda) MLC would try to link to would be of a different gcc version.

At this point, I realised that the default Amazon Linux 2 Machine images with EKS and GPU support came only with CUDA 11.4 installed, so I was in a bit of pickle. That meant that none of the wheels MLC provided would support my hardware. (Turns out there’s an additional issue preventing pip from finding the right binaries for Amazon Linux 2 based AMIs I didn’t know about). Trying to build the main compiler dependency TVM from source proved to be impossible, so I wound up having to find a way to upgrade CUDA instead (see the CUDA Segue below).

If your various platforms are well supported, then the two main wheels you’re asked to install are mlc-ai and mlc-chat. The former imbues you with the core dependencies to compile and run models, while the latter gives you the necessary python code to actually run the compiled model either via the python or REST APIs. If you further want to run the compiled model via the CLI, you’ll also need mlc-chat-cli.


To compile a model, you’re then going to clone the mlc-llm git repo and run

pip install .
Enter fullscreen mode Exit fullscreen mode

inside the folder. Then follow the rest of instructions in the docs.

It doesn’t really say this in the docs, but this recent PR allows you to compile the model library for your model for multiple CUDA architectures by setting
--target=cuda-multiarch, not just the one in your current container/machine. If you have no intention on distributing the compiled model library and you’re unlikely to change GPU types, you can probably stick with --target=cuda.

I copied my setup steps into a Dockerfile in case I ever needed to replay the steps I took. You should be able to

docker pull
Enter fullscreen mode Exit fullscreen mode

but it’s important to recognise that the mlc_llm repo inside this image was whatever it was at the time of the build, and if you’re coming from the future you should make sure to run

git pull --recursive && pip install -v .
Enter fullscreen mode Exit fullscreen mode

if there are upstream fixes to bugs you may encounter.

All the additional deps in there are really there for if you want to quantise your model with GPTQ, and you’re on CUDA >=12.0. This will transitively install a preview (nightly) pytorch build, and build auto-gptq from source.

Because you need to be able to load the entire model into RAM during the compilation step, I actually spun up a larger g4dn.2xlarge node with its 32GiB of RAM just for the compilation. Since I’ve taken all the effort to compile this model, you can find several quantisation weights in different branches at this HuggingFace repo. In case the model libraries have not yet been merged in, they’re available at this PR.

The mlc-llm project is under very active development (as really are all of these projects), and so things may change quite a bit. As I was writing this post, this PR refactoring how they quantised GPTQ models was merged, so make sure to check the commit log if you’re coming from the future.


It took me a while to set up a suitable environment and get mlc-llm to actually compile Vicuna-13B-v1.5 since it’s not one of the precompiled models that they offer by default. But once I got it working, its performance, in my opinion, definitely stands up to its claims.

Based on their own /stats endpoint, it took about 1.6s to prefill the prompt, and had an output at around 20 tokens/s — well within our fast boundaries.

One thing I found miraculous with the MLC compiled models is that once the model is loaded into memory, which, granted takes a few seconds, it pretty much stays there with no significant change in VRAM usage (about 400mb for my 1K token prompt, but then no more for a different 1K prompt). GPU utilisation goes up while generation is happening, but VRAM more or less stays the same. In all the other tools I’ve tried, VRAM goes up several GBs during generation, often leading to out of memory errors.

One last note, batched inference is not currently supported by MLC, though they claim this is under active development. It’s not entirely clear what batched inference means in this case, since it can mean multiple things depending on which level you’re operating on. I did try making 3 concurrent requests to the endpoint, and while they all streamed tokens back and the processes don’t run into the OOM error, the output of all 3 streams seem to actually belong to a single prompt.

A CUDA Segue

Since MLC didn’t have out of the box support for CUDA 11.4, and many of these other tools build for 11.8 by default, I figured I should find a way to upgrade CUDA on my EC2 nodes. The Amazon Machine Image (i.e. the OS) that eksctl picks for GPU support comes with CUDA 11.4, so I first tried to upgrade CUDA manually by following the instructions here.

The installer told me that I needed to first stop all the processes using the driver before I could upgrade the driver, which I never figured out how to do (all the commands I found on StackOverflow didn’t do anything). Further reading led me to believe that all CUDA toolkits from 11.0 up are backwards compatible as long as you have a minimum driver version, so I tried just upgrading the CUDA toolkit libraries without upgrading actual driver.

While this worked, it didn’t move the needle much to build TVM from source, nor did it allow me to install the MVC wheels. But I’d later learn that was because the linux flavour that Amazon Linux 2 self-identifies is also not supported by MLC.

In the end, I followed this discussion on the aws-eks-ami repo to to install an Ubuntu based AMI and then install the drivers and toolkit using the NVIDIA GPU Operator. That wound up being significantly easier. and had the added benefit of setting the driver and toolkit version at the containerd level rather than the host OS level. Because the GPU Operator could handle upgrades automatically, this meant that if I ever need to downgrade CUDA back to 11.8 this was as simple as changing a single line in a k8s resource. Both changing the linux flavour and the CUDA driver/toolkit version in this way let me use MLC directly from the wheels rather than compiling TVM from source.

A CPU+ Alternative with Llama.cpp

Since MLC benchmarked themselves against Llama.cpp, I’d be remised not to try it. Given that its main claim to fame is allowing you to run LLMs on CPUs, it tingled my cheap(ness) sensitivities, so I wanted to see how it performed. Llama.cpp also comes with some interesting features including support for constraining the output to a given format (in our case, JSON), and the ability to run partially on CPU and offloading some of the model layers on to the GPU.

I chose to use llama-cpp-python, which is a thin wrapper on top of Llama.cpp that comes with a FastAPI server. A couple considerations came to mind: firstly, it comes with a Docker image built for CUDA 12.1.1, so it was pretty plug-n-play. Secondly, it runs a FastAPI server on uvicorn, which I have some experience with. I knew if I wanted to try have multiple uvicorn workers, that was something I could easily setup using the WEB_CONCURRENCY environment variable (no surprise, that didn’t work since it tried to load 2 copies of the model into GPU, leading to CUDA OOM). Compiling the C++ server seems simple enough, but I went with the more well-known solution to start with.

With my prompt of about 1000 tokens (I have to put the search results into the prompt for them to be analysed), even with the model fully offloaded to the GPU, it was still too slow — with a prompt eval time of about 8s. Generation was pretty good, at 17.85 tokens per second, but the long wait before anything started streaming was a deal breaker.

There’s also a whole bunch of options that you can tweak that might affect performance, but I was looking for something more out of the box, since I don’t have the depth of knowledge to know which combination of flags and values would support my use case.

In case you decide to go with Llama.cpp, there’s a treasure chest of tips on how to improve performance. What’s nice is that you also get a set of stats for how your generation performed.

Like many of these other solutions, batched inference is not yet currently supported, though that’s currently being worked on. While I was working on this post, Llama.cpp released a new file format for optimised models. The new GGUF model format superseded the old GGML format, which you might have seen around HuggingFace. With additional churn around model format, you might end up having to convert some of the models you want to use if they’re not yet in the new format.

Huggingface’s Text Generation Inference (TGI)

Given HuggingFace’s centrality to the AI world, it behooves me to try out their Text Generation Inference server, especially since it seems to support concurrent requests which is exactly what we want.

The very sparse README asks you to dig into the source code to read the arguments, which is fair enough, though not ideal. The actual docs are hosted on HuggingFace, but are themselves pretty sparse—you still have to read the source code to learn about the options. It does come with a Docker container that you can simply deploy to k8s, so it makes spinning up a TGI pod fairly easy. Unfortunately this is where the simplicity would end.

It turns out that Exllama, which TGI uses under the hood for Llama models, doesn’t support T4 GPUs, so TGI will automatically disable Exllama for any cards with compute capability less than 8 (including the T4). I actually didn’t even know that TGI uses Exllama until I saw a bunch of Exllama logs during runtime.

In order to prevent runtime OOM errors, TGI tries to prefill your GPU with a certain amount of zeroed data on startup — as much as you configure based on a few batching-related flags: max_concurrent_requests, max_batch_prefill_tokens, max_batch_total_tokens, waiting_served_ratio and max_total_tokens. How batching works exactly, and how these are combined is not yet well documented, and there are many github issues reporting out of memory errors on startup.

What’s worse is that the OOM prefill error is actually a red herring, and that there’s a second stack trace above that highlights the actual source of the error (which often times is not actually a true out-of-memory error).

That said, once I changed the quanitsation to use bitsandbytes (I still can’t get GPTQ working), I got a good token generation speed of almost 12 tokens per second. It took about 3.5 minutes for TGI to load the model into GPU, which may be a dealbreaker for you if have high-availability requirements, but that’s not one of the criteria I’m currently considering based on our usage. Ultimately the fact that TGI handles batching by default makes it a clear winner.

One nice thing that TGI has if you have the hardware, is that it’s able to shard your model across multiple GPUs and handles that relatively transparent for you. It does mean that your model needs to have weights stored in safetensors format rather than pickle format. But in case it doesn’t there’s a HuggingFace Space that will help you convert models automatically. Once it’s done converting, it will open a pull request on the model’s repo, which will then give you a revision hash that you can pass to TGI, even if the code owners haven’t merged your PR yet.


Right now, I have TGI deployed since it supports concurrent requests, which has given me some peace of mind that we can support multiple users at once.

I’ll keep an eye out for how MLC supports batching/concurrent requests, since that still performs the best in terms of speed. I came across some additional tools that I didn’t try, so let me know if you do use them and if there’re any gotchas!

  • vLLM - No support for quantised models, though that’s on the roadmap.
  • OpenLLM - Need to build a custom docker image for every model. It sell itself as a “One thing to rule them all” type framework, and I decided I don’t have the time to learn it end-to-end, including their own custom Bento format.
  • FastChat - By the same people behind the Vicuna model, comes with a whole gradio UI. The server architecture seems like it could be quite powerful, but also complex, so I didn’t want to try it out in this first pass.

[1] Shoutout to my intern Felix Cohen for powering through the various prompt and model combinations with sometimes very arbitrary metrics for “good enough”

[2] There seem to be a lot of LLM tools these days either for people wanting to run LLMs locally on their PCs, or big companies with a lot of GPU compute, but not that many for the use cases in the middle. We’re stuck in a chicken and egg situation where you’d like to be able to spin up an LLM as a proof-of-concept, but until you reach the amortisation query threshold, you don’t really want your GPUs just sitting around doing nothing.

Top comments (0)