Fine-tuned large language models (LLM) are becoming increasingly popular in AI applications. These powerful language models are widely used to automate a series of tasks, improve customer service, and generate domain-specific content.
However, serving these fine-tuned LLMs at scale comes with challenges. Those models are computationally consuming. Their sizes are much larger than the traditional microservices. These features make it hard to archive high throughput serving and low cold start scaling.
This post will introduce our experience on LLM serving with vLLM and service scaling in modelz.
Use vLLM for high throughput LLM serving
vLLM is a high-throughput and memory-efficient LLM serving engine. It offers OpenAI compatible API, which makes it easy to be integrated with the existing LLM applications.
The first problem of using vLLM is building a GPU environment to build and install vLLM. With the help of envd, this can be done in one file like:
# syntax=v1
def build():
base(dev=True)
install.cuda(version="11.8.0")
install.conda()
install.python()
install.apt_packages(name=["build-essential"])
# install torch here to reuse the cache
install.python_packages(name=["torch"])
# install from source
install.python_packages(name=["git+https://github.com/vllm-project/vllm.git"])
By running envd up
, you can get into the development environment with everything you need. If you prefer Dockerfile, we also have a template.
vLLM already supports many LLM such as LLaMA, Falcon, MPT, etc. However, to support your own LLM, you may need to provide a model-specific prompt template. To address this issue, we create a tool called llmspec, which provides the prompt templates with OpenAI compatible interface. You can build your prompt generator on top of this library.
To run the vLLM serving in a Kubernetes cluster, there are some necessary configurations:
- Always set
--worker-use-ray
to run the model inference in another Python process to avoid health probe failure. - Provide enough shared memory (at least 30% RAM).
- Reduce
--gpu-memory-utilization
to avoid GPU OOM for long sequences. - Increase
--max-num-batched-tokens
if you want to get long sequences.
If you want to simulate a multiple concurrent request test, you can use the following script:
from random import randint
import concurrent.futures
import openai
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"
def query(max_tokens=20):
chat = openai.ChatCompletion.create(
model="mosaicml/mpt-30b-chat",
messages=[{
"role": "user",
"content": "Who are you?",
}],
stream=True,
max_tokens=max_tokens,
)
for result in chat:
delta = result.choices[0].delta
print(delta.get('content', ''), end='', flush=True)
print()
def batch_test():
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(query, max_tokens=randint(20, 200)) for _ in range(20)
]
for future in concurrent.futures.as_completed(futures):
future.result()
if __name__ == "__main__":
batch_test()
Scaling with Modelz
Modelz is a fully managed platform that provides users with a simple API for deploying machine learning models. By using our platform, your service can be scaled according to the real-time API invocation. The docker image will also be optimized to minimize the container cold start time.
If you want to deploy models to your private cluster or single GPU server, try openmodelz. It takes care of the underlying technical details and provides a simple and easy-to-use CLI to deploy and manage your machine learning services.
If you have any questions related to deploying models into production, feel free to reach out, by joining Discord, or through modelz-support@tensorchord.ai.
Advertisement Time
- mosec - A general high-performance and easy-to-use machine learning serving framework.
- pgvecto.rs - A powerful Postgres extension for vector similarity search.
Top comments (0)