Mistral-7b on GCP

#ai #gcp #mistral7b #tutorial

Whether you're an AI enthusiast, a developer, or simply a tech junkie, you might have heard of Mistral-7b, a state-of-the-art language model. But what goes on behind the scenes? How do you set it up? How fast can it solve your apple math problems, and how often will it get them right? In this deep-dive, we'll walk through every step of setting up Mistral-7b on Google Cloud Engine (GCE), analyzing log metrics, measuring response time and quality, and assessing server resource utilization. Strap in as we navigate through the YAML configurations, command lines, and real-world testing results to get the full picture of Mistral-7b's capabilities and quirks.

Setup on GCP

Without further ado, here are the steps I took to setup Mistral-7b (github source) on GCE:

Create a minimal Linux VM
Install Anaconda
install Skypilot[gcp].
official mistral-7b docs for deploying with SkyPilot: and here's the yaml file I used for my cluster in GCP, saved as mistral-7b.yaml:

envs:
  _MODEL_NAME: mistralai/Mistral-7B-v0.1
  _SKYPILOT_NUM_GPUS_PER_NODE: 1
  _PRIVATE_IP: #your instance IP
resources: 
  cloud: gcp
# the gpu model, used sky show-gpus to check
  accelerators: L4:1

run: |
  docker run --gpus all -p $_PRIVATE_IP:8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest \
                   --host 0.0.0.0 \
                   --model $_MODEL_NAME \
                   --tensor-parallel-size $_SKYPILOT_NUM_GPUS_PER_NODE

Run the command 'sky launch -c $cluster_name mistral--7b.yaml'. This provisions the configs in the yaml automatically onto your GCP environment.
Setup your network for querying the AI server.
the command for curl:

curl http://$IP:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "mistralai/Mistral-7B-v0.1",
      "prompt": "My favourite condiment is",
      "max_tokens": 25
  }'

The estimated cost of this setup on GCE is 500USD/month with 1x Nvidia L4 and g2-standard-4 (4 vCPU, 2 core, 16 GB memory)

After successfully setting up Mistral-7b on GCE, the real detective work began: analyzing the logs. Diving into the labyrinthine world of server metrics and data, I focused my attention on both the response time and the quality of the generated answers. The logs offered a treasure trove of information, highlighting key parameters such as query length and token limits that significantly influenced how long it took for Mistral-7b to respond. However, speed wasn't the only metric under the microscope. I also found the model's quality of response to be erratic at times, with occasional detours into complete gibberish. One memorable instance involved Mistral-7b generating the curious term "plogram", showcasing the unpredictability that comes with advanced AI.

Log Metrics Reviewed:

Received request cmpl-XXXXXX: This signifies a new incoming request to Mistral AI, including the prompt and sampling parameters.
Avg prompt throughput, Avg generation throughput: These indicate the average speed of processing the input prompts and generating the outputs.
Running, Swapped, Pending: These show the request status.
GPU KV cache usage, CPU KV cache usage: These indicate the usage level of GPU and CPU cache.
Real time taken to query the server from curl
logs on GCP

Response Time and Quality

I repeated the same promts a few times with different max tokens:

"prompt": "I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with? Lets think step by step"
"prompt": "your favorite condiment is"
"prompt": "write a haiku explaining k8s"
"prompt": "Give two concise bullet-point arguments against the Münchhausen trilemma"

Upon reviewing the logs, it's clear that query length and token limit significantly impact the AI model Mistral-7b's response time. Occasionally, the model produces errors in the form of random unrelated sentences. In 3 queries it produced 3 different responses to the apple math problem and was wrong 2/3 times. The response time typically varies between 3.5 to 4 seconds for a 25-token limit but can extend to 19 seconds for a 500-token limit. While the initial prompt and token limit influence the duration, the average generation speed stays fairly constant.

Showcase

Here's a showcase of Mistral-7b getting it wrong and taking a long time (500 response tokens and 16.548 seconds real time):

Model: - mistralai/Mistral-7B-v0.1
Prompt: - I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with? Let's think step by step.
Max Tokens: - 500

Response:
Object: - text_completion
Created Time: - 1696551350
Model Used: - mistralai/Mistral-7B-v0.1
Choices:
Index: - 0
Text:

1st Step: I went to the market and bought 10 apples, and gave 2 apples to the neighbor and 2 to the repairman. Now I remain with 10 apples minus 4 apples = 6 apples.

2nd Step: I went and bought 5 more apples and ate 1 apple. So I remain with 6+5-1 = 8 apples.

3rd Step: We have to add the bananas to the picture. I gave my brother 3 bananas. So 8 apples and 3 bananas. So I remain with 8 apples + 3 bananas = 10 pieces of fruit.

Logprobs: - null

Finish Reason: - stop

Usage:

Prompt Tokens: 72

Total Tokens: 318

Completion Tokens: 246

Server Resource Utilization

GPU KV Cache Usage: This went from 0.4% to 1.2% in the first query and remained below 2.1% in the other queries.
CPU KV Cache Usage: Remained at 0.0%.

The low resource utilization suggests that the hardware isn't the bottleneck for the lengthy response time, in fact, plenty of room to do more.

Concluding Insights

Lengthier Queries = Longer Time: If you want a quick response, try to streamline your prompts.
Response Quality: Seems to be quite random, sometimes Mistral-7b will say random stuff just to fill up the token count. The reverse is true too, if the token count is not long enough it will generate a response but then cut off mid sentence. Also, seems to be good at Haiku but bad at math.
Resource Underutilization: there was plenty of room to scale or run more queries in parallel.
Network issues: Didn't notice anything special here but may be another vector affecting response time.

DEV Community