By default, MT-Bench uses OpenAI as a service provider with a gpt-4
model ID, which is a vanilla GPT-4 model with 8k context introduced back in Spring 2023. However, it is possible to override the model ID via the --judge-model
argument.
As of June 2024, GPT-4 series models have the following pricing (per million tokens):
Prompt | Completion | |
---|---|---|
GPT-4o | $ 5,00 | $ 15,00 |
GPT-4-Turbo (0125-preview) | $ 10,00 | $ 30,00 |
GPT4 8K (0613) | $ 30,00 | $ 60,00 |
By running the MT-Bench using GPT-4 Turbo or Omni one can potentially save 6 times on API calls for one evaluation. But how will the score change? Let's find out :)
Costs
I have used Phi-3 Medium with 8K context and quantized at 8-bits (running inference server via LM Studio). I have executed answer generation 4 times. Then for each of the sets, I have run one judgment generation with the three models.
OpenAI API consumption cost per one eval*:
GPT-4o | $ 0,93 |
GPT-4-Turbo (0125-preview) | $ 1,85 |
GPT4 8K (0613) | $ 5,10 |
*I only collected the total tokens consumed by gpt-4-0613
(621805) during 4 runs. For the calculation, I assumed that each model had similar token consumption with 580k prompt and 60k completion tokens
Reviewing the Scores
The below findings can not be generalized as they take a small sample of results for just one target model (Phi-3). Still...
For each of the LLM judges, I have calculated the mean (out of 4 runs) and standard deviation as a percentage of the mean. As you can see:
- Omni tends to inflate the score by a factor of 12%
- All models are quite consistent with just a 1-3% deviation in scores
- The vanilla GPT-4 shows the most consistency across turns
Mean | 1st Turn | 2nd Turn | Avg |
---|---|---|---|
GPT-4o | 9,13125 | 8,2814875 | 8,70720325 |
GPT-4-Turbo (0125-preview) | 8,290625 | 7,5270175 | 7,90932575 |
GPT-4 8K (0613) | 8,41875 | 7,04375 | 7,73125 |
StDev | 1st Turn | 2nd Turn | Avg |
---|---|---|---|
GPT-4o | 0,00230424 | 0,0262376 | 0,01302793 |
GPT-4-Turbo (0125-preview) | 0,00620126 | 0,02336659 | 0,01396082 |
GPT-4 8K (0613) | 0,01178508 | 0,01858418 | 0,01152749 |
GPT-4 Turbo is the closest to the baseline of GPT-4 8K, 2nd turn sees the most deviation:
% of GPT4 8K | |
---|---|
1st Turn | |
GPT-4o | 108,5% |
GPT-4-Turbo (0125-preview) | 98,5% |
GPT-4 8K (0613) | 100,0% |
Both Omni and Turbo see the least drop in 2nd turn scores:
2nd turn drop | |
---|---|
GPT-4o | 9,31% |
GPT-4-Turbo (0125-preview) | 9,21% |
GPT-4 8K (0613) | 16,33% |
Raw Scores
Model | 1st Turn | 2nd Turn | Avg |
---|---|---|---|
GPT-4o #1 | 9,14375 | 8,5625 | 8,853125 |
GPT-4o #2 | 9,14375 | 8,3375 | 8,740625 |
GPT-4o #3 | 9,1 | 8,15 | 8,625 |
GPT-4o #4 | 9,1375 | 8,07595 | 8,610063 |
GPT-4-Turbo (0125-preview) #1 | 8,35 | 7,7 | 8,025 |
GPT-4-Turbo (0125-preview) #2 | 8,2875 | 7,64557 | 7,968553 |
GPT-4-Turbo (0125-preview) #3 | 8,3 | 7,4375 | 7,86875 |
GPT-4-Turbo (0125-preview) #4 | 8,225 | 7,325 | 7,775 |
GPT-4 8K (0613) #1 | 8,4875 | 7,2125 | 7,85 |
GPT-4 8K (0613) #2 | 8,5125 | 6,975 | 7,74375 |
GPT-4 8K (0613) #3 | 8,3 | 7,075 | 7,6875 |
GPT-4 8K (0613) #4 | 8,375 | 6,9125 | 7,64375 |
About
MT-Bench is a quick (and dirty?) way to evaluate a chatbot model (fine-tuned instruction following LLM). When a new open-source model is published at Hugging-face it is not uncommon to see the score presented as a testament of quality. It offers ~$5 worth of OpenAI API calls towards getting a good ballpark of how your model does. A good tool to iterate on fine-tuning an assistant model.
MT-Bench is a Python program that asks the target model 80 predefined questions (doing inference via HF Transformers or OpenAI compatible API endpoint). The questions cover Humanities, STEM, Extraction, Roleplay, Writing, Reasoning, and Coding. There are 2 turns - it asks a question and gets the answer (1st turn), then adds a follow-up question and collects the 2nd answer (2nd turn). It then iterates through all questions and asks the GPT-4 model (the legacy 8K model from Spring 2023) to score both answers on a scale from 1 to 10 (hence the lowest a model can get is 1, not 0 :). The results are 3 aggregate scores: 1st turn, 2nd turn, and average score.
########## First turn ##########
score
model turn
stablelm-2-brief-1_6b_2 1 3.240506
########## Second turn ##########
score
model turn
stablelm-2-brief-1_6b_3 2 2.443038
########## Average ##########
score
model
stablelm-2-brief-1_6b_3 2.822785
As explained in this paper, which introduced the MT-Bench and investigated the utility of LLM as an evaluator, the score shows high agreement with human preferences. I.e. the larger the MT-Bench score the higher the model gets on LMSYS Chatbot Arena.
Another popular option for LLM evaluation is AlpacaEval. This one uses a newer and cheaper GPT-4 Turbo model as a baseline. The authors of AlpacaEval provided correlation coefficients of different evals with LMSYS Arena showing a strong association between LLM judges' scores and human preferences at the Arena:
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more