DEV Community

Cover image for AI: An overview of common LLM benchmarks
Hodeem
Hodeem

Posted on

AI: An overview of common LLM benchmarks

Introduction

With almost every release of a state-of-the-art (SOTA) LLM, is an accompanying table (like the one below) that compares each of the frontrunners.

Claude 3 benchmarks table

If you're like me, most of the benchmarks seemed foreign, thereby making it difficult to understand what's going on. So, in this blog post I want to clarify the most common benchmarks. First, let's start by doing a quick summary of what LLMs and benchmarks are.

What is an LLM?

A Large Language Model (LLM) is a type of machine learning model designed to understand and generate human language among other things. LLMs can also perform tasks such as answering questions, translating languages, summarizing texts, and even writing code.

What are benchmarks?

A benchmark can be thought of as a standardized test to evaluate how well a model can complete specific tasks. The evaluation usually entails providing the model with tasks to complete, and once those tasks are completed, the model is given a score. The scale is dependent on the benchmark, so a higher score is not always better.

With the summaries out of the way, let's review ten (10) common benchmarks:

AI2 Reasoning Challenge (ARC)

  • The goal: To measure a model's ability to correctly answer questions and demonstrate sound reasoning.

  • The test: To correctly answer more than 7000 grade-school science questions.

HellaSwag (Harder Endings, Longer contexts and Low-shot Activities for Situations With Adversarial Generations)

  • The goal: To measure a model's ability to choose the most plausible ending for a given context, especially in scenarios where the context is longer, and the tasks are more challenging.

  • The test: The model is given a short story or context and must select the correct ending from a set of options. There are options that are tricky or misleading.

MMLU (Massive Multitask Language Understanding)

  • The goal: To assess a model's ability to understand and respond correctly across a wide range of subjects, from elementary knowledge to advanced topics.

  • The test: The model is tested with more than 15,000 questions from over 50 different subjects, including humanities, sciences, and social sciences, with questions designed to reflect varying levels of difficulty.

GSM8K (Grade School Math 8K)

  • The goal: To evaluate a model’s capability in solving grade-school-level math problems.

  • The test: The model is presented with more than 8,000 grade-school math word problems and must solve them correctly.

TruthfulQA

  • The goal: To measure a model's ability to generate truthful and factually accurate responses.

  • The test: The model is given more than 800 questions that could easily lead to incorrect or misleading answers, and it is evaluated on how truthful its responses are.

Winogrande

  • The goal: To assess a model's understanding of common-sense reasoning and its ability to resolve ambiguous pronouns in sentences.

  • The test: The model is given sentences with ambiguous pronouns and must correctly identify which noun the pronoun refers to.

Chatbot Arena

  • The goal: To evaluate and compare the performance of different chatbots in open-ended conversations.

  • The test: To engage in conversations with human evaluators who rate the chatbots based on various criteria.

HumanEval

  • The goal: To measure a model’s ability to write correct Python code to solve programming problems.

  • The test: The model is given programming tasks, and its output is compared against expected solutions to determine correctness.

MBPP (Mostly Basic Programming Problems)

  • The goal: To evaluate a model's basic programming skills, focusing on the ability to solve simple coding problems.

  • The test: The model is presented with a series of basic programming problems and is judged on the correctness and efficiency of its solutions.

GLUE (General Language Understanding Evaluation)

  • The goal: To measure a model's ability to perform a variety of natural language understanding tasks.

  • The test: The model is tested on a suite of language tasks, including sentiment analysis, textual entailment, and sentence similarity.

Sentiment analysis is the process of determining the emotional tone or opinion expressed in a piece of text, typically categorizing it as positive, negative, or neutral.

Textual entailment is the task of determining whether a given piece of text (the hypothesis) can be inferred or logically follows from another piece of text (the premise).

Here's an example of textual entailment:

Premise: "The cat is on the mat."
Hypothesis: "There is a cat."

This is a positive textual entailment because the second sentence (hypothesis) can be inferred from the first sentence (premise).

Let's talk about shots

Some benchmark diagrams may have terms like "0-shot CoT" or "5-shot" all over. Let's clarify these by tackling CoT first.

Chain of Thought (CoT) is a technique used in language models where the model is encouraged to generate a series of intermediate reasoning steps before arriving at the final answer. It mimics how humans often solve complex problems by breaking them down into smaller, manageable steps.

Here's an example of a CoT question and response:

Question:
"Sarah has 7 apples. 
She gives 3 apples to her friend and then buys 5 more. 
How many apples does Sarah have now?"

Model's Response:

"Sarah starts with 7 apples."
"She gives 3 apples away, so she has 7 - 3 = 4 apples left."
"She buys 5 more apples, so now she has 4 + 5 = 9 apples."
Answer: "Sarah has 9 apples."
Enter fullscreen mode Exit fullscreen mode

Shots are examples that demonstrate to the model the correct way to answer the problem. "5-shot" means five(5) examples and "0-shot" means no examples.

A high 0-shot score demonstrates that the model has a strong ability to generalize its knowledge to new tasks or questions it hasn't seen before.

On the other hand, a high 5-shot (or any number greater than 0) score shows that the model is good at learning from a few examples and applying that knowledge to similar tasks.

So, what's the key difference between "n-shot" and "n-shot CoT"?

n-shot asks for a direct answer without explicit reasoning.
n-shot CoT encourages the model to show its work and reasoning process.

Conclusion

Hopefully by understanding these benchmarks, you'll be better equipped to interpret the ever-evolving LLM landscape and make informed decisions about which models might be most suitable for your needs.

Sources

What is an LLM?

Claude 3 benchmarks

What is Chain of Thoughts?

LLM Benchmarks

An in-depth guide to LLM benchmarking

Top comments (0)