DEV Community

Pragyan Tripathi
Pragyan Tripathi

Posted on

After Gemini, here's Meta's attempt to fool us 👎

Last week, Meta announced their new LLM, Llama 3.1, claiming it rivals closed-source models like GPT-4 and Claude 3.5.

They used popular benchmarks like MMLU (massive multi-task language understanding) to back up their claims.

But here's the problem:

These benchmarks are fundamentally flawed.

Here's why:

  1. They're too easy

MMLU was created in 2020. Back then, most models scored around 25%.

Now? The best models score 88-90%.

It's like grading high school students on middle school tests.

  1. They contain errors

A study found that 57% of MMLU's virology questions had errors.

26% of logical fallacy questions were wrong too.

Some had no correct answer. Others had multiple right answers.

  1. Models might be cheating

LLMs are trained on internet data. This often includes benchmark questions and answers.

Models could be "contaminated" - they've seen the test in advance.

Some companies might even deliberately train on benchmark data to boost scores.

  1. Small changes have big impacts

Asking an AI to state an answer directly vs. giving a letter/number can produce different results.

This affects reproducibility and comparability.

  1. They don't reflect real-world performance

Benchmark scores often fail to match how models perform on actual tasks.

Companies use inflated scores to hype products and boost valuations.

So what's the solution?

  1. Create harder benchmarks.

â†ģ MMLU-Pro, GPQA, and MuSR are examples of tougher tests.

  1. Use automated testing systems

â†ģ HELM(holistic evaluation of language models) and EleutherAI Harness generate more trustworthy leaderboards.

  1. Develop new benchmarks for emerging skills

â†ģ GAIA tests real-world problem-solving. NoCha(novel challenge) assesses long-context understanding.

  1. Use AI to create benchmarks

â†ģ Projects like AutoBencher use LLMs to develop new tests.

  1. Focus on safety benchmarks

â†ģ Anthropic is funding the creation of benchmarks to assess AI safety risks.

The big picture:

As AI commercializes, we need reliable, specific benchmarks.

Startups specializing in AI evaluation are emerging.

The era of AI labs grading their own homework is ending.

And that's a good thing for everyone.

Top comments (0)