Jeffrey Ip

Posted on Jan 17 • Edited on Aug 30

‼️ Top 5 Open-Source LLM Evaluation Frameworks in 2024 🎉🔥

#webdev #programming #opensource #ai

TL:DR

"I feel like there are more LLM evaluation solutions out there than there are problems around LLM evaluation" - said Dylan, a Head of AI at a Fortune 500 company.

And I couldn't agree more - it seems like every week there is a new open-source repo trying to do the same thing as the other 30+ frameworks that already exists. At the end of the day, what Dylan really wants is a framework, package, library, whatever you want to call it, that would simply quantify the performance of the LLM (application) he's looking to productionize.

So, as someone who were once in Dylan's shoes, I've compiled a list of the top 5 LLM evaluation framework that exists in 2024 :) 😌

Let's begin!

1. DeepEval - The Evaluation Framework for LLMs

DeepEval is your favorite evaluation framework's favorite evaluation framework. It takes top spot for a variety of reasons:

Offers 14+ LLM evaluation metrics (both for RAG and fine-tuning use cases), updated with the latest research in the LLM evaluation field. These metrics include:
- G-Eval
- Summarization
- Hallucination
- Faithfulness
- Contextual Relevancy
- Answer Relevancy
- Contextual Recall
- Contextual Precision
- RAGAS
- Bias
- Toxicity

Most metrics are self-explaining, which means DeepEval's metrics will literally tell you why the metric score cannot be higher.

Offers modular components that is extremely simple to plug and use. You can easily mix and match different metrics, or even use DeepEval to build your own evaluation pipeline if needed.
Treats evaluations as unit tests. With an integration for Pytest, DeepEval is a complete testing suite most developers are familiar with.
Allows you to generate synthetic datasets using your knowledge base as context, or load datasets from CSVs, JSONs, or Hugging face.
Offers a hosted platform with a generous free tier to run real-time evaluations in production.

With Pytest Integration:

from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
 input="How many evaluation metrics does DeepEval offers?",
 actual_output="14+ evaluation metrics",
 context=["DeepEval offers 14+ evaluation metrics"]
)
metric = HallucinationMetric(minimum_score=0.7)

def test_hallucination():
  assert_test(test_case, [metric])

Then in the CLI:

deepeval test run test_file.py

Or, without Pytest (perfect for notebook environments):

from deepeval import evaluate
...

evaluate([test_case], [metric])

🌟 Star DeepEval on GitHub

2. MLFlow LLM Evaluate - LLM Model Evaluation

MLFlow is a modular and simplistic package that allows you to run evaluations in your own evaluation pipelines. It offers RAG evaluation and QA evaluation.

MLFlow is good because of its intuitive developer experience. For example, this is how you run evaluations with MLFlow:

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

🌟 Star MLFlow on GitHub

3. RAGAs - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

Third on the list, RAGAs was build for RAG pipelines. They offer 5 core metrics:

Faithfulness
Contextual Relevancy
Answer Relevancy
Contextual Recall
Contextual Precision

These metrics make up the final RAGAs score. DeepEval and RAGAs have very similar implementations, but RAGAs metrics are not self-explaining, making it much harder to debug unsatisfactory results.

RAGAs is third on the list primarily because it also incorporates the latest research into its RAG metrics, is simple to use, but not higher on the list because of its limited features and inflexibility as a framework.

from ragas import evaluate
from datasets import Dataset
import os

os.environ["OPENAI_API_KEY"] = "your-openai-key"

# prepare your huggingface dataset in the format
# Dataset({
#     features: ['question', 'contexts', 'answer', 'ground_truths'],
#     num_rows: 25
# })

dataset: Dataset

results = evaluate(dataset)

🌟 Star RAGAs on GitHub

4. Deepchecks

Deepchecks stands out as it is geared more towards evaluating the LLM itself, rather than LLM systems/applications.

It is not higher on the list due to its complicated developer experience (seriously, try setting it up yourself and let me know how it goes), but its open-source offering is unique as it focuses heavily on the dashboards and the visualization UI, which makes it easy for users to visualize evaluation results.

🌟 Star Deepchecks on GitHub

5. Arize AI Phoenix

Last on the list, Arize AI evaluates LLM applications through extensive observability into LLM traces. However it is extremely limited as it only offers three evaluation criteria:

QA Correctness
Hallucination
Toxicity

🌟 Star Phoenix on GitHub

So there you have it, the list of top LLM evaluation frameworks GitHub has to offer in 2024. Think there's something I've missed? Comment below to let me know!

Thank you for reading, and till next time 😊

Top comments (8)

Comment hidden by post author - thread only accessible via permalink

Ranjan Dailata • Jan 17

Few more :)

OpenAI Evals
TruLens
Truera

Jeffrey Ip • Jan 17

Top 5 only!

Bap • Jan 22

Nice list!

Comment hidden by post author - thread only accessible via permalink

Tim Yao • Jun 11

This article's author is also the author of DeepEval which is ranked No.1 in this article.
I will say good effort for both building DeepEval and writing articles but this is not the right way to promote yourself.
That may make this comparison article not have any value.

Sally O • Jan 23

Somehow you missed TruLens: github.com/truera/trulens

DeepLearning AI has a whole free course on how to use it to test RAG apps and a workshop on agents, too.
deeplearning.ai/short-courses/buil...
youtube.com/watch?v=0pnEUAwoDP0

Jeffrey Ip • Jan 24

Did not miss it, top 5 only!

Matija Sosic • Jan 17

It's good to be aware of these. Are all of them comparable, or do they define the benchmarks differently? In other words, is there a "golden standard" that all benchmarking tools follow?

Jeffrey Ip • Jan 17

Glad you liked it!

Some comments have been hidden by the post's author - find out more

DEV Community

‼️ Top 5 Open-Source LLM Evaluation Frameworks in 2024 🎉🔥

TL:DR

1. DeepEval - The Evaluation Framework for LLMs

2. MLFlow LLM Evaluate - LLM Model Evaluation

3. RAGAs - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

4. Deepchecks

5. Arize AI Phoenix

Top comments (8)

Read next

TypeScript for Domain-Driven Design (DDD)

The Limitations of Machine Learning: What We Still Can't Teach Machines

Migrating from Azure Database for PostgreSQL to Neon

.NET Development and Localization for JustAnswer – case study