DEV Community

Cover image for ‼️ Top 5 Open-Source LLM Evaluation Frameworks in 2024 🎉🔥
Jeffrey Ip
Jeffrey Ip

Posted on • Updated on

‼️ Top 5 Open-Source LLM Evaluation Frameworks in 2024 🎉🔥

TL:DR

"I feel like there are more LLM evaluation solutions out there than there are problems around LLM evaluation" - said Dylan, a Head of AI at a Fortune 500 company.

And I couldn't agree more - it seems like every week there is a new open-source repo trying to do the same thing as the other 30+ frameworks that already exists. At the end of the day, what Dylan really wants is a framework, package, library, whatever you want to call it, that would simply quantify the performance of the LLM (application) he's looking to productionize.

So, as someone who were once in Dylan's shoes, I've compiled a list of the top 5 LLM evaluation framework that exists in 2024 😌

Let's begin!


1. DeepEval - The Evaluation Framework for LLMs

DeepEval is your favorite evaluation framework's favorite evaluation framework. It takes top spot for a variety of reasons:

  • Offers 14+ LLM evaluation metrics (both for RAG and fine-tuning use cases), updated with the latest research in the LLM evaluation field. These metrics include:
    • G-Eval
    • Summarization
    • Hallucination
    • Faithfulness
    • Contextual Relevancy
    • Answer Relevancy
    • Contextual Recall
    • Contextual Precision
    • RAGAS
    • Bias
    • Toxicity

Most metrics are self-explaining, which means DeepEval's metrics will literally tell you why the metric score cannot be higher.

  • Offers modular components that is extremely simple to plug and use. You can easily mix and match different metrics, or even use DeepEval to build your own evaluation pipeline if needed.
  • Treats evaluations as unit tests. With an integration for Pytest, DeepEval is a complete testing suite most developers are familiar with.
  • Allows you to generate synthetic datasets using your knowledge base as context, or load datasets from CSVs, JSONs, or Hugging face.
  • Offers a hosted platform with a generous free tier to run real-time evaluations in production.

With Pytest Integration:

from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
 input="How many evaluation metrics does DeepEval offers?",
 actual_output="14+ evaluation metrics",
 context=["DeepEval offers 14+ evaluation metrics"]
)
metric = HallucinationMetric(minimum_score=0.7)

def test_hallucination():
  assert_test(test_case, [metric])
Enter fullscreen mode Exit fullscreen mode

Then in the CLI:

deepeval test run test_file.py
Enter fullscreen mode Exit fullscreen mode

Or, without Pytest (perfect for notebook environments):

from deepeval import evaluate
...

evaluate([test_case], [metric])
Enter fullscreen mode Exit fullscreen mode

🌟 Star DeepEval on GitHub


2. MLFlow LLM Evaluate - LLM Model Evaluation

MLFlow is a modular and simplistic package that allows you to run evaluations in your own evaluation pipelines. It offers RAG evaluation and QA evaluation.

MLFlow is good because of its intuitive developer experience. For example, this is how you run evaluations with MLFlow:

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)
Enter fullscreen mode Exit fullscreen mode

🌟 Star MLFlow on GitHub

3. RAGAs - Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

Third on the list, RAGAs was build for RAG pipelines. They offer 5 core metrics:

  • Faithfulness
  • Contextual Relevancy
  • Answer Relevancy
  • Contextual Recall
  • Contextual Precision

These metrics make up the final RAGAs score. DeepEval and RAGAs have very similar implementations, but RAGAs metrics are not self-explaining, making it much harder to debug unsatisfactory results.

RAGAs is third on the list primarily because it also incorporates the latest research into its RAG metrics, is simple to use, but not higher on the list because of its limited features and inflexibility as a framework.

from ragas import evaluate
from datasets import Dataset
import os

os.environ["OPENAI_API_KEY"] = "your-openai-key"

# prepare your huggingface dataset in the format
# Dataset({
#     features: ['question', 'contexts', 'answer', 'ground_truths'],
#     num_rows: 25
# })

dataset: Dataset

results = evaluate(dataset)
Enter fullscreen mode Exit fullscreen mode

🌟 Star RAGAs on GitHub


4. Deepchecks

Deepchecks stands out as it is geared more towards evaluating the LLM itself, rather than LLM systems/applications.

It is not higher on the list due to its complicated developer experience (seriously, try setting it up yourself and let me know how it goes), but its open-source offering is unique as it focuses heavily on the dashboards and the visualization UI, which makes it easy for users to visualize evaluation results.

Image description

🌟 Star Deepchecks on GitHub


5. Arize AI Phoenix

Last on the list, Arize AI evaluates LLM applications through extensive observability into LLM traces. However it is extremely limited as it only offers three evaluation criteria:

  1. QA Correctness
  2. Hallucination
  3. Toxicity

Image description

🌟 Star Phoenix on GitHub


So there you have it, the list of top LLM evaluation frameworks GitHub has to offer in 2024. Think there's something I've missed? Comment below to let me know!

Thank you for reading, and till next time 😊

Top comments (7)

Collapse
 
fernandezbaptiste profile image
Bap

Nice list!

Collapse
 
ranjancse profile image
Info Comment hidden by post author - thread only accessible via permalink
Ranjan Dailata
Collapse
 
guybuildingai profile image
Jeffrey Ip

Top 5 only!

Collapse
 
ridiculonculator profile image
Sally O

Somehow you missed TruLens: github.com/truera/trulens

DeepLearning AI has a whole free course on how to use it to test RAG apps and a workshop on agents, too.
deeplearning.ai/short-courses/buil...
youtube.com/watch?v=0pnEUAwoDP0

Collapse
 
guybuildingai profile image
Jeffrey Ip

Did not miss it, top 5 only!

Collapse
 
matijasos profile image
Matija Sosic

It's good to be aware of these. Are all of them comparable, or do they define the benchmarks differently? In other words, is there a "golden standard" that all benchmarking tools follow?

Collapse
 
guybuildingai profile image
Jeffrey Ip

Glad you liked it!

Some comments have been hidden by the post's author - find out more