DEV Community

Joschka Braun
Joschka Braun

Posted on

LLM Evaluation Metrics for Labeled Data

The following is an overview of general purpose evaluation metrics based on foundational models and fine-tuned LLMs as well as RAG specific evaluation metrics. The evaluation metrics rely on ground truth annotations/reference answers to assess the correctness of the model response. They were collected from research literature and discussions with other LLM app builders. Implementation in Python or links to the models are provided where available.

General Purpose Evaluation Metrics using Foundational Models

A classical yet predictive way to assess how much the model response agrees with the reference answer is to measure the overlap between the two. This is suggested in Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering. Concretely, the authors suggest to measure the proportion of tokens of the reference answer which are also part of the model response, i.e., measure the recall. They find that this metric only slightly lags behind using GPT-3.5-turbo (see table 2 from the paper) to compare output & reference answer.

Code: here

The authors compared more methods by their correlation with human judgment and found that the most predictive metric for the correctness of the model response is to use another LLM for grading it, in this case, GPT-4. In particular, they instruct the LLM to compare the generated response with the ground truth answer and output "no" if there is any information missing from the ground truth answer.

Code: here

The authors of LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models take this method further by prompting an LLM to generate a JSON schema whose fields are scores that assess the model response on different dimensions using a reference answer. While this method was developed for chatbots, it exemplifies using JSON generation as a way to assess the correctness of the model response on various criteria. They compared using scales of 0-5 and 0-100, finding that the 0-5 scale only slightly outperforms.

Fine-tuned LLMs as General Purpose Evaluation Metrics

An emerging body of work proposes fine-tuning LLMs to yield evaluations assessing the correctness of a model response given a reference answer.


The authors of Prometheus: Inducing fine-grained evaluation capability in language models fine-tune LLaMa-2-Chat (7B & 13B) to output feedback and a score from 1-5 for a given a response, the instructions which yielded the response, a reference answer to compare against, and a score rubric. The model is highly aligned with GPT-4 evaluation and is comparable to it in terms of performance (as measured by human annotators) while being drastically cheaper. They train the model on GPT-4 generated data, which contained fine-grained scoring rubrics (a total of 1k rubrics) and reference answers to a given instruction. The methods were benchmarked on MT Bench, Vicuna Bench, Feedback Bench & Flask Eval.

Model: here


The authors of CRITIQUELLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation fine-tune two versions of ChatGLM-2 (6B, 12B & 66B) to output a score (1-10) and a critique. One fine-tuned version receives as input the user query and the model response, such that it can be used as a reference-free evaluation metric. The other version receives as input the user query, the model response, and the reference answer, such that it can be used as a reference-based evaluation metric.

While their method performs worse than GPT-4, it is interesting as it converts a reference-based evaluation metric into a reference-free one. They achieve this by training the reference-based model on GPT-4 outputs and the reference-free model on GPT-4 outputs that respond to prompts to revise the previous evaluation to not use the reference answer.


The authors of INSTRUCTSCORE: Explainable Text Generation Evaluation with Fine-grained Feedback extend the idea of fine-tuning an LLM to generate feedback & scores given a user query, the model response, and the reference answer. Instead of only giving feedback & scores, they fine-tuned the model to generate a report that contains a list of error types, locations, severity labels, and explanations. Their Llama-7B-based model is close in performance to supervised methods and outperforms GPT-4 based methods.

Model: here

RAG Specific Evaluation Metrics

In its simplest form, a RAG application consists of a retrieval and a generation step. The retrieval step fetches the context given a query. The generation step answers the initial query after being supplied with the fetched context. The following is a collection of evaluation metrics to evaluate the retrieval and generation steps in an RAG application.

Percent Target Supported by Context

This metric calculates the percentage of sentences in the target/ground truth supported by the retrieved context. It does that by instructing an LLM to analyze each sentence in the reference answer and output "yes" if the sentence is supported by the retrieved context and "no" otherwise. This is useful to understand how well the retrieval step is working and provides an upper ceiling for the performance of the entire RAG system as the generation step can only be as good as the retrieved context.

Code: here

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

The authors of ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems improve upon the RAGAS paper using 150 labeled data to fine-tune LLM judges to evaluate context relevancy, answer faithfulness & answer relevance. Note, context relevancy measures how relevant the retrieved context is to the query, answer faithfulness measures how much the generated answer is based on the retrieved context, and answer relevance measures how well the generated answer matches the query.

Concretely, given a corpus of documents & few-shot examples of in-domain passages mapped to in-domain queries & answers, they generate synthetic triplets of query, passage, answer. Then, they use these triplets to train LLM judges for context relevancy, answer faithfulness & answer relevance with a binary classification loss, and utilize the labeled data as validation dataset. In the last step, they use the labeled data to learn a rectifier function to construct confidence intervals for the model's prediction (they leverage prediction-powered inference).

When benchmarking their method to rank different RAG systems, they find that their method outperforms RAGAS and a GPT-3.5-turbo-16k baseline as measured by correlation of true ranking with ranking based on the scores of the respective method.

Code: here

Getting Started

You can get started with these evaluation metrics on Parea.

Top comments (0)