A complete list of all the LLM evaluation metrics you need to care about!

Recently, I have been talking to a lot of LLM developers trying to understand the issues they face while building production-grade LLM applications. There's a certain similarity among all those interviews, most of them are not sure what to evaluate beside the extent of hallucinations.

To make that easy for you, here's a compiled list of the most important evaluation metrics you need to consider before launching your LLM application to production. I have also added notebooks for you to try them out:

Response Quality:

Metrics Usage
Response Completeness Evaluate if the response completely resolves the given user query.
Response Relevance Evaluate whether the generated response for the given question, is relevant or not.
Response Conciseness Evaluate how concise the generated response is i.e. the extent of additional irrelevant information in the response.
Response Matching Compare the LLM-generated text with the gold (ideal) response using the defined score metric.
Response Consistency Evaluate how consistent the response is with the question asked as well as with the context provided.

Quality of Retrieved Context and Response Groundedness:

Metrics Usage
Factual Accuracy Evaluate if the facts present in the response can be verified by the retrieved context.
Response Completeness wrt Context Grades how complete the response was for the question specified concerning the information present in the context.
Context Relevance Evaluate if the retrieved context contains sufficient information to answer the given question.

Prompt Security:

Metrics Usage
Prompt Injection Identify prompt leakage attacks

Language Quality of Response:

Metrics Usage
Tone Critique Assess if the tone of machine-generated responses matches with the desired persona.
Language Critique Evaluate LLM generated responses on multiple aspects - fluence, politeness, grammar, and coherence.

Conversation Quality:

Metrics Usage
Conversation Satisfaction Measures the user’s satisfaction with the conversation with the AI assistant based on completeness and user acceptance.

Some other Custom Evaluations:

Metrics Usage
Guideline Adherence Grade how well the LLM adheres to a given custom guideline.
Custom Prompt Evaluation Evaluate by defining your custom grading prompt.
Cosine Similarity Calculate cosine similarity between embeddings of two texts.

