Unit Testing LLMs with DeepEval

#llm #ai #unittest #deepeval

For the last year I have been working with different LLMs (OpenAI, Claude, Palm, Gemini, etc) and I have been impressed with their performance. With the rapid advancements in AI and the increasing complexity of LLMs, it has become crucial to have a reliable testing framework that can help us maintain the quality of our prompts and ensure the best possible outcomes for our users. Recently, I discovered DeepEval (https://github.com/confident-ai/deepeval), an LLM testing framework that has revolutionized the way we approach prompt quality assurance.

DeepEval: A Comprehensive LLM Testing Framework
DeepEval is an open-source framework designed specifically for testing the quality of LLMs. It provides a simple and intuitive way to "unit test" LLM outputs, similar to how developers use Pytest for traditional software testing. With DeepEval, you can easily create test cases, define metrics, and evaluate the performance of your LLM applications.

One of the key benefits of DeepEval is its extensive collection of plug-and-use metrics, with over 14 LLM-evaluated metrics backed by research. These metrics cover a wide range of use cases, allowing you to assess various aspects of your LLM's performance, such as answer relevancy, faithfulness, and hallucination. Additionally, DeepEval offers the flexibility to customize metrics to suit your specific needs, ensuring that you can thoroughly evaluate your LLM applications.

Code Example: Evaluating LLM Output with DeepEval
Last weekend I spent some time putting together an example of creating a DeepEval unit test that works with OpenAI GPT-35 and Claude Haiku (running on AWS Bedrock). In this example I wanted to test its ability to summarize a small piece of text and the evaluate the resposne using the following Metrics:

Answer Relevancy
Summary Metric
Latency Metric

The first section loads the dependencies and sets the mock data to test with
python

import asyncio
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, SummarizationMetric, LatencyMetric
from .deepeval_test_utils import send

mock_data = {"input": "The 'coverage score' is calculated as the percentage of assessment questions for which both the summary and the original document provide a 'yes' answer. This method ensures that the summary not only includes key information from the original text but also accurately represents it. A higher coverage score indicates a more comprehensive and faithful summary, signifying that the summary effectively encapsulates the crucial points and details from the original content.", "expected_output": "The coverage score quantifies how well a summary captures and accurately represents key information from the original text, with a higher score indicating greater comprehensiveness."}

The next set of python code is used to load the specific prompt giving the model and return back any of the configurations for the LLM models

def get_summary_prompt(input: str, model_type: LanguageModelType):
    if model_type == LanguageModelType.GPT_35:
        return _get_openai_summary_prompt(input)
    else:
        return _get_claude_summary_prompt(input)


def _get_openai_summary_prompt(input: str):
    config = {}
    config["top_p"] = 1
    config["max_tokens"] = 1000
    config["temperature"] = 1
    config["model"] = "gpt-3.5-turbo-16k"

    print("input", input)
    prompt = [
        {
            "role": "system",
            "content": "You are a knowledge base AI. Please summarize the following text:",
        },
        {
            "role": "user",
            "content": input
        }
    ]
    return (prompt, config)


def _get_claude_summary_prompt(input: str):
    prompt = f"""Human: Please summarize the following text: {input}
             Assistant: """
    config = {}
    config["temperature"] = 1
    config["top_p"] = 0.999
    config["top_k"] = 350
    config["model_id"] = "anthropic.claude-3-haiku-20240307-v1:0"
    return (prompt, config)

The next block of code is the main function which will retrieve the prompt based on the model and then send it off to the specific LLM (either OpenAI or Claude and return the response).

async def get_summary(model_type: LanguageModelType):
    # Call LLM Interactor with the mock data
    prompt, config = get_summary_prompt(mock_data["input"], model_type)
    print("prompt", prompt)
    result = await send(llm_type=model_type, messages=prompt, config=config)
    print("results", result)
    return prompt, result

The following is the text case for the OpenAI. It calls the get_summary model, setups the appropriate metrics, and validates the response in the test case

def test_openai_summary():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    prompt, result = loop.run_until_complete(
        get_summary(LanguageModelType.GPT_35)
    )
    answer_relevancy_metric = AnswerRelevancyMetric(
        threshold=0.5, model="gpt-3.5-turbo", include_reason=True
    )
    summary_metric = SummarizationMetric(
        threshold=0.5,
        model="gpt-4",
        assessment_questions=[
            "Is the coverage score based on a percentage of 'yes' answers?",
            "Does the score ensure the summary's accuracy with the source?",
            "Does a higher score mean a more comprehensive summary?"
        ]
    )

    latency_metric = LatencyMetric(max_latency=7.0)

    test_case = LLMTestCase(
        input=mock_data["input"],
        actual_output=str(result),
        expected_output=mock_data["expected_output"],
        latency=6.0
    )
    assert_test(test_case, metrics=[answer_relevancy_metric, summary_metric, latency_metric])

The next block of code is for testing the summary code against the Claude LLM

def test_claude_summary():
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    prompt, result = loop.run_until_complete(
        get_summary(LanguageModelType.CLAUDE)
    )
    answer_relevancy_metric = AnswerRelevancyMetric(
        threshold=0.5, model="gpt-4", include_reason=True
    )

    summary_metric = SummarizationMetric(
        threshold=0.5,
        model="gpt-4",
        assessment_questions=[
            "Is the coverage score based on a percentage of 'yes' answers?",
            "Does the score ensure the summary's accuracy with the source?",
            "Does a higher score mean a more comprehensive summary?"
        ]
    )

    latency_metric = LatencyMetric(max_latency=7.0)

    test_case = LLMTestCase(
        input=str(prompt),
        actual_output=str(result),
        expected_output=mock_data["expected_output"],
        latency=6.0
    )
    assert_test(test_case, metrics=[answer_relevancy_metric, summary_metric, latency_metric])

To run the response you can just use the following command

deepeval test run test_sample.py

After the test cases have run it will print out a summary of the with a breakdown of its findings

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓

┃                     ┃                  ┃                      ┃        ┃ Overall Success     ┃

┃ Test case           ┃ Metric           ┃ Score                ┃ Status ┃ Rate                ┃

┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩

│ test_openai_summary │                  │                      │        │ 100.0%              │

│                     │ Answer Relevancy │ 1.0 (threshold=0.5,  │ PASSED │                     │

In this example, we create a test case using the LLMTestCase class, specifying the input prompt and the actual output generated by the LLM application. We then define an 3 metrics which will be used to evaluate the relevancy of the LLM's output. Finally, we use the assert_test function to run the evaluation and ensure that the output meets the specified criteria.

DeepEval has been a game-changer for our team in ensuring the quality of our LLM applications. By providing a comprehensive testing framework with a wide range of metrics and synthetic dataset generation capabilities, DeepEval has streamlined our testing process and given us the confidence to deploy new versions of LLMs with ease.

For application developers and engineers who are new to AI and LLMs, DeepEval offers a simple and accessible way to evaluate the performance of your LLM applications. By leveraging the power of DeepEval, you can focus on developing high-quality API integrations while ensuring that your LLM prompts remain effective and reliable.

I highly recommend exploring DeepEval and incorporating it into your LLM testing workflow. With its extensive features and user-friendly interface, DeepEval is an invaluable tool for anyone working with LLMs, regardless of their level of expertise in AI and data science.