Kobi B

Posted on Nov 24

Overcoming LLM Testing Challenges with Pytest and Trulens: Ensuring Reliable Responses

#testing #pytest #llm #rag

Testing Large Language Models (LLMs) within real-world applications presents a unique set of challenges.. In this post, I’ll share my experience navigating the complexities of testing an LLM-based API, the hurdles encountered, and how tools like Pytest and Trulens became instrumental in achieving reliable and meaningful test results.

Introduction

As I was tasked with testing LLM responses, I faced one of the biggest fears of any QA automation engineer: creating infrastructure and tests for an inconsistent and unpredictable system, which could lead to unreliable tests. How could I create reliable tests that would ensure our LLM output provides users with the accurate information they need to know?

The project involved developing an application where the LLM assists users by answering medical insurance policy-related questions. Users input specific medical services, and the LLM provides detailed information based on predefined policy data. Ensuring the accuracy and consistency of these responses was critical, as they directly influenced user’s understanding of their health coverage and potential out-of-pocket costs.

The consequences of providing incorrect information are clear—nobody wants to believe they are covered for a medical service, only to discover that the information is totally wrong when they are already at the doctor's office.

So, why is it so challenging to test LLMs?

One of the most significant challenges in testing LLMs is their tendency to generate varied responses, even when presented with the same prompt. This variability can lead to inconsistent test results, making it difficult to assert whether the model's output aligns with the expected ground truth.

Another critical challenge is the presence of hidden hallucinations or subtle errors in the LLM's responses. These are inaccuracies that may seem minor but can significantly impact the user's understanding of their policy. For example, a small typo or a slightly incorrect amount can mislead users, potentially causing confusion or financial misunderstandings.

First Strategy: Testing with Trulens

OWhen we searched the internet for an optimal solution to our challenges, our initial strategy was to find a tool that would allow us to create some ground truth, test the LLM response, and create a user-friendly dashboard. That's when we came across Trulens.

Is a tool designed to monitor and evaluate LLM outputs. Trulens allows for a more nuanced understanding of the model's decision-making process by providing similarity scores and other metrics that reflect how closely the LLM's responses align with the expected outcomes. Also, it has many other features, like test answer relevance, groundness and context relevance, which could open us doors for many kinds of tests.

Implementing Ground Truth Feedback

The first approach was to utilize Trulens ground truth feedback feature. By providing the expected output,which was the final reports about the medical services that was approved by our team, Trulens evaluates the LLM's responses against this baseline, assigning scores that indicate the degree of alignment.

Example Evaluation:

Ground Truth - Final Report for MRI:

In-Network:
Coverage: Covered with additional cost-sharing.
Requirements: Referral is required.
Liabilities: $300 copayment per MRI test; deductible does not apply.

Out-of-Network:
Coverage: Not covered.

Additional Information:
Ensure the MRI is medically necessary as determined by your doctor to qualify for coverage.
Always check with your insurance provider for any updates or changes to coverage terms.

First test LLM Response:

In-Network:
Coverage: Covered with additional cost-sharing.
Requirements: Referral is required; MRI must be ordered by your doctor.
Liabilities: $300 copayment per MRI test; contributes to out-of-pocket limit.

Out-of-Network:
Coverage: Not covered.

Additional Information:
Always check with your insurance or healthcare provider to confirm if prior authorization is needed.
The out-of-pocket limit is $9,450 for an individual and $18,900 for a family.
For personalized details, contact Blue Cross and Blue Shield of Illinois or your healthcare provider.

First test Trulens Score: 1.0 (Perfect Match with additional text)

Second test LLM Response:

In-Network:
Coverage: Covered with additional cost-sharing.
Requirements: Referral is required.
Liabilities: $30 copayment per MRI test; deductible does not apply.

Out-of-Network:
Coverage: Not covered.

Additional Information:
Ensure the MRI is medically necessary as determined by your doctor to qualify for coverage.
Always check with your insurance provider for any updates or changes to coverage terms.

Trulens Score: 0.6 (Partial Match - same text with crucial amount mistake)

Challenges with Trulens Evaluation

At first, we thought that Trulens is above our expectations, with a variety of test scenarios, easy to develop, user-friendly dashboard and a lot of other features. What could go wrong?

While Trulens provided all of those, it presented its own set of challenges:

Ambiguous Scoring: Even when the LLM's response was completely incorrect, such as choosing 30$ instead of $300, Trulens might still assign a relatively high similarity score (e.g., 0.6). This numerical similarity doesn't necessarily reflect the contextual correctness, leading to false positives in test evaluations.
Interpretation Complexity: The nuanced scores require a deeper understanding to interpret correctly. If the tone or the wording was changed, but the details like :need for pre-authorization or the cost of the visit were the same, for us, the test should be green, but from a text similarity perspective, it will be completely different and the test will fail.
Non-Binary Outcomes: Unlike binary pass/fail criteria, similarity scores introduced a spectrum of outcomes, making it harder to automate clear-cut assertions in tests.Especially when it comes to a very wide scale of automation tests.

To overcome those challenges, we tried to specify our ground truth to specific parts from the final report(while still comparing it to the final report), like the copayment amount, but we found that we are still facing the same issues as written above.
For example:
Final report : same as the example above
Ground Truth: “copayment amount is $30” (while on the final report it 300$)
Trulens Score: 0.3

These challenges highlighted the limitations of relying solely on similarity-based evaluation tools like Trulens for assertive testing of LLM responses.

Adopting a New Strategy: Direct Assertion Against Ground Truth

Given the limitations encountered with Trulens, a shift in strategy was necessary to achieve unequivocal test results. The need for binary results made us realize that we needed to use assert methods. But against what could we assert our ground truth? Using the full report might be too complicated for large-scale testing.

And then it hit us—we could use another LLM to read the report and search for a specific area in the text.

The decision was made to directly query another LLM about the final report and assert its answers against the expected ground truth.

The new strategy began with defining specific questions about parts from the final report, and their corresponding expected answers. The final report output, with the questions about it, was sent to another LLM, and his answer asserted our expected answer.

Instead of relying on Trulens to evaluate the responses, the new approach involves using another instance of the LLM (OpenAI's GPT-4o) to answer questions about the final report. By doing so, we leverage the LLM’s own understanding to verify the accuracy of the responses generated by our primary LLM.

For example: “in the text above, what is the cost of the payment?” should return a deterministic response - 300$. Nothing more or less. And to assert this, we moved to a more traditional tool, pytest, that is just comparing the values since we no longer need the similarity score.

How the process look like?

Generate the Final Report: The primary LLM that is on test generates the final report based on the given medical service.

Structured Testing: Structured questions are posed to an external LLM (GPT-4o) and it answers the same questions based on the final report.

Assertion: The responses are directly compared against the expected answers, ensuring accuracy without the ambiguity of similarity scores.

However, easy as it seems, the challenges of using LLM, as I mentioned early, came across . During this transition, I encountered a couple of unexpected hurdles:

Unexpected Response Formats: Sometimes, the external OpenAI LLM would respond with full sentences or phrases instead of just the expected single letters ("a", "b", etc.). For example:

Question: "Is a referral required for MRI? Choose one of the following: a.yes , b. No",

LLM answer : “a.yes” , “option a”.

This caused the assertions in the tests to fail, even if the underlying answer was correct.

Misinterpretation of Report Structure: While sometimes the final report output came with structural differences, it led the external LLM to sometimes overlook or misinterpret the relevant information, thinking that some details weren't present in the report.

To address these challenges, I refined both the prompts sent to the external LLM and the questions themselves to ensure clarity and adherence to the expected response format.

Refined Prompt Structure:

Based on the following policy details for MRI:
{{ Here we place the answer we would like to validate }}

Question: What is the copayment/Liabilities amount for MRI? Choose one of the following:
a. $100
b. $200
c. $300
d. $400
e. $500
Please respond with only the lowercase letter corresponding to the correct answer (e.g., 'a').

To ensure that the external LLM focuses on the correct section of the report and responds in the desired format, I made the questions more explicit. These refinements ensured that the external LLM understood exactly where to look in the report and how to format its response, thereby eliminating the issues of unexpected response formats and misinterpretation of the report's structure.

Benefits of the New Approach:

Clarity and Precision:

By structuring the prompt to require only the corresponding letter as a response, the likelihood of extraneous text is minimized, leading to more straightforward assertions.

Direct Comparison:

Eliminating the intermediary evaluation step allows for direct comparison between the LLM's response and the expected ground truth, ensuring unequivocal test results.

Reduced Ambiguity:

This method avoids the ambiguity introduced by similarity scores, focusing solely on whether the response matches the expected answer.

Simplified Testing Framework:

Streamlining the testing process by removing the reliance on Trulens simplifies the overall framework, making it easier to maintain and understand.

Early Error Detection:

By directly asserting responses, any changes or errors in the final report that affect crucial information are quickly identified, preventing misleading users.

This new strategy is akin to having a second opinion from a trusted friend before finalizing important decisions—ensuring accuracy and reliability without getting lost in the fuzzy middle ground.

Scaling the Test:

As the new strategy was found to be very useful, the potential to automate the test development process was promising.

as the test procedure that written above was built in a way to suit to all of the possible scenarios that we cover, we created a JSON file with the tests data: medical service, questions about the final report, and expected answer about it:

Sample JSON Structure:

{
 "services": [
    {
      "query": "MRI",
      "tests": [
        {
          "test_number": 1,
          "additional_question": "What is the copayment amount for MRI? Choose one of the following: a.100$ ,b.200$ , c.300$",
          "expected_response": "c"
        },
        {
          "test_number": 2,
          "additional_question": "Is a referral required for MRI? Choose one of the following: a.yes , b. no",
          "expected_response": "a"
        }
      ]
    },
    // Additional services and tests...
  ]

}

This structured approach ensures that every crucial aspect of the policy is covered, and the expected responses are clearly defined. Also, it allows us to expand the test scale as we want - add new medical services and questions about the report as we want, as we just need to add the desired question to the JSON.

Moreover, to create the flow almost fully automated, we provided OpenAI our verified answers of the final reports and example questions about some reports, and it created a full JSON file as above, with all the scenarios to be tested. In that way we can expand our tests with minimal work.

What next ?

As you can understand, this strategy can open many doors for testing other LLM applications in the future. All we need is to define our ground truth—the aspects we expect the LLM to provide 100% reliable information about—and simply ask another LLM questions about the output.

And remember, when your LLM tries to pull a fast one on you with a "slightly off" answer, a well-crafted prompt and a sharp eye can keep your tests running smoothly. After all, in the wild world of LLM testing, it's better to catch the $300 copayment errors before they turn into $30,000 misunderstandings!

DEV Community