Emmanuel Aiyenigba

Posted on Jan 16

An ML Engineer Review on Deepchecks LLM Evaluation Solution

#llm #machinelearning #ai #datascience

Introduction

Validating LLM-based applications from deployment to production helps ensure that models perform optimally and flag potential issues on time. LLM evaluation allows developers and product managers to monitor, safeguard, and analyze the performance and capabilities of AI systems.

As an ML engineer, I know for a fact that LLM-based applications are becoming increasingly complex, which makes prioritizing continuous LLM evaluation using efficient tools the best way to understand the capabilities of my models and ensure optimal performance for my application.

Among the diverse LLM evaluation tools available, I find Deepchecks LLM Evaluation System to stand out through its unique way of ensuring model accuracy and safety. It thoroughly analyses for bias, toxicity, and privacy leaks (PII). Sometimes, LLMs may output sensitive personal information from seemingly innocuous inputs. As a machine learning engineer, It is important to protect my pipeline against PII leaks. Deepchecks’ engine helps me do that by continuously monitoring my models for deviation and anomalies and safeguards my models from generating harmful content.

I couldn’t say no to Deepchecks when they asked for my expert review of their LLM Evaluation Solution a couple of months after they launched it. The tool felt timely as it tackles a crucial challenge in the LLM space. Plus, who am I to turn down flattery?

When I tried out the Deepchecks LLM Evaluation solution with real data, I noticed that it helped reduce risk in my LLM pipeline by flagging potential issues and safeguarding my model from hallucination. I am going to walk you through how the tool works and give my honest expert review about it based on my experience.

First off, let me show you some benefits of LLM evaluation.

Benefits of LLM Evaluation

LLM evaluation helps you understand the strengths and weaknesses of your models and monitor them to ensure that their performance is optimal at all times. It offers a wide range of benefits for LLM-based applications.

The following are some benefits of LLM evaluation:

Improve model accuracy: Evaluation allows you to assess how well the LLM understands prompts and how you can help it improve. The evaluation engine analyzes the model and provides calculated values for accuracy and coherence. Developers can leverage these metrics to improve the model.
Optimize performance: Evaluation identifies and highlights the weak areas of your LLM. Identifying where your model underperforms will help you know where to optimize performance.
Flag potential pitfalls: LLM evaluation systems like Deepchecks allow you to monitor your model in real-time to detect anomalies before they affect users. It analyzes for biases and harmful stereotypes throughout the lifecycle of your application.
Prevent model hallucination: Large language models may generate factually misleading and potentially harmful output. This is known as LLM hallucination. Continuous evaluation can help safeguard your LLM from hallucination.
Enhance efficiency: Evaluating your LLM can help boost efficiency by identifying the areas where it is inefficient. Evaluation can help you reduce operational costs for your LLM-based apps by identifying situations where your model performs well with less data.
Prevent harmful output: Evaluation can identify and block harmful content generated by the model.

Understanding Deepchecks LLM Evaluation

Before I proceed to evaluate real-world data with Deepchecks LLM Evaluation Solution, let me first help you understand what Deepchecks LLM Evaluation is all about.

Deepchecks LLM Evaluation offers solutions that optimize LLM pipelines, help you understand how your LLM performs, discover pitfalls, and prevent model hallucinations. It allows LLM-based applications to monitor, safeguard, and validate their models.

The system analyzes uploaded data, measures various aspects of LLM interactions (like completeness, coherence, relevance, toxicity, fluency, etc), calculates the average values for each property, provides the overall score for your data, and highlights weak segments. Deepchecks LLM Evaluation also helps you understand users' thoughts by grouping related user inputs into topics. This way, you not only know what your users are interested in, but you also identify the topics where the response of your LLM is below optimal.

Deepchecks provides an estimated annotation for each interaction. Annotation helps you understand how good or bad the response to an input is. Although you can manually annotate your data, it is efficient to allow Deepchecks Evaluation to auto-annotate. You can review/edit the estimated annotations and customize the annotation configuration file.

The following are the key highlights of Deepchecks LLM Evaluation:

Dual focus: The Evaluation system is dual focus, allowing you to analyze for safety and accuracy. It checks for toxicity, bias, and personal information leaks in model output. The system also evaluates the quality of the responses by calculating the average values of properties like Fluency, Coherence, Completeness, Grounded in Context, Avoided Answer, etc.
Flexible testing and version comparison: Deepchecks LLM Evaluation allows you to combine automated tests with manual verification for a better result. Compare multiple versions of your data to help identify weak areas and improve the overall performance of your LLM.
Continuous monitoring for deviation and anomalies: The system ensures the optimal performance of your LLM pipeline by continuously monitoring it for deviation and anomalies.
Continuous evaluation throughout the lifecycle: Deepchecks lets you evaluate your data at various phases - testing, staging, and production - so that you can catch issues early.

Evaluating Sample Data with Deepchecks LLM Evaluation System

In this section, I will dive into how the Deepchecks LLM evaluation solution works. I want you to follow along. Together, we will evaluate real-world data and explore all the features in the tool to understand how it validates LLM-based pipelines.

Account and setup

Create an account
Create new application. Add application name, version, and type.

Set Data Source to Golden Set.

Golden Set data is essentially a benchmark dataset used to evaluate and compare the performance of different versions of your model. It will help me gauge how well my model understands user inputs. Golden Set data should represent the data distribution your application will encounter in production to have a dependable yardstick to compare experiments.

Production data is the data encountered in production. Uploading production data will allow Deepchecks’ engine to automatically evaluate the user input and output of your model in production to help you understand the efficiency and accuracy of your pipeline.

Uploading data

I am using my ChatGPT data as my Golden Set.

Follow the steps below to download your ChatGPT data:

On the ChatGPT page, click on your name and select Settings.

Navigate to Data controls and click Export to export your data.

After downloading and unzipping your ChatGPT data, convert conversations.json, one of the ChatGPT files downloaded, to a CSV file. Also, format your CSV data to at least reflect the mandatory structure of Deepchecks.

Below is the file structure the Deepchecks engine expects:

user_interaction_id: This is an optional column. The user_interaction_id is used for identifying interactions across different versions and when updating annotation. It should be unique.
input: This is a mandatory column. It represents user inputs to the model.
output: Output generated by the model to the user. This is a mandatory column.
information_retrieval: Information used as context to understand your current request. This is an optional column.
full_prompt: The full prompt in the current interaction. This is an optional column.
annotation: Rate the model’s output - good, bad, or unknown.

Upload your data after formatting it to the above structure. The Deepchecks engine will automatically evaluate it. You can also upload your data using the Deepchecks Python SDK.

This is the structure of my data after formatting it.

I am making use of the UI upload to upload my data.

Annotation

Let me break down the evaluation result. I will begin with annotation.

You will notice that I did not provide annotations in my sample. This is because I want the Deepchecks system to do it for me. Deepchecks recommends having a baseline annotated by experts. There are two ways to do this:

Provide initial annotations manually done by experts in your sample. Manual annotation is not scalable - it requires a lot of time and effort.
Get estimated annotations from the system and do a human review of it. This is a great way to annotate samples.

In my case, I did the latter. Let’s take a look at the estimated annotations provided by Deepchecks.

The image above is the annotation score for my 104 interaction data. 77% of the interaction was annotated good, 11% annotated bad, and 13% not annotated.

Next, let’s go over to the Data page to see how each interaction is annotated and do a human review.

I can review each interaction and edit the annotation. Click on the annotation sign to change the annotation.

Note: Full colored thumbs up/thumbs down denotes manual annotation while the outline colored sign denotes estimated annotation (annotation suggested by the system)

I can also customize the estimated annotation rules to improve score quality. To customize the estimated annotation rules, go to the Annotations Config page, download the current YAML config file, edit it, and upload a new version.

This is what the current configuration looks like

How you can customize the current configuration:

Add or remove properties
Change the thresholds. Thresholds are used to flag anomalies.
Modify the sample evaluation prompt for unclassified data.
Change the similarity algorithm and its location.

Properties

Deepchecks LLM Evaluation has over 34 built-in properties. Properties measure aspects of your interactions like completeness, coherence, relevance, toxicity, etc. These properties determine estimated annotation.

Built-in, Custom, and LLM properties are the types of properties on Deepchecks LLM Evaluation.

The built-in properties are calculated by Deepchecks’ system using specially trained NLP models and algorithms. Built-in properties include Invalid Links, Reading Ease, Toxicity, Fluency, Formality, Sentiment, Avoided Answer, Grounded in Context, Relevance and Retrieval Relevance.

Custom properties are values provided by user alongside other fields in their data sample. I can add new custom properties if I want by going to the Custom Properties page.

LLM properties evaluate the quality of interactions using LLMs.

Click Add property filter on the Data page to see the list of all properties and to filter properties.

The Overview page shows the values of calculated properties. Below are the calculated properties for my sample.

Breakdown of some of the calculated properties:

Relevance: This property shows the relevance of the output to the input. 0 is the lowest range (not relevant) and 1 is the highest (very relevant). The relevance value of my sample is 0.87.
Grounded in Context: This property shows how well the output relates to the conversation context. 0 is the lowest range (not grounded in context) and 1 is the highest (grounded in context). The Grounded in Context value of my sample is 0.35.
Avoided Answer: The Avoided Answer value is the calculated probability (0 to 1) of the LLM intentionally or unintentionally avoiding the user’s questions.
Toxicity: The Toxicity value for my sample is 0. It means that my model did not generate harmful or offensive outputs.
Fluency: The Fluency value for my model is 0.91. It means that the input texts are well-written. 1 is the highest fluency value.
Sentiment: Sentiment ranges from -1 to 1. It is the measure of the emotions expressed in a text.

Topics

Topics help understand what users are asking about and identify what topics the model underperforms.

The topics in my sample:

I can filter through interactions associated with a certain topic. Below are some of the interactions associated with Subscription management and Server-Driven UI topics.

Segments

Deepchecks helps identify weak segments within data. With Segments, I can easily see areas where my LLM underperforms.

Below are the weak segments in my sample.

Versions

I can add new versions of my data for comparison. Comparison can help improve the overall performance of your model.

Samples Generation

I can increase my Golden Set coverage by generating automatic user inputs. To do this, click Generate Data in the dashboard, and add the guidelines and application description. Including a relevant link or file will help boost context.

Conclusion

Continuous evaluation of your LLM pipeline will help your LLM-based application perform optimally. By continuously validating your models throughout their lifecycle, you can catch issues like deviation and anomalies early, prevent harmful output, and comply with AI regulations.

The Deepchecks Evaluation system is a quality tool to monitor your LLM pipeline and identify areas where your model underperforms so that you can optimize and deliver quality to your users. You can monitor the performance of your LLM from the dashboard and gain insight into how well your model understands and responds to user inputs. The system scores your data to help you understand the overall performance of your pipeline and how you can improve it.

The benefits of continuous LLM evaluation are enormous. Product owners can deliver quality to their users and ensure their models conform to industry ethics by continuously monitoring and validating their pipeline from testing to production using Deepchecks LLM Evaluation.

I strongly recommend Deepchecks LLM Evaluation Solution for ML engineers, managers, and product owners looking to optimize their LLM pipeline.

DEV Community