DEV Community

Ayaka Hara
Ayaka Hara

Posted on

How to Evaluate a PDF Chatbot Response with Prompt Flow

Using Azure AI Studio's Prompt Flow, we've managed to easily implement a chatbot that can answer questions about PDF documents. However, it's crucial to verify whether this chatbot is accurately extracting and providing answers from the PDFs. In this blog, as the next step, we'll delve into preparing test data and conducting an evaluation of the chatbot. This process will help us determine the accuracy of its responses, ensuring that the chatbot is effectively serving its intended purpose.

This post is a part of a series that serves as a step-by-step guide to developing a chatbot with RAG:

Table of Contents


1. Prepare test data

Prepare test data based on the PDF data used to create chatbot. Here, as an example, we will prepare a csv file with the expected correct answers to the questions to azure-search-openai-demo sample data.

Ideally, it would be desirable to prepare between 50 to 100 test data samples, and if possible, up to 200. However, for this trial, we will start by preparing 5 samples.

Note: you should choose an appropriate format when saving the file, as Japanese and other characters may be garbled.
file format : e.g. CSV UTF-8

Image description

What is PerksPlus?,[],It's the ultimate benefits program designed to support the health nd wellness of employees.,
What are not covered under the PErksPlus?,[],"Non-fitness related expenses, Medical treatments and procedures, Travel expenses (unless related to a fitness program), and Food and supplements.",
What is Contoso Electronics?,[],"It's a leader in the aerospace industry, providing advanced electronic components for both commercial and military aircraft.",
What is Northwind Health Plus?,[],"It's a comprehensive plan that provides comprehensive coverage for medical, vision, and dental services.",
How much does it cost for one employee to enroll in Northwind Health Plus?,[],$55.00,
Enter fullscreen mode Exit fullscreen mode

2. Add test data on Azure AI Studio

Before starting the evaluation, test data should be registered on Azure AI Studio.

Move to "Data" tab and click "New data".

Image description

Next, select "Upload files/folders" and upload the test data from local.

Image description

Image description

Then, put the data name and finally create your data.

Image description

Once your test data is correctly uploaded you can see the data details like the following screenshot.
Image description

3. Evaluate a flow

Now, we're ready to start evaluating a flow.

Move to "Evaluation" tab and click "New evaluation".

Image description

Put an evaluation name and select "Question and answering pairs" as kind of evaluation scenario this time.
Image description

Next, select a flow which you want to evaluate.

Image description

Select the metrics, for instance, Groundedness, Relevance, Coherence, and GPT similarity.
Please refer to more details of metrics: built-in evaluation metrics

Image description

Select configuration test data to evaluate. Since we have already registered test data, select "Use existing dataset".

Image description

Change dataset mapping. Answer should be what comes out from the flow, so configure answer and ground_truth as follows.
Image description

Lastly, review the evaluation configuration and submit it.

Image description

4. Check the evaluation result

Once evaluation has been done, you can find the completed sign like below.
Image description

There are metrics scores and detailed metrics result.

  • Metrics scores
    • Coherence : The measure evaluates the coherence and naturalness of the generated text. It measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.
    • Similarity : Similarity is a measure that quantifies the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model. It is calculated by first computing sentence-level embeddings using the embeddings API for both the ground truth and the model's prediction. These embeddings represent high-dimensional vector representations of the sentences, capturing their semantic meaning and context.

The Similarity bar graph illustrates that one of the five test data had the lowest score, 1.

Image description

Further review of the delayed metrics results shows that a speculative price of "$55.00" is obtained for the question "How much does it cost for one employee to enroll in Northwind Health Plus? However, no such response was received.

Image description

The actual PDF data showed that this price was in a table, from which it could not be successfully output as a response. (Ref : Benefit_Options.pdf)

Image description


In this blog post, we've shared how to prepare test data and conduct an evaluation of the chatbot, a necessary step to confirm the accuracy of its answers. Having established this, our next focus will be on deploying the chatbot, developed using Azure AI Studio's Prompt Flow, as a REST endpoint. Next in our series, "How to Deploy a PDF Chatbot as a REST Endpoint and Test with Postman", we will then explore how to test this deployment using Postman.

Top comments (0)