Ayaka Hara

Posted on Jan 25

How to Evaluate a PDF Chatbot Response with Prompt Flow

#azure #openai #rag

Using Azure AI Studio's Prompt Flow, we've managed to easily implement a chatbot that can answer questions about PDF documents. However, it's crucial to verify whether this chatbot is accurately extracting and providing answers from the PDFs. In this blog, as the next step, we'll delve into preparing test data and conducting an evaluation of the chatbot. This process will help us determine the accuracy of its responses, ensuring that the chatbot is effectively serving its intended purpose.

This post is a part of a series that serves as a step-by-step guide to developing a chatbot with RAG:

Step 1 : How to Easily Build a PDF Chatbot with RAG (Retrieval-Augmented Generation) Using Azure AI Studio's Prompt Flow
Step 2 : How to Evaluate a PDF Chatbot Response with Prompt Flow ← YOU ARE HERE!
Step 3 : How to Deploy a PDF Chatbot as a REST Endpoint and Test with Postman

Prerequisites
1. Prepare test data
2. Add test data on Azure AI Studio
3. Evaluate a flow
4. Check the evaluation result
Conclusion

Prerequisites

Complete to create a flow (see How to Easily Build a PDF Chatbot with RAG (Retrieval-Augmented Generation) Using Azure Prompt Flow)

1. Prepare test data

Prepare test data based on the PDF data used to create chatbot. Here, as an example, we will prepare a csv file with the expected correct answers to the questions to azure-search-openai-demo sample data.

Ideally, it would be desirable to prepare between 50 to 100 test data samples, and if possible, up to 200. However, for this trial, we will start by preparing 5 samples.

Note: you should choose an appropriate format when saving the file, as Japanese and other characters may be garbled.
file format : e.g. CSV UTF-8



question,chat_history,answer,context

What is PerksPlus?,[],It's the ultimate benefits program designed to support the health nd wellness of employees.,

What are not covered under the PErksPlus?,[],"Non-fitness related expenses, Medical treatments and procedures, Travel expenses (unless related to a fitness program), and Food and supplements.",

What is Contoso Electronics?,[],"It's a leader in the aerospace industry, providing advanced electronic components for both commercial and military aircraft.",

What is Northwind Health Plus?,[],"It's a comprehensive plan that provides comprehensive coverage for medical, vision, and dental services.",

How much does it cost for one employee to enroll in Northwind Health Plus?,[],$55.00,

Add test data on Azure AI Studio

Before starting the evaluation, test data should be registered on Azure AI Studio.

Move to "Data" tab and click "New data".

Next, select "Upload files/folders" and upload the test data from local.

Then, put the data name and finally create your data.

Once your test data is correctly uploaded you can see the data details like the following screenshot.

3. Evaluate a flow

Now, we're ready to start evaluating a flow.

Move to "Evaluation" tab and click "New evaluation".

Put an evaluation name and select "Question and answering pairs" as kind of evaluation scenario this time.

Next, select a flow which you want to evaluate.

Select the metrics, for instance, Groundedness, Relevance, Coherence, and GPT similarity.
Please refer to more details of metrics: built-in evaluation metrics

Select configuration test data to evaluate. Since we have already registered test data, select "Use existing dataset".

Change dataset mapping. Answer should be what comes out from the flow, so configure answer and ground_truth as follows.

Lastly, review the evaluation configuration and submit it.

4. Check the evaluation result

Once evaluation has been done, you can find the completed sign like below.

There are metrics scores and detailed metrics result.

Metrics scores
- Coherence : The measure evaluates the coherence and naturalness of the generated text. It measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.
- Similarity : Similarity is a measure that quantifies the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model. It is calculated by first computing sentence-level embeddings using the embeddings API for both the ground truth and the model's prediction. These embeddings represent high-dimensional vector representations of the sentences, capturing their semantic meaning and context.

The Similarity bar graph illustrates that one of the five test data had the lowest score, 1.

Further review of the delayed metrics results shows that a speculative price of "$55.00" is obtained for the question "How much does it cost for one employee to enroll in Northwind Health Plus? However, no such response was received.

The actual PDF data showed that this price was in a table, from which it could not be successfully output as a response.　(Ref : Benefit_Options.pdf)

Conclusion

In this blog post, we've shared how to prepare test data and conduct an evaluation of the chatbot, a necessary step to confirm the accuracy of its answers. Having established this, our next focus will be on deploying the chatbot, developed using Azure AI Studio's Prompt Flow, as a REST endpoint. Next in our series, "How to Deploy a PDF Chatbot as a REST Endpoint and Test with Postman", we will then explore how to test this deployment using Postman.

DEV Community

How to Evaluate a PDF Chatbot Response with Prompt Flow

Table of Contents

Prerequisites

1. Prepare test data

Add test data on Azure AI Studio

3. Evaluate a flow

4. Check the evaluation result

Conclusion

Top comments (0)

Read next

🤖 AI: Unlocking the Future ! 🌟

Practical Use of Total Cost of Ownership (TCO) and Pricing Calculator for Cost Management.

How to Azure: Host a Selenium JavaScript Node Application in Azure and Send Email Notifications on Failures

Tech Spotlight: Daily Tech News