DEV Community

Cover image for Evaluation of OpenAI Assistants
Shreyansh Jain
Shreyansh Jain

Posted on

Evaluation of OpenAI Assistants

Recently, I had an interesting user call where the user wanted to evaluate the performance of OpenAI assistants.

The Use Case:

  • Similar to a RAG pipeline, the user built an Assistant to answer medical queries about diseases and medicines.
  • Provided a prompt to instruct the assistant and a set of files containing supporting information from which the assistant was required to generate responses.

Challenges Faced:

  • Had to mock conversations with the chatbot, acting as different personas (e.g., a patient with malaria), which was time-consuming for over 100 personas.
  • After mocking conversations, had to manually rate individual responses based on parameters like whether the response was grounded in supporting documents, concise, complete, and polite.

Solution Developed:

  • Simulating Conversations: Built a tool that mocks conversations with the Assistant based on user personas (e.g., "A patient asking about the treatment of malaria").
  • Evaluation of OpenAI Assistant: Tool rates the conversation on parameters like user satisfaction, grounded facts, relevance, etc., using UpTrain's pre-configured metrics (20+ metrics covering use cases such as response quality, tonality, grammar, etc.)

Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb

Top comments (0)