DEV Community

Cover image for How to build a PDF QA chatbot using OpenAI and ChromaDB 🤗
Jeffrey Ip for Confident AI

Posted on • Updated on • Originally published at confident-ai.com

How to build a PDF QA chatbot using OpenAI and ChromaDB 🤗

TL;DR

In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built ✅

I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.

A small request 🙏🏻

I'm trying to get DeepEval to 5k stars by the end of 2023, can you please help me out by starring my repo? It helps me create more weekly high quality content ❤️ thank you very very much!

https://github.com/confident-ai/deepeval

Introducing RAG, Vector Databases, and OCR

Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text - this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutorial.

Once text chunks are extracted using OCR, they are converted into a high-dimensional vector (aka. vectorized) using embedding models like Word2Vec, FastText, or BERT. These vectors, which encapsulate the semantic meaning of the text, are then indexed in a vector database. We'll be using ChromaDB as our in-memory vector database 🥳

Now, let's see what happens when a user asks their PDF something. First, the user query is first vectorized using the same embedding model used to vectorize the extracted PDF text chunks. Then, the top K most semantically similar text chunk is fetched by searching through the vector database, which remember, contains the text chunks from our PDF. The retrieved text chunks are then provided as context for ChatGPT to generate an answer based on information in their PDF. This is the process of retrieval, augmented, generation (RAG).

Feeling educated? 😊 Let's begin.

Project Setup

First, I'm going to guide you through how to set up your project folders and any dependencies you need to install.

Create a project folder and a python virtual environment by running the following command:

mkdir chat-with-pdf
cd chat-with-pdf
python3 -m venv venv
source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Your terminal should now start something like this:

(venv)
Enter fullscreen mode Exit fullscreen mode

Installing dependencies

Run the following command to install OpenAI API, ChromaDB, and Azure:

pip install openai chromadb azure-ai-formrecognizer streamlit tabulate
Enter fullscreen mode Exit fullscreen mode

Let's briefly go over what each of those package does:

  • streamlit - sets up the chat UI, which includes a PDF uploader (thank god 😌)
  • azure-ai-formrecognizer - extracts textual content from PDFs using OCR
  • chromadb - is an in-memory vector database that stores the extracted PDF content
  • openai - we all know what this does (receives relevant data from chromadb and returns a response based on your chatbot input)

Next, create a new main.py file - the entry point to your application

touch main.py
Enter fullscreen mode Exit fullscreen mode

Getting your API keys

Lastly, get your OpenAI and Azure API key ready (click the hyperlink to get them if you don't already have one)

Note: It's pretty troublesome to sign up for an account on Azure Cognitive Services. You'll need a card (although they won't charge you automatically), and phone number 😔 but do give it a try if you're trying to build something serious!

Building the Chatbot UI with Streamlit

Streamlit is an easy way to build frontend applications using python.

Lets import streamlit along with setting up everything else we'll need:

import streamlit as st
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from tabulate import tabulate
from chromadb.utils import embedding_functions
import chromadb
import openai

# You'll need this client later to store PDF data
client = chromadb.Client()
client.heartbeat()
Enter fullscreen mode Exit fullscreen mode

Give our chat UI a title and create a file uploader:

...
st.write("#Chat with PDF")

uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
...
Enter fullscreen mode Exit fullscreen mode

Listen for a change event in uploaded_file. This will be triggered when you upload a file:

...
if uploaded_file is not None:
    # Create a temporary file to write the bytes to
    with open("temp_pdf_file.pdf", "wb") as temp_file:
        temp_file.write(uploaded_file.read())
...
Enter fullscreen mode Exit fullscreen mode

View your streamlit app by running main.py (we'll implement the chat input UI later):

streamlit run main.py
Enter fullscreen mode Exit fullscreen mode

That's the easy part done 🥳! Next comes the not so easy part...

Extracting text from PDFs

Carrying on from the previous code snippet, we're going to send temp_file to Azure Cognitive Services for OCR:

    ...
    # you can set this up in the azure cognitive services portal
    AZURE_COGNITIVE_ENDPOINT = "your-custom-azure-api-endpoint"
    AZURE_API_KEY = "your-azure-api-key"
    credential = AzureKeyCredential(AZURE_API_KEY)
    AZURE_DOCUMENT_ANALYSIS_CLIENT = DocumentAnalysisClient(AZURE_COGNITIVE_ENDPOINT, credential)

    # Open the temporary file in binary read mode and pass it to Azure
    with open("temp_pdf_file.pdf", "rb") as f:
        poller = AZURE_DOCUMENT_ANALYSIS_CLIENT.begin_analyze_document("prebuilt-document", document=f)
        doc_info = poller.result().to_dict()
    ...
Enter fullscreen mode Exit fullscreen mode

Here, dict_info is a dictionary containing information on the extracted text chunks. It's a pretty complicated dictionary, so I would recommend printing it out and seeing for yourself what it looks like.

Paste in the following to finish processing the data received from Azure:

   ...
   res = []
   CONTENT = "content"
   PAGE_NUMBER = "page_number"
   TYPE = "type"
   RAW_CONTENT = "raw_content"
   TABLE_CONTENT = "table_content"

   for p in doc_info['pages']:
        dict = {}
        page_content = " ".join([line["content"] for line in p["lines"]])
        dict[CONTENT] = str(page_content)
        dict[PAGE_NUMBER] = str(p["page_number"])
        dict[TYPE] = RAW_CONTENT
        res.append(dict)

    for table in doc_info["tables"]:
        dict = {}
        dict[PAGE_NUMBER] = str(table["bounding_regions"][0]["page_number"])
        col_headers = []
        cells = table["cells"]
        for cell in cells:
            if cell["kind"] == "columnHeader" and cell["column_span"] == 1:
                for _ in range(cell["column_span"]):
                    col_headers.append(cell["content"])

        data_rows = [[] for _ in range(table["row_count"])]
        for cell in cells:
            if cell["kind"] == "content":
                for _ in range(cell["column_span"]):
                    data_rows[cell["row_index"]].append(cell["content"])
        data_rows = [row for row in data_rows if len(row) > 0]

        markdown_table = tabulate(data_rows, headers=col_headers, tablefmt="pipe")
        dict[CONTENT] = markdown_table
        dict[TYPE] = TABLE_CONTENT
        res.append(dict)
    ...
Enter fullscreen mode Exit fullscreen mode

Here, we accessed various properties of the dictionary returned by Azure to get texts on the page, and data stored in tables. The logic is pretty complex because of all the nested structures 😨 but from personal experience, Azure OCR works well even for complex PDF structures, so I highly recommend giving it a try :)

Storing PDF content in ChromaDB

Still with me? 😅 Great, we're almost there so hang in there!

Paste in the code below to store extracted text chunks from res in ChromaDB.

    ...
    try:
        client.delete_collection(name="my_collection")
        st.session_state.messages = []
    except:
        print("Hopefully you'll never see this error.")

    openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="your-openai-api-key", model_name="text-embedding-ada-002")
    collection = client.create_collection(name="my_collection", embedding_function=openai_ef)
    data = []
    id = 1
    for dict in res:
        content = dict.get(CONTENT, '')
        page_number = dict.get(PAGE_NUMBER, '')
        type_of_content = dict.get(TYPE, '')

        content_metadata = {   
            PAGE_NUMBER: page_number,
            TYPE: type_of_content
        }

        collection.add(
            documents=[content],
            metadatas=[content_metadata],
            ids=[str(id)]
        )
        id += 1
    ...
Enter fullscreen mode Exit fullscreen mode

The first try block ensures that we can continue uploading PDFs without having to refresh the page.

You might have noticed that we add data into a collection and not to the database directly. A collection in ChromaDB is a vector space. When a user enters a query, it performs a search inside this collection, instead of the entire database. In Chroma, this collection is identified by a unique name, and with a simple line of code, you can add all extracted text chunks via to this collection via collection.add(...).

Generating a response using OpenAI

I get asked a lot about how to build a RAG chatbot without relying on frameworks like langchain and lLamaIndex. Well here's how you do it - you construct a list of prompts dynamically based on the retrieved results from your vector database.

Paste in the following code to wrap things up:

...
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("What do you want to say to your PDF?"):
    # Display your message
    with st.chat_message("user"):
        st.markdown(prompt)
    # Add your message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})

    # query ChromaDB based on your prompt, taking the top 5 most relevant result. These results are ordered by similarity.
    q = collection.query(
        query_texts=[prompt],
        n_results=5,
    )
    results = q["documents"][0]

    prompts = []
    for r in results:
        # construct prompts based on the retrieved text chunks in results 
        prompt = "Please extract the following: " + prompt + "  solely based on the text below. Use an unbiased and journalistic tone. If you're unsure of the answer, say you cannot find the answer. \n\n" + r

        prompts.append(prompt)
    prompts.reverse()

    openai_res = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "assistant", "content": prompt} for prompt in prompts],
        temperature=0,
    )

    response = openai_res["choices"][0]["message"]["content"]
    with st.chat_message("assistant"):
        st.markdown(response)

    # append the response to chat history
    st.session_state.messages.append({"role": "assistant", "content": response})
Enter fullscreen mode Exit fullscreen mode

Notice how we reversed prompts after constructing a list of prompts according to the list of retrieved text chunks from ChromaDB. This is because the results returned from ChromaDB is ordered in descending order, meaning the most relevant text chunk will always be the first in the results list. However, the way ChatGPT works is it considers the last prompt in a list of prompts more, hence why we have to reverse it.

Run the streamlit app and try things out for yourself 😙:

streamlit run main.py
Enter fullscreen mode Exit fullscreen mode

ğŸŽ‰ Congratulations, you made it to the end!

Taking it a step further

As you know, LLM applications are a black box and so for production use cases, you'll want to safeguard the performance of your PDF chatbot to keep your users happy. To learn how to build a simple evaluation framework that could get you setup in less than 30 minutes, click here.

Conclusion

In this article, you've learnt:

  • what a vector database is a how to use ChromaDB
  • how to use the raw OpenAI API to build a RAG based chatbot without relying on 3rd party frameworks
  • what OCR is and how to use Azure's OCR services
  • how to quickly set up a beautiful chatbot UI using streamlit, which includes a file uploader.

This tutorial walked you through an example of how you can build a "chat with PDF" application using just Azure OCR, OpenAI, and ChromaDB. With what you've learnt, you can build powerful applications that help increase the productivity of workforces (at least that's the most prominent use case I've came across).

The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/chat-with-pdf

Thank you for reading!

Top comments (0)