Sasmitha Manathunga

Posted on Feb 8, 2023 • Updated on Jun 10, 2023

Build a Document QA App in 3 Simple Steps with Langchain and Streamlit

#gpt3 #ai #machinelearning #openai

In this tutorial, we'll be building an AI-powered document QA web app using Python.

With just a few lines of code, you'll have a working document QA app that you can use to extract information from any PDF. Here's a preview of what we'll be creating in this tutorial:

You can find the source code here.

So let's get started!

1. Set up your environment

I highly recommend that you use a package/environment management tool so that the external dependencies you're using won't affect any of your existing projects.

We'll be using the built-in venv module to create virtual environments.

First, open your terminal and create a virtual environment.

python -m venv venv

and activate it:

venv\Scripts\activate

Now, let's install the required dependencies:

pip install streamlit pypdf openai faiss-cpu langchain==0.0.77

Finally, we'll need to set an environment variable for the OpenAI API key:

set OPENAI_API_KEY=<YOUR_API_KEY>

You can get an API key here.

Now, that we're all set, let's start coding our app!

2. Create a QA chain with langchain

Create a file named utils.py, where we'll write the functions for parsing PDFs, creating a vector store, and answering questions.

First, let's import the required dependencies:

from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.vectorstores.faiss import FAISS
from pypdf import PdfReader
import streamlit as st

Then, we'll add a function to parse PDFs

def parse_pdf(file):
    pdf = PdfReader(file)
    output = []
    for page in pdf.pages:
        text = page.extract_text()
        output.append(text)

    return "\n\n".join(output)

We can't fit the whole document inside the prompt since GPT-3 has a limited context window. So we'll have to:

Split the document into smaller chunks
Embed those chunks in a special database called a vector store, which allows us to fetch only the relevant passages for a question by doing a semantic search

def embed_text(text):
    """Split the text and embed it in a FAISS vector store"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800, chunk_overlap=0, separators=["\n\n", ".", "?", "!", " ", ""]
    )
    texts = text_splitter.split_text(text)

    embeddings = OpenAIEmbeddings()
    index = FAISS.from_texts(texts, embeddings)

    return index

The RecursiveCharacterTextSplitter recursively tries to split the document by the given separators. Note that the order of the separators are important as it'll first try to split the document by \n\n then, by ., and so on.

Finally, let's write a function to search the index and pass the relevant passages to GPT for question answering:

def get_answer(index, query):
    """Returns answer to a query using langchain QA chain"""

    docs = index.similarity_search(query)

    chain = load_qa_chain(OpenAI(temperature=0))
    answer = chain.run(input_documents=docs, question=query)

    return answer

Now, let's create a simple UI for our app.

3. Build the web app with Streamlit

Streamlit makes it easy to create web apps using Python in minutes.

First, create a file named app.py and import Streamlit and the functions we made earlier.

import streamlit as st
from utils import parse_pdf, embed_text, get_answer

Now, the whole UI can be created with just a couple of lines:

st.header("Doc QA")
uploaded_file = st.file_uploader("Upload a pdf", type=["pdf"])

if uploaded_file is not None:
    index = embed_text(parse_pdf(uploaded_file))
    query = st.text_area("Ask a question about the document")
    button = st.button("Submit")
    if button:
        st.write(get_answer(index, query))

That's it🎉 Now, open up your terminal and run:

streamlit run app.py

and see your document QA app in action.

Optimizing the app

Let's see how we can optimize the app and make our life easier.

Caching the results

You'll notice the app will run embed_text() and parse_pdf() each time we ask a question. To fix this, we'll have to cache the results. An easy way to do this is by using @st.cache.

In utils.py add the following just before the function declaration.

@st.cache
def parse_pdf(file):

@st.cache
def embed_text(text)

Now, save the file and rerun the app. You'll see that after asking the first question subsequent ones will be faster.

Managing secrets

In step 1, we set the OpenAI API key using the command line, which can be cumbersome to type in every time we run the app using a new terminal. So let's load the API key from a file:

Create a directory called .streamlit at the root of your app.
Inside it, create a file named secrets.toml and add the following:

OPENAI_API_KEY = "<YOUR_API_KEY>"

Put the API key inside double quotes.

Now, you no longer need to type in the API key every time you spin up a new terminal.

Important: If you're using git, make sure to add secrets.toml to your .gitignore file before committing.

Wrap-up and next steps

Congrats🎉 you made an AI-powered document QA app in just 3 easy steps. If you want to deploy this app, Streamlit Community Cloud lets you share and deploy your apps for free in just a few minutes.

I encourage you to further develop this app, for example, by adding sources to the answers and adding support for more file types. You can learn more about langchain in their well-written documentation which includes excellent examples for every use case.

If you have any questions, feel free to leave a comment below.

Happy coding💻

🙌 Hey! If you enjoy my content and want to show some love, feel free to buy me a coffee. Each cup helps me create more useful content for incredible developers like you!