In this tutorial, we'll be building an AI-powered document QA web app using Python.
With just a few lines of code, you'll have a working document QA app that you can use to extract information from any PDF. Here's a preview of what we'll be creating in this tutorial:
You can find the source code here.
So let's get started!
1. Set up your environment
I highly recommend that you use a package/environment management tool so that the external dependencies you're using won't affect any of your existing projects.
We'll be using the built-in venv
module to create virtual environments.
First, open your terminal and create a virtual environment.
python -m venv venv
and activate it:
venv\Scripts\activate
Now, let's install the required dependencies:
pip install streamlit pypdf openai faiss-cpu langchain==0.0.77
Finally, we'll need to set an environment variable for the OpenAI API key:
set OPENAI_API_KEY=<YOUR_API_KEY>
You can get an API key here.
Now, that we're all set, let's start coding our app!
2. Create a QA chain with langchain
Create a file named utils.py
, where we'll write the functions for parsing PDFs, creating a vector store, and answering questions.
First, let's import the required dependencies:
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.vectorstores.faiss import FAISS
from pypdf import PdfReader
import streamlit as st
Then, we'll add a function to parse PDFs
def parse_pdf(file):
pdf = PdfReader(file)
output = []
for page in pdf.pages:
text = page.extract_text()
output.append(text)
return "\n\n".join(output)
We can't fit the whole document inside the prompt since GPT-3 has a limited context window. So we'll have to:
- Split the document into smaller chunks
- Embed those chunks in a special database called a vector store, which allows us to fetch only the relevant passages for a question by doing a semantic search
def embed_text(text):
"""Split the text and embed it in a FAISS vector store"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800, chunk_overlap=0, separators=["\n\n", ".", "?", "!", " ", ""]
)
texts = text_splitter.split_text(text)
embeddings = OpenAIEmbeddings()
index = FAISS.from_texts(texts, embeddings)
return index
The
RecursiveCharacterTextSplitter
recursively tries to split the document by the given separators. Note that the order of the separators are important as it'll first try to split the document by\n\n
then, by.
, and so on.
Finally, let's write a function to search the index and pass the relevant passages to GPT for question answering:
def get_answer(index, query):
"""Returns answer to a query using langchain QA chain"""
docs = index.similarity_search(query)
chain = load_qa_chain(OpenAI(temperature=0))
answer = chain.run(input_documents=docs, question=query)
return answer
Now, let's create a simple UI for our app.
3. Build the web app with Streamlit
Streamlit makes it easy to create web apps using Python in minutes.
First, create a file named app.py
and import Streamlit and the functions we made earlier.
import streamlit as st
from utils import parse_pdf, embed_text, get_answer
Now, the whole UI can be created with just a couple of lines:
st.header("Doc QA")
uploaded_file = st.file_uploader("Upload a pdf", type=["pdf"])
if uploaded_file is not None:
index = embed_text(parse_pdf(uploaded_file))
query = st.text_area("Ask a question about the document")
button = st.button("Submit")
if button:
st.write(get_answer(index, query))
That's it🎉 Now, open up your terminal and run:
streamlit run app.py
and see your document QA app in action.
Optimizing the app
Let's see how we can optimize the app and make our life easier.
Caching the results
You'll notice the app will run embed_text()
and parse_pdf()
each time we ask a question. To fix this, we'll have to cache the results. An easy way to do this is by using @st.cache
.
In utils.py
add the following just before the function declaration.
@st.cache
def parse_pdf(file):
@st.cache
def embed_text(text)
Now, save the file and rerun the app. You'll see that after asking the first question subsequent ones will be faster.
Managing secrets
In step 1, we set the OpenAI API key using the command line, which can be cumbersome to type in every time we run the app using a new terminal. So let's load the API key from a file:
- Create a directory called
.streamlit
at the root of your app. - Inside it, create a file named
secrets.toml
and add the following:
OPENAI_API_KEY = "<YOUR_API_KEY>"
Put the API key inside double quotes.
Now, you no longer need to type in the API key every time you spin up a new terminal.
Important: If you're using git, make sure to add
secrets.toml
to your.gitignore
file before committing.
Wrap-up and next steps
Congrats🎉 you made an AI-powered document QA app in just 3 easy steps. If you want to deploy this app, Streamlit Community Cloud lets you share and deploy your apps for free in just a few minutes.
I encourage you to further develop this app, for example, by adding sources to the answers and adding support for more file types. You can learn more about langchain in their well-written documentation which includes excellent examples for every use case.
If you have any questions, feel free to leave a comment below.
Happy coding💻
Oldest comments (3)
Awesome guide! Thank you very much for sharing this.
Glad you enjoyed it
Thank you so much !