DEV Community

Cover image for Building an Advanced Streamlit Chatbot with OpenAI Integration: A Comprehensive Guide - Part 3
James
James

Posted on • Updated on

Building an Advanced Streamlit Chatbot with OpenAI Integration: A Comprehensive Guide - Part 3

Introduction

In today's information-rich world, a PDF document is ubiquitous, serving as the primary format for academic papers, reports, and manuals. However, extracting and interacting with the knowledge contained within this document can be time-consuming and daunting, especially when dealing with multiple files. Imagine if you chat with your PDF as you would with a knowledgeable friend or a personal assistant, asking questions and getting precise answers without the need to comb through pages of text. This vision has become a reality with the solution given throughout this blog: a Streamlit-based application designed to transform a PDF into an interactive chatbot.

This innovative tool leverages the power of natural language processing (NLP) and machine learning to understand and retrieve information from your document, offering an intuitive and efficient way to interact with text data. Whether you're a researcher, student, or professional, the app opens up new possibilities for knowledge extraction and data accessibility, making engaging with your PDF meaningfully easier.

For those just joining us on this journey, I encourage you to explore the foundations laid in the previous entries of this series:

  • Part 1: A step-by-step guide to building an interactive Streamlit chatbot, which you can find here.
  • Part 2: A comprehensive guide to enhancing your Streamlit chatbot with OpenAI integration, available here.

Overview of the Streamlit App

The application harnesses Streamlit, an open-source framework, to create a user-friendly interface, bridging the gap between users and their PDF documents. Streamlit's simplicity and efficiency in building data applications have made it the perfect choice for the project, allowing us to focus on the core functionality of conversational AI without getting bogged down by complex web development processes.

The app's purpose extends beyond simple text extraction; it creates a dynamic dialogue between the user and their documents. By uploading PDFs to the app, users can directly ask questions related to the content of these files, receiving answers generated by the conversational AI model. This approach not only saves time but also enhances the accessibility and usability of information contained within PDFs, making it a valuable tool for anyone looking to extract insights from their documents efficiently.

Key Features

The Streamlit app we've developed is not just a technical showcase but a bridge between the static world of documents and the dynamic realm of conversational AI. It comes with some standout features, including:

  1. PDF Processing:
    At the heart of the app is the PyPDF2 library, a robust tool for reading PDF files and extracting text. This functionality is critical for accessing the information locked within your documents. The app is designed to accommodate multiple file formats, ensuring that your data, regardless of its container, is accessible and interactive.

  2. Chunking Text:
    To ensure efficient processing and retrieval, the app employs a smart chunking strategy. It divides text from your documents into manageable chunks, carefully balancing chunk size and overlap. This method optimizes performance and maintains the contextuality of the information, making the interactions as meaningful as possible.

  3. Vector Store Creation:
    The app leverages HuggingFaceInstructEmbeddings to generate rich, contextual embeddings from the chunked text. These embeddings are then stored in a FAISS vector store, a highly efficient similarity search and retrieval structure. This allows for quick and relevant responses to user queries, bridging the gap between natural language questions and the stored textual data.

  4. Conversational Interface:
    Using LangChain, the app crafts a conversational model that enables users to engage with their uploaded documents as if chatting with an expert. This feature is enhanced by a conversation buffer memory, which preserves the dialogue context, allowing for more nuanced and informed responses throughout the interaction.

  5. Interactive UI:
    Streamlit's user-friendly interface is the canvas on which this app paints its functionality. Users can effortlessly upload PDFs and input their questions, initiating a dialogue with the digital contents of their documents.

These features collectively transform static documents into a dynamic, interactive knowledge base, empowering users to uncover insights from their PDFs through simple conversations. The app is a testament to the potential of combining traditional data with modern AI and UI technologies to create a more accessible and enjoyable experience.

Diving Deeper into the Conversational PDF Chatbot's Code

The Streamlit app revolutionizes how we interact with PDF documents, turning them into engaging conversational partners. Let's explore the code in more detail to understand the intricate dance between its components that brings this innovation to life.

Setting the Stage with Environment and Imports

Initially, the app lays its foundation by importing essential libraries, setting up the environment, and loading keys. This preparation is crucial for integrating various functionalities, from PDF manipulation and natural language processing to the web interface and conversational AI. The dotenv library ensures sensitive information like the OpenAI API key is securely managed, highlighting the app's emphasis on security and privacy.

import os
import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from htmlTemplates import css, bot_template, user_template

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
Enter fullscreen mode Exit fullscreen mode

Example .env file

OPENAI_API_KEY=api_key
Enter fullscreen mode Exit fullscreen mode

The load_dotenv() function loads environment variables from a .env file, which helps store sensitive information like API keys. The OpenAI API key is then retrieved using os.getenv("OPENAI_API_KEY").

Extracting Wisdom from PDFs

The get_pdf_text function uses PyPDF2 to read and extract text from a PDF file, returning the combined text of all pages.

def get_pdf_text(pdf_file):
    pdf_reader = PdfReader(pdf_file)
    return "".join(page.extract_text() for page in pdf_reader.pages)
Enter fullscreen mode Exit fullscreen mode

This function takes a PDF file as input and creates a PdfReader object. It then extracts the text from each page of the PDF using page.extract_text() and joins all the extracted text into a single string using "".join(). This combined text represents the entire content of the PDF document.

Crafting Bite-Sized Knowledge Pieces

The get_text_chunks function divides the extracted text into chunks to make processing more manageable and efficient. This ensures the conversational model can handle the text without being overwhelmed by too much data at once.

def get_text_chunks(text, chunk_size=1000, chunk_overlap=200):
    text_chunks = []
    position = 0
    while position < len(text):
        start_index = max(0, position - chunk_overlap)
        end_index = position + chunk_size
        chunk = text[start_index:end_index]
        text_chunks.append(chunk)
        position = end_index - chunk_overlap
    return text_chunks
Enter fullscreen mode Exit fullscreen mode

The get_text_chunks function takes the extracted text and optional parameters for chunk_size and chunk_overlap. It initializes an empty list called text_chunks to store the divided chunks of text.

The function then iterates over the text using a while loop, starting from position = 0. Each iteration calculates the start_index and end_index of the current chunk based on the positionchunk_size, and chunk_overlap. The chunk_overlap parameter allows for some overlap between consecutive chunks to maintain context.

The function extracts the text chunk using text[start_index:end_index] and appends it to the text_chunks list. It then updates the position by moving it forward by chunk_size - chunk_overlap to prepare for the next iteration. Finally, the function returns the list of text_chunks containing the divided chunks of text.

Creating a Repository of Insights

With the text divided into chunks, get_vectorstore creates a vector store using embeddings from HuggingFaceInstructEmbeddings and FAISS for text retrieval. This vector store allows for efficient searching and matching of text based on user queries.

def get_vectorstore(text_chunks):
    embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
    vector_store = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vector_store
Enter fullscreen mode Exit fullscreen mode

The get_vectorstore function takes the list of text_chunks as input. It starts by creating an instance of HuggingFaceInstructEmbeddings using the "hkunlp/instructor-xl" model. This embedding model converts the text chunks into high-dimensional vectors that capture the semantic meaning of the text.

Next, the function creates a FAISS vector store using the from_texts method. It passes the text_chunks and the embeddings object to create the vector store. FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It allows for fast retrieval of similar vectors based on a query vector.

Finally, the function returns the created vector_store, which contains the embedded text chunks and enables efficient search and retrieval operations.

Weaving Conversations with AI

The core of the chatbot's conversational abilities is set up in get_conversation_chain. This function configures the language model (in this case, ChatOpenAI) and links it with the vector store for information retrieval, creating a conversational retrieval chain.

def get_conversation_chain(vectorstore):
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    return ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory,
    )
Enter fullscreen mode Exit fullscreen mode

The get_conversation_chain function takes the vectorstore as input. It starts by creating an instance of ChatOpenAI, a language model provided by OpenAI. This model will generate responses based on the information retrieved from the vector store.

Next, the function creates an instance of ConversationBufferMemory, which stores and manages the conversation history. The memory_key parameter specifies the key under which the conversation history will be stored, and return_messages=True indicates that the memory should return messages instead of raw text.

Finally, the function creates a ConversationalRetrievalChain using the from_llm method. It takes the llm (language model), retriever (vector store retriever), and memory as parameters. The vectorstore.as_retriever() method obtains the retriever object from the vector store, which retrieves relevant information based on the user's query.

The ConversationalRetrievalChain combines the language model, retriever, and memory to enable conversational interactions. It uses the retriever to find relevant information from the vector store based on the user's input and conversation history, and then it generates a response using the language model.

Facilitating Dialogue between User and Machine

When a user inputs a question, handle_userinput passes this question to the conversational chain, retrieves the response, and displays it using the Streamlit interface. This function also alternates between user and bot templates for a conversational look.

def handle_userinput(user_question):
    if st.session_state.conversation is not None:
        response = st.session_state.conversation({'question': user_question})
        st.session_state.chat_history = response['chat_history']
        for i, message in enumerate(st.session_state.chat_history):
            if i % 2 == 0:
                st.write(user_template.replace("{{MSG}}", message.content), unsafe_allow_html=True)
            else:
                st.write(bot_template.replace("{{MSG}}", message.content), unsafe_allow_html=True)
    else:
        st.write("Please upload PDFs and click process")
Enter fullscreen mode Exit fullscreen mode

The handle_userinput function takes the user_question as input. It first checks if the conversation object exists in the Streamlit session state (st.session_state.conversation). If the conversation object is  None, the conversation chain is initialized and ready to handle user input.

If the conversation object exists, the function passes the user_question to the conversation chain using st.session_state.conversation({'question': user_question}). This invokes the conversational retrieval chain, which retrieves relevant information from the vector store based on the user's question and generates a response using the language model.

The response from the conversation chain is stored in the response variable. The chat_history from the response is then stored in the Streamlit session state using st.session_state.chat_history = response['chat_history']. This ensures that the conversation history is preserved across different user interactions.

Next, the function iterates over the chat_history using enumerate(). For each message in the chat history, it checks if the index (i) is even or odd. If the index is even, it's a user message, so the function writes the message content using the user_template and st.write(). If the index is odd, it's a bot message, so the function writes the message content using the bot_template and st.write(). The unsafe_allow_html=True parameter is used to allow HTML rendering of the message templates. If the conversation object does not exist in the session state, the function writes a message prompting the user to upload PDFs and click the "Process" button.

The Grand Stage: Main Function and UI

Finally, the main function sets up the Streamlit page and user interface, allowing users to upload PDFs, ask questions, and interact with the chatbot. This function is where the application starts and manages the overall flow of the chatbot interaction.

def main():
    load_dotenv()
    st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
    st.write(css, unsafe_allow_html=True)
    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if user_question := st.text_input("Ask a question about your documents:"):
        handle_userinput(user_question)
    with st.sidebar:


        st.subheader("Your PDFs")
        pdf_docs = st.file_uploader("Upload PDFs and click process", type="pdf", accept_multiple_files=True)

        if st.button("Process"):
            with st.spinner("Processing PDFs"):
                process_files(pdf_docs, st)>)

if __name__ == '__main__':
    main()
Enter fullscreen mode Exit fullscreen mode

The function writes the CSS styles defined in the css variable using st.write(css, unsafe_allow_html=True). This applies the custom styles to the Streamlit application.

It checks if the conversation key exists in the Streamlit session state using if "conversation" not in st.session_state. If the key doesn't exist, it initializes st.session_state.conversation to None. This ensures that the conversation object is initialized correctly.

The function then creates a text input field using st.text_input("Ask a question about your documents:"). It prompts the user to enter a question related to the uploaded documents. If the user enters a question (i.e., user_question is truthy), the function calls the handle_userinput function, passing the user_question as an argument. This handles the user input and generates the appropriate response.

This detailed explanation of the code sections should provide a comprehensive understanding of how the conversational PDF chatbot works behind the scenes. Each part plays a crucial role in extracting text from PDFs, creating vector stores, setting up the conversational chain, handling user input, and managing the user interface using Streamlit.

Conclusion

As we conclude our exploration of the conversational AI chat app, it's clear that we're on the verge of a game-changing transformation in how we interact with written content. This technical feat by Streamlit paves the way for more intuitive, efficient, and engaging interactions with documents by combining AI and natural language processing technologies with a user-friendly interface.

This application has the potential to revolutionize information retrieval by converting static PDFs into dynamic conversations, opening up a new level of accessibility and interaction with documents. The benefits of this technology are immense, from simplifying academic research to improving corporate knowledge sharing. As this technology evolves, the potential use cases for this tool will continue to expand.

As developers, researchers, and professionals in various fields begin to adopt and adapt this tool, we can look forward to a future where the barriers between us and the information we seek are further diminished. The conversational AI chat app not only provides a solution to the problem of navigating information-dense documents but also paves the way towards a future where knowledge is not just accessible but also conversational, personalized, and immediately at the fingertips.

We encourage you to explore the application, push its boundaries, and imagine new ways it can be used in your work, studies, or daily life. The journey towards more interactive and intelligent document handling has just begun, and the possibilities are as limitless as the collective imagination.

Coming In Part 4: Advancing Conversational Capabilities

In the next blog post, we will take a significant leap forward by delving into the latest advancements made to the conversational PDF chatbot. the journey has demonstrated how to transform static documents into interactive dialogues. However, the road ahead promises to introduce even more dynamic features and functionalities that redefine your interaction with digital content.

What to Expect

  1. OpenRouter Integration: We'll explore integrating OpenRouter, a cutting-edge service that facilitates seamless interaction with multiple large language models (LLMs). This integration aims to enhance the chatbot's conversational quality and response accuracy, ensuring you receive the most relevant information from your documents.
  2. Customizable Conversational Models: Tailoring the conversational experience to your needs is crucial. We will introduce the ability to select from various LLMs, allowing you to choose the model that best fits the complexity and nature of your documents.
  3. Enhanced Document Management: Managing the documents you interact with should be as intuitive as the conversations you have with them. The next installment will cover improvements in document management, making it easier to upload, select, and delete documents directly within the app.
  4. Advanced Configuration Options: Explore the myriad configuration options for fine-tuning the conversational engine. From adjusting the model's temperature to setting the maximum token limit, these options will give you unprecedented control over the chatbot's behavior and responses.
  5. Interactive UI Enhancements: The user interface (UI) bridges you and the chatbot. We'll highlight the latest UI enhancements to streamline interactions and ensure a smoother, more engaging user experience.
  6. Debugging and Analytics: We'll explain the app's debugging features and analytics for those interested in what happens behind the scenes. Understanding how the app processes your queries and documents can offer valuable insight # TLDR

Logic

Sparknotes breakdown of the code in logical sections:

  1. Imports and Environment Setup:

    • The necessary libraries and modules are imported, including Streamlit, LangChain components, and PyPDF2.
    • Environment variables are loaded using load_dotenv(), and the OpenAI API key is retrieved.
  2. Helper Functions:

    • get_pdf_text(pdf_file): Extracts text from a PDF file using PyPDF2.
    • get_text_chunks(text, chunk_size=1000, chunk_overlap=200): Divides the extracted text into smaller chunks for efficient processing.
    • get_vectorstore(text_chunks): This function creates a vector store using text chunks and embeddings (OpenAI or Hugging Face).
    • get_conversation_chain(vectorstore): This function sets up the conversational retrieval chain using the vector store and a language model (ChatOpenAI or Hugging Face).
    • handle_userinput(user_question): This function handles user input by retrieving relevant information from the vector store and displaying the conversation history.
  3. Main Function:

    • main(): Orchestrates the app's functionality, including setting up the Streamlit app, handling user input, and processing files.
    • It initializes the conversation and chat history session states.
    • It sets up the sidebar for uploading PDF files and processing them when clicking the "Process" button.
  4. File Processing:

    • process_files(file_list, st): This function processes the uploaded files by extracting text, chunking it, creating a vector store, and setting up the conversation chain.
    • It supports processing PDF, TXT, and CSV files.
  5. Additional Functions:

    • get_file_text(file_path_list): Extracts text from multiple files based on their file extensions.
  6. Main Execution:

    • The if __name__ == '__main__': block ensures that the main() function is executed when the script is run directly.

Code:


import os
import streamlit as st
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings  # , HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS, Qdrant, Chroma
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from htmlTemplates import css, bot_template, user_template
from langchain.llms import HuggingFaceHub, HuggingFacePipeline

# load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")



def get_pdf_text(pdf_file):
    pdf_reader = PdfReader(pdf_file)
    return "".join(page.extract_text() for page in pdf_reader.pages)


# get text chunks method
def get_text_chunks(text, chunk_size=1000, chunk_overlap=200):
    text_chunks = []
    position = 0
    # Iterate over the text until the entire text has been processed
    while position < len(text):
        start_index = max(0, position - chunk_overlap)
        end_index = position + chunk_size
        chunk = text[start_index:end_index]
        text_chunks.append(chunk)
        position = end_index - chunk_overlap
    return text_chunks


# get vector store method
def get_vectorstore(text_chunks):

    # embeddings = OpenAIEmbeddings(openai_api_key = openai_api_key)
    embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
    vector_store = FAISS.from_texts(texts=text_chunks, embedding=embeddings)

    print(type(vector_store))

    return vector_store


# get conversation chain method
def get_conversation_chain(vectorstore):
    model_prams = {"temperature": 0.23, "max_length": 4096}
    llm = ChatOpenAI()

    # Alternatively, you can use a different language model, like Hugging Face's model
    # llm = HuggingFaceHub(repo_id="microsoft/phi-2", model_kwargs=model_prams)
    print("Creating conversation chain...")
    print("Conversation chain created")
    memory = ConversationBufferMemory(
        memory_key='chat_history', return_messages=True)

    return ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),  # Text vector retriever for context matching
        memory=memory,  # Memory buffer to store conversation history
    )


# get handler user input method
def handle_userinput(user_question):
    if st.session_state.conversation is not None:

        response = st.session_state.conversation({'question': user_question})
        st.session_state.chat_history = response['chat_history']

        for i, message in enumerate(st.session_state.chat_history):
            if i % 2 == 0:
                st.write(user_template.replace(
                    "{{MSG}}", message.content), unsafe_allow_html=True)
            else:
                st.write(bot_template.replace(
                    "{{MSG}}", message.content), unsafe_allow_html=True)
    else:
        st.write("Please upload PDFs and click process")


def main():
    load_dotenv()
    st.set_page_config(page_title="Chat with multiple PDFs",
                       page_icon=":books:")
    st.write(css, unsafe_allow_html=True)

    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = None

    st.header("Chat with multiple PDFs :books:")
    if user_question := st.text_input("Ask a question about your documents:"):
        handle_userinput(user_question)

    st.subheader("Model Parameters")

    # init sidebar
    with st.sidebar:
        st.subheader("Your PDFs")
        pdf_docs = st.file_uploader("Upload PDFs and click process", type="pdf", accept_multiple_files=True)

        if st.button("Process"):
            with st.spinner("Processing PDFs"):
                process_files(pdf_docs, st)


def process_files(file_list, st): 

    for file in file_list:
        file_extension = os.path.splitext(file.name)[1]
        file_name = os.path.splitext(file.name)[0]
        if file_extension == ".pdf":
            raw_text = get_pdf_text(file)
        elif file_extension == ".txt":
            with open(file, 'r') as txt_file:
                raw_text = txt_file.read()

        elif file_extension == ".csv":
            with open(file, 'r') as csv_file:
                raw_text = csv_file.read()

        else:
            raise Exception("File type not supported")

    print(raw_text)
    text_chunks = get_text_chunks(raw_text)
    print(f'Number of text chunks: {len(text_chunks)}')
    print("Creating vector store")
    vector_store = get_vectorstore(text_chunks)
    print("Vector store created")
    print("Creating conversation chain")
    st.session_state.conversation = get_conversation_chain(vector_store)
    print("Conversation chain created")


def get_file_text(file_path_list):
    raw_text = ""
    for file_path in file_path_list:
        file_extension = os.path.splitext(file_path)[1]
        file_name = os.path.splitext(file_path)[0]
        if file_extension == ".pdf":
            raw_text += get_pdf_text(file_path)
        elif file_extension == ".txt":
            with open(file_path, 'r') as txt_file:
                raw_text += txt_file.read()

        elif file_extension == ".csv":
            with open(file_path, 'r') as csv_file:
                raw_text += csv_file.read()

        else:
            raise Exception("File type not supported")

    return raw_text


if __name__ == '__main__':
    main()
Enter fullscreen mode Exit fullscreen mode

Part 1: Building an Advanced Streamlit Chatbot with OpenAI Integration: A Comprehensive Guide - Part 1

Part 2: Building an Advanced Streamlit Chatbot with OpenAI Integration: A Comprehensive Guide - Part 2

P.S. Leave comments with things you would like me to cover next.

Top comments (0)