Apify for Apify

Posted on Nov 21, 2023 • Originally published at blog.apify.com on Aug 27, 2023

How to create a custom AI chatbot with Python

#ai #llms #python

In this tutorial, were going to build a custom AI chatbot. Our chatbot is going to work on top of data that will be fed to a large language model (LLM). In other words, well be developing a retrieval-augmented chatbot. The main tools well use are Streamlit and LangChain.

Streamlit is a tool for the quick creation of web apps. Well use it to implement the chat interface.
LangChain is a framework that simplifies the building of LLM apps. It mostly acts as the glue between vector databases, LLMs, and your custom code.

Well split this tutorial into 3 steps:

First, well get some data that can be used as context for the LLM.
Second, well use Streamlit to create the chat interface.
Lastly, well connect everything together using LangChain.

The code is available at https://github.com/apify/chat-with-a-website.

➡Related: What is retrieval-augmented generation, and why use it for chatbots?

Obtaining the data and saving it in a vector database

First, we want to collect some data. We'll later use this as the context provided to the LLM when chatting. Our example code will use Apifys Website Content Crawler to scrape the selected website and store it in a local vector database.

First, lets create an .env file that will contain the website we want to chat with and API tokens for Apify and OpenAI:

OPENAI_API_KEY=your_api_key
APIFY_API_TOKEN=your_api_key
WEBSITE_URL="<https://docs.apify.com/platform>"

Next, lets install all the required packages:

pip install apify-client chromadb langchain openai python-dotenv streamlit tiktoken

Our environments all set, so lets write some Python code!

Lets create a new file called scrape.py. First, we want to import the necessary packages and load our .env file:

import os

from apify_client import ApifyClient
from dotenv import load_dotenv
from langchain.document_loaders import ApifyDatasetLoader
from langchain.document_loaders.base import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# Load environment variables from a .env file
load_dotenv()

Next, well write the main function:

if __name__ == ' __main__':
    apify_client = ApifyClient(os.environ.get('APIFY_API_TOKEN'))
    website_url = os.environ.get('WEBSITE_URL')
    print(f'Extracting data from "{website_url}". Please wait...')
    actor_run_info = apify_client.actor('apify/website-content-crawler').call(
        run_input={'startUrls': [{'url': website_url}]}
    )
    print('Saving data into the vector database. Please wait...')
    loader = ApifyDatasetLoader(
        dataset_id=actor_run_info['defaultDatasetId'],
        dataset_mapping_function=lambda item: Document(
            page_content=item['text'] or '', metadata={'source': item['url']}
        ),
    )
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
    docs = text_splitter.split_documents(documents)

    embedding = OpenAIEmbeddings()

    vectordb = Chroma.from_documents(
        documents=docs,
        embedding=embedding,
        persist_directory='db2',
    )
    vectordb.persist()
    print('All done!')

We'll run the Website Content Crawler Actor on Apify to scrape the target website, then use the ApifyDatasetLoader that is integrated into LangChain to load the scraped documents.

Then, we use the RecursiveCharacterTextSplitter to chunk the documents, and finally, we use OpenAIs embeddings to convert our documents into vectors that get stored in the db directory.

Creating the chat interface

We're gonna use Streamlit to create the interface. Well base it on examples provided at https://github.com/langchain-ai/streamlit-agent.

Lets start with the imports and some settings:

import os

import streamlit as st
from dotenv import load_dotenv
from langchain.callbacks.base import BaseCallbackHandler
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.memory.chat_message_histories import StreamlitChatMessageHistory
from langchain.vectorstores import Chroma

load_dotenv()

website_url = os.environ.get('WEBSITE_URL', 'a website')

st.set_page_config(page_title=f'Chat with {website_url}')
st.title('Chat with a website')

Next, we'll implement some helpers. The get_retriever function will create a retriever based on data we extracted in the previous step using scrape.py. The StreamHandler class will be used for streaming the responses from ChatGPT to our application.

@st.cache_resource(ttl='1h')
def get_retriever():
    embeddings = OpenAIEmbeddings()
    vectordb = Chroma(persist_directory='db', embedding_function=embeddings)

    retriever = vectordb.as_retriever(search_type='mmr')

    return retriever

class StreamHandler(BaseCallbackHandler):
    def __init__ (self, container: st.delta_generator.DeltaGenerator, initial_text: str = ''):
        self.container = container
        self.text = initial_text

    def on_llm_new_token(self, token: str, **kwargs) -> None:
        self.text += token
        self.container.markdown(self.text)

Finally, lets add the main code. We use the ConversationalRetrievalChain utility provided by LangChain along with OpenAIs gpt-3.5-turbo. The rest of the code sets up the Streamlit chat interface.

retriever = get_retriever()

msgs = StreamlitChatMessageHistory()
memory = ConversationBufferMemory(memory_key='chat_history', chat_memory=msgs, return_messages=True)

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0, streaming=True)
qa_chain = ConversationalRetrievalChain.from_llm(
    llm, retriever=retriever, memory=memory, verbose=False
)

if st.sidebar.button('Clear message history') or len(msgs.messages) == 0:
    msgs.clear()
    msgs.add_ai_message(f'Ask me anything about {website_url}!')

avatars = {'human': 'user', 'ai': 'assistant'}
for msg in msgs.messages:
    st.chat_message(avatars[msg.type]).write(msg.content)

if user_query := st.chat_input(placeholder='Ask me anything!'):
    st.chat_message('user').write(user_query)

    with st.chat_message('assistant'):
        stream_handler = StreamHandler(st.empty())
        response = qa_chain.run(user_query, callbacks=[stream_handler])

Connecting everything together

If youve followed along with this tutorial, then by now, you should have three files: .env, [scrape.py](<http://scrape.py>)and chat.py. Lets take what weve created and use it to chat with a website!

First, run python scrape.py to extract the relevant data from the target website. Note that this step may take a while since the website might be pretty big. You can check the progress at https://console.apify.com/actors/runs.

After the data extraction is done, you can start chatting with the website by running streamlit run chat.py!

DEV Community

How to create a custom AI chatbot with Python

Obtaining the data and saving it in a vector database

Creating the chat interface

Connecting everything together

Top comments (0)

Read next

Using dj-rest-auth to integrate GitHub authentication in your Django project

AI Therapist using Assembly AI

GraphFusion is Now Open Source – Join Us in Building the Future of AI Knowledge Graphs 🚀

Athene AI - Voice Journaling Assistant