DEV Community

Cover image for ChatGPT bot with your own data
merlos
merlos

Posted on • Edited on

ChatGPT bot with your own data

In this article I will explain how you can build a simple custom bot based on ChatGPT API that will answer questions based on a custom knowledge base.

As we know, ChatGPT and other Large Language Models (LLM) have been trained with the materials publicly available on the Internet. In particular ChatGPT knowledge base only covers till 2021.

In addition, it is also well known that the internet is just the tip of the iceberg, and there is a lot of content, the deep web, that is not publicly available and resides behind intranets, require to login or are so specific that ChatGPT may not reach that knowledge level.

Tip of the iceberg is small compared with what is under the water

So, the question is:

Can we create a ChatGPT like bot that provides answers based on our small piece of deep web or with a very specific content custom made knowledge base?

Yes, and it is fairly simple.

Here is what you'll achieve after following this tutorial. In this case I tested it with the contents of my personal webpage.

ChatGPT demo

Cool, isn't it?

So, What's the process you need to follow?

  1. Gather the base of knowledge you want to use for your bot. For example, a set of webpages PDF, markdown, word files, etc. Create the corpus with the content that has the answers to the questions.

  2. Convert the corpus plain text files. Convert those files into plain text. We need this step because the model that generates the answer needs plain text as input.

  3. Create a set of embeddings. Embeddings is a way to translate information into a vector format to perform a semantic search. In our case the embeddings are used to find the documents that most probably have the information that can be used to generate the answer to our question. To create the embeddings the text files will be broken in chunks and we will call an OpenAI API to create embeddings for each chunk.

What do I need?

You only have a couple of pre-requisites:

1) A box with python (tested with 3.9 and 3.11)
2) OpenAI account, and an API key. You get some USD of free credits for testing.

The free credits should be enough even for a relatively large knowledge base.

The code that we will use on this tutorial uses GPT3.5turbo model, which is the version prior to GPT4. It's API is compatible with GPT4 API. So, if you get access to it, you just need to change the model.

In addition, some basic python knowledge may help you to customize the code.

Let's bot it!

First, clone the repo and install the dependencies:

git clone https://github.com/merlos/ask-me/
cd ask-me
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

If you haven't done it yet, create the OpenAi account and get an API key. Then, set this environment variable within the command prompt:

# OSX and GNU/Linux
export OPENAI_API_KEY=<your key>

# Windows
setenv OPENAI_API_KEY=<your key>
Enter fullscreen mode Exit fullscreen mode

Collect your knowledge base

The goal of this step is to gather the data that will be used as corpus and convert each file it into a plain text file.

Although the repo comes with a few basic tools, depending on your particular use case and you may need to do some additional work for gathering and converting the content.

In any case, I would start with a small dataset to test the answers and play around.

Also note, that these large language models are optimized for ingesting text, so formats such has data tables won't provide good results.

The repo comes with a few tools that may be useful.

The first one is scrapper.py. It takes as argument a webpage address and goes through all the pages that are within the domain. For example:

python scrapper.py https://www.merlos.org
Enter fullscreen mode Exit fullscreen mode

As output, it will create the folder ./text/www.merlos.org/ with the scrapped pages in text format. The scrapper is not the best in the world, so it may have small glitches. The source code is pretty well explained.

For more corporate environments. Another tool provided is is DownloadSharePointDocs.ps1 This is a PowerShell script downloads the files from a SharePoint library.

# In a powershell prompt
.\DownloadSharePointDocuments.ps1 \
    -SiteUrl "https://<tenant name>.sharepoint.com/sites/yoursite/" 
    -LibraryName "Library"
    -DownloadFolder "./LibraryDocuments"
Enter fullscreen mode Exit fullscreen mode

Note that DownloadFolder must exist. The script downloads the required powershell modules.

To run it on a GNU/Linux or a Mac you need to install PowerShell (f.i brew install pwsh). On windows you need to run it as Administrator if you don't have the modules required.

The last helper that is provided is pdftotext.py. This script takes a folder with a set of PDFs and converts them into txt files. You can download or collect PDFs, leave them in a folder and this script will convert them into plain text.

python pdftotext ./path/with/pdfs --output ./text/filesfolder/
Enter fullscreen mode Exit fullscreen mode

For other types of documents such as word documents or slide deck presentations, you may need to find some other tools to convert them into plain txt files. The only requirement for the next step is to place all the documents converted into plain text, in a folder for the next step.

You can leave in the comments any other tool that you have used for gathering the data.

Process the data.

Now, you have a folder full of plain text files that will be your corpus. The next step is to break down these files in smaller chunks of text, and include something called embeddings.

We break down the files in small chunks because the OpenAI API has a limit on the amount of text it can process in one single call.

The creation of embeddings is a technique widely used in natural language processing (NLP) and will help us to do a semantic search.

To get the embeddings run the following script indicating what is the source folder with the text files from the previous step.

python process.py ./text/www.merlos.org
Enter fullscreen mode Exit fullscreen mode

This script will output the files:

  1. processed/corpus.csv, which is just a list of the txt files in one single csv files, and
  2. processed/embeddings.csv that includes the embeddings.

This step is relatively slow, consequently you may prefer start with a small subset of docs for testing.

During the implementation it was noticed that the reliability of OpenAPI was poor (maybe because I was in the free tier?). It returned errors in many occasions. If there is an error, you can relaunch the script and it will recover the session.

If you have a large corpus you can use this loop that will relaunch the script automatically.

until python process.py ./text/folder-with-txt-files; do echo "Restarting...";done
Enter fullscreen mode Exit fullscreen mode

You have some additional documentation about this on the README file of the repo.

Test it

Ok, now it's time for the coolest part. We have the corpus with the embeddings that is everything we need.

You can test it in the command line.

python ask.py "this is my question?"
Enter fullscreen mode Exit fullscreen mode

Or you can do the same launching a browser

python web.py
Enter fullscreen mode Exit fullscreen mode

And open the browser at http://localhost:5000

What happens behind the scenes when you ask a question?

The first thing that will be done is to try to find the chunks of text that most probably have the answer to the question. To search that text we will use the embeddings we calculated on the processing step.

An embeddings is just a way to represent a word - or strictly speaking a token - in a vector format. Therefore, instead of having the word king or queen, you will have a vectors, for instance, (1,1,1) and (1,1.4,1), but witch a lot more of dimensions. The values on the different dimensions will depend on how "close" they are. In the example, king and queen will be close because they are semantically related.

When we pose the question, there is an algorithm that is run on the server that searches in the local dataset for the chunks of texts whose content have the closest embeddings to the question and, therefore, are more likely to have the answer to the question.

Once we have the chunks of texts that most likely have the answer, we will send those chunks of texts (the context) together with the question and we will will literally ask ChatGPT to provide an answer based on the text provided, similarly to a prompt that you would provide in the ChatGPT website. This is what is sent the API:

Based on the following context:
   {context}
answer the following question:
   {question}
Enter fullscreen mode Exit fullscreen mode

Where {context} are replaced with the chunks of text found and the question is the {question} provided by the user.

Hack the code

The source code of the repo is pretty straight forward and, even if you don't have much python knowledge you should be able to follow it.

Alternatively, you can paste the code into ChatGPT and ask it to do the changes on your behalf :).

Last words

Definitely large language models and Generative AI are going to change the way we learn, work and find information.

However, right now only big tech companies like OpenAI (aka CloseAI), Google with Bart, or Facebook with LlaMa, seem to have the capacity to release these kind of models, but my hope is that either soon computing these models will be accessible to anyone or that some organization will release a truly open model similarly to what happened in the image field with Stable Diffusion. That will foster innovation and will make these technologies available even in countries and organizations with less resources.

Top comments (0)