DEV Community

Michael Wahl
Michael Wahl

Posted on

Training ChatGPT with local data to create your own chat bot!

Using OpenAI’s ChatGPT, we can train a language model using our own local/custom data, thats scoped toward our own needs or use cases.

I am using a Mac/MacOS, but you can also use Windows or Linux.

Install Python
You need to make sure you have Python installed, and at least version 3.0+. Head over to following link and download python installer: . You can also open a terminal and run python3 --version to verify you have the correct version of python installed.

Upgrade PIP
python3 -m pip install -U pip

Installing Libraries

pip3 install openai
pip install gpt_index==0.4.24
pip3 install PyPDF2
pip3 install gradio
Enter fullscreen mode Exit fullscreen mode

Get OpenAI key

Prep Data
Create a new directory named ‘docs’ anywhere you like and put PDF, TXT or CSV files inside it. You can add multiple files if you like but remember that more data you add, more the tokens will be used. Free accounts are given 18$ worth of tokens to use.

Script

from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain import OpenAI
import gradio as gr
import sys
import os

os.environ["OPENAI_API_KEY"] = 'ApiGoesHere'

def construct_index(directory_path):
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name="text-davinci-003", max_tokens=num_outputs))

    documents = SimpleDirectoryReader(directory_path).load_data()

    index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

    index.save_to_disk('index.json')

    return index

def chatbot(input_text):
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    response = index.query(input_text, response_mode="compact")
    return response.response

iface = gr.Interface(fn=chatbot,
                     inputs=gr.inputs.Textbox(lines=7, label="Enter your text"),
                     outputs="text",
                     title="My AI Chatbot")

index = construct_index("docs")
iface.launch(share=True)

Enter fullscreen mode Exit fullscreen mode

Save as app.py

Open Terminal and run following command

python3 app.py

This will start training. This might take some time based on how much data you have fed to it. Once done, it will output a link where you can test the responses using simple UI. It outputs local URL: http://127.0.0.1:7860

You can open this in any browser and start testing your custom trained chatbot. The port number above might be different for you.

To train on more or different data, you can close using CTRL + C and change files and then run the python file again.

If this article was helpful, maybe consider a clap or follow me back.

Top comments (5)

Collapse
 
__77157b67e profile image
Сергей Авдейчик

While working with your code that utilizes the langchain library, I've encountered an issue that I can't seem to resolve on my own. When I attempt to run my program, I'm getting the following error:

ImportError: cannot import name 'BaseLanguageModel' from 'langchain.schema'

I'm unable to import BaseLanguageModel from langchain.schema. I've already checked and made sure I have the latest version of the langchain library installed, but the issue still persists.

Could you assist me in figuring this out? What might be the issue here and how could I fix it?

I do understand that this might be a question for the authors of the langchain library, but you might have encountered a similar issue or know how to resolve it.

Thank you in advance for your time and assistance.

Best Regards,
Sergey

Collapse
 
0xmichaelwahl profile image
Michael Wahl

Check the following:

pip install langchain==0.0.118
pip install gpt_index==0.4.24

Collapse
 
0xmichaelwahl profile image
Michael Wahl

Were you able to resolve your issue?

Collapse
 
__77157b67e profile image
Сергей Авдейчик

I'm really glad I found this article - it's simple, clear, and functional! Your instructions allowed me to create my own chatbot without any problems. However, I have a question about the data we use to train the model. Are there any specific requirements for the format and nature of this data? I would appreciate any additional information on this topic.

Collapse
 
0xmichaelwahl profile image
Michael Wahl

I only uploaded and trained using pdf, CSV, and Text.