Michael Wahl

Posted on Jul 2, 2023

Training ChatGPT with local data to create your own chat bot!

#chatgpt #ai #python

Using OpenAI’s ChatGPT, we can train a language model using our own local/custom data, thats scoped toward our own needs or use cases.

I am using a Mac/MacOS, but you can also use Windows or Linux.

Install Python
You need to make sure you have Python installed, and at least version 3.0+. Head over to following link and download python installer: . You can also open a terminal and run python3 --version to verify you have the correct version of python installed.

Upgrade PIP
python3 -m pip install -U pip

Installing Libraries

pip3 install openai
pip install gpt_index==0.4.24
pip3 install PyPDF2
pip3 install gradio

Get OpenAI key

Prep Data
Create a new directory named ‘docs’ anywhere you like and put PDF, TXT or CSV files inside it. You can add multiple files if you like but remember that more data you add, more the tokens will be used. Free accounts are given 18$ worth of tokens to use.

Script

from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain import OpenAI
import gradio as gr
import sys
import os

os.environ["OPENAI_API_KEY"] = 'ApiGoesHere'

def construct_index(directory_path):
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.7, model_name="text-davinci-003", max_tokens=num_outputs))

    documents = SimpleDirectoryReader(directory_path).load_data()

    index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

    index.save_to_disk('index.json')

    return index

def chatbot(input_text):
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    response = index.query(input_text, response_mode="compact")
    return response.response

iface = gr.Interface(fn=chatbot,
                     inputs=gr.inputs.Textbox(lines=7, label="Enter your text"),
                     outputs="text",
                     title="My AI Chatbot")

index = construct_index("docs")
iface.launch(share=True)

Save as app.py

Open Terminal and run following command

python3 app.py

This will start training. This might take some time based on how much data you have fed to it. Once done, it will output a link where you can test the responses using simple UI. It outputs local URL: http://127.0.0.1:7860

You can open this in any browser and start testing your custom trained chatbot. The port number above might be different for you.

To train on more or different data, you can close using CTRL + C and change files and then run the python file again.

If this article was helpful, maybe consider a clap or follow me back.

Top comments (5)

Сергей Авдейчик • Jul 3 '23

While working with your code that utilizes the langchain library, I've encountered an issue that I can't seem to resolve on my own. When I attempt to run my program, I'm getting the following error:

ImportError: cannot import name 'BaseLanguageModel' from 'langchain.schema'

I'm unable to import BaseLanguageModel from langchain.schema. I've already checked and made sure I have the latest version of the langchain library installed, but the issue still persists.

Could you assist me in figuring this out? What might be the issue here and how could I fix it?

I do understand that this might be a question for the authors of the langchain library, but you might have encountered a similar issue or know how to resolve it.

Thank you in advance for your time and assistance.

Best Regards,
Sergey

Michael Wahl • Jul 14 '23

Check the following:

pip install langchain==0.0.118
pip install gpt_index==0.4.24

Michael Wahl • Jul 14 '23

Were you able to resolve your issue?

Сергей Авдейчик • Jul 2 '23

I'm really glad I found this article - it's simple, clear, and functional! Your instructions allowed me to create my own chatbot without any problems. However, I have a question about the data we use to train the model. Are there any specific requirements for the format and nature of this data? I would appreciate any additional information on this topic.

Michael Wahl • Jul 14 '23

I only uploaded and trained using pdf, CSV, and Text.

DEV Community

Training ChatGPT with local data to create your own chat bot!

Top comments (5)

Read next

Ping Pong game in Pygame python

Kernel Memory with Azure OpenAI, Blob storage and AI Search services

Kernel Memory document ingestion

VoiceScribe: Elevating Transcriptions with AssemblyAI's Universal-2 Model