Dhanush Reddy

Posted on Feb 12, 2023 • Edited on Apr 30, 2023

Fine-Tune GPT-3 on custom datasets with just 10 lines of code using GPT-Index

#ai #gpt3 #openai #beginners

The Generative Pre-trained Transformer 3 (GPT-3) model by OpenAI is a state-of-the-art language model that has been trained on a massive amount of text data. GPT3 is capable of generating human-like text, performing tasks like question-answering, summarization, and even writing creative fiction. Wouldn't it be cool if you feed GPT3 with your own data source and ask it questions.

In this blog post, we'll be going to see exactly that. Fine-tuning GPT-3 on custom datasets using the GPT-Index, and do it all with just 10 lines of code! GPT-Index does the heavy lifting, by providing an high level API for connecting external knowledge bases with LLMs.

Prerequisties

You need to have Python Installed on your system.
An OpenAI API Key. If you donot have a key create a new account on openai.com/api, and get $18 of free credits.

Code

I am not going into the details of how all this is working, as this would make this blog post longer and go against the title. You can refer to gpt-index.readthedocs.io/en/latest if you need to learn more.

Create a folder and open up it in your favorite code editor. Create a virtual environment for this project if needed.
For this tutorial, we need to have gpt-index and Langchain installed. Please download the versions i mention here so to avoid any breaking changes.

pip install gpt-index==0.4.1 langchain==0.0.83

If your data sources are in form of PDF's also install PyPDF2

pip install PyPDF2==3.0.1

Now create a new file main.py and add the following code:

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)

# save to disk
index.save_to_disk('index.json')

For this code to run, you need to have your datasources be it PDF's, text files etc inside of a directory named as data in the same folder. Run the code after adding data.

Your project directory should look something like this:

project/
├─ data/
│  ├─ data1.pdf
├─ query.py
├─ main.py

Now create another file named query.py and add the following code:

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from gpt_index import GPTSimpleVectorIndex

# load from disk
index = GPTSimpleVectorIndex.load_from_disk('index.json')

print(index.query("Any Query You have in your datasets"))

If you run this code you will be getting response from OpenAI with the query you have sent.

I have tried using this paper on Arxiv, as a datasource and asked for this query:

Conclusion

With GPT-Index, it has become much easier to work with GPT-3 and fine-tune it with just a few lines of code. I hope this small post has shown you how to get started with GPT-3 on custom datasets using GPT-Index.

Of course, you can setup a simple frontend to give it a chatbot look like ChatGPT.

In case if you still have any questions regarding this post or want to discuss something with me feel free to connect on LinkedIn or Twitter.

If you run an organization and want me to write for you, please connect with me on my Socials 🙃

Latest comments (14)

redsquares • Mar 9 '23

Hi! Great project! Thanks in advance!

Is there a way to direct it to use mainly the indexed documents ?

Like, "resume me the content of indexed PDFs" , is there a way to restrict data for analysis to the data folder ?

Thanks

shyamgt • Feb 22 '23

I am getting the "ModuleNotFoundError: No module named 'langchain.utilities'" when running main.py. installed langchain, openai, llama_index

Dhanush Reddy • Feb 23 '23

Hey @shyamgt, we are using GPT-Index and not langchain.
Did you properly copy my code?

Maybe you have forgot installing it (install it via pip install gpt-index)

Sujata • Mar 20 '23

I am facing same issue, this is the error log (on windows 10 python virtual environment)
Traceback (most recent call last): File "app.py", line 1, in <module> from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\__init__.py", line 15, in <module> from gpt_index.indices.common.struct_store.base import SQLContextBuilder File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\__init__.py", line 4, in <module> from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\keyword_table\__init__.py", line 4, in <module> from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\keyword_table\base.py", line 15, in <module> from gpt_index.indices.base import DOCUMENTS_INPUT, BaseGPTIndex File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\base.py", line 19, in <module> from gpt_index.docstore import DOC_TYPE, DocumentStore File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\docstore.py", line 9, in <module> from gpt_index.readers.schema.base import Document File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\readers\__init__.py", line 34, in <module> from gpt_index.readers.web import ( File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\readers\web.py", line 5, in <module> from langchain.utilities import RequestsWrapper ModuleNotFoundError: No module named 'langchain.utilities'

machinesrental • Feb 16 '23

API not available

Pranav Tej • Feb 13 '23

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 192193 tokens

exactly above error i get when query.py is run

Konstantinos N. Nikas • Feb 13 '23

Hi mr. Dhanush Reddy, thank you for your article -guide.
Can we similarly point to a government open access database in which we have username and password to create official texts of the same typology as that of the government database using gpt3?

Dhanush Reddy • Feb 14 '23 • Edited

@nikaskn, I would suggest you to look at Llama Hub.

Alternatively, here is the main website: llamahub.ai, where you can find other data loaders for GPT Index.

Pranav Tej • Feb 13 '23

Yes, I ran it, it worked but i ended up gettnig OPenAI errors. Though i have 18 $ credit unutilized. I tried adding the same PDF as in your example into the data folder, its max 8 pages not sure why i was getting this error.