DEV Community

Dhanush Reddy
Dhanush Reddy

Posted on • Updated on

Fine-Tune GPT-3 on custom datasets with just 10 lines of code using GPT-Index

The Generative Pre-trained Transformer 3 (GPT-3) model by OpenAI is a state-of-the-art language model that has been trained on a massive amount of text data. GPT3 is capable of generating human-like text, performing tasks like question-answering, summarization, and even writing creative fiction. Wouldn't it be cool if you feed GPT3 with your own data source and ask it questions.

In this blog post, we'll be going to see exactly that. Fine-tuning GPT-3 on custom datasets using the GPT-Index, and do it all with just 10 lines of code! GPT-Index does the heavy lifting, by providing an high level API for connecting external knowledge bases with LLMs.

Prerequisties

  • You need to have Python Installed on your system.
  • An OpenAI API Key. If you donot have a key create a new account on openai.com/api, and get $18 of free credits.

Code

I am not going into the details of how all this is working, as this would make this blog post longer and go against the title. You can refer to gpt-index.readthedocs.io/en/latest if you need to learn more.

  • Create a folder and open up it in your favorite code editor. Create a virtual environment for this project if needed.

  • For this tutorial, we need to have gpt-index and Langchain installed. Please download the versions i mention here so to avoid any breaking changes.

pip install gpt-index==0.4.1 langchain==0.0.83
Enter fullscreen mode Exit fullscreen mode

If your data sources are in form of PDF's also install PyPDF2

pip install PyPDF2==3.0.1
Enter fullscreen mode Exit fullscreen mode

Now create a new file main.py and add the following code:

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)

# save to disk
index.save_to_disk('index.json')
Enter fullscreen mode Exit fullscreen mode

For this code to run, you need to have your datasources be it PDF's, text files etc inside of a directory named as data in the same folder. Run the code after adding data.

Your project directory should look something like this:

project/
├─ data/
│  ├─ data1.pdf
├─ query.py
├─ main.py
Enter fullscreen mode Exit fullscreen mode
  • Now create another file named query.py and add the following code:
import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from gpt_index import GPTSimpleVectorIndex

# load from disk
index = GPTSimpleVectorIndex.load_from_disk('index.json')

print(index.query("Any Query You have in your datasets"))
Enter fullscreen mode Exit fullscreen mode

If you run this code you will be getting response from OpenAI with the query you have sent.

I have tried using this paper on Arxiv, as a datasource and asked for this query:

An Example Query to GPT3 with the coreesponding Response

Conclusion

With GPT-Index, it has become much easier to work with GPT-3 and fine-tune it with just a few lines of code. I hope this small post has shown you how to get started with GPT-3 on custom datasets using GPT-Index.

Of course, you can setup a simple frontend to give it a chatbot look like ChatGPT.

In case if you still have any questions regarding this post or want to discuss something with me feel free to connect on LinkedIn or Twitter.

If you run an organization and want me to write for you, please connect with me on my Socials 🙃

Latest comments (14)

Collapse
 
redsquares profile image
redsquares

Hi! Great project! Thanks in advance!

Is there a way to direct it to use mainly the indexed documents ?

Like, "resume me the content of indexed PDFs" , is there a way to restrict data for analysis to the data folder ?

Thanks

Collapse
 
shyamgt profile image
shyamgt

I am getting the "ModuleNotFoundError: No module named 'langchain.utilities'" when running main.py. installed langchain, openai, llama_index

Collapse
 
dhanushreddy29 profile image
Dhanush Reddy

Hey @shyamgt, we are using GPT-Index and not langchain.
Did you properly copy my code?

Maybe you have forgot installing it (install it via pip install gpt-index)

Collapse
 
nirvitarka profile image
Sujata

I am facing same issue, this is the error log (on windows 10 python virtual environment)
Traceback (most recent call last):
File "app.py", line 1, in <module>
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\__init__.py", line 15, in <module>
from gpt_index.indices.common.struct_store.base import SQLContextBuilder
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\__init__.py", line 4, in <module>
from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\keyword_table\__init__.py", line 4, in <module>
from gpt_index.indices.keyword_table.base import GPTKeywordTableIndex
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\keyword_table\base.py", line 15, in <module>
from gpt_index.indices.base import DOCUMENTS_INPUT, BaseGPTIndex
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\indices\base.py", line 19, in <module>
from gpt_index.docstore import DOC_TYPE, DocumentStore
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\docstore.py", line 9, in <module>
from gpt_index.readers.schema.base import Document
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\readers\__init__.py", line 34, in <module>
from gpt_index.readers.web import (
File "F:\Projects\Other\ToDo\Self\SA\AppIoTProjects\GPT2\custombot\myenv\lib\site-packages\gpt_index\readers\web.py", line 5, in <module>
from langchain.utilities import RequestsWrapper
ModuleNotFoundError: No module named 'langchain.utilities'

Collapse
 
machinesrental profile image
machinesrental

API not available

Collapse
 
web3tej profile image
Pranav Tej

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 192193 tokens

exactly above error i get when query.py is run

Collapse
 
nikaskn profile image
Konstantinos N. Nikas

Hi mr. Dhanush Reddy, thank you for your article -guide.
Can we similarly point to a government open access database in which we have username and password to create official texts of the same typology as that of the government database using gpt3?

Collapse
 
dhanushreddy29 profile image
Dhanush Reddy • Edited

@nikaskn, I would suggest you to look at Llama Hub.

Alternatively, here is the main website: llamahub.ai, where you can find other data loaders for GPT Index.

Collapse
 
web3tej profile image
Pranav Tej

Yes, I ran it, it worked but i ended up gettnig OPenAI errors. Though i have 18 $ credit unutilized. I tried adding the same PDF as in your example into the data folder, its max 8 pages not sure why i was getting this error.

Image description

Collapse
 
dhanushreddy29 profile image
Dhanush Reddy

Can you once try on your local system?
I feel Replit servers maybe be blacklisted by OpenAI, as the same code works on my own.

Collapse
 
web3tej profile image
Pranav Tej

What about colab ?

Collapse
 
web3tej profile image
Pranav Tej

Hi,

I am trying the above code, request to explain me the following. # load from disk
index = GPTSimpleVectorIndex.load_from_disk('index.json')

I have my file as pdf in the data folder as you said, but what index.json here?

Am i missing something.

Collapse
 
dhanushreddy29 profile image
Dhanush Reddy

Did you run main.py before running query.py?
Maybe you have forgot that.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.