(Updated at 20, February, 2022)
In this post, I will fine-tune GPT-2, especially rinna's, which are one of the Japanese GPT-2 models. I am Japanese and most of my chat histories are in Japanese. Because of that, I will fine-tune "Japanese" GPT-2.
GPT-2 stands for Generative pre-trained transformer 2 and it generates sentences as the name shows. We could build a chatbot by fine-tuning a pre-trained model with tiny training data.
I will not go through GPT-2 in detail. I highly recommend the article How to Build an AI Text Generator: Text Generation with a GPT-2 Model on dev.to to understand what is GPT-2 and what is a language model.
git repository: chatbot_with_gpt2
I would appreciate the author of the following two articles.
Thanks to the first author, I could build my chatbot model. The sources in my git repository are almost constructed with his codes. I just summarized them. Thanks to the second author, I could go through GPT-2.
rinna is a conversational pre-trained model given from rinna Co., Ltd. and five pre-trained models are available on hugging face [rinna Co., Ltd.] on 19, February 2022. rinna is a bit famous in Japanese because they published rinna AI on LINE, one of the most popular SNS apps in Japan. She is a junior high school girl. We could take conversations on LINE.
I am not sure when the models are published on hugging face, but anyways, the models are available now. I will fine-tune
rinna/japanese-gpt2-small whose number of parameters is small. By the way, I wanted to use
rinna/japanese-gpt-1b whose number of parameters is around one billion, but I couldn't because of the memory capacity on google colab.
I will suppose you have a google and git account and can use google colab.
Furthermore, I will use a chat history on LINE. If you have no account on the app, it is okay. All you have to do is prepare a chat history and modify the data. I know these processes are the hardest and most bothering things though. If you have the account, the following processes would work. Note that, if your LINE setting language is Japanese, you should change it to English until exporting a chat history because the following processes are supposing the setting language (not message language) is English.
At the end of this process, your google drive is constructed as follows.
MyDrive ---- chatbot_with_gpt2.ipynb | |- config | |- general_config.yaml | |- data |- chat_history.txt
- 1: Clone chatbot_with_gpt2 repository on your local machine.
It is accomplished by running the following command on the git bash.
git clone https://github.com/ksk0629/chatbot_with_gpt2
chatbot_with_gpt2/chatbot_with_gpt2.ipynbto the google drive.
3: Make a directory named
configon your google drive and create
general_config.yamlin the config folder.
general_config.yaml is as follows.
github: username: your_github_username email: your_email token: your_access_token ngrok: token: anything
ngrok block is needless, but it is needed to avoid an error below.
- 4: Get a chat history from LINE.
We can get the history by following the official announcement [Help centre - Chat history].
- 5: Make a directory named
dataon your google drive and move the chat history to the directory.
chatbot_with_gpt2.ipynbon google colaboratory.
2: Run the cells in Preparation block.
The environment is prepared to get training data and build the model by running the cells.
- 3: Change
The initial yaml file is as follows.
line: initial: input_username: "input_username" output_username: "output_username" target_year_list: "[2016,2017,2018,2019,2020,2021,2022]" path: input_path: "/content/gdrive/MyDrive/data/chat_history.txt" output_path: "chat_history_cleaned.pk"
You have to change at least initial block. The meaning of each line is as follows.
- input_username: a username of messages that you want to input into the model
- output_username: a username of messages that you want the model to output
- target_year_list: years that you want to use to train the model
- input_path: path to the raw chat history
- output_path: path to the cleaned data that is obtained by the following process
Note that, if you do not change output_path, then your training data would not be available after closing the notebook. Of course, it is available whilst the notebook is working.
- 4: Run the cell in Preprocessing data block.
The data is cleaned in the cell.
- 5: Change
The initial yaml file is as follows.
general: basemodel: "rinna/japanese-gpt2-xsmall" dataset: input_path: "chat_history_cleaned.pk" output_path: "gpt2_train_data.txt" train: epochs: 10 save_steps: 10000 save_total_limit: 3 per_device_train_batch_size: 1 per_device_eval_batch_size: 1 output_dir: "model/default" use_fast_tokenizer: False
You have to change input_path in dataset block to the path to the cleaned data, which is specified in
pre_processor_config.yaml. You can change basemodel to rinna/japanese-gpt2-small, but others (medium and 1b) would not work because of a lack of GPU memory as I mentioned in What is rinna section.
- 6: Run the cells in Training data preparation and Building model block.
That is all! After running this cell, all you have to do is wait for a while. You would see your model file in the directory that is specified in
Again, all you have to do is run the only one cell in Talking with the model block. Then, the source code is running and you could talk with the model, like the following.
I fine-tuned GPT-2 with my chat history on LINE. I certainly did it, but there are the following problems as you could see in Let's talk to the model section.
- There is unnecessary line
Setting 'pad_token_id' to 'eos_token_id':2 for open-end generation.in each conversation.
- There are some tokens, like
<br/ゥ>, that disturb coherence sentence.
- The model did not reply well.
The first response
looks quite good because "おっす" means "Hey" and the response means "You are home. You’ve got to be exhausted". Something like these. But the others look wrong. To improve the model, I could clean training data more and I need to understand GPT-2 and the source codes.
If you have any suggestions, comments, or questions about this article, please comment below. I'd appreciate it.