DEV Community

Cover image for AI enthusiasm #6 - Finetune any LLM you want💡
Astra Bertelli
Astra Bertelli

Posted on

AI enthusiasm #6 - Finetune any LLM you want💡

Your personal, free-of-charge ChatGPT??🤯

Haven't you always wanted a personal ChatGPT that is based on your data and that is finetuned exactly on them?

In today's post, I'll show you how to finetune a Large Language Model, without needs for GPUs or large amounts of RAM. You will be able to do it also on Google or Kaggle notebooks.

For this tutorial, we'll use this Kaggle notebook and this dataset about Saccharomyces cerevisiae industrial applications: we will be finetuning TinyLlama/TinyLlama-1.1B-Chat-v1.0 from Hugging Face Hub.

Set up and data extraction

First of all, we'll need to get all the packages required by the script to run, and we'll do it just by installing the requirements.txt file that comes along with our dataset:

! pip install -r /kaggle/input/fungal-octopus-dataset/requirements.txt
Enter fullscreen mode Exit fullscreen mode

We will now extract data and create an Hugging Face-like dataset, which comes in a form that is more directly usable when we want to finetune a model.

from sklearn.model_selection import train_test_split
from datasets import load_dataset, Dataset, DatasetDict
import json

def parse_jsonl(filepath):
  jsonl = open(filepath, "r")
  lines = jsonl.readlines()
  jsonobjs = []
  for line in lines:
    jsonobjs.append(json.loads(line))
  return jsonobjs

data = parse_jsonl("/kaggle/input/fungal-octopus-dataset/saccer_info.jsonl")

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)


train_data = Dataset.from_list(train_set)
test_data = Dataset.from_list(test_set)

dataset = DatasetDict(
    {
        "train": train_data,
        "test": test_data
    }
)
Enter fullscreen mode Exit fullscreen mode

As you can see, we define a function that is able to extract json objects from a json file (parse_jsonl), we transform the list into train and test data and than we create an Hugging Face-like DatasetDict, which is the final form of our Kaggle dataset.

Import the model and tokenize the input data

We now need to load the model from Hugging Face Hub to the Kaggle notebook:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_checkpoint = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
Enter fullscreen mode Exit fullscreen mode

And, next, we have to tokenize our text data so that the model is able to employ them for training:

def tokenize_function(examples):
    return tokenizer(examples["text"])


tokenized_datasets = dataset.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
Enter fullscreen mode Exit fullscreen mode

Define training settings

Last but not least, we need to define training settings for the model: we will use only few well-established parameters, for this tutorial:

from transformers import Trainer, TrainingArguments

usr = "USER_ID"

model_name = f"tiny-saccharomyces-llama-{usr}"
training_args = TrainingArguments(
    f"{model_name}",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
)
Enter fullscreen mode Exit fullscreen mode

Now that we have the trainer object, we can proceed with the actual finetuning, just by running:

trainer.train()
Enter fullscreen mode Exit fullscreen mode

And, if we want the model to be available on the Hub, let's just push it to our account:

trainer.push_to_hub()
Enter fullscreen mode Exit fullscreen mode

Now you'll just have to wait until the training is finished and the model is pushed (it will take approx. 1 h), and then everything is done :)

Let me know what you will be using this fine-tuning pipeline for in the comments below😊!

References

Most of this tutorial is based on Hugging Face course about Transformers and on Niels Rogge's Transformers tutorials: make sure to check their work and give them a star on GitHub, if you please ❤️

Cover image by Duncan Rawlinson

Top comments (0)