DEV Community

David Mezzetti for NeuML

Posted on • Originally published at neuml.hashnode.dev

 

Train a language model from scratch

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

txtai has a robust training pipeline that can fine-tune large language models (LLMs) for downstream tasks such as labeling text. txtai also has the ability to train language models from scratch.

The vast majority of time, fine-tuning a LLM yields the best results. But when making significant changes to the structure of a model, training from scratch is often required.

Examples of significant changes are:

  • Changing the vocabulary size
  • Changing the number of hidden dimensions
  • Changing the number of attention heads or layers

This article will show how to build a new tokenizer and train a small language model (known as a micromodel) from scratch.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai datasets sentence-transformers onnxruntime onnx
Enter fullscreen mode Exit fullscreen mode

Load dataset

This example will use the ag_news dataset, which is a collection of news article headlines.

from datasets import load_dataset

dataset = load_dataset("ag_news", split="train")
Enter fullscreen mode Exit fullscreen mode

Train the tokenizer

The first step is to train the tokenizer. We could use an existing tokenizer but in this case, we want a smaller vocabulary.

from transformers import AutoTokenizer

def stream(batch=10000):
    for x in range(0, len(dataset), batch):
        yield dataset[x: x + batch]["text"]

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = tokenizer.train_new_from_iterator(stream(), vocab_size=500, length=len(dataset))
tokenizer.model_max_length = 512

tokenizer.save_pretrained("bert")
Enter fullscreen mode Exit fullscreen mode

Let's test the tokenizer.

print(tokenizer.tokenize("Red Sox defeat Yankees 5-3"))
Enter fullscreen mode Exit fullscreen mode
['re', '##d', 'so', '##x', 'de', '##f', '##e', '##at', 'y', '##ank', '##e', '##es', '5', '-', '3']
Enter fullscreen mode Exit fullscreen mode

With a limited vocabulary size of 500, most words require multiple tokens. This limited vocabulary lowers the number of token representations the model needs to learn.

Train the language model

Now it's time to train the model. We'll train a micromodel, which is an extremely small language model with a limited vocabulary. Micromodels, when paired with a limited vocabulary have the potential to work in limited compute environments like edge devices and microcontrollers.

from transformers import AutoTokenizer, BertConfig, BertForMaskedLM

from txtai.pipeline import HFTrainer

config = BertConfig(
    vocab_size = 500,
    hidden_size = 50,
    num_hidden_layers = 2,
    num_attention_heads = 2,
    intermediate_size = 100,
)

model = BertForMaskedLM(config)
model.save_pretrained("bert")
tokenizer = AutoTokenizer.from_pretrained("bert")

train = HFTrainer()

# Train model
train((model, tokenizer), dataset, task="language-modeling", output_dir="bert",
      fp16=True, per_device_train_batch_size=128, num_train_epochs=10,
      dataloader_num_workers=2)
Enter fullscreen mode Exit fullscreen mode

Sentence embeddings

Next let's take the language model and fine-tune it to build sentence embeddings.

wget https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/examples/training/nli/training_nli_v2.py
python training_nli_v2.py bert
mv output/* bert-nli
Enter fullscreen mode Exit fullscreen mode

Embeddings search

Now we'll build a txtai embeddings index using the fine-tuned model. We'll index the ag_news dataset.

from txtai.embeddings import Embeddings

# Get list of all text
texts = dataset["text"]

embeddings = Embeddings({"path": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))
Enter fullscreen mode Exit fullscreen mode

Let's run a search and see how much the model has learned.

embeddings.search("Boston Red Sox Cardinals World Series")
Enter fullscreen mode Exit fullscreen mode
[{'id': '76733',
  'text': 'Red Sox sweep Cardinals to win World Series The Boston Red Sox ended their 86-year championship drought with a 3-0 win over the St. Louis Cardinals in Game Four of the World Series.',
  'score': 0.8008379936218262},
 {'id': '71169',
  'text': 'Red Sox lead 2-0 over Cardinals of World Series The host Boston Red Sox scored a 6-2 victory over the St. Louis Cardinals, helped by Curt Schilling #39;s pitching through pain and seeping blood, in World Series Game 2 on Sunday night.',
  'score': 0.7896029353141785},
 {'id': '70100',
  'text': 'Sports: Red Sox 9 Cardinals 7 after 7 innings BOSTON Boston has scored twice in the seventh inning to take an 9-to-7 lead over the St. Louis Cardinals in the World Series opener at Fenway Park.',
  'score': 0.7735188603401184}]
Enter fullscreen mode Exit fullscreen mode

Not too bad. It's far from perfect but we can tell that it has some knowledge! This model was trained for 5 minutes, there is certainly room for improvement in training longer and/or with a larger dataset.

The standard bert-base-uncased model has 110M parameters and is around 440MB. Let's see how many parameters this model has.

# Show number of parameters
parameters = sum(p.numel() for p in embeddings.model.model.parameters())
print(f"Number of parameters:\t\t{parameters:,}")
print(f"% of bert-base-uncased\t\t{(parameters / 110000000) * 100:.2f}%")
Enter fullscreen mode Exit fullscreen mode
Number of parameters:       94,450
% of bert-base-uncased      0.09%
Enter fullscreen mode Exit fullscreen mode
ls -lh bert-nli/pytorch_model.bin
Enter fullscreen mode Exit fullscreen mode
-rw-r--r-- 1 root root 386K Jan 11 20:52 bert-nli/pytorch_model.bin
Enter fullscreen mode Exit fullscreen mode

This model is 386KB and has only 0.1% of the parameters. With proper vocabulary selection, a small language model has potential.

Quantization

If 386KB isn't small enough, we can quantize the model to get it down even further.

from txtai.pipeline import HFOnnx

onnx = HFOnnx()
onnx("bert-nli", task="pooling", output="bert-nli.onnx", quantize=True)
Enter fullscreen mode Exit fullscreen mode
embeddings = Embeddings({"path": "bert-nli.onnx", "tokenizer": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))
embeddings.search("Boston Red Sox Cardinals World Series")
Enter fullscreen mode Exit fullscreen mode
[{'id': '76733',
  'text': 'Red Sox sweep Cardinals to win World Series The Boston Red Sox ended their 86-year championship drought with a 3-0 win over the St. Louis Cardinals in Game Four of the World Series.',
  'score': 0.8008379936218262},
 {'id': '71169',
  'text': 'Red Sox lead 2-0 over Cardinals of World Series The host Boston Red Sox scored a 6-2 victory over the St. Louis Cardinals, helped by Curt Schilling #39;s pitching through pain and seeping blood, in World Series Game 2 on Sunday night.',
  'score': 0.7896029353141785},
 {'id': '70100',
  'text': 'Sports: Red Sox 9 Cardinals 7 after 7 innings BOSTON Boston has scored twice in the seventh inning to take an 9-to-7 lead over the St. Louis Cardinals in the World Series opener at Fenway Park.',
  'score': 0.7735188603401184}]
Enter fullscreen mode Exit fullscreen mode
ls -lh bert-nli.onnx
Enter fullscreen mode Exit fullscreen mode
-rw-r--r-- 1 root root 187K Jan 11 20:53 bert-nli.onnx
Enter fullscreen mode Exit fullscreen mode

We're down to 187KB with a quantized model!

Train on BERT dataset

The BERT paper has all the information regarding training parameters and datasets used. Hugging Face Datasets hosts the bookcorpus and wikipedia datasets.

Training on this size of a dataset is out of scope for this article but example code is shown below on how to build the BERT dataset.

bookcorpus = load_dataset("bookcorpus", split="train")
wiki = load_dataset("wikipedia", "20220301.en", split="train")
wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])
dataset = concatenate_datasets([bookcorpus, wiki])
Enter fullscreen mode Exit fullscreen mode

Then the same steps to train the tokenizer and model can be run. The dataset is 25GB compressed, so it will take some space and time to process!

Wrapping up

This article covered how to build micromodels from scratch with txtai. Micromodels can be fully rebuilt in hours using the most up-to-date knowledge available. If properly constructed, prepared and trained, micromodels have the potential to be a viable choice for limited resource environments. They can also help when realtime response is more important than having the highest accuracy scores.

It's our hope that further research and exploration into micromodels leads to productive and useful models.

Top comments (1)

Collapse
 
leonardpuettmann profile image
Leonard PĆ¼ttmann

Hey David, really cool article! I appreciate that you also added a link to a colab to directly try out the code. txtai also looks really great!. :-) Would love it if you would check out Kern AI refinery as well, it might be really interesting for you.

An Animated Guide to Node.js Event Loop

>> Check out this classic DEV post <<