DEV Community

Cover image for LLM pre training and scaling law
Zaynul Abedin Miah
Zaynul Abedin Miah

Posted on

LLM pre training and scaling law

This blog I've written is part of the first module from Coursera course by AWS Team on "Generative AI with Large Language Models" When choosing a model for generative AI applications, there are two options: selecting an existing model or creating one from the beginning. Open-source models and model hubs like Hugging Face and PyTorch can be helpful tools for AI developers. Pre-training is essential to help large language models learn to recognize patterns and structures in human language. Autoencoding models, such as BERT and RoBERTa, are useful for understanding both directions of text and can be used for tasks like sentiment analysis and identifying named entities. Autoregressive models, like GBT and BLOOM, are used for generating text by looking at the words that came before it. Sequence-to-sequence models, such as T5 and BART, use both an encoder and a decoder to perform tasks like translation and summarization.

The size of a model affects how well it can perform, but it is difficult and costly to train larger models. Frameworks like Hugging Face and PyTorch have model hubs that offer resources and model cards, providing information about use cases, training methods, and limitations of the models. Different types of transformer models, such as encoder-only, decoder-only, and sequence-to-sequence models, are used for different tasks depending on what they are trained to do.

Efficient multi-GPU compute strategies

Large language models (LLMs) need a lot of GPU RAM to store their parameters, which becomes even higher during training due to extra components. Quantization helps save memory by decreasing the level of detail in model weights using data types like FP16, Bfloat16, or INT8. LLMs are large, with billions or hundreds of billions of parameters, making it impossible to train them on a single GPU. Distributed computing techniques are needed to address these challenges. Fine-tuning is a process that comes after training, keeping all the training parameters in memory and might require using distributed computing.

Using more than one GPU is important for training large models and can also be beneficial for smaller models. Techniques like Distributed Data-Parallel (DDP) for small models and Fully Sharded Data-Parallel (FSDP) for large models can help balance memory usage and communication volume between GPUs. Researchers are studying how to improve the performance of smaller models because training them on multiple GPUs is expensive and technically complex.

During pre-training, the main aim is to make the model perform at the highest level possible on its learning task, which involves minimising loss when predicting tokens. There are two ways to improve performance: making the dataset bigger and adding more parameters to the model. However, the budget for computing is a restricting factor. A petaFLOP per second measures the amount of computer resources needed for training. The Chinchilla paper suggests that many 100 billion parameter large language models may be overparameterized and undertrained, so they would benefit from additional training data.

Domain Adaption

Domain adaptation is necessary in language models for specialized domains like law and medicine due to specific vocabulary and language structures. Pretraining models from scratch is necessary to achieve better performance in these specialized domains, such as when your target domain uses vocabulary and language structures not commonly used in day-to-day language.

Image description

BloombergGPT

BloombergGPT, a financial-focused large language model created by Bloomberg, demonstrates how pre-training a model can improve its domain-specificity. BloombergGPT model uses finance and general-purpose data to understand finance and generate finance-related text. The recommended training dataset size is typically 20 times larger than the number of parameters in the model. However, the team's dataset is smaller due to limited availability of financial domain data. They wanted 50 billion parameters and 1.4 trillion tokens for training, but found 700 billion tokens, which is less than the optimal amount for computation. Pretraining can improve domain-specificity, but challenges may require trade-offs between optimal compute resources and model training configurations.

Additional resources
Here are some of the resourse paper you can read:
BLOOM is an open-source language model with 176 billion parameters, similar to GPT-4. It has been trained in a transparent and open manner. The authors discuss the dataset and training process in detail in this paper. You can also view a summary of the model: https://bigscience.notion.site/BLOOM-BigScience-176B-Model-ad073ca07cdf479398d5f95d88e218c4

DeepLearning lessons. AI's Natural Language Processing specialisation covers the fundamentals of vector space models and how they are used in language modelling: https://www.coursera.org/learn/classification-vector-spaces-in-nlp/home/module/3

OpenAI researchers study scaling laws for large language models: https://arxiv.org/abs/2001.08361

Which language model architecture and pretraining objective are most effective for zero-shot generalisation? The paper looks at different modelling choices in big pre-trained language models and finds the best way to achieve zero-shot generalisation:
https://arxiv.org/pdf/2204.05832.pdf

HuggingFace Tasks and Model Hub provide resources for machine learning tasks.

Meta AI proposes efficient LLMs with 13B parameters, outperforming GPT3 with 175B parameters on most benchmarks: https://arxiv.org/pdf/2302.13971.pdf

This paper Explores few-shot learning in LLMs: https://arxiv.org/pdf/2005.14165.pdf

DeepMind's "Chinchilla Paper" evaluates optimal model size and token count for LLM training: https://arxiv.org/pdf/2203.15556.pdf

BloombergGPT is a finance-specific LLM trained to follow chinchilla laws, providing a powerful example: https://arxiv.org/pdf/2303.17564.pdf

Top comments (0)