Siddhant Khare

Posted on Oct 17 • Edited on Dec 21

Exploring parallelism in Large Language Models (LLMs)

#llm #parallelism #computerscience #programming

Introduction

Large Language Models (LLMs) have revolutionized natural language processing by enabling machines to understand and generate human-like text. However, their vast size and computational demands present significant challenges in training & inference. In my quest to address these challenges, I did a POC, which experiments with various parallelism strategies to optimize LLM performance.

In this blog post, I'll delve deep into the technical aspects of my PoC, incorporating insights from the codebase and experimental results. I'll also include "Explain Like I'm 5" (ELI5) moments to make complex concepts more accessible.

The challenge with LLMs

Computational demands

LLMs like Claude-3.5 and GPT-4 contain hundreds of billions of parameters. Training and deploying these models require enormous computational resources, often involving clusters of GPUs or specialized hardware.

Memory constraints: The sheer size of these models can exceed the memory capacity of a single device.
Compute time: Training can take weeks or months, even on high-performance hardware.

Need for efficient parallelism

To make LLMs practical for widespread use, it's essential to distribute the computational workload effectively across multiple devices, optimizing both memory usage and compute efficiency.

Understanding parallelism in LLMs

Parallelism involves dividing a task into smaller sub-tasks that can be processed simultaneously, thereby speeding up computation.

ELI5 explanation

Imagine assembling a giant Lego castle. If you build it alone, it takes a very long time. But if you have friends, and each person builds a different section, the castle comes together much faster. That's parallelism—sharing the work to finish quicker.

Types of parallelism

Data Parallelism (DP): Replicating the entire model on multiple devices and splitting the data across them.
- Pros: Simple to implement.
- Cons: Limited by the largest model that fits on a single device.
Model Parallelism (MP): Splitting the model's parameters across multiple devices.
- Pros: Allows training of larger models.
- Cons: Communication overhead between devices can slow down training.
Pipeline Parallelism (PP): Dividing the model into sequential stages and feeding micro-batches through the pipeline.
- Pros: Balances memory and compute loads.
- Cons: Pipeline bubbles (idle times) can reduce efficiency.
Tensor Parallelism (TP): Splitting individual tensors (weights) across devices.
- Pros: Enables fine-grained parallelism.
- Cons: Increased complexity in implementation.
Expert Parallelism (EP): Distributing expert layers (like in Mixture of Experts models) across devices.
- Pros: Scales specific parts of the model.
- Cons: May require specialized architectures.
Context Parallelism (CP): Dividing sequences into smaller chunks for parallel processing.
- Pros: Optimizes memory usage for long sequences.
- Cons: May introduce dependencies that need careful handling.

The LLM Parallelism Explorer PoC

Objective

The PoC aims to:

Experiment with different parallelism strategies, including combinations of DP, MP, PP, TP, EP, and CP.
Analyze their impact on training and inference efficiency.
Provide insights into optimal configurations for deploying large-scale LLMs.

Architecture overview

The project is designed to facilitate flexible experimentation with various parallelism techniques, leveraging configuration management and detailed logging.

Key features

Configuration management with Hydra: I used Facebook's Hydra to manage complex configurations easily.
Model configurations: Included predefined configurations for models like Grok1.yaml, llama3.1-405b.yaml, llama3.1-70b.yaml, and llama3.1-8b.yaml.
Detailed Metrics Output: Generates CSV files containing comprehensive metrics from experiments.

Technical implementation

Model configurations

The repository includes YAML configuration files for different model sizes:

Grok1.yaml: A smaller model for initial testing.
llama3.1-8b.yaml: An 8-billion parameter model.
llama3.1-70b.yaml: A 70-billion parameter model.
llama3.1-405b.yaml: A 405-billion parameter model.

These configurations specify model architecture details, parallelism strategies, and hyperparameters.

Configuration management with Hydra

Hydra allows for:

Dynamic configuration composition: Combine multiple configuration files and override parameters easily.
Command-Line overrides: Modify configurations on the fly without changing the code.
Reproducibility: Keep track of the exact configurations used in each experiment.

Parallelism strategies

The PoC explores various combinations of parallelism strategies:

Data Parallelism (DP)
Tensor Parallelism (TP)
Pipeline Parallelism (PP)
Expert Parallelism (EP)
Chunk Parallelism (CP)

Each strategy can be adjusted via configuration parameters, enabling fine-tuned experimentation.

Metrics and Logging

The PoC outputs detailed metrics to CSV files, including:

Memory usage: Total, model, optimizer states, and activation memory.
Parallelism parameters: Number of GPUs, TP, PP, EP, CP, DP configurations.
Performance metrics: Pipeline bubble rates, token imbalance hypotheses.

Experimental results

Example CSV output

An example snippet of the CSV output:

name	ngpus	TP	PP	EP	CP	DP	sub_seq_length	micro_batch_size	num_of_micro_batches	data_parallel_sharding_strategy	total_memory_gb	model_and_optimizer_states_memory_gb	activations_memory_gb
LLaMA-3.1 405B	8	1	1	1	1	8	8192	1	512.0	NO_OP	7372.49	6803.58	568.91
LLaMA-3.1 405B	8	1	1	1	1	8	8192	2	256.0	NO_OP	7941.40	6803.58	1137.82
LLaMA-3.1 405B	8	1	1	1	1	8	8192	4	128.0	NO_OP	9079.22	6803.58	2275.64
LLaMA-3.1 405B	8	1	1	1	1	8	8192	1	512.0	FULLY_SHARD	1324.86	755.95	568.91

Explanation of key columns

ngpus: Number of GPUs used.
TP/PP/EP/CP/DP: Tensor, Pipeline, Expert, Chunk, and Data Parallelism degrees.
sub_seq_length: Length of subsequences used in chunking.
micro_batch_size: Size of micro-batches in pipeline parallelism.
num_of_micro_batches: Number of micro-batches processed.
data_parallel_sharding_strategy: Strategy used for sharding data (e.g., NO_OP, OPTIMIZER_STATES, FULLY_SHARD).
total_memory_gb: Total memory usage in GB.
model_and_optimizer_states_memory_gb: Memory used by model parameters and optimizer states.
activations_memory_gb: Memory used for storing activations.

Analysis of results

Impact of micro-batch size

Increasing the micro_batch_size leads to higher activations_memory_gb due to larger activation memory.
There's a trade-off between batch size and memory consumption.
For example, with micro_batch_size of 1, activations memory is 568.91 GB; increasing it to 4 raises activations memory to 2275.64 GB.

Data parallel sharding strategies

NO_OP: No sharding; data is replicated across devices.
- High memory usage due to duplication of model and optimizer states.
FULLY_SHARD: Shards both model parameters and optimizer states, offering significant memory savings.
- Reduces model_and_optimizer_states_memory_gb from 6803.58 GB (NO_OP) to 755.95 GB (FULLY_SHARD).
- Total memory usage drops accordingly.

Memory savings with sharding

Using FULLY_SHARD dramatically reduces total_memory_gb.
For micro_batch_size 1:
- NO_OP: 7372.49 GB total memory.
- FULLY_SHARD: 1324.86 GB total memory.

ELI5 moment

Think of sharding strategies like sharing a heavy backpack. If everyone carries their own heavy backpack (NO_OP), it's tough. If we distribute the items among friends (FULLY_SHARD), each person carries less weight, making it easier for everyone.

Practical implications

For devs.

Flexible Experimentation: Easily adjust configurations to find optimal parallelism strategies for specific models and hardware setups.
Reproducibility: Hydra ensures that experiments can be replicated with the same configurations.
Detailed insights: CSV outputs provide granular metrics for performance tuning.

For orgs.

Cost efficiency: Optimizing memory and compute resources can significantly reduce hardware costs.
Scalability: Enables training of larger models without proportionally increasing resources.
Informed cecision-making: Data-driven insights support strategic planning for deploying LLMs.

Future work

no commitments as it just side-project POC stuff 🙃

Automated Optimization: Implement algorithms to automatically select optimal parallelism configurations based on model size and hardware.
Support for Heterogeneous Clusters: Extend compatibility to include different types of accelerators (e.g., TPUs).
Enhanced Visualization: Develop tools to visualize the impact of different configurations on performance metrics.

Final ELI5 takeaway

Training a massive language model is like moving a mountain of sand. If everyone carries a small bucket (FULLY_SHARD), and we work together smartly, the mountain moves faster, and no one gets too tired. The LLM Parallelism Explorer helps figure out the best way to share the load so that moving the mountain becomes manageable.

Note: The concepts and findings discussed are based on the experiments and implementations in the LLM Parallelism Explorer PoC project. For detailed code and further technical specifics, refer to the GitHub repository.

Stay connected and get more insights

If you found this guide helpful and are dealing with similar challenges, don't hesitate to reach out to me on X. For more tech insights and updates, consider following me on GitHub. Let's innovate together!

DEV Community