🌇China Community Day: Yu Wong (AWS Solution Architect) Generative AI

#communityday #startup #ai #bedrock

Long-Context problem:

Concurrent performance degradation as context length increases.
Exponential growth in pre-fill latency with longer contexts.
Linear increases in decoding latency and context switching costs as context length grows.

Long-Content optimization Hardware:

A100 Memory Hierarchy - Leveraging the advanced memory architecture of the A100 GPU to improve performance for long-context models.

Long-Content optimization Machine Learning Engineering:

FlashAttention
An efficient attention mechanism that reduces the computational and memory costs of attention for long sequences.
VLLM (Very Long Language Models)
Specialized techniques to enable training and inference of language models with extremely long contexts.

Long-Content optimization Model Architecture:

MoE (Mixture of Experts)
Using a modular model architecture with multiple specialized sub-networks to handle different aspects of long-context processing more efficiently.
Speculative Decoding
Techniques to predict future tokens and start processing them in parallel, reducing overall latency for long-range dependencies.

Background of Prefill & Decode:

Impact of Prefill duration on throughput:

Prefill tasks occupy all computing resources, so Prefill-Prefill cannot be parallelized.
Decode tasks require very few computing resources and can be parallelized with Prefill tasks.

Separate Prefill & Decode, Cut 80% cost

introducing the DecodeOnly server.
Achieve Prefill-Decode separation by transmitting inference data over the network.
original architecture focuses on optimizing Prefill tasks.
Prefill no longer needs to store the KV Cache data (data is immediately sent to the Decode server upon generation).
Inference no longer requires large GPU memory support

Retrieval Augmented Generation (RAG):

A technique that enhances language models by integrating external knowledge retrieval to generate more informed and relevant responses.
RAG (includes: ETL, intention, retrieval)
Model Lifecycle Management (includes: model, dataset, entity)
Performance acceleration (includes: accelerate framework, quantization)
Infrastructure Operation (includes: customize chip, managed service)

RAG Workflow: