DEV Community

Cover image for 🌇China Community Day: Yu Wong (AWS Solution Architect) Generative AI

🌇China Community Day: Yu Wong (AWS Solution Architect) Generative AI

More Photos of AWS COMMUNITY DAY in China (Shen Zhen)



Image description

Long-Context problem:

  • Concurrent performance degradation as context length increases.
  • Exponential growth in pre-fill latency with longer contexts.
  • Linear increases in decoding latency and context switching costs as context length grows.

Long-Content optimization Hardware:

  • A100 Memory Hierarchy - Leveraging the advanced memory architecture of the A100 GPU to improve performance for long-context models.

Long-Content optimization Machine Learning Engineering:

  • FlashAttention
    An efficient attention mechanism that reduces the computational and memory costs of attention for long sequences.

  • VLLM (Very Long Language Models)
    Specialized techniques to enable training and inference of language models with extremely long contexts.

Long-Content optimization Model Architecture:

  • MoE (Mixture of Experts)
    Using a modular model architecture with multiple specialized sub-networks to handle different aspects of long-context processing more efficiently.

  • Speculative Decoding
    Techniques to predict future tokens and start processing them in parallel, reducing overall latency for long-range dependencies.

Image description

Background of Prefill & Decode:

  • Cost of LLM cluster inference:
  • Throughput * Hardware Utilization / Hardware Price

Impact of Prefill duration on throughput:

  • Prefill tasks occupy all computing resources, so Prefill-Prefill cannot be parallelized.
  • Decode tasks require very few computing resources and can be parallelized with Prefill tasks.

Separate Prefill & Decode, Cut 80% cost

  • introducing the DecodeOnly server.
  • Achieve Prefill-Decode separation by transmitting inference data over the network.
  • original architecture focuses on optimizing Prefill tasks.
  • Prefill no longer needs to store the KV Cache data (data is immediately sent to the Decode server upon generation).
  • Inference no longer requires large GPU memory support

Image description

Retrieval Augmented Generation (RAG):

  • A technique that enhances language models by integrating external knowledge retrieval to generate more informed and relevant responses.
  • RAG (includes: ETL, intention, retrieval)
  • Model Lifecycle Management (includes: model, dataset, entity)
  • Performance acceleration (includes: accelerate framework, quantization)
  • Infrastructure Operation (includes: customize chip, managed service)

RAG Workflow:

  • Data Preprocessing (ETL)
  • Knowledge extraction
  • Knowledge enhancement
  • Knowledge vectorization
  • Knowledge injection

LLM Orchestration:

  • Intention identification (intention)
  • knowledge retrieval(multi conversation rewrite)
  • Retrieval

Knowledge Enhancement:

  • QA document synthesize
  • Content summary
  • Content split
  • Keyword extraction

Editor

Image description

Danny Chan, AWS community builder (Hong Kong), specialty of FSI and Serverless

Image description

Kenny Chan, AWS community builder (Hong Kong), specialty of FSI and Machine Learning

Top comments (2)

Collapse
 
kennc profile image
Kenn C

amazing!

Collapse
 
danc profile image
Danny Chan

happy learning