DEV Community

Cover image for StreamingLLM - 4 Million Token , 22x Faster
JohnDotOwl
JohnDotOwl

Posted on

StreamingLLM - 4 Million Token , 22x Faster

Recent advances in natural language processing have led to the development of large language models (LLMs) like GPT-3 that can generate remarkably human-like text. However, a major limitation of LLMs is that their performance degrades rapidly as more context is provided. Attempting to feed LLMs unlimited data results in the model slowing down and eventually crashing.

A new technique called Streaming LM aims to solve this problem by allowing unlimited context to be provided to LLMs without sacrificing speed or overloading memory.

How LLMs Currently Work

LLMs like GPT-3 are based on the transformer architecture. Each new token that is processed increases the model's complexity quadratically. So feeding more data to the model exponentially increases training time and memory usage.

Current approaches to dealing with long context sequences include attention windowing, where older tokens are discarded, or keeping only the most recent tokens. However, both these methods result in the model losing the broader context.

Image description

Key Insights Behind Streaming LLM

The creators of Streaming LLM made an interesting observation about how attention is distributed in LLMs. They found that most of the attention is paid to the first few tokens, with diminishing attention paid to later tokens.

So even with a large context sequence, the first few tokens and the most recent tokens contain the majority of relevant information. The tokens in the middle receive negligible attention.

How Streaming LLM Works

Streaming LLM takes advantage of this attention phenomenon. Instead of feeding all tokens to the model, it provides:

The first few important tokens (attention syncs)
A rolling cache of the most recent tokens
As new tokens are added, expired tokens from the middle are dropped without significantly impacting performance.

This approach allows virtually unlimited context to be provided to the LLM while maintaining efficiency and avoiding memory issues.

Does This Solve LLMs' Context Limitations?

For certain use cases like long-form content generation, the Streaming LLM approach works very well. The context limit is essentially removed.

However, for cases like summarizing academic papers, detailed context will still be lost. So there are still limitations to how much context LLMs can effectively utilize.

Nonetheless, Streaming LLM is an exciting first step towards enabling LLMs to leverage far more data context than previously possible. The potential to enhance LLMs' understanding of long conversations and documents is immense.

The Future Possibilities

While Streaming LLM has its limitations, it opens up news avenues for feeding more data to LLMs. With further research, more advanced techniques could be developed to allow LLMs to thoroughly comprehend vastly more information.

The future of LLMs is bright. With innovations like Streaming LLM, models stand to become capable of far more complex and nuanced language understanding and generation.

Github - https://github.com/mit-han-lab/streaming-llm
Research Paper - https://arxiv.org/abs/2309.17453

Top comments (0)