DEV Community

Cover image for Long Context Windows in LLMs are Deceptive (Lost in the Middle problem)🧐
Namee for LLMWare

Posted on

Long Context Windows in LLMs are Deceptive (Lost in the Middle problem)🧐

It seems like OpenAI and Anthropic have been in a battle of context windows for the better part of a year.

In May of 2023, Anthropic breathlessly announced: "We've expanded Claude's context window from 9K to 100K tokens, corresponding to around 75,000 words!"

(Note: 75,000 words are about 300 pages)

Not to be outdone, OpenAI released its 128K context window in November 2023.

Only then to be outnumbered by Anthropic's 200k context window in March 2024.

Is this back and forth tennis match for context window sizes really necessary?

What are context windows?

A context window is the text range around a target token (a token is about a word) that an LLM can process at the time the information is generated.

People assume that the larger the context window, the more text that can be input to search, for example.

However, long context windows in LLMs are misleading because many users assume that you don't need RAG if the context windows are big enough.

Lost in the Middle Problem

Studies and experiments, however, have shown that long context windows in LLMs provide challenges when looking for a specific fact or text.

The most vivid illustration of this problem for me showed up in this YouTube video.

Here, the experimenter uses a context length of only 2k tokens (remember how GPT-4 has 128k token limit) to search for a simple sentence in the middle that reads:

"Astrofield creates a normal understanding of non-celestial phenomena."

And guess what? About 2/3 of these models fail this test! They literally can't find this sentence in only 2k tokens!

The Winners and Losers

🏆 Here is the list of the models that passed the 2k context window test: ChatGPT Turbo, Open Hermes 2.5 - Mistral 7B, Mistral 7b Instruct (passed once at 10:43 and failed once at 3:47), and Yi 34B Chat

👎 Here is a list of the models that failed the test: Mixtral 8x7B Instruct, Mistral Medium, Claude 2.0, GPT 4 Turbo, Gemini, Mistral 7B Instruct, Zephyr 7B Beta, PPIX 70B, Starling 7B - alpha, Llama 2 - 70B chat, Vicuna 33B and Mixtral 8x7B Instruct

Same Experiment with RAG

Now a small disclaimer about me -- I am the founder of an open source project where we also make models in Hugging Face as well as a platform for LLM-based workflows called LLMWare.

I was inspired to recreate this experiment so we made up a document of about 11,000 tokens (much more than the 2k) about astrophysics, added the sentence that is being queried "Astrofield creates a normal understanding of non-celestial phenomena" somewhere in the middle of the document and ran RAG on our LLMWare platform.

We then tried this against 3 models - LLMWare BLING Tiny Llama 1.1B, LLMWare DRAGON Yi-6b, and also the Zephyr 7B Beta (which had failed the test in the YT video).

Here are some screenshots of the results. As you can see, with RAG and fine-tuning, even a 1.1B model can find the answer.

LLMWare Bling Tiny Llama 1.1B:

Finds this content with no problem. 💯

Image description

LLMWare Dragon Yi 6B:

Also finds this content with no problem. 💯

Image description

Zephyr 7B Beta (not finetuned for RAG so a little more chatty but still finds it with RAG where it had failed before):

Image description

The Lesson: Large Context Windows are Ineffective at Searches

As can be seen by our experiment, even the smallest 1B parameter model can do a better job than GPT-4 Turbo for fact-based searches with RAG. It is much better to use a small model with RAG than to rely on just a large (or in this case, not that large at just 2k tokens) context window IF it is coupled with the right RAG workflow.

I hope this experimentation underscored the importance of a good LLM-based workflow using RAG. If you want to learn about RAG, here is an article I wrote recently in to help you get started.

So the next time someone tries to impress you with just a long context window, look critically at the surrounding workflow to make sure you are getting the answer you want.

Explore LLMWare on GitHub ⭐️

Please join our LLMWare community on discord to learn more about RAG and LLMs!

Top comments (2)

matijasos profile image
Matija Sosic

I agree that larger context is not a silver bullet. In you experience, in which use cases it really improves LLM performance?

noberst profile image

Great question!

I think the longer context window is actually more relevant for other kinds of workflows - for example, if you wanted to summarize or rewrite a very long piece of text requiring a context window that large.