In today’s rapidly advancing AI world, one of the limiting factors of modern Large Language Models (LLMs) is the context size. But it would also be interesting to know how well the LLMs can use the context they have – their context recall, or the reliability with which the LLM can access information in its context.
To set the stage, the context is the data fed to the LLM for it to produce output, basically representing the LLM’s “working memory.” While there are techniques to work around the current size limitation – most notably Retrieval-Augumented Generation (RAG) – ultimately, all the relevant information about the task-at-hand must fit into the context.
Context sizes are improving, with the recent update of the GPT-4 model (
gpt-4-1106-preview) bumping the context size to 128 thousand tokens and Claude 2 upgrading its context to 200 thousand tokens.
I’m working on an AI dev tool GPT Pilot that uses LLMs a lot. So, I was interested in context recall - however, it becomes more apparent at larger context sizes. In other words, how well can the LLM find the information it needs that is in the context? Less than ideal, as it turns out.
I was interested in exactly how well this context recall works for different LLMs, specifically for GPT-3.5, GPT-4 and Claude. I constructed a context of the desired size with a piece of data buried inside it, asked the LLM to find it, and measured how often it succeeds.
This research follows the “haystack test” Greg Kamradt published when the update GPT-4 came out (twitter, code). That test provided useful insight into (the lack of) context recall performance. But it was performed on a very small sample test (limiting its statistical significance) and was initially limited to GPT-4 (he has since published an updated version that also uses Claude 2.1). Moreover, the test data consists of essays that were likely already used pretraining LLMs, and the results were evaluated by GPT-4, potentially introducing confounding variables into the mix.
To dive deeper, I wanted to measure pure context recall on random data never before seen by an LLM and measure it directly (as the probability of success). I also wanted to run the test in more iterations to achieve more statistically significant results. The results were surprising!
In the test, I constructed an artificial data set – a randomly generated CSV file with two columns, “key” and ”value,” and as many rows that would fit into the context (minus some padding for prompt, query, and response so that total number of tokens is under the limit).
This was constructed for 8, 16, 32, 64, 96, 128, and 192 thousand tokens. The set was split into 5 equal parts (quintiles) of 20% size of the total CSV length:
- Quintile 0: Near the start of the context
- Quintile 1: In the first half of the context
- Quintile 2: Around the middle of the context
- Quintile 3: In the second half of the context
- Quintile 4: Near the end of the context
I randomly chose a key from the target quintile and asked the LLM to find the corresponding value (from the entire set). This was repeated 30 times and then calculated the resulting score for that context size and quintile, as a percentage of correct responses (ie the correct value was found).
As of writing this article, GPT (and especially GPT-4) is the undisputed champion of LLMs in terms of reasoning power. Let’s see how well it performs in terms of recall.
(Note for readers in a hurry: I put a handy chart comparing all the results at the bottom of this article).
GPT-3.5 performed poorly on the tests. While I didn't test on the 4k that it was originally built with, results for 8k context size didn’t perform very well, and using all 16k produced outright atrocious results.
GPT-4 was flawless on 8k context and preformed really well with 16k context. It was somewhat worse with 32k and 64k (roughly on par with gpt-3.5 on 8k), and rather poorly on 96k and 128k contexts.
Results for Claude were surprising. While it's understandable that Claude has at least somewhat different approach to solving the context problem, the graphs do tell rather different story than those of GPT series.
Claude 2 performed flawlessly on 8k, really well on 16k, 32k, 64k, and 96k contexts (on par with GPT-4 16k), and not too shabby on 192k! It was much slower than GPTs on large contexts, though. Unfortunately, I didn’t focus time the requests, but on large contexts, Claude seems several times slower than GPT-4 – like it was doing RAG or something else behind the scenes.
As expected, Claude Instant did somewhat worse than both Claude 2 and GPT-4, but it was markedly better than GPT-3.5.
How should we interpret these results? For example, what exactly does 73% recall performance mean for us when using these models in the real world?
It’s important to remember these tests measure the absolute ability of LLM to (ideally) perfectly remember every little detail from a big data set. While it’s useful for us to be able to evaluate performance, in many uses, it’s not as big of an issue for a few reasons:
- Real-world data is usually duplicated in one way or another (in other words, compressible), meaning it’s probably easier for an LLM to remember real-world data than purely random strings with very high entropy.
- In the real world, if we want to look up the data as-is, we’d use a database, not a LLM. The context is a guide to the LLM on what to do and how to do it, not a trivia quiz.
In other words, these results show the hard performance limits on the context recall and are a useful guide when thinking about context sizes we want to employ in our use cases. But the real-world situation is both messier and more forgiving.
Anecdotally, in some of the use cases I looked at, the models gave okay results at sizes where they scored 75% or more on this test.
I also didn’t shoot for hard, statistically sound measurements (3-sigma confidence) because that would be measuring at higher precision (and at a much higher cost) than what’s really useful.
As noted, I intentionally used synthetic random data to measure the recall. Using real-word data would probably give a somewhat different results and would be an interesting followup study.
I also kept the conversation chain short: all the data was in one (system) message, and the user query was in the second message. It would be interesting to see if having the context split across multiple smaller messages impacts the performance in any way.
Finally, I also haven’t tested any open source LLMs. Most are limited to 4k context, so it wouldn’t be a fair comparison. However, it would be interesting to see a comparison of the leading open source LLMs regarding context recall performance.
With the above caveats out of the way, who won the context contest?
Based on the context limit alone, Claude 2 is the winner, followed by GPT-4. When using small context sizes (relative to what the models suggest), both models perform really well.
Unless you’re dealing with small data that can comfortably fit inside a 4k context size, my recommendation is that you avoid GPT-3.5 and Claude Instant 1.2.
That’s it for this post - I hope you find this insightful. If you have a different experience with any of these LLMs, let me know what you found out.
Also, it would mean A LOT if you check out GPT Pilot. We’re working on a dev tool that tries to offload 90+% of coding tasks from the developer to the LLM. It’s completely open source so if you star the Github repo, it would mean a lot to us. Thank you 🙏