DEV Community

Cover image for The Death of RAG: What a 10M Token Breakthrough Means for Developers
Dawid Dahl
Dawid Dahl

Posted on • Updated on

The Death of RAG: What a 10M Token Breakthrough Means for Developers

"In our research, we’ve also successfully tested up to 10 million tokens." - Google Researchers

The other day, Google announced Gemini Pro 1.5 with a massive increase in accurate long-context understanding. While I could not immediately put my finger on exactly what the broader implications might be, I had a hunch that if this is actually true, it's going to change the game.

And it was not until I spoke to a colleague at work when I finally realized the true impact of Google's announcement.

He said:

"Latency is going to be a thing though... 10M tokens is quite a few MBs."

It struck me that I’d actually be thrilled to have the option to wait longer, if it meant a higher quality AI conversation.

For example let’s say it took 5 minutes, or 1 hour — hell, even if it took one whole day — to have my entire codebase put into the chat’s context window. If after that time, the AI had near-perfect access to that context throughout the rest of the conversation like Google claims, I’d happily, patiently and gratefully wait that amount of time.

What is RAG (Retrieval-Augmented Generation)

Both me and my colleague had worked on the ARC AI Portal at our workplace, an internal platform where we give everybody free access to GPT-4 and are utilizing something that is called RAG (retrieval-augmented generation), for various purposes. The purpose of RAG is to provide an AI access to information it does not natively possess, akin to the fresh perspective of initiating a new ChatGPT session.

For example, one use case for RAG was when another colleague of ours, the author Rebecka Carlsson, asked us to let people chat directly with her latest book The Speaker's Journey using our company's AI portal.

So the AI portal team developed the full RAG pipeline: took the book → chunked it into small pieces (not literally) → used OpenAI's embedding model to vectorize the chunks → inserted them into a vector database → and finally gave the AI access to the database within the chat via something called semantic search.

Mostly, it worked great. If people asked some specific detail from her book, the RAG solution was able to retrieve the information more often than not.

But here is the deal, RAG is a hack. We are essentially brute-forcing the information onto the AI, and that means that often it doesn't actually work as well as one would hope. It also can't do summaries well. It also requires developer time to set up, meaning it's slow and costly.

So my point to my colleague was this:

As it is now, people obviously wait weeks, even months, and pay loads of money to people like us to implement RAG, a solution which is riddled with problems even when done by an expert.

Surely then, Google — and eventually OpenAI when they release the equivalent solution — adding a little bit of latency for this new feature is fine.

The Big Problem In AI-Driven Development Today

In my current AI-driven development (AIDD) workflow, I always find myself copy-pasting the relevant parts of my codebase into the AI chat window in the beginning of the conversation. This is because, like most things in life, the specific functions I collaborate on with the AI never exist in isolation. It is always embedded in some larger network of system dependencies.

Important to point out: At work we use either our internal AI portal or ChatGPT Teams, where OpenAI never train their models on our proprietary data or conversations.

So even as I painstakingly take my time to try and copy-paste all the relevant context, since a production codebase is such a huge eco-system of tens or even hundreds of thousands of lines of code, I could never realistically give it all. And even if I do give a lot, as the conversation goes on, the AI will eventually begin to forget.

Github Copilot tried to solve this with a native RAG solution built straight into the code editor. While it works sometimes, it's so sketchy I can never rely on it, meaning their RAG implementation is almost useless.

This cumbersome dance of feeding the AI piece by piece of our codebase, and it constantly forgetting and needing to start over, is a fragmented, inefficient process that disrupts the flow of the AI collaboration and often leads to results that are hit or miss.

That is - until now.

Claims on X about Google Gemini 1.5 Pro

The Exciting Post-RAG Era

Approximately, 1 million tokens would amount to around 50,000 lines of code, and 10 million tokens would thus equate to about 500,000 lines of code. That means that if Google's claims are correct, almost all our codebases would fits into an AI's view all at once.

This would be nothing short of revolutionary.

It's akin to moving from examining individual cells under a microscope to viewing an entire organism at once. Where once we pieced together snippets of code to get a partial understanding, a 10 million token context allows us to perceive the full "organism" of our codebase in all its glorious complexity and interconnectivity.

This shift then would offer a complete and holistic view, enhancing our ability to collaborate with the AI to add new features, refactor, test and optimize our software systems efficiently.

So Is RAG Dead?

Even after the conversation with my colleague, the thoughts of the deeper implications kept on coming. When we get up to 10m context length with better retrieval than RAG, what is even the point of RAG? Does it have any value at all?

By RAG I mean specifically: creating embeddings, feeding them into a vector database and then doing semantic search over those embeddings before feeding the results back to the AI.

Just take one unique selling point of RAG today: metadata. That is, the ability to attach extra pieces of information — such as sources, line-of-code number, file-type, compliance-labels, etc — to the data that the AI interacts with. With such metadata, we can enable the retrieval step to access detailed information for greater specificity and context-awareness in the AI's responses.

But really, why go through the vector database hassle, when you could just have a quick higher-order function that transforms your entire codebase into a json data structure with whatever desired metadata you'd like?

Something such as:

type FormatterFunction<T = unknown, U = unknown, V = unknown> = (
  inputData: T,
  config: U
) => V;

type ProcessData<T = unknown, U = unknown, V = unknown> = (
  formatter: FormatterFunction<T, U, V>,
  inputData: T,
  config: U
) => V;
Enter fullscreen mode Exit fullscreen mode

Could result in some data structure like this:

  code: print(the),
  loc: 73672
  code: print(post),
  loc: 73673
  code: print(RAG),
  loc: 73674
  code: print(era),
  loc: 73675

Enter fullscreen mode Exit fullscreen mode

And then you give this as JSON to the AI, instead of giving it the regular codebase. Sure, it’s more characters, which would increase the overall token count. But when you’re dealing with the hypothetical insanity of millions of tokens, this is starting feel like a possibility.

Playing the Devil's RAG-vocate

Before we declare RAG dead, let's invite a Devil's advocate and think about some of the other reasons why we might want to keep RAG around?

😈 Fake?

"Yeah I saw the original Gemini video, which turned out to be fake. So why would I believe this?"

I was also very skeptical, until I saw this video from someone not working for Google.

Also, there were these demos from a tester not affiliated with Google on X as well.

I was extremely surprised by these promising results.

😈 Staying Updated:

"RAG keeps AI clued into the latest info, something a static context can't always do."

Well, what would prevent us from just giving the freshest data at the beginning of every AI conversation? Or even updating it periodically during the same conversation?

😈 Reducing Hallucinations:

"Since RAG runs on our own server, we have the power to tell the AI to simply say 'I don't know' if relevant context was not able to be retrieved."

This is true, and the simple fact that we as developers have a programmatic step of total control between the retrieval and the response stages just intuitively feels good. So this is a good point.

But then again, there is nothing stopping us from implementing some solution where we first do the retrieval query, and then perform some arbitrary action before feeding the result back to the model. You wouldn't need to do the whole manual chunking/manual embedding/vector database/manual semantic-search for that.

😈 Handling the Tough Questions:

"For those tricky queries that need more than just a quick look-up, RAG can dig deeper."

If we have the full and complete data, and if the AI can have instant access to all of it like Google appeared to demonstrate in their demos, why would we need to dig deeper with RAG at all?

😈 Efficiency:

"When it comes to managing big data without bogging down the system, RAG can be pretty handy."

If this large context window is offered as a service, then that means the system is actually designed to be bogged with data.

😈 Keeping Content Fresh:

"RAG helps AI stay on its toes, pulling in new data on the fly."

Google declares: "Gemini 1.5 Pro can seamlessly analyze, classify and summarize large amounts of content within a given prompt." This means it can pull in data from the entire context window on the fly.

😈 Computational and Memory Constraints:

"Processing 100 million tokens in a single pass would require significant computational resources and memory, which might not be practical or efficient for all applications. Not to mention costly."

This is a good point. As more compute is needed, costs will be higher compared to RAG.

Also considering the global environmental impact - running data centers is one of the major energy drains today. Efficient use of computational resources with RAG could potentially contribute to more sustainable AI practices.

😈 Extending with API requests:

"Sometimes, the AI would need to augment its data with external API requests to get the full picture. When we do RAG, it happens on a server, so we can call out to external services before returning the relevant context back to the model."

AI already has access to web browsing, and there is nothing in principle that prevents an AI to use it while constructing its responses. If you would like more control over external services and make network requests, you should utilize AI Function Calling instead.

😈 Speed:

"I saw Google's demos. It took a long time to get a response; vector databases are much much faster."

This is also true. But personally, I'd rather wait a long time for an accurate response, than wait a short time for a response I can't trust.

Also honestly, who would be surprised if within a couple of months the latency is starting to decrease as new models are released?

In Summary

Google's new break-through announcement could flip the script for developers by allowing AI to digest our entire codebases at once, thanks to its potential 10 million token capacity. This leap forward should make us rethink the need for RAG, as direct, comprehensive code understanding by AI becomes a reality.

The prospect of waiting a bit longer for in-depth AI collaboration seems a small price to pay for the massive gains in accuracy and sheer brain power. As we edge into this new era, it's not just about coding faster; it's about coding smarter, with AI as a true partner in our creative and problem-solving endeavors.

Dawid Dahl is a full-stack developer at UMAINARC. In his free time, he enjoys metaphysical ontology and epistemology, analog synthesizers, consciousness, techno, Huayan and Madhyamika Prasangika philosophy, and being with friends and family.

Top comments (0)