DEV Community

Cover image for Understanding Retrieval Augmentation Generation (RAG) Behind llama-index for Beginners
Emanuel Ferreira
Emanuel Ferreira

Posted on • Updated on

Understanding Retrieval Augmentation Generation (RAG) Behind llama-index for Beginners

Large Language Models (LLMs) like GPT and Llama only have access to public data, making it hard for enterprises to use their own private data to build Artificial Intelligence (AI) products.

A cheap and fast way to get around this issue with LLMs is using Context Augmentation, where you can augment LLMs with your private data.

In this article, we will see behind LlamaIndex which is a data framework for LLMs applications

Context Augmentation

Context augmentation is when you provide data to your context window (a range of tokens that the model considers when generating responses to prompts) when synthesizing an answer to a query with the LLM.

Let's suppose you want to ask a question about how to be rich to LLM based on your own private data.

Using LlamaIndex, under the hood, it would look like this:

context_str = 'a text about how to be rich'
query_str = 'how to be rich creating business?'
Enter fullscreen mode Exit fullscreen mode
DEFAULT_TEXT_QA_PROMPT_TMPL = (
    "Context information is below. \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    "Given the context information and no prior knowledge, "
    "answer the question: {query_str}\n"
)
Enter fullscreen mode Exit fullscreen mode
# output: do x, y
Enter fullscreen mode Exit fullscreen mode

Together with your question query_str, LlamaIndex will send your extra context to the LLM under context_str variable, creating a response that is grounded in the data you have provided.

Dealing with large contexts

A diagram showing original context -> break into multiple chunks -> call to LLM using first chunk -> any chunks remaining? -> yes -> refine -> return answer

Sometimes context doesn't fit in a single LLM call due to token limits. To solve this, LlamaIndex refines answers across multiple LLM calls, breaking the original context into multiple chunks, and using the new context chunks in each iteration with the refine template.

context_str = 'second part about how to be rich'
query_str = 'how to be rich creating business?'
existing_answer = 'do x, y'
Enter fullscreen mode Exit fullscreen mode
CHAT_REFINE_PROMPT_TMPL_MSGS = [
    HumanMessagePromptTemplate.from_template("{query_str}"),
    AIMessagePromptTemplate.from_template("{existing_answer}"),
    HumanMessagePromptTemplate.from_template(
        "We have the opportunity to refine the above answer "
        "(only if needed) with some more context below.\n"
        "------------\n"
        "{context_msg}\n"
        "------------\n"
        "Given the new context, refine the original answer to better "
        "answer the question. "
        "If the context isn't useful, output the original answer again.",
    ),
]
Enter fullscreen mode Exit fullscreen mode
# output: do x, y, z
Enter fullscreen mode Exit fullscreen mode

After each iteration, the answer will be refined again and again, getting better until the refinement stops. This way, the LLM has a chance to read all the context before providing a final answer.

Conclusion

That's the core part of context augmentation for beginners. Behind the scenes, there are a lot of techniques to make it more performant and precise, such as top_k similarity, vector stores, and prompts optimizations.

What's next

A great start is a repository that I did. There you can get your favorite essays through a crawler and augment LLMs based on your favorite authors

https://github.com/EmanuelCampos/monorepo-llama-index

Wanna discuss AI, Technology, or startups? DM me on Twitter or LinkedIn

Thanks to the reviewers:

Logan Markewich - Software Engineer at LLamaIndex

References:

Context Window - Chelsy Ma
LLamaIndex

Top comments (0)