Retrieval Augmented Generation (RAG) is a Generative AI (GenAI) architecture technique that augments a Large Language Model (LLM) with fresh, trusted data retrieved from authoritative internal knowledge bases and enterprise systems, to generate more informative and reliable responses.
Retrieval-Augmented Generation (RAG) is an emerging generative AI technology that addresses these limitations.
RAG transforms real-time, multi-source business data into intelligent, context-aware, and compliant prompts to reduce AI hallucinations and elevate the effectiveness and trust of generative AI apps.
How RAG works:
- Retrieval: When a prompt is given, the model first retrieves relevant information from a knowledge base. This could be a database, a set of documents, or even the internet.
- Generation: The retrieved information is then used to generate a response using a language model.
RAG Architecture
This is important because training equips an LLM with world knowledge, some common understanding, and general information from internet documents. A classic application for LLMs is chatbots or more advanced conversational AI assistants that can respond to queries about domain-specific information without hallucinating.
RAG System Architecture
LLMs aren’t exposed to all the information they need. Knowledge also changes over time. So, how can an LLM answer questions about evolving information it hasn’t yet observed? And how can we mitigate the chance of AI hallucinations in those scenarios?
Retrieval augmented generation (RAG) is a generative AI method that enhances LLM performance by combining world knowledge with custom or private knowledge. These knowledge sets are formally referred to as parametric and nonparametric memories, respectively[1]. This combining of knowledge sets in RAG is helpful for several reasons:
Providing LLMs with up-to-date information: LLM training data is sometimes incomplete. It may also become outdated over time. RAG allows adding new and/or updated knowledge without retraining the LLM from scratch.
Preventing AI hallucinations: The more accurate and relevant in-context information LLMs have, the less likely they’ll invent facts or respond out of context.
Maintaining a dynamic knowledge base: Custom documents can be updated, added, removed, or modified anytime, keeping RAG systems up-to-date without retraining.
With this high level understanding of the purpose of RAG, let's dig deeper into how RAG systems actually work.
Key Components of a RAG system and their function
An RAG system usually consists of two layers:
- a semantic search layer composed of an embedding model and vector store, and
- a generation layer (also called the query layer) composed of an LLM and its associated prompt
Figure 1 shows how RAG systems layer new information into the world knowledge LLMs already know.
Semantic Search Layer
The semantic search layer comprises two key components: an embedding model and a vector store or database. Together, these components enable the semantic search layer to:
- Build a knowledge base by gathering custom or proprietary documents such as PDFs, text files, Word documents, voice transcriptions, and more.
- Read and segment these documents into smaller pieces, commonly called "chunks."
- Transform the chunks into embedding vectors and store the vectors in a vector database alongside the original chunk text.
It’s worth examining how the embedding model and vector store make semantic search possible. Search is enriched by understanding a query's intent and contextual meaning (i.e., semantics) rather than just looking for literal keyword matches.
Today’s enterprises store vast amounts of information – like customer service guides, FAQs, HR documents, manuals, and research reports – across a wide variety of databases. However, keyword-based retrieval is challenging at scale, and may reduce the quality of the generated responses.Traditional keyword search solutions in RAG produce limited results for knowledge-intensive tasks. Developers must also deal with word embedding, data subsetting, and other complexities as they prepare their data. In contrast, semantic search automates the process by generating semantically relevant passages, and information ordered by relevance, to maximize the quality of the RAG response.
A Dive Into Embeddings
Before we deep-dive into this transformative technology, let's first understand what we mean by 'Embeddings'. These are the mathematical representation of complex data types like words, sentences, and objects in a lower-dimensional vector space. Think of embeddings as the 'numeric mask' of the data that's not only more palatable for machine learning algorithms but also retains the semantic relationships among the data.
Here's a simple example using word embedding
Python
import gensim
from gensim.models import Word2Vec
# Sample sentences
sentences = [
"The quick brown fox jumps over the lazy dog.",
"The dog is sleeping on the couch.",
"The cat is chasing a mouse."
]
# Create a Word2Vec model
model = Word2Vec(sentences, min_count=1)
# Get the embedding for a word
embedding = model["dog"]
print(embedding)
In this example:
- We create a list of sentences.
- We use Word2Vec to create a word embedding model.
- We get the embedding for the word "dog".
The output will be a numerical vector representing the embedding for the word "dog". This vector captures the semantic meaning of the word, such as its similarity to other words in the vocabulary.
Other popular embedding techniques:
- GloVe: Global Vectors for Word Representation
- FastText: Subword embeddings that can handle out-of-vocabulary words
- BERT: Bidirectional Encoder Representations from Transformers, a more recent technique that captures contextual information
Embeddings are widely used in various natural language processing tasks, such as text classification, sentiment analysis, and machine translation.
Embedding model
Embedding models are in charge of encoding text. They project text into a numerical representation equivalent to the original text’s semantic meaning, as depicted in Figure 2. For instance, the sentence “Hi, how are you?” could be represented as a numerical (embedding) vector [0.12, 0.2, 2.85, 1.33, 0.01, ..., -0.42] with N dimensions.
This illustrates a key takeaway about embeddings:
Embedding vectors that represent texts with similar meanings tend to cluster together within the N-dimensional embedding space.
Some examples of embedding models are
- OpenAI's text-embedding-ada-002
- Jina AI's jina-embeddings-v2
- SentenceTransformers multi-QA
Vector store
Vector stores are specialized databases for handling high-dimensional data representations. They have specific indexing structures optimized for the efficient retrieval of vectors.
Some examples of open-source vector stores are Facebook’s FAISS, Chroma DB, and even PostgreSQL with the pgvector extension. Vector stores can be in-memory, on disk, or even fully managed, like Pinecone and Weaviate.
The Future is Vector: Embedding and GenAI
Looking forward, the symbiotic relationship between embeddings and GenAI is likely to deepen. As GenAI models continue to grow in complexity, the need for efficient data representation, storage, and retrieval will only escalate.
But it's not just about efficiency. Embeddings hold the key to 'interpretability' in AI, one of the holy grails in the field. While most AI models are seen as 'black boxes', embeddings can give us a sneak peek into what's happening inside. By visualizing these embeddings, we can understand how the model perceives different data points and their relationships. This can be a game-changer in AI applications that demand transparency and explainability.
In parallel, we can also anticipate significant advancements in vector databases. They will likely become more intuitive, intelligent, and integrated with AI development pipelines. We can expect functions like automatic generation of optimal embeddings for a given task and query optimization based on the nature of vector data. I will cover vector topic indepth in another blog.
Generation Layer
The generation layer consists of an LLM and its associated prompt. The generation layer takes a user query (text) as input and does the following:
- Executes a semantic search to find the most relevant information for the query.
- Inserts the most relevant chunks of text into an LLM prompt along with the user's query and invokes the LLM to generate a response for the user.
A deeper look at how the LLM and prompt interact with an RAG system.
LLM
Large language models are built upon Transformer architecture, which uses the technique of attention mechanism to help the model decide where to pay more or less attention in a sentence or text. LLMs are trained on massive amounts of data drawn from public sources, mainly available on the internet.
LLMs become brainier in RAG systems and are able to generate improved answers based on the context retrieved through semantic search. Now, the LLM can change its answers to better align with each query’s intent and meaning.
Some examples of managed LLMs are OpenAI’s ChatGPT, Google’s Bard, and Perplexity AI’s Perplexity. Some LLMs are available for self-managed scenarios, such as Meta's Llama 2, TII’s Falcon, Mistral’s Mistral AI, and Databricks’s Dolly.
Prompt
A prompt is a text input given to an LLM that effectively programs it by tailoring, augmenting, or sharpening its functionalities. With RAG systems, the prompt contains the user’s query alongside relevant contextual information retrieved from the semantic search layer that the model can use to answer the query.
Grounding LLMs with rag prompt engineering
To harness the potential of Large Language Models (LLMs), we need to provide clear instructions – in the form of LLM prompts. An LLM prompt is text that instructs the LLM on what kind of response to generate. It acts as a starting point, providing context and guiding the LLM towards the desired outcome. Here are some examples of different types of LLM prompts:
Task-oriented prompts
"Write a poem about a cat chasing a butterfly."
(Instructs the LLM to generate creative text in the form of a poem with a specific theme.)"Translate 'Hello, how are you?' from English to Spanish: " (Instructs the LLM to perform a specific task – translation.)
Content-specific prompts
"Write a news article about climate change based on the latest sources."
(Provides specific content focus and emphasizes factual accuracy.)"Continue this story: When my spaceship landed on a strange planet, I..."
(Offers context for the LLM to continue a creative narrative.)
Question-answering prompts
"What is the capital of Bolivia?"
(Instructs the LLM to access and process information to answer a specific question.)"What are the pros and cons of genetic engineering based on the attached publication?"
(Provides context from a source and asks the LLM to analyze and answer a complex question.)
Code generation prompts
"Write a Python function to calculate the factorial of a number." (Instructs the LLM to generate code that performs a specific programming task.)
"Complete the attached JavaScript code snippet to display an alert message on the screen."
(Provides partial code and instructs the LLM to complete the functionality.)
Essential Things to Know When Considering RAG
In addition to the practical and theoretical applications of RAG, AI practitioners should also be aware of the ongoing monitoring and optimization commitments that come with it.
RAG evaluation
RAG systems should be evaluated as they change to ensure that behavior and quality are improving and not degrading over time. RAG systems should also be red-teamed to evaluate their behavior when faced with jailbreak prompts or other malicious or poor-quality inputs.
Quality and quantity of RAG knowledge
A RAG system is as good as the content available in the knowledge database. Furthermore, even if the knowledge database has the correct information, if the semantic search does not retrieve it or rank it highly enough in the search results, the LLM will not see the information and will likely respond unsatisfactorily.
Moreover, if the retrieved content has low information density — or is entirely irrelevant, the LLM’s response will also be unsatisfactory. In this case, using a model with a larger context window is tempting so that more semantic search results can be provided to the LLM. But this comes with tradeoffs namely, increased cost and risk of diluting the relevant information with irrelevant information — which can “confuse” the model.
RAG cost
Since embedding models usually have an insignificant cost nowadays, RAG’s main costs arise from vector database hosting and LLM inference. The biggest driver of cost with LLM inference in RAG systems is likely the number of semantic search results inserted into the prompts. A more significant LLM prompt with more semantic search results could potentially yield a higher-quality response. Still, it will also result in more token usage and possibly more substantial response latency.
However, a larger prompt with more information does not necessarily guarantee a better response. The optimal number of results to insert into the prompt will be different for every system and is impacted by factors such as chunk size, chunk information density, the extent of information duplication in the database, the scope of user queries, and much more. An evaluation-driven development approach is likely the best way to determine the best process for your system.
Is RAG Right for Your Generative AI Applications?
Retrieval augmented generation systems mark a significant advancement in AI, enhancing LLM performance by reducing hallucinations and ensuring knowledge base information is current, accurate, and relevant. Balancing information retrieval against cost and latency while maintaining a high-quality knowledge database is essential for effective use. Future advancements, including techniques like hypothetical document embeddings (HyDE), promise to further improve RAG systems.
Despite its costs, RAG undeniably improves user interaction, creating stickier, more delightful generative AI experience for customers and employees alike.
References:
[1] Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.
[2] Neelakantan, Arvind, et al. "Text and code embeddings by contrastive pre-training." arXiv preprint arXiv:2201.10005 (2022).
[3] White, Jules, et al. "A prompt pattern catalog to enhance prompt engineering with chatgpt." arXiv preprint arXiv:2302.11382 (2023).
[4] Introducing text and code embeddings [WWW Document], OpenAI. URL https://openai.com/blog/introducing-text-and-code-embeddings
Top comments (1)
Nice content! I came across this blog on RAGChecker. Might be helpful for you to plan your upcoming RAG related content!