Mikkel

Posted on Dec 3

Retrieval-Augmented Generation

#rag #llm #learning

TL;DR

Retrieval-Augmented Generation (RAG) builds upon large language models (LLMs) like ChatGPT and Gemini by enabling them to access relevant internal and external documents. These documents can serve as references to answer questions, assist in writing reports, or support decision-making. By leveraging references, RAG increases trust in the generated responses, enables exploration of large knowledge bases, and prevents the model from "hallucinating" answers. Hosting on platforms like Azure ensures the setup remains private and GDPR-compliant.

Introduction

For over 15 years, Google has been synonymous with information retrieval. The internet became the world's library, with Google's advanced algorithm serving as our librarian. However, this paradigm has recently shifted to an all-knowing sage called ChatGPT. This sage has too much time on their hands and has read everything in the library for you. They can recall and explain it to you in seconds—even simplifying complex topics as though you were five years old.

Now imagine this sage takes a crash course in your company. Perhaps you’ve amassed hundreds of documents, including internal protocols, annual reports, and methodologies. Within hours, this sage could become an expert in your organization. This crash course is called Retrieval-Augmented Generation (RAG). With RAG, you can have an AI assistant that answers questions, generates reports, and supports decision-making based on your company’s specific knowledge base.

In this blog post, we’ll briefly explore what RAG is, the technologies behind it, and—most importantly—how it can benefit you. To understand RAG and specialized chatbots, it’s helpful to know a bit about NLP, embeddings, and LLMs.

Building Blocks

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the branch of AI that deals with understanding language. It includes applications like text-to-speech, image generation from text, and speech-to-image. Some of the biggest AI breakthroughs in recent years have come from this field. OpenAI has been a key player, introducing models like ChatGPT for language and DALL-E for image generation.

Embeddings

To process human language, computers need to convert words and sentences into numbers—a process known as embedding. Text is broken into smaller pieces called tokens, and each token is assigned a vector (a series of numbers) in a lookup table. For example:

King: [0.8, 0.6, 0.1]
Man: [0.5, 0.3, 0.1]
Woman: [0.6, 0.4, 0.1]
Queen: [0.9, 0.7, 0.1]

This numerical representation enables fascinating operations. By converting words into numbers, we can perform mathematical calculations on their meanings. For example:

King - Man + Woman = Queen

King - Man = [0.8, 0.6, 0.1] - [0.5, 0.3, 0.1] = [0.3, 0.3, 0.0]
Result + Woman = [0.3, 0.3, 0.0] + [0.6, 0.4, 0.1] = [0.9, 0.7, 0.1]

Thus, the result is Queen, represented by the vector [0.9, 0.7, 0.1]. This is a simplification, but the principle is the same.

ChatGPT embeddings operate in 1,536 dimensions, enabling it to understand the semantic relationships between words and sentences far beyond this simple example.

By calculating the “cosine similarity” between two vectors, we can measure how similar the meanings of two words are. For example, if we compare the words "cat" and "dog," which have closely related vectors, we will obtain a high similarity score, indicating their semantic closeness. This enables AI models to understand the context and relationships between words in a highly nuanced way, forming the foundation for advanced NLP applications such as machine translation, sentence meaning comparison, and much more.

LLMs

The next major component of RAG is large language models, or LLMs. Popular examples include OpenAI's ChatGPT, Google’s Gemini, and Meta’s Llama. LLMs are built on transformer architectures—a revolutionary design that allows models to recognize patterns and derive meaning from vast amounts of data. For instance, ChatGPT-3 was trained on 570GB of text data, equivalent to reading a 316-million-page PDF. And newer models have only grown larger.

Transformers were introduced in Google’s groundbreaking paper, “Attention is All You Need.” Before this, language models primarily relied on recurrent or convolutional neural networks, which process data sequentially, step by step. These architectures were time-consuming and lacked the ability to run computations in parallel. Transformers addressed this by introducing attention mechanisms—a way to assign meaning to words based on their surrounding context.

This attention mechanism subtly modifies the vectors representing words. For example, in the earlier example, the word king was represented by the vector [0.8, 0.6, 0.1]. In the sentence “The king loves his queen,” the vector for king might transform into [0.4, 0.5, 0.6]. These transformations occur through a series of calculations that evaluate the relationships between word vectors.

When you ask ChatGPT a question, it responds based on all the knowledge it has accumulated during training. This foundational design enables the model to generalize and perform well across a wide range of topics. It can quickly answer questions about science, history, technology, and more, understanding the context of your inquiries to provide relevant and accurate responses. However, challenges arise when dealing with specialized tasks or ensuring trustworthiness.

A helpful analogy is to think of an LLM as a judge. The judge has extensive knowledge of laws and regulations. When faced with a new, specific case, their authority and expertise improve if they have an assistant who retrieves case files from the library containing similar rulings and procedures. In this scenario, the assistant represents the RAG functionality.

Before RAG, addressing such specialized tasks involved either fine-tuning the model—adding more data and retraining—or attempting to engineer specific prompts to guide the model’s behavior. Both methods were time-consuming and less efficient. This is where RAG comes in.

RAG

Retrieval-Augmented Generation (RAG) is an extension of LLMs. Here’s how it works: you upload your own data—this could be PDFs, PowerPoint presentations, or Word documents. Similar to how LLMs process text, these documents are divided into smaller chunks and embedded into numerical vectors. However, RAG diverges from traditional models at this point. Instead of retraining the model on this new data, the embedded vectors are stored in a vector database.

When the LLM is asked a question or tasked with generating text, it first retrieves relevant data from the vector database and uses it as context for its response. This approach allows the model to access additional data and provide answers with references.

The advantages of this method are significant:

Speed: It is much faster than retraining a model with new data.
Trust: By providing references for its responses, the model builds greater trust with users.
Customization: If you have proprietary data, such as internal processes or company-specific information, the LLM can incorporate this knowledge into its responses.

From a GDPR perspective, it’s crucial to ensure that the servers hosting your data and vectors are located within the EU. This can be a challenge, as many vector databases are hosted by U.S.-based companies. However, solutions like Azure and Supabase—which we have experience with—offer database hosting in GDPR-compliant locations such as Germany and Sweden.

Experiences

At Convai, we have spent the past few months working extensively with chatbots and RAG. Our projects have ranged widely, requiring various types of solutions. For simpler tasks, we’ve used a drag-and-drop framework called Flowise. This tool allows predefined building blocks to be arranged on a board, enabling quick and efficient solutions.

For more advanced projects that demand flexibility, we’ve turned to a Python library called LangChain, the foundation upon which Flowise is built. LangChain offers a highly customizable approach, allowing us to assemble a variety of components to process data in the way that best fits the use case. Additionally, it enables us to incorporate tailored prompts for the models when needed, adding an extra layer of precision and functionality.

Potential

As you can see, there are numerous opportunities to integrate RAG into your business. By granting an LLM access to internal documents, a chatbot can serve as a domain expert. Additionally, providing access to relevant external documents can further enhance its capabilities.

A traditional chatbot can be significantly improved by giving it access to the company’s FAQ. Beyond that, an LLM can be used to generate content for the company’s website. With tools like LangChain, the model can analyze existing content on the site, using it as a stylistic guide. This ensures that any new content the LLM generates matches the tone and format of the existing material.

If you want to use an LLM for planning purposes, you can give the model access to historical plans. Based on this data, the LLM can create new plans, such as event schedules or meeting agendas, tailored to your needs.

DEV Community

Retrieval-Augmented Generation

TL;DR

Introduction

Building Blocks

Natural Language Processing (NLP)

Embeddings

LLMs

RAG

Experiences

Potential

Top comments (0)

Read next

Large-scale Data Processing with Step Functions : AWS Project

Effortless Web Hosting: Build and Run an NGINX Server with Docker

Deploying a Complete Machine Learning Fraud Detection Solution Using Amazon SageMaker : AWS Project

A Comprehensive Look at Custom JavaScript Compilers