DEV Community

Cover image for Retrieval - Augmented Generation (RAG)
Babloo Kumar
Babloo Kumar

Posted on

Retrieval - Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) systems are a class of models that combine elements of both retrieval and generation in natural language processing (NLP). These systems integrate retrieval-based methods with traditional generation-based models like GPT (Generative Pre-trained Transformer).

It represents a significant advancement in NLP by bridging the gap between retrieval-based approaches (focused on factual accuracy and relevance) and generation-based models (focused on language fluency and coherence). They are increasingly used in various applications, including dialogue systems, question answering, and content generation tasks.

End to End RAG System
[Fig1. Diagram of the RAG System (Reference image is taken from the internet)]

Here’s a breakdown of how RAG systems typically work (taking a simple example of chatbot querying the private knowledge base or collection of documents which could be in different formats):

For code simplicity and learning purpose, I will be using langchain (Orchestrator)- very minimal , AzureOpenAI (for embedding & gpt-4 models) and FAISS (local vector DB)

First step _ is store the documents into the vector db using embedding model, popularly known as _Ingestion. To do below steps has to be followed:

# load the document --> split the document into chunks --> create embeddings --> store in vector database

1 - Split the entire knowledge base or collection of documents in chunks, chunks basically represents the single piece of context to be queried.

    pdf_path = "Vectors-in-memory/2210.03629v3.pdf"
    loader = PyPDFLoader(file_path=pdf_path)
    documents = loader.load()
    text_splitter = CharacterTextSplitter(
        chunk_size=1000, chunk_overlap=30, separator="\n"
    )
    docs = text_splitter.split_documents(documents=documents)
Enter fullscreen mode Exit fullscreen mode

2 - Use embedding model to transform each chunks into vector embeddings.

  embeddings = AzureOpenAIEmbeddings(
    azure_deployment=os.environ["AZURE_EMBEDDING_OPENAI_DEPLOYMENT"],
    openai_api_key=os.environ["AZURE_EMBEDDING_OPENAI_KEY"],
    azure_endpoint=os.environ["AZURE_EMBEDDING_OPENAI_API_BASE"],
    openai_api_type=os.environ["AZURE_EMBEDDING_OPENAI_API_TYPE"],
    openai_api_version=os.environ["AZURE_EMBEDDING_OPENAI_API_VERSION"],
 )

Enter fullscreen mode Exit fullscreen mode

3 - Store all the vector embeddings into the vector database.

    vectorstore = FAISS.from_documents(docs, embeddings)
    vectorstore.save_local("faiss_index_react")
Enter fullscreen mode Exit fullscreen mode

Next work on the query( or chatbot question part), which is popularly known as Retrieving. The retrieval aspect involves retrieving relevant information or contexts from a large corpus of text based on a given query or input. This retrieval is typically performed using information retrieval techniques, where documents or passages most relevant to the input are selected.

4 - Embed the query(which is asked from the chatbot) using the same embedding model. See the below code,

    chat = AzureChatOpenAI(
        openai_api_key="xxxxxxd8a815xxxxxxxxxxxxx",
        azure_endpoint="https://testmodel.openai.azure.com/",
        openai_api_type="azure",
        azure_deployment="GPT4",
        openai_api_version="2024-05-01-preview",
        temperature=0,
        max_tokens=None,
        timeout=None,
        max_retries=2,
    )
   #Let below be the query from the chatbot
   query = "What is machine learning?"

   new_vectorstore = FAISS.load_local(
      "faiss_index_react",embeddings,allow_dangerous_deserialization=True
   )

   template = """Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up the answer.
    Use three sentences maximum and keep the answer as concise as possible.
    Always say "Thank you for Asking!!!" at the end of the answer.

    {context}

    Question: {question}

    Helpful Answer:"""

  custom_rag_prompt = PromptTemplate.from_template(template)

  rag_chain = (
      {"context": new_vectorstore .as_retriever(), "question": RunnablePassthrough() }
      | custom_rag_prompt
      | chat
    )

    res = rag_chain.invoke(query)
    print(res)

Enter fullscreen mode Exit fullscreen mode

5 - Use the resulting vector to run the query against the index in the vector DB.
Vector DB performs Approximate Nearest Neighbor for the query embedding and it returns the context, the procedure returns the vectors which are similar in a given latent space.

6 - Now map the results into the prompt and pass it to the LLM(in this case AzureChatOpenAI, which has Gpt-4 model deployed). Once the relevant information is retrieved, it is combined with the input query or prompt to generate a coherent and contextually relevant response. This generation process often involves models like GPT (or its variants), which are capable of generating fluent and contextually appropriate text.

The key innovation of RAG systems lies in the effective integration of these two components—retrieval and generation. By leveraging retrieval to provide contextually rich inputs to the generation model, RAG systems aim to produce responses that are not only fluent but also more grounded in relevant knowledge or information from the retrieval process.

Top comments (0)