Yemi Adejumobi

Posted on Jul 16

Building a Traceable RAG System with Qdrant and Langtrace: A Step-by-Step Guide

#ai #rag #llmops #observability

Vector databases are the backbone of AI applications, providing the crucial infrastructure for efficient similarity search and retrieval of high-dimensional data. Among these, Qdrant stands out as one of the most versatile projects. Written in Rust, Qdrant is a vector search database designed for turning embeddings or neural network encoders into full-fledged applications for matching, searching, recommending, and more.

In this blog post, we'll explore how to leverage Qdrant in a Retrieval-Augmented Generation (RAG) system and demonstrate how to trace its operations using Langtrace. This combination allows us to build and optimize AI applications that can understand and generate human-like text based on vast amounts of information.

Complete Code Repository

Before we dive into the details, I'm excited to share that the complete code for this RAG system implementation is available in our GitHub repository:

RAG System with Qdrant and Langtrace

This repository contains all the code examples discussed in this blog post, along with additional scripts, documentation, and setup instructions. Feel free to clone, fork, or star the repository if you find it useful!

What is a RAG System?

Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language models (LLMs) with external knowledge. The process typically involves three steps:

Retrieval: Given a query, relevant information is retrieved from a knowledge base (in our case, stored in Qdrant).
Augmentation: The retrieved information is combined with the original query.
Generation: An LLM uses the augmented input to generate a response.

This approach allows for more accurate and up-to-date responses, as the system can reference specific information rather than relying solely on its pre-trained knowledge.

Implementing a RAG System with Qdrant

Let's walk through the process of implementing a RAG system using Qdrant as our vector database. We'll use OpenAI's GPT model for generation and Langtrace for tracing our system's operations.

Setting Up the Environment

First, we need to set up our environment with the necessary libraries:



import os
import time
import openai
from qdrant_client import QdrantClient, models
from langtrace_python_sdk import langtrace, with_langtrace_root_span
from typing import List, Dict, Any

# Initialize environment and clients
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"
langtrace.init(api_key='your_langtrace_api_key_here')
qdrant_client = QdrantClient(":memory:") 
openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))

Initializing the Knowledge Base

Next, we'll create a function to initialize our knowledge base in Qdrant:



@with_langtrace_root_span("initialize_knowledge_base")
def initialize_knowledge_base(documents: List[str]) -> None:
    start_time = time.time()

    # Check if collection exists, if not create it
    collections = qdrant_client.get_collections().collections
    if not any(collection.name == "knowledge-base" for collection in collections):
        qdrant_client.create_collection(
            collection_name="knowledge-base"
        )
        print("Created 'knowledge-base' collection")

    qdrant_client.add(
        collection_name="knowledge-base",
        documents=documents
    )
    end_time = time.time()
    print(f"Knowledge base initialized with {len(documents)} documents in {end_time - start_time:.2f} seconds")

Querying the Vector Database

We'll create a function to query our Qdrant vector database:



@with_langtrace_root_span("query_vector_db")
def query_vector_db(question: str, n_points: int = 3) -> List[Dict[str, Any]]:
    start_time = time.time()
    results = qdrant_client.query(
        collection_name="knowledge-base",
        query_text=question,
        limit=n_points,
    )
    end_time = time.time()
    return results

Generating LLM Responses

We'll use OpenAI's GPT model to generate responses:



@with_langtrace_root_span("generate_llm_response")
def generate_llm_response(prompt: str, model: str = "gpt-3.5-turbo") -> str:
    start_time = time.time()
    completion = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt},
        ],
        timeout=10.0,
    )
    end_time = time.time()
    response = completion.choices[0].message.content
    return response

The RAG Process

Finally, we'll tie it all together in our RAG function:



@with_langtrace_root_span("rag")
def rag(question: str, n_points: int = 3) -> str:
    print(f"Processing RAG for question: {question}")

    context_start = time.time()
    context = "\n".join([r.document for r in query_vector_db(question, n_points)])
    context_end = time.time()

    prompt_start = time.time()
    metaprompt = f"""
    You are a software architect.
    Answer the following question using the provided context.
    If you can't find the answer, do not pretend you know it, but answer "I don't know".

    Question: {question.strip()}

    Context:
    {context.strip()}

    Answer:
    """
    prompt_end = time.time()

    answer = generate_llm_response(metaprompt)
    print(f"RAG completed, answer length: {len(answer)} characters")
    return answer

Tracing with Langtrace

As you may have noticed, we've decorated our functions with @with_langtrace_root_span. This allows us to trace the execution of our RAG system using Langtrace, an open-source LLM observability tool. You can read more about group traces in the Langtrace documentation.

What is Langtrace?

Langtrace is a powerful, open-source tool designed specifically for LLM observability. It provides developers with the ability to trace, monitor, and analyze the performance and behavior of LLM-based systems. By using Langtrace, we can gain valuable insights into our RAG system's operation, helping us to optimize performance, identify bottlenecks, and ensure the reliability of our AI applications.

Key features of Langtrace include:

Easy integration with existing LLM applications
Detailed tracing of LLM operations
Performance metrics and analytics
Open-source nature, allowing for community contributions and customizations

In our RAG system, each decorated function will create a span in our trace, providing a comprehensive view of the system's execution flow. This level of observability is crucial when working with complex AI systems like RAG, where multiple components interact to produce the final output.

Using Langtrace in Our RAG System

Here's how we're using Langtrace in our implementation:

We initialize Langtrace at the beginning of our script:



from langtrace_python_sdk import langtrace, with_langtrace_root_span
langtrace.init(api_key='your_langtrace_api_key_here')

We decorate each main function with



@with_langtrace_root_span("function_name")
def function_name():
    # function implementation

This setup allows us to create a hierarchical trace of our RAG system's execution, from initializing the knowledge base to generating the final response.

Testing the RAG System

Let's test our RAG system with a few sample questions:



def demonstrate_different_queries():
    questions = [
        "What is Qdrant used for?",
        "How does Docker help developers?",
        "What is the purpose of MySQL?",
        "Can you explain what FastAPI is?",
    ]
    for question in questions:
        try:
            answer = rag(question)
            print(f"Question: {question}")
            print(f"Answer: {answer}\n")
        except Exception as e:
            print(f"Error processing question '{question}': {str(e)}\n")

# Initialize knowledge base and run queries
documents = [
    "Qdrant is a vector database & vector similarity search engine. It deploys as an API service providing search for the nearest high-dimensional vectors. With Qdrant, embeddings or neural network encoders can be turned into full-fledged applications for matching, searching, recommending, and much more!",
    "Docker helps developers build, share, and run applications anywhere — without tedious environment configuration or management.",
    "PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing.",
    "MySQL is an open-source relational database management system (RDBMS). A relational database organizes data into one or more data tables in which data may be related to each other; these relations help structure the data. SQL is a language that programmers use to create, modify and extract data from the relational database, as well as control user access to the database.",
    "NGINX is a free, open-source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.",
    "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.",
    "SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.",
    "The cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals.",
]
initialize_knowledge_base(documents)
demonstrate_different_queries()

Analyzing the Traces

After running our RAG system, we can analyze the traces in the Langtrace dashboard. Here's what to look for:

Check the Langtrace dashboard for a visual representation of the traces.
Look for the 'rag' root span and its child spans to understand the flow of operations.
Examine the timing information printed for each operation to identify potential bottlenecks.
Review any error messages printed to understand and address issues.

Conclusion

In this blog post, we've explored how to leverage Qdrant, a powerful vector database, in building a Retrieval-Augmented Generation (RAG) system. We've implemented a complete RAG pipeline, from initializing the knowledge base to generating responses, and added tracing with Langtrace to gain insights into our system's performance. By leveraging open-source tools like Qdrant for vector search and Langtrace for LLM observability, we're not only building powerful AI applications but also contributing to and benefiting from the broader AI development community. These tools empower developers to create, optimize, and understand complex AI systems, paving the way for more reliable AI applications in the future.

Remember, you can find the complete implementation of this RAG system in our GitHub repository. We encourage you to explore the code, experiment with it, and adapt it to your specific use cases. If you have any questions or improvements, feel free to open an issue or submit a pull request. Happy coding!

DEV Community