Ryan Michael

Posted on Feb 22 • Originally published at dewykb.github.io

RAG for Real - Gotchas to consider before building your app

#ai #architecture #learning #development

As developers, we sometimes find ourselves at the transition from experimentation to production wishing we'd thought of something earlier. The journey from a prototype to a fully operational application is full of challenges, especially for RAG and Gen AI applications.

This post presents some of the surprising challenges specific to building RAG applications, and offers concrete solutions to help you build a successful application. The goal is to help you avoid some of the common gotchas and save you some headaches down the road.

Challenge 1: Managing changing documents and metadata

What needs to be done: When a RAG application ingests a document, it often extracts and indexes multiple chunks to facilitate efficient search. Often times, documents are associated with metadata used for access control, hybrid search, and data organization. In a production system, the set of documents is often dynamic, changing as users interact with the application.

Why it's hard: Each chunk of a document is indexed as a separate vector, so any update or deletion requires tracking these fragments and their metadata to ensure consistency. Hybrid search models, which leverage both traditional and vector search mechanisms, require a careful approach to indexing and querying. Additionally, enforcing ACLs at the chunk level complicates permissions management.

Design Around Documents

The key to a successful RAG application lies in recognizing that vector indexes are just one component of a complete solution. Instead of treating documents as just data to be chunked and indexed, think of each document as the central entity around which your application architecture is built.

This perspective allows the development of systems that support complex interactions with documents, such as updates, deletions, and access control, in a more integrated manner. By focusing on documents, you ensure that every part of your application, from search to security, is aligned with the goal of delivering relevant, accessible, and secure content.

Capture document-level metadata

Document metadata can support many use cases, for example hybrid search and access control - but it's important to remember that a single document will be associated with an arbitrary number of extracted text chunks, images, summarizations, etc.

There are two basic ways to manage this type of 1:N relationship: normalized (where you store the metadata in a separate table) and denormalized (where the metadata is duplicated and stored alongside each chunk).

Whenever possible the normalized approach is preferable. For example you can store document metadata alongside documents or as a separate table and use SQL joins to associate metadata with rows in your vector index. This approach simplifies document and metadat changes, which otherwise could affect a huge number of related DB records (chunks, summaries, etc).

Implement versioned document storage

To handle the challenges of document updates and deletions efficiently, it's a good idea to implement a system of document versioning. This approach allows you to manage updates and deletions in a controlled manner, where each version of a document is tracked, and changes can be rolled back if necessary. This setup is particularly beneficial for applications that require frequent updates to their documents, as it minimizes the impact on search functionality and ensures that users always have access to the most current and accurate information. Moreover, versioning supports the implementation of ACLs by providing a clear framework for who has access to what information and when, thereby enhancing the security and integrity of the application.

Challenge 2: Ensuring Persistence and Durability

What needs to be done: Moving from a prototype to a production-ready application often means meeting more stringent requirements for data persistence and durability. Production applications need to work for multiple users concurrently, and users will be continually interacting with the application, adding & removing documents, making queries, etc. In a Jupyter notebook, local or in-memory storage may suffice for temporary experimentation, but production environments demand reliable storage solutions built for concurrent users.

Why it's hard: External databases and persistent queues introduce complexity, including network latency, data synchronization issues, and the need for robust error handling mechanisms. Ensuring the durability of vector indexes, which are crucial for the fast retrieval of information in RAG applications, requires careful selection of storage solutions that can support high read/write throughput. Finally, ingesting documents to support RAG often involves time-consuming data extraction and summarization: these processes shouldn’t happen as part of an API request, and need to be tolerant of machine failures, restarts, and code deployments.

Use a durable ingestion queue

A durable ingestion queue is foundational for managing the continuous addition and removal of documents by multiple users. Services like Amazon SQS and Inngest or durable databases like Postgres can be used to build an ingestion queue that ensures all document processing tasks are reliably captured and processed, even in the face of system failures or restarts.

This queue acts as a buffer, absorbing spikes in activity and allowing document processing tasks to be carried out asynchronously, without impacting the responsiveness of the application to user queries. By decoupling document ingestion from immediate processing, you provide a scalable way to handle user interactions and background tasks, ensuring that the system remains responsive and reliable.

Use idempotent ingestion logic

To manage the complexities of concurrent document workloads and ensure atomic updates, idempotent ingestion logic is essential. This means designing your ingestion processes so that if the same operation is performed multiple times, the results don’t affect each other.

For example, if an LLM is used during document chunking its outputs may change from one run to the next. If your ingestion pipeline includes steps after chunking and a job is restarted during the chunking operation, you could end up with inconsistent results based on different chunking results.

Implementing idempotent operations allows for retries without side effects, which is crucial in a distributed system where network issues or component failures can interrupt tasks. This approach also facilitates concurrent processing, ensuring that multiple workers can safely process tasks without stepping on each other's toes, thereby increasing the efficiency and reliability of document ingestion and updates. Idempotent processes work hand-in-hand with document versioning, allowing new versions to be released atomically after all processing has completed successfully.

Use a remote vector index

Choosing a remote vector index like pgvector or AstraDB is critical for supporting the durable, high-throughput read/write operations necessary for RAG applications. These technologies are designed to handle the demands of vector indexing, offering the scalability and performance needed for fast retrieval of information.

A remote vector index also enables separation of concerns, allowing the computational workload associated with vector operations to be offloaded from the primary application database. This separation ensures that the indexing and retrieval processes do not interfere with each other, maintaining high availability and performance even under heavy load. Moreover, these solutions come with built-in durability and fault tolerance features, ensuring that your vector data remains intact and accessible even in the event of system failures.

Challenge 3: Monitoring, Auditing, and Debugging

What needs to be done: As RAG applications are deployed in production, understanding their operation and performance becomes crucial. You want to know if your users are having a good experience and how much your LLM bill is going to be before it gets out of hand.

Monitoring, auditing, and debugging take on heightened importance, not just for traditional metrics like response times and error rates, but also for tracking the non-determinism of language model outputs and the effectiveness of prompt engineering.

Why it's hard: The inherent non-determinism of language models and the ambiguity of natural language processing make it challenging to predict and diagnose issues. Traditional monitoring tools may not be sufficient to capture the nuanced behavior of RAG applications, requiring developers to think creatively about how to track performance and user interactions.

Persist DB and LLM queries

To effectively monitor and audit RAG applications, it is crucial to persist database (DB) and Large Language Model (LLM) queries. This involves capturing and storing logs of all interactions with the database and the language model, such as Postgres access logs and proprietary tools like LangSmith. The general principle is to ensure that every query, its response, and associated metadata are recorded.

This logging allows you to analyze the performance of both DB queries and LLM interactions over time, identify patterns or anomalies, and troubleshoot issues with data retrieval or language model responses. Additionally, by persisting these queries, organizations can track and forecast usage costs, crucial for managing expenses related to LLM usage.

Design for user feedback

Incorporating mechanisms for user feedback directly into the RAG application is essential for understanding user satisfaction and the effectiveness of the system's outputs. Simple features like thumbs-up/down buttons, star ratings, and options to regenerate responses provide users with a direct way to communicate their experience.

This feedback not only serves as a critical metric for monitoring user satisfaction but also offers valuable data that can be used to fine-tune the application. In the absence of deterministic outputs, user feedback can be a useful way to determine if your application is doing what you expect it to.

Implement LLM performance testing

To ensure the reliability and accuracy of RAG applications over time, implementing a robust regression testing framework is key. Tools like TruLens or RAGAS, can be employed to systematically test the application against a suite of predefined scenarios and inputs. This testing helps identify any deviations from expected outcomes, potentially highlighting issues introduced by updates to the language model, changes in the data, or alterations in the application code.

Regression testing is particularly important in the context of RAG applications due to the non-deterministic nature of language models; it ensures that updates or changes do not adversely affect the application's performance or the quality of its outputs. Furthermore, consistent regression testing facilitates a continuous improvement cycle, where feedback and testing results inform ongoing development and optimization efforts.

Conclusion

Transitioning RAG applications from prototyping to production can be challenging, but with careful planning and the right tools, these hurdles can be overcome. By addressing document management complexities, ensuring data persistence, and implementing robust monitoring and debugging practices, you can deploy RAG applications that are not only powerful but also reliable and maintainable. Remember, the key to success in production is not just leveraging the latest technology but also understanding the nuances of your application and anticipating potential challenges before they arise.

We developed Dewy to incorporate many of these lessons and ensure that you can focus on creating impactful applications without getting bogged down by the underlying complexities.

Dewy is designed to handle document management, data persistence, and operational transparency, allowing you to leverage the power of RAG applications seamlessly.

Check out the docs to learn more!

DEV Community