Architecting Enterprise AI Series: Architecting an Enterprise RAG (Retrieval Augmented Generation) System

Artificial intelligence (AI) is becoming increasingly important in the enterprise as businesses seek ways to enhance decision-making, automate processes, and gain insights from data. AI systems, which generate outputs that can influence real or virtual environments, can help organizations achieve these objectives.

Enterprise AI is applying AI technologies to optimize operations, enhance decision-making, and create new products and services. This includes various AI systems, such as Classifiers, Generative Models, and Recommenders.

Enterprise AI aims to leverage these AI capabilities to improve efficiency, reduce costs, gain a competitive advantage, and achieve strategic objectives.

Challenges of Architecting AI Systems

Therefore, a new approach is needed to address these limitations and foster the development of AI systems. However, architecting such systems presents unique challenges.

  • Need to guard AI system from queries that not the focus of the AI System
  • Reasoning Error should not be passed to user prompt
  • Monitoring the performance of AI System that still within expected company policy and guidelines

Those among other early challenges to design RAG system for AI LLM. There is a good proposal from the source cited that can be considered in architecting an Enterprise RAG.

Building an Enterprise RAG System

There are key components that constitute the development of Enterprise RAG.

User authentication

The first component in our system! Before the user can even start interacting with the chatbot, we need to authenticate the user for various reasons. Authentication helps with security and personalization, which is a must for enterprise systems.

Input guardrail

It’s essential to prevent user inputs that can be harmful or contain private information. Recent studies have shown it’s easy to jailbreak LLMs. Here’s where input guardrails come in. Let’s have a look at different scenarios for which we need guardrails.

Query rewriter

Once the query passes the input guardrail, we send it to the query rewriter. Sometimes, user queries can be vague or require context to understand the user’s intention better. Query rewriting is a technique that helps with this. It involves transforming user queries to enhance clarity, precision, and relevance.

Encoder

Once we have the original and rewritten queries, we encode them into vectors (a list of numbers) for retrieval. Choosing an encoder is probably the most important decision in building your RAG system. Let’s explore why and the factors to consider when choosing your text encoder.

Document ingestion

The Document ingestion system manages the processing and persistence of data. During the indexing process each document is split into smaller chunks that are converted into an embedding using an embedding model. The original chunk and the embedding are then indexed in a database.

Chunker

How you decide to tokenize (break) longform text can decide the quality of your embeddings and the performance of your search system. If chunks are too small, certain questions cannot be answered; if the chunks are too long, then the answers include generated noise. You can exploit summarisation techniques to reduce noise, text size, encoding cost and storage cost.

Indexer

The indexer is responsible for creating an index of the documents, which serves as a structured data structure. The indexer facilitates efficient search and retrieval operations. Efficient indexing is crucial for quick and accurate document retrieval. It involves mapping the chunks or tokens to their corresponding locations in the document collection. The indexer performs vital tasks in document retrieval, including creating an index and adding, updating, or deleting documents.

Data storage

Since we are dealing with a variety of data we need dedicated storage for each of them. It’s critical to understand the different considerations for every storage type and specific use cases of each. There are three categories: Embeddings, Documents and Chat History.

Vector database

The vector database powering the semantic search is a crucial retrieval component of RAG. However, selecting this component appropriately is vital to avoid potential issues. 

Document ingestion

The Document ingestion system manages the processing and persistence of data. During the indexing process each document is split into smaller chunks that are converted into an embedding using an embedding model. The original chunk and the embedding are then indexed in a database. Let’s look at the components of the document ingestion system.

Generator

It requires careful considerations and trade-offs mainly between self-hosted inference deployment and private API services. 

Output guardrail

The output guardrail functions similarly to its input counterpart but is specifically tailored to detect issues in the generated output. It focuses on identifying hallucinations, competitor mentions, and potential brand damage as part of RAG evaluation. The goal is to prevent generating inaccurate or ethically questionable information that may not align with the brand’s values. By actively monitoring and analyzing the output, this guardrail ensures that the generated content remains factually accurate, ethically sound, and consistent with the brand’s guidelines.

User feedback

Once an output is generated and served, it is helpful to get both positive or negative feedback from users. User feedback can be very helpful for improving the flywheel of the RAG system, which is a continuous journey rather than a one-time endeavor. This entails not only the routine execution of automated tasks like reindexing and experiment reruns but also a systematic approach to integrate user insights for substantial system enhancements.

Observability

Building a RAG system does not end with putting the system into production. Even with robust guardrails and high-quality data for fine-tuning, models require constant monitoring once in production. Generative AI apps, in addition to standard metrics like latency and cost, need specific LLM observability to detect and correct issues such as hallucinations, out-of-domain queries, and chain failures. Now let’s have a look at the pillars of LLM observability.

Caching

For companies operating at scale, cost can become a hindrance. Caching is a great way to save money in such cases. Caching involves the storage of prompts and their corresponding responses in a database, enabling their retrieval for subsequent use. This strategic caching mechanism empowers LLM applications to expedite and economize responses with three distinct advantages.

Source: https://www.galileo.ai/blog/mastering-rag-how-to-architect-an-enterprise-rag-system