What Is RAG? Definition, How It Works & Real Limits

Q: How does a RAG system actually work?

Documents are split into small pieces called chunks, then vectorized with an embeddings model to capture their meaning. When a user asks a question, that question is turned into a vector and the semantically closest chunks are retrieved from the database. Those chunks are then sent to the language model along with the question so it can generate a response.

Q: What are the limits of RAG?

RAG has three main limits: limited iterative reasoning because it cannot reason in multiple steps to refine its search; strong dependency on data organization and structure; and output quality that depends on the sources — if the documents are incomplete or biased, so will the response. It does not guarantee completeness or truthfulness.

Q: Does RAG work with any LLM?

Yes. RAG is model-agnostic and works with GPT, Claude, Mistral, Llama, and any other LLM, whether cloud-hosted or on-premise. Final quality depends primarily on the embeddings model used for indexing and the LLM used for response generation.

Introduction to RAG (Retrieval-Augmented Generation)¶

Everyone has heard of RAG (Retrieval-Augmented Generation) at this point. But what is RAG exactly? Many people have already implemented it — sometimes with no-code tools or Python libraries like LangChain or LlamaIndex. It's straightforward to set up, but I also see a lot of people disappointed with the results. The thing is, you really need to understand what it's for and how it works before you can tell whether it's the right fit for your use case.

I hadn't originally planned to write another RAG explainer — there are already plenty of resources out there. But talking with people who want to use it in enterprise contexts, I keep noticing the same pattern: everyone rushes past the fundamentals. What does RAG actually do? How does it really work in practice?

So let me walk through the points I usually end up clarifying when someone asks me about it.

RAG is deceptively easy to implement¶

Setting up a RAG is easy. Actually, it's almost too easy — you follow a tutorial, wire up two libraries, and it runs. But the results aren't always what people expect (spoiler: disappointment is common).

That simplicity is a trap. On a very basic question, RAG can look like it's working perfectly. But as soon as you push it slightly outside the obvious cases, the responses start falling apart.

That's where teams often spend a lot of time tinkering, optimizing the wrong components, without really understanding where the actual problem is. And that's normal — like any AI project, RAG is more complex than it appears. You really need to understand how each component works to avoid going in circles. I've written about how to improve a RAG here and about error analysis to understand what's breaking here.

But let's get back to the RAG fundamentals in this article.

What is RAG and why use it in enterprise?¶

Before talking about RAG, you need to understand what RAG is and why we need it. Since ChatGPT arrived, we've all seen how powerful language models (LLMs) can be. But there's a fundamental limitation: they don't know our data. They have no access to a company's internal information.

A simple example: I have documentation on a Word document. If I want to ask an AI questions about that document, I can paste the text into ChatGPT — the AI reads it and responds, because I've put the text directly into what we call the model's context (or the prompt).

The catch is that the context window is limited to a certain number of words (or more precisely, tokens). You can load one page, but not ten thousand at once. In enterprise settings, documents are often voluminous, and you want an AI that can answer questions across all of them.

The solution that emerged: at each question, select only a few relevant excerpts from the documentation and insert them into the model's context. This is, incidentally, the same principle that engines like Perplexity use to reference and cite web content.

This is exactly what RAG does: for each question, it picks the relevant excerpts and adds them to the context so the AI can respond — even across very large document volumes. If you want to read more on this, AWS explains it well here too: (aws.amazon.com)

In practice, setting up a RAG system can seem straightforward at first, but optimizing it to actually work well in production takes time and expertise. That's why I built heeya, a RAG chatbot that can be deployed on any website, with all necessary optimizations already baked in.

How does a RAG system work?¶

Simply put, the first step in RAG is storing the relevant documents in a vector database. But since a language model can't read everything at once (the context limit again), each document is split into small pieces called chunks. Each chunk is then transformed into a vector that captures its meaning — its semantics. This is what vectorization does: it lets you quickly compare the user's question against all the document fragments to find the closest matches.

Vectorizing text means transforming it into a vector that captures its meaning (its semantics). This makes it easy to compare the similarity between a user's question and various document fragments. For instance, if "a white cat" is represented as [1, 1], "a black cat" as [1, 0], and "a black dog" as [7, 1], you can see that "a black cat" is closer to "a white cat" than to "a black dog". This principle lets you quickly identify the most relevant passages to inject into the model's context.

The vectorization step is critical — it's what allows the right documents to be retrieved when a question is asked. For this, we use what are called embedding models. Their job is to transform text into a vector that captures its meaning. These models are themselves AIs, trained specifically for this task. For a deeper look at this concept (what it's actually for, how it works, which model to choose between OpenAI, Mistral, and BGE-M3), I wrote a dedicated article on embeddings, the fundamental building block of modern AI.

As you can see, there's real upstream work involved. RAG performance depends heavily on two things: how you split the documents (the size and method, called chunking), and the quality of the embeddings model you use to turn those pieces into vectors. Those two points alone already give you significant leverage for improving results. Before any of that, however, the documents themselves must be correctly extracted from their source files — PDF parsing is the silent first step that determines everything downstream; I compare the main tools in my article on PDF parsing for RAG.

Once the database is ready, the rest of the pipeline kicks in: when a user asks a question, that question is turned into a vector, and the database is searched for the chunks closest to it — those that semantically resemble the question most. This search compares distances between the question vector and the stored chunk vectors. A set number of chunks are retrieved (defined in advance), and sent to the language model along with the question so it can generate a response.

The final prompt sent to the AI looks something like this:

Here are the relevant chunks for the question: 
Chunk 1: <CHUNK 1 TEXT>
Chunk 2: <CHUNK 2 TEXT>
...

Here is the user's question: In which document can I find information about the engine schedule with reference X2D2E?

At this second stage, several parameters can be tuned to improve the RAG: - Using the question alone for retrieval may not be enough. - The number of retrieved chunks may be too low or too high. - The quality of the LLM generating the response can be improved.

Starting from this baseline, you can begin experimenting and evaluating the RAG — then improve it by analyzing errors and adjusting parameters. If you're at the evaluation stage, I'd recommend reading my article on error analysis to understand what's breaking here.

What is RAG actually useful for? What are its limits?¶

This is, I think, the most important question. RAG cannot answer every question you throw at it, because of certain inherent limitations. In its most basic form — the one I describe here — it can only answer direct questions that target a limited piece of content.

Let me explain. If you ask it a very broad question, the number of retrieved chunks may simply not be enough to answer it properly.

Where RAG works well: - Access to specialized information: internal documents or domain knowledge not available in base models - Providing precise information: RAG is particularly effective for answering targeted questions by going directly to the requested information in the provided documents

Where RAG shows its limits: - Limited iterative reasoning: RAG doesn't know how to reason in multiple steps to refine its search. It doesn't verify whether the retrieved documents are truly the most relevant or whether the information is complete. For a complex question, it will simply surface the semantically closest passages without "understanding" the global context the way a human would. - Dependence on data organization: if documents are poorly structured or poorly indexed, the retrieval will be ineffective. Good organization, metadata, and clear structuring are essential. - Source quality and bias: RAG only relays what it finds. If the documents are incomplete, outdated, or biased, the response will be too. For more on this: elastic.co

Key takeaways: - RAG does not guarantee completeness or truthfulness of responses. There is always a risk of hallucination or error. - To make a RAG reliable, you need to: organize data well, monitor source quality, calibrate parameters (chunking, embeddings, etc.), and enable source citation so information can be verified when needed. - Evaluation and improvement of a RAG happen primarily in real-world conditions, because problems surface in actual use. For a structured approach with quantitative metrics (Hit Rate, MRR, faithfulness, RAGAS) and a complete audit methodology, I dedicated an article to how to evaluate a RAG in production.

Conclusion: Is RAG useful for your AI projects?¶

RAG is genuinely useful — provided you take the time to set it up properly and improve it over time. It's not something you implement in a rush and let run on its own: you need to evaluate it, tune it, and fix what's wrong for it to be truly effective.

The limitations I've outlined aren't easy to eliminate, but there are ways to mitigate them. Agentic RAG is often discussed as a way to address some of these shortcomings. If you're looking for a system that delivers 100% correct answers, RAG (and AI in general) isn't for you. But if you're willing to aim for 90–95% correct responses and invest some time in proper implementation, then RAG can genuinely become your best ally.

There's also a broader question that keeps coming up in 2026: with LLMs now offering context windows in the millions of tokens, do you even need RAG? I address this debate directly in RAG vs long context LLM: is RAG really dead?. And if you've decided RAG is the right direction but aren't sure whether to go with a classic pipeline or a full long-context approach for your specific use case, the long context vs RAG decision framework lays out the real cost and performance trade-offs.

If you want to go deeper on RAG, even the French government has published a guide on implementing RAG: Guide to Retrieval-Augmented Generation (RAG).

FAQ: Frequently Asked Questions About RAG¶

What is RAG in a nutshell? RAG (Retrieval-Augmented Generation) is a technique that lets an LLM answer questions based on your own documents. At each query, the most relevant excerpts are automatically retrieved and injected into the model's context.

What is the difference between RAG and fine-tuning? Fine-tuning modifies the model itself by training it on your data. RAG injects information at query time without modifying the model. RAG is preferable for data that changes frequently or specific domain knowledge. Fine-tuning is better suited for permanently adapting the model's style or behavior. I wrote a complete article to help you choose: training, fine-tuning or RAG: which to pick for your AI?, covering real costs (from a few hundred to several million euros) and a method to avoid the wrong call.

Does RAG work with any LLM? Yes. RAG is model-agnostic. It works with GPT, Claude, Mistral, Llama, and any other LLM, whether cloud-hosted or on-premise. Quality depends on the embeddings model used for indexing and the LLM used for generation.

Does RAG eliminate hallucinations? No. RAG reduces hallucinations by providing the model with relevant context, but does not eliminate them entirely. The LLM can still misinterpret context or generate inaccurate information. The solution: always cite sources and let users verify.

How do you improve a RAG that isn't working well? Start by measuring. Establish a baseline with Hit Rate and faithfulness metrics before touching anything. Then work through the retrieval layer first: check chunking quality, consider hybrid BM25 + vector retrieval for domain-specific corpora, and add a reranker. I cover all 8 optimization levers with measured gains in the RAG optimization guide. For a completely different architecture that addresses the deepest limitations of classic RAG, agentic RAG is worth exploring once the basics are solid.

How many documents can you index in a RAG system? There's no theoretical limit. Industrial RAG systems handle millions of documents. In practice, the constraint comes from indexing costs (embeddings) and response latency. Good metadata filtering allows you to query even very large document bases efficiently.