Aller au contenu

RAG

The 7 Wrong Reflexes of RAG Teams (and How to Fix Them)

Introduction

When a RAG project stalls, it's almost never because of a missing technology. It's because of a chain of counter-productive reflexes that teams adopt without realizing it. You tweak the prompt when the problem is in the retrieval. You call it "working" after four manual tests. You stack advanced techniques before you've understood where things are actually breaking.

After roughly twenty RAG projects in consulting and audit engagements, I keep running into the same 7 reflexes. These aren't technical mistakes. They're cognitive biases. But they sabotage performance just as reliably as bad chunking. Here's the list, with the replacement reflex for each one.


PDF Parsing for RAG : Extract Data That Actually Works

The problem nobody wants to face

8 out of 10 RAG systems that fail in production have a parsing problem upstream. Not a model issue, not a prompt issue, not a retriever issue. Just a PDF that was read badly from the start.

That's the pattern I see on almost every project I work on. A company spends weeks choosing its language model, configuring its vector database, tuning its prompts — and the system still misses the mark. Because the source document was misread right at the beginning.

Parsing (structured data extraction from a document) is the most underestimated step in any RAG pipeline. If your information retrieval from source files is approximate, the sophistication of everything else doesn't matter — you're building on sand. A badly extracted table, confused columns, an ignored technical diagram — and your LLM generates confidently wrong answers.

In this article, I'll show you why document structuring is so hard, how the 4 major tools on the market actually compare, and what I learned across two very different projects: factory documentation at Continental, and an e-commerce site with thousands of product pages.


Evaluate RAG in Production : Metrics, RAGAS & Audit

80% of the RAGs I audit have no evaluation system

That's a number I wish I could back with an academic citation. But it comes straight from the field: of the production RAG systems I've audited over the past two years, roughly 8 out of 10 have no structured evaluation system in place.

The pattern is always the same. The project gets shipped. The team "checked it manually" on 10 or 15 questions during QA. User feedback seems fine. And then nobody measures anything again.

The hidden cost of this gap is enormous. You don't know if the RAG is drifting after a document update. You don't know if a change in your embedding model broke something. You don't know whether the improvements you're making are actually gains, or just compensating for a regression somewhere else. You're optimizing blind.

This is the single biggest thing that separates a RAG proof-of-concept from a mature production system. A POC "works". A production system gets measured, monitored, and improved in a controlled way. This article covers the RAG metrics that actually matter, evaluation frameworks (RAGAS, DeepEval, TruLens), how to build a solid evaluation dataset, and how to set up continuous evaluation in production.


How to Optimize RAG: 8 Techniques with Measured Gains

You're probably optimizing in the wrong place

When a RAG isn't working well, here's what 90% of teams do: they change the prompt.

They rephrase the instructions, try different models, adjust the temperature. And sometimes it helps a little. But most of the time, that's not where the problem is.

Jason Liu, one of the most followed RAG experts, has a framing I find spot-on: "Before touching anything, reach 97% recall in retrieval."

97% recall means that in 97 out of 100 cases, the chunk containing the right answer is among the results you pass to the LLM. If you're not there, the best prompt in the world won't change a thing. The LLM cannot invent information that isn't in its context.

The real RAG optimization order is: measure first, then retrieval, then generation. Not the other way around. If you're not yet familiar with the basics of how RAG works, start there before optimizing any component.


Optimal RAG Chunking : 8 Strategies & Real Benchmarks

The chunking you're probably using is the worst one tested

Let me start with a result that surprised me when I first saw it.

Chroma Research published a benchmark comparing all common chunking strategies. They tested the default OpenAI Assistants parameters: 800 tokens, 400 tokens of overlap. Their verdict is unambiguous — it's the configuration with the lowest precision across all tests. 1.4% precision. Their exact comment: "particularly poor recall-efficiency tradeoffs".

These are the parameters tens of thousands of projects are using right now, often because it's what the LangChain or LlamaIndex quick start suggests.

Meanwhile, configurations 4x simpler (200 tokens, zero overlap) perform 3.7x better on precision.

Chunking is the decision most teams spend the least time on. And yet it's probably the one with the highest impact on your RAG quality.


Hybrid RAG : BM25 + Vector Search With +10% Recall

Your vector RAG is missing questions you don't even know about

It's a comment I hear often on RAG projects: "It works well in general, but sometimes it finds nothing on questions that seem straightforward."

Concrete example: "What is the ISO-27001 procedure for remote access?" → 0 relevant results.

Vector search encodes meaning. But when a query contains an exact identifier — a standard name, a product code, a domain acronym — semantic encoding fails completely.

This is what's called vocabulary mismatch. And it's the problem hybrid search solves.


Reranker for RAG: Cohere, BGE, Jina, Voyage Compared

Hybrid retrieval finds the right chunks. The reranker puts them in the right order.

You have implemented hybrid BM25 + vector retrieval. Your recall@10 is decent. And yet the LLM produces mediocre answers: the relevant information is there in the top-10 chunks, but it sits at rank 8 or 9. The LLM ignores it or dilutes it in the noise from the chunks above.

That is the problem a reranker solves. Not recall. Precision. Not "find it," but "put what matters first."

In this article I compare the four most widely used rerankers in production (Cohere, BGE, Jina, Voyage) alongside the notable newcomers from 2025-2026, with public benchmark figures, real pricing, and a direct recommendation by project profile.

Securing a RAG: prompt injection, data leaks, RBAC

Securing a RAG is simpler than a classic security audit, and harder than you think

A RAG in production chains three components: a retriever that searches your documents, a context injected into a prompt, and an LLM that generates a response. Each of those three links is a distinct attack vector. Ignore any one of them, and your system is vulnerable, even if the other two are perfectly secured.

The good news: half of the guardrails cost nothing. The bad news: the other half requires genuine architectural rework if you did not think about it from the start.

LLM-as-a-judge: when to use it, with the real cost in €

What an LLM-as-a-judge is, in one quotable sentence

An LLM-as-a-judge is a second language model that evaluates the output of a first model against an explicit set of criteria: relevance, faithfulness to sources, completeness, tone. It produces a score and a justification. That's it.

The mechanism is useful. But it is expensive, slow, and biased if applied without discernment. The question is not "should I use an LLM judge" but "at which point in my pipeline, at what frequency, with which model."

The rule I apply on my engagements: deterministic tests first, the LLM judge as a last resort, never inside the fast development loop.

Build a RAG evaluation dataset in 30 minutes

An imperfect dataset beats having no measurement at all

No weeks of annotation needed, no domain expert on call from day one. In 30 minutes, you can generate a usable starting dataset directly from your chunks, measure recall@k, and kick off a first improvement cycle.

That dataset will be imperfect. That's normal and acceptable. The goal isn't perfection: it's to have a reproducible measurement rather than nothing. A recall@5 of 0.71 measured on 50 synthetic questions already tells you infinitely more than "it seems to work in the demo."

The method described here runs in four steps: generate questions from your chunks, compute recall@k, iterate on retrieval (hill climbing), and feed "not relevant" feedback back as hard negatives for the reranker. For generation metrics (faithfulness, answer relevancy, context recall) and the choice between RAGAS, DeepEval, and TruLens, see Evaluate RAG in production: metrics & RAGAS.