Aller au contenu

RAG

Optimize Your RAG: The 8 Techniques That Make a Real Difference

You're probably optimizing in the wrong place

When a RAG isn't working well, here's what 90% of teams do: they change the prompt.

They rephrase the instructions, try different models, adjust the temperature. And sometimes it helps a little. But most of the time, that's not where the problem is.

Jason Liu, one of the most followed RAG experts, has a framing I find spot-on: "Before touching anything, reach 97% recall in retrieval."

97% recall means that in 97 out of 100 cases, the chunk containing the right answer is among the results you pass to the LLM. If you're not there, the best prompt in the world won't change a thing. The LLM cannot invent information that isn't in its context.

The real RAG optimization order is: measure first, then retrieval, then generation. Not the other way around.


RAG Chunking Strategies: The Definitive Guide to Optimal Chunking

The chunking you're probably using is the worst one tested

Let me start with a result that surprised me when I first saw it.

Chroma Research published a benchmark comparing all common chunking strategies. They tested the default OpenAI Assistants parameters: 800 tokens, 400 tokens of overlap. Their verdict is unambiguous — it's the configuration with the lowest precision across all tests. 1.4% precision. Their exact comment: "particularly poor recall-efficiency tradeoffs".

These are the parameters tens of thousands of projects are using right now, often because it's what the LangChain or LlamaIndex quick start suggests.

Meanwhile, configurations 4x simpler (200 tokens, zero overlap) perform 3.7x better on precision.

Chunking is the decision most teams spend the least time on. And yet it's probably the one with the highest impact on your RAG quality.