Aller au contenu

2026

Reranker for RAG: Cohere, BGE, Jina, Voyage Compared

Hybrid retrieval finds the right chunks. The reranker puts them in the right order.

You have implemented hybrid BM25 + vector retrieval. Your recall@10 is decent. And yet the LLM produces mediocre answers: the relevant information is there in the top-10 chunks, but it sits at rank 8 or 9. The LLM ignores it or dilutes it in the noise from the chunks above.

That is the problem a reranker solves. Not recall. Precision. Not "find it," but "put what matters first."

In this article I compare the four most widely used rerankers in production (Cohere, BGE, Jina, Voyage) alongside the notable newcomers from 2025-2026, with public benchmark figures, real pricing, and a direct recommendation by project profile.

Securing a RAG: prompt injection, data leaks, RBAC

Securing a RAG is simpler than a classic security audit, and harder than you think

A RAG in production chains three components: a retriever that searches your documents, a context injected into a prompt, and an LLM that generates a response. Each of those three links is a distinct attack vector. Ignore any one of them, and your system is vulnerable, even if the other two are perfectly secured.

The good news: half of the guardrails cost nothing. The bad news: the other half requires genuine architectural rework if you did not think about it from the start.

Multi-Agent Systems: What Actually Works

Multi-agent systems are usually the first architecture people reach for. Specialized agents, an orchestrator that distributes tasks, clean hand-offs between roles. On paper, it looks elegant.

In production, it is a different story.

According to the MAST study published by UC Berkeley in March 2025, based on 1,600 execution traces, multi-agent systems fail between 41% and 86.7% of the time depending on the framework. And when they fail, the problem rarely comes from the model itself: it comes from the architecture.

Here is what the data actually says, and how to decide whether you need multiple agents or one well-equipped single agent.

CrewAI vs LangGraph vs Pydantic AI : Honest 2026 Pick

Every three months, a new AI agent framework drops and makes the front page of Reddit and Hacker News. CrewAI. LangGraph. AutoGen. Pydantic AI. Smolagents. And now Mastra, Agno, Letta, OpenAI Agents SDK, Inferable... The list grows every quarter.

The question everyone asks: which one should I pick?

The trap is believing there's a "best framework." The truth is that these tools don't target the same audience. And some of them are genuinely not built for serious data scientists who want to understand, optimize, and control what they build.

In this article, I'll walk through the five main frameworks — their real strengths, their concrete weaknesses, and who each one is honestly suited for. Plus a few outsiders worth knowing. And a direct recommendation on what I actually use on client engagements.

LLM-as-a-judge: when to use it, with the real cost in €

What an LLM-as-a-judge is, in one quotable sentence

An LLM-as-a-judge is a second language model that evaluates the output of a first model against an explicit set of criteria: relevance, faithfulness to sources, completeness, tone. It produces a score and a justification. That's it.

The mechanism is useful. But it is expensive, slow, and biased if applied without discernment. The question is not "should I use an LLM judge" but "at which point in my pipeline, at what frequency, with which model."

The rule I apply on my engagements: deterministic tests first, the LLM judge as a last resort, never inside the fast development loop.

Testing an LLM with unit tests: regex, length, entities

Before paying for an LLM judge, test like a developer

Before reaching for an LLM-as-judge at $0.60 per million tokens, 80% of regressions in an LLM system are detectable with free, instantaneous assertions: incorrect output format, response too short, expected entity missing, invalid JSON, forbidden word present. These checks do not require AI to evaluate AI. They take 10 lines of Python and plug into any CI/CD pipeline with pytest.

This is the approach I apply systematically before setting up a semantic evaluator on client projects. This article covers the assertions that catch the most bugs, how to organize them into a pytest suite, and when you actually need to move up to the next level.

Build a RAG evaluation dataset in 30 minutes

An imperfect dataset beats having no measurement at all

No weeks of annotation needed, no domain expert on call from day one. In 30 minutes, you can generate a usable starting dataset directly from your chunks, measure recall@k, and kick off a first improvement cycle.

That dataset will be imperfect. That's normal and acceptable. The goal isn't perfection: it's to have a reproducible measurement rather than nothing. A recall@5 of 0.71 measured on 50 synthetic questions already tells you infinitely more than "it seems to work in the demo."

The method described here runs in four steps: generate questions from your chunks, compute recall@k, iterate on retrieval (hill climbing), and feed "not relevant" feedback back as hard negatives for the reranker. For generation metrics (faithfulness, answer relevancy, context recall) and the choice between RAGAS, DeepEval, and TruLens, see Evaluate RAG in production: metrics & RAGAS.

Embeddings in RAG : What They Are & Why They Matter

No embeddings, no ChatGPT answering questions about your documents. No semantic search that finds an article even when you type synonyms. No AI agent that remembers what you told it last week.

Embeddings are the foundational building block of all modern AI. And yet, in the vast majority of projects I work on, they're the least well understood component. Teams use them — often without really knowing why — and then wonder why results are disappointing.

In this article, I'll explain what embeddings actually are, how they work at a high level, why they matter so much, how to choose the right model in 2026, and the concrete pitfalls to avoid. Whether you're a manager or a developer, you should come away with a solid understanding of the topic.

The 7 Wrong Reflexes of RAG Teams (and How to Fix Them)

Introduction

When a RAG project stalls, it's almost never because of a missing technology. It's because of a chain of counter-productive reflexes that teams adopt without realizing it. You tweak the prompt when the problem is in the retrieval. You call it "working" after four manual tests. You stack advanced techniques before you've understood where things are actually breaking.

After roughly twenty RAG projects in consulting and audit engagements, I keep running into the same 7 reflexes. These aren't technical mistakes. They're cognitive biases. But they sabotage performance just as reliably as bad chunking. Here's the list, with the replacement reflex for each one.

Prompt Caching: Cut Your LLM Bill by Up to 90% in 2026

If you're paying full price for LLM calls in 2026, you're leaving 50 to 90% savings on the table

Prompt caching has become the first cost optimization to implement in any LLM project running in production, and oddly enough, nobody talks about it enough.

What I keep seeing on the projects I work on: teams spend hours comparing models, negotiating volume discounts with providers, looking for open-source alternatives. And all along, their code is paying full price for the same 10,000-token system prompt on every single call, without ever having heard of cache_control.

In this article, you'll see how prompt caching works at the technical level (the KV cache), how the three major providers implement it differently (Anthropic, OpenAI, Gemini), the patterns that genuinely cut the bill, and a concrete ROI calculation on a real use case.