Aller au contenu

Hybrid RAG : BM25 + Vector Search With +10% Recall

Your vector RAG is missing questions you don't even know about

It's a comment I hear often on RAG projects: "It works well in general, but sometimes it finds nothing on questions that seem straightforward."

Concrete example: "What is the ISO-27001 procedure for remote access?" → 0 relevant results.

Vector search encodes meaning. But when a query contains an exact identifier — a standard name, a product code, a domain acronym — semantic encoding fails completely.

This is what's called vocabulary mismatch. And it's the problem hybrid search solves.


3 types of queries where vector search alone fails

An embedding model encodes meaning, semantic associations, conceptual proximity. That's powerful. If you want to understand concretely how text becomes a vector and why that representation is so powerful, I have an article that covers the fundamentals: embeddings in RAG.

But it fails systematically on 3 categories of queries.

1. Domain jargon and standards

"DTU 31.2", "CSRD", "IFRS 9", "ISO-27001 annex A.9". These identifiers have no meaning in the semantic sense — they are labels. A general-purpose embedding model doesn't know that "ISO-27001 annex A.9" is about access control, and will encode it as a character sequence with no strong relationship to the corresponding document's content.

2. Proper nouns, product references, error codes

"ERROR_CODE_403b", "Product Ref X-2247-FR", "endpoint /v2/users/batch". Embeddings encode these, but poorly: the model doesn't understand that exact matching is what matters here, not semantic proximity.

3. Mixed queries

"Configure Redis in cluster mode on AWS with SASL authentication." Half the query is conceptual, the other half is technical and textual. Pure vector search handles this asymmetry badly.

And conversely, here are the 3 cases where BM25 alone fails:

  • Conceptual questions ("what is a multi-agent architecture?")
  • Synonyms and paraphrases ("way to store" ≠ "method to save" for BM25)
  • Natural multilingual language

The conclusion is clear: the failure modes of BM25 and vector search are not correlated. What one misses, the other often finds. And that's exactly why combining them works.


How BM25 works (without the math)

BM25 (Best Matching 25) is a text search algorithm. The core idea: a rare word in a short document is highly relevant.

Two parameters do all the work:

  • k1 (1.2 by default): "frequency saturation". After the 5th occurrence of a word in a document, the marginal contribution becomes nearly zero. If "Redis" appears 20 times in a doc, that's not 20x more relevant than if it appears twice.
  • b (0.75 by default): length normalization. A 50-word doc containing "Redis" is more relevant than a 5,000-word doc that also contains it.

These defaults have been validated across dozens of benchmarks. In the vast majority of projects, you don't need to touch them.

The modern view: BM25 as a sparse vector

This concept is important for understanding hybrid architectures. BM25 can be represented as a sparse vector: a vector whose dimension equals the vocabulary size, with non-zero values only for words present in the document.

This "sparse" view is what unifies BM25 and dense vector search in a single architecture: both produce vectors, but of very different natures.

BM25 / Sparse Dense vector
Representation Sparse vector (full vocab) Dense vector (768–1536 dims)
Captures Exact matches Meaning, context, synonyms
Strength Jargon, codes, proper nouns Conceptual questions
Weakness Synonyms, paraphrases Exact identifiers
Cost Very fast (CPU) Slower (GPU recommended)

And if BM25 isn't enough (I explain when at the end), there are smarter sparse alternatives: SPLADE and BGE-M3.


Reciprocal Rank Fusion: the fusion algorithm that actually works

When you have two result lists (one from BM25, one from vector search), how do you combine them?

The naive first idea: add the scores. That doesn't work. BM25 scores and cosine similarity scores don't share the same range or distribution. Adding a BM25 score of 15.3 to a cosine similarity of 0.82 has no mathematical meaning.

The solution: Reciprocal Rank Fusion (RRF)

The idea is elegant: forget the scores, keep only the ranks.

The formula: RRF_score(doc) = 1 / (k + rank_BM25) + 1 / (k + rank_vector)

With k = 60 (the default in Elasticsearch and Azure AI Search).

A step-by-step example:

Document BM25 rank Vector rank RRF score
Doc A 1st 2nd 1/(60+1) + 1/(60+2) = 0.0325
Doc B 3rd 1st 1/(60+3) + 1/(60+1) = 0.0323
Doc C 2nd absent 1/(60+2) + 0 = 0.0161

Doc A comes out on top because it ranks well in both systems. Doc C is absent from vector search: it loses to docs present in both lists, even though it was 2nd in BM25.

That's the robustness of RRF: a document that's strong in only one list can still surface, but a moderately good document in both will almost always beat it.

graph LR
    Q["💬 Query"] --> B["🔤 BM25\n(sparse)"]
    Q --> V["🧠 Embedding\n(dense)"]
    B --> BL["📋 Ranked list\nBM25"]
    V --> VL["📋 Ranked list\nvector"]
    BL --> RRF["⚖️ Reciprocal\nRank Fusion"]
    VL --> RRF
    RRF --> R["📊 Top-K\nmerged"]
    R --> LLM["💡 LLM → Answer"]

    style Q fill:#fef3c7,stroke:#d97706,color:#000
    style B fill:#fce7f3,stroke:#db2777,color:#000
    style V fill:#e0e7ff,stroke:#4338ca,color:#000
    style BL fill:#fce7f3,stroke:#db2777,color:#000
    style VL fill:#e0e7ff,stroke:#4338ca,color:#000
    style RRF fill:#f3f4f6,stroke:#6b7280,color:#000
    style R fill:#dbeafe,stroke:#2563eb,color:#000
    style LLM fill:#bbf7d0,stroke:#16a34a,color:#000

k = 60 vs k = 2: what's the difference?

  • High k (60) → high-rank positions carry less weight. The final list balances both signals more evenly.
  • Low k (2) → top-ranked positions dominate heavily. Qdrant uses k=2 by default, which strongly favors the best results from a single list.

Alternative: RelativeScoreFusion (Weaviate)

Weaviate proposes a different approach: normalize each method's scores between 0 and 1, then add them. Their benchmark on FIQA (a financial Q&A dataset) shows +6% recall vs RRF. It's worth testing both on your data before committing.


The benchmarks that justify implementing it

I'm not asking you to take my word for it. Here are the numbers.

Microsoft Azure AI Search (BEIR + client data)

Method NDCG@3 Gain vs vector only
BM25 only 40.6 −7%
Vector only 43.8
Hybrid RRF 48.4 +10%
Hybrid + reranker 60.1 +37%

Elasticsearch (BEIR benchmark)

Hybrid: +1.4% vs vector only, +18% vs BM25 only.

LlamaIndex (Hit Rate on Q&A benchmark)

The hybrid + reranker combination achieves a Hit Rate of 0.938. That's the top score in their benchmark, ahead of vector only (0.891) and BM25 only (0.869).

The takeaway: hybrid consistently improves on pure vector search, especially on data with domain jargon. Add a reranker on top and you multiply the effect further.


Implementation: 3 stacks in practice

Stack 1 — LangChain + BM25 + FAISS (open-source, minimal)

This is the entry point. Perfect for a POC or a fully local stack with no external database.

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Your pre-split documents
documents = [...]

# BM25 retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# FAISS vector retriever
embeddings = OpenAIEmbeddings()
faiss_store = FAISS.from_documents(documents, embeddings)
vector_retriever = faiss_store.as_retriever(search_kwargs={"k": 5})

# Fusion with RRF (c=60 by default in LangChain)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]  # 50/50 BM25 / vector
)

results = ensemble_retriever.invoke("your question")

That's 15 lines of code. RRF is handled natively by EnsembleRetriever.

Main limitation: BM25 is recomputed in memory on every restart. For production, you need to persist the BM25 index separately (with pickle or an external solution).


Stack 2 — LlamaIndex + QueryFusionRetriever

LlamaIndex has a native implementation with async support, useful if you want to parallelize both searches to reduce latency.

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import VectorIndexRetriever, QueryFusionRetriever

# Your existing indexes
vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(
    docstore=vector_index.docstore,
    similarity_top_k=5
)

# Fusion
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,      # no multi-query here, just fusion
    use_async=True,     # BM25 and vector in parallel
    mode="reciprocal_rerank",
)

results = await hybrid_retriever.aretrieve("your question")

The use_async=True matters: both retrievers run in parallel, cutting latency roughly in half in practice.


Stack 3 — Weaviate (production, all-in-one)

Weaviate handles BM25, vector search, and fusion natively. It's the cleanest stack for large-scale production: everything lives in the database, no fusion logic in your application code.

import weaviate
from weaviate.classes.query import MetadataQuery, HybridFusion

client = weaviate.connect_to_local()
collection = client.collections.get("Documents")

# Hybrid query with RelativeScoreFusion
results = collection.query.hybrid(
    query="your question",
    alpha=0.5,           # 0 = pure BM25, 1 = pure vector, 0.5 = balanced
    fusion_type=HybridFusion.RELATIVE_SCORE,
    query_properties=["content", "title^2"],  # title boost
    limit=5,
    return_metadata=MetadataQuery(score=True, explain_score=True)
)

for obj in results.objects:
    print(obj.properties["content"])
    print(f"Hybrid score: {obj.metadata.score}")

The alpha parameter is your main lever in production: - Increase alpha for conceptual questions (more vector weight) - Decrease alpha when queries contain heavy jargon or identifiers (more BM25 weight)

When to use it: high volumes, metadata filtering combined with hybrid search, multi-tenancy, or when you want to avoid managing BM25 separately in your codebase.


Situation Recommendation
Domain jargon, acronyms, standards Hybrid (BM25 essential)
Pure conceptual questions Vector only
Proper nouns, product codes, references Hybrid
Queries with typos Vector (more robust to misspellings)
Code patterns, regex, logs BM25 only
Multilingual corpus Hybrid + BGE-M3
General-purpose production without profiling Hybrid by default

My personal rule: in general-purpose production, I always start with hybrid. The added cost is marginal (BM25 is nearly free compared to embedding) and the gains on edge cases always justify the effort.


Going further: SPLADE and BGE-M3

BM25 is an excellent baseline, but it has one obvious limitation: it doesn't understand synonyms. "Send" and "transmit" are two different tokens for BM25 — no proximity score between them.

Two modern alternatives are worth knowing.

SPLADE (Sparse Lexical and Expansion)

SPLADE is a model trained to produce "intelligent" sparse vectors. Unlike BM25, it performs term expansion: if a document talks about "car", SPLADE can add weight to "automobile", "vehicle", "auto". It keeps the sparse structure (fast, filterable) while capturing some of the semantics.

When to consider it: your corpus has many domain-specific synonyms, or BM25 misses terminological variants too frequently.

BGE-M3 (BAAI/bge-m3)

This is the most versatile model released in 2024. A single model simultaneously produces: - Dense vectors (classic embeddings) - Sparse vectors (SPLADE-style) - ColBERT scores (for reranking)

100 languages. 8192-token context. You use it as a dense retriever, sparse retriever, and reranker — with a single model to manage.

When to consider it: multilingual corpus, or when you want to simplify your stack by deploying and maintaining just one model.

BM25 SPLADE BGE-M3 sparse
Term expansion No Yes Yes
Training required No Yes (pre-trained model) Yes (pre-trained model)
Speed Very fast Fast Moderate
Multilingual Depends on tokenizer Limited Yes (100 languages)
Ideal use case Baseline, exact jargon Domain synonyms Multilingual production

FAQ

Does hybrid search slow down queries?

The BM25 part is near-instantaneous (CPU computation on an inverted index). The vector search is the bottleneck, same as in a classic RAG. In practice, a well-implemented hybrid adds 5 to 15ms of latency vs vector only — negligible in the vast majority of RAG architectures.

Which vector database should I choose for hybrid search?

Weaviate and Qdrant have the best native support today. Weaviate offers alpha, RelativeScoreFusion, and BM25F (BM25 with per-field boosting). Qdrant has its own RRF with k=2. Elasticsearch remains the reference if you already run an ELK stack. FAISS and Chroma require managing BM25 separately on the application side (which is what LangChain does with EnsembleRetriever).

Do I need to fine-tune BM25 for my domain?

Rarely necessary. The default k1 and b parameters have been validated on very diverse corpora. What matters more: tokenizer quality (handling accents, hyphens, and the casing in your data) and preprocessing (stop words relevant to your domain). For highly specialized jargon, BM25 doesn't even need stemming — exact tokens are precisely what you're looking for.

Does BM25 handle languages other than English well?

Yes, with an appropriate tokenizer. LangChain uses rank_bm25 with the Snowball stemmer, which covers most languages well. For technical jargon or acronyms, stemming is useless or even counterproductive — disable it or exclude those terms from normalization.


Further reading


If my articles interest you and you have questions, or just want to talk through your own challenges, feel free to reach out at anas@tensoria.fr — I enjoy these conversations.

You can also book a call or subscribe to my newsletter.


About me

I'm Anas Rabhi, freelance AI Engineer & Data Scientist. I help companies design and ship AI solutions (RAG, agents, NLP).

Discover my services at tensoria.fr or try our AI agents solution at heeya.fr.