Hybrid Search: Combining Keyword and Vector Search for Better Retrieval

Introduction

Search is deceptively hard. When a user types a query, they might be searching by meaning ("how do I fix a login problem?") or by exact identifier ("error code SKU-12345"). These two cases need fundamentally different approaches — and most search systems only do one of them well.

Vector search (also called semantic search) is excellent at understanding intent and meaning. It can match "password reset" with "account recovery" because the two phrases mean similar things, even though they share no words. But vector search is poor at exact matching: it may fail to find the one document that contains "SKU-12345" if the embedding model blurs that specific token into a generic representation.

Keyword search (most famously, BM25) is the opposite: it excels at exact matching but has no understanding of synonyms, context, or meaning.

Hybrid search runs both in parallel and combines their results. This is the approach used by virtually every serious production RAG (retrieval-augmented generation) system. This article explains how both methods work, why combining them is better, and how to implement hybrid search using Reciprocal Rank Fusion (RRF).

The Limits of Pure Vector Search

Vector search works by converting text into numerical vectors called embeddings. Similar texts produce vectors that are close together in space, so "find the most similar documents" becomes a fast nearest-neighbor lookup.

This approach excels at capturing semantic similarity. A query like "How do I reset my password?" can match documents containing "account recovery" or "login issues" even without exact word overlap.

However, vector search struggles in several real-world scenarios:

Exact keyword matches: Searching for "SKU-12345" or "RFC-8446" may not return the correct document if the embedding model does not treat those specific tokens as important.
Rare or technical terms: Domain-specific jargon, abbreviations, or recently coined terms may not be well-represented in the embedding space.
Named entities: Product names, person names, or location-specific queries can be missed if embeddings generalize too aggressively.
Short queries: Single-word or very short queries often lack enough semantic context for embeddings to work effectively.

These failures are not hypothetical. In production systems, they lead to user frustration and degraded retrieval quality.

The Strengths of Keyword Search (BM25)

BM25 (Best Match 25) is a probabilistic ranking function used by search engines for decades. It does not use neural networks or embeddings — it is a statistical method based on counting words.

BM25 scores documents based on three factors:

Term frequency (TF): How often the query terms appear in the document. More occurrences = higher score.
Inverse document frequency (IDF): How rare the query terms are across the entire corpus. Rare terms are more informative.
Document length normalization: Prevents long documents from automatically outscoring short ones just because they contain more words overall.

The BM25 formula is:

\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}

In plain English: this formula adds up relevance scores for each query word, rewarding documents where that word appears often (but not too often) and is rare across the whole collection.

Where:

f(q_i, D) is the term frequency of query term q_i in document D.
|D| is the document length (in words).
avgdl is the average document length across the corpus.
k_1 and b are tuning parameters. Default values of k_1 = 1.5 and b = 0.75 work well in most cases.

Where BM25 falls short

BM25 does not understand language meaning. "Password reset" and "account recovery" are completely unrelated to BM25 because they share no words. It also cannot handle typos or synonyms. This is precisely where vector search excels — the two methods complement each other perfectly.

Why Hybrid Search Works Better

Singular Value Decomposition diagram showing U, Sigma, and V matrix factorization — **Figure:** Dense vector search is rooted in SVD-like matrix factorization — documents are projected into a compact latent embedding space where proximity captures semantic similarity. BM25 operates in a different space: the raw term-frequency inverted-index space. Hybrid search combines both, capturing what neither can retrieve alone. Source: Georg-Johann / Wikimedia Commons (CC BY-SA 3.0)

Hybrid search runs both vector search and BM25 in parallel, then merges the two ranked lists into a single final ranking. This ensures:

Exact keyword matches are captured when they matter.
Semantic similarity is captured for conceptual queries.
When one method fails, the other can cover for it.

Empirical benchmarks consistently show hybrid search outperforming either method alone, especially on diverse query types. This is the main reason production RAG systems use it.

Reciprocal Rank Fusion (RRF)

The most common and effective way to merge BM25 and vector search results is Reciprocal Rank Fusion (RRF). It is simple, effective, and does not require training a separate model.

How RRF works

Each retrieval method returns a ranked list of documents. RRF assigns a score to each document based on its position in each list:

\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}

In plain English: a document gets a point for every ranked list it appears in. The higher it ranks in a list, the bigger the point. Documents that rank highly in both BM25 and vector search get the largest combined score.

Where:

R is the set of ranking methods (e.g., BM25 and vector search).
rank_r(d) is the position of document d in ranking method r (starting at 1 for the top result).
k is a constant — typically 60 — that prevents documents at the very top from getting excessively large scores over those just below them.

Why RRF is robust

RRF only uses ranks, not the raw scores from each method. This is crucial because BM25 scores and cosine similarity scores are on completely different scales — you cannot simply add them. By working with ranks, RRF is immune to this scale mismatch, and it naturally balances both methods without any manual weight tuning.

Implementing Hybrid Search

Here is a practical implementation of hybrid search using BM25 and vector embeddings. First, install the required libraries: pip install rank_bm25 sentence-transformers numpy.

Step 1: Index documents with both methods

You need two indexes built over the same documents:

A BM25 index (e.g., using the rank_bm25 library, Elasticsearch, or OpenSearch).
A vector index (e.g., Pinecone, Weaviate, Qdrant, or FAISS for local use).

Step 2: Query both indexes

import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer

# Sample documents
documents = [
    "How to reset your password",
    "Account recovery guide",
    "Product SKU-12345 specifications",
    "Understanding RFC-8446 TLS 1.3"
]

# BM25 indexing
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Vector indexing
model = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = model.encode(documents)

# Query
query = "password reset"

# BM25 search
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
bm25_ranks = np.argsort(bm25_scores)[::-1]

# Vector search
query_embedding = model.encode([query])[0]
cosine_scores = np.dot(doc_embeddings, query_embedding)
vector_ranks = np.argsort(cosine_scores)[::-1]

print("BM25 ranks:", bm25_ranks)
print("Vector ranks:", vector_ranks)

Step 3: Apply Reciprocal Rank Fusion

The function below takes the two ranked lists and produces a single merged ranking. Notice that ranks are 0-indexed from the sort output, so we add 1 to each rank position to follow the RRF formula.

def reciprocal_rank_fusion(bm25_ranks, vector_ranks, k=60):
    rrf_scores = {}

    # Add BM25 scores
    for rank, doc_id in enumerate(bm25_ranks):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)

    # Add vector scores
    for rank, doc_id in enumerate(vector_ranks):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)

    # Sort by RRF score
    sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_docs

# Combine results
final_ranking = reciprocal_rank_fusion(bm25_ranks, vector_ranks)
print("Hybrid search results:", final_ranking)

Alternative Fusion Methods

RRF is the most popular approach, but there are alternatives worth knowing:

Weighted score fusion

Combine normalized scores from BM25 and vector search using a weighted average:

\text{score}(d) = \alpha \cdot \text{score}_{\text{BM25}}(d) + (1 - \alpha) \cdot \text{score}_{\text{vector}}(d)

In plain English: α controls how much you trust BM25 versus vector search. Setting α = 0.3 means 30% BM25 and 70% vector search. Both scores must be normalized to the same scale first — otherwise the method is meaningless.

This requires score normalization and manual tuning of the weight α, which is why RRF is usually preferred.

Learned fusion

Train a reranker model that takes both BM25 and vector scores as features and learns optimal weights from labeled data. This is more complex but can achieve higher accuracy if you have enough labeled query-document pairs.

Query-adaptive fusion

Dynamically adjust the fusion weights based on the query. Short, precise queries rely more on BM25; long, conceptual queries favor vector search.

When to Use Hybrid Search

Hybrid search is ideal when:

Your corpus contains both conceptual content and exact identifiers (SKUs, error codes, names).
Users submit diverse query types — questions, keywords, and technical terms.
You want retrieval that degrades gracefully across edge cases.
Your domain includes rare or specialized terminology not well-covered by embedding models.

Real-world examples where hybrid search shines:

E-commerce product search (combining semantic search with exact SKU matching).
Technical documentation (semantic intent + exact API names or error codes).
Legal or regulatory text (concept search + precise citation matching).
Customer support (understanding intent + finding exact error codes).

Production Considerations

Latency

Hybrid search queries two indexes instead of one. To minimize latency, run both retrievals in parallel. Most vector databases and search engines support parallel execution efficiently.

Index synchronization

Ensure both indexes are updated together when documents are added, modified, or deleted. Inconsistent indexes lead to missing or duplicate results.

Cost

Running two indexes increases storage and compute costs. However, the improvement in retrieval quality typically justifies the expense — poor retrieval degrades everything downstream, including the quality of LLM answers.

Tuning the RRF constant

The RRF constant k = 60 works well in most cases and was validated in the original RRF paper. Adjust it only if you have strong evidence from offline evaluation metrics.

Hybrid Search in Vector Databases

Many modern vector databases now support hybrid search natively, which simplifies implementation:

Weaviate: Supports hybrid search with configurable BM25 and vector weights.
Qdrant: Offers hybrid search using score fusion.
Elasticsearch: Combines keyword search with dense vector retrieval natively.
Pinecone: Supports sparse-dense hybrid search.

Example: Hybrid Search with Weaviate

The following example uses the Weaviate v4 Python client. The alpha parameter controls the blend: alpha=0 means pure BM25, alpha=1 means pure vector search, and values in between blend both. Install with: pip install weaviate-client>=4.0.

import weaviate
from weaviate.classes.query import HybridFusion

# v4 connection (replaces weaviate.Client())
client = weaviate.connect_to_local()  # or connect_to_wcs(), connect_to_custom()

collection = client.collections.get("Articles")

results = collection.query.hybrid(
    query="what is retrieval augmented generation",
    alpha=0.75,           # 0=pure BM25, 1=pure vector
    fusion_type=HybridFusion.RELATIVE_SCORE,
    limit=5,
    return_metadata=weaviate.classes.query.MetadataQuery(score=True)
)

for obj in results.objects:
    print(obj.properties, obj.metadata.score)

client.close()

Evaluation Metrics for Hybrid Search

To confirm that hybrid search actually improves over each method alone, measure these metrics on a held-out evaluation set with known relevant documents:

Precision@K: Of the top K results returned, what fraction are actually relevant?
Recall@K: Of all relevant documents in the corpus, what fraction appear in the top K results?
Mean Reciprocal Rank (MRR): On average, how high does the first relevant result appear? Higher is better.
NDCG (Normalized Discounted Cumulative Gain): Accounts for both ranking position and relevance grade — the most comprehensive single metric.

Always compare hybrid search to pure BM25 and pure vector search as baselines before deploying.

Common Pitfalls

Over-relying on one method

If one method dominates the fusion — say, BM25 always pushes its top result to first place — you lose the benefit of hybrid search. Use evaluation metrics to confirm both methods are contributing.

Ignoring query types

Analyze your query distribution. If 90% of your users type exact product codes, lean heavier on BM25. If 90% ask natural-language questions, lean heavier on vector search.

Poor chunking strategy

Both BM25 and vector search are only as good as the document chunks they index. If chunks are too large, important keywords are diluted. If too small, context is lost. Invest time in your chunking strategy before optimizing fusion.

Comparison: Pure vs Hybrid Search

Aspect	Pure Vector Search	Pure BM25	Hybrid Search
Semantic Understanding	Excellent	Poor	Excellent
Exact Keyword Matching	Weak	Excellent	Excellent
Rare Terms / Identifiers	Weak	Excellent	Excellent
Short Queries	Moderate	Good	Good
Implementation Complexity	Low	Low	Medium
Cost	Medium	Low	Medium–High
Overall Robustness	Good	Good	Excellent

Future Directions

Hybrid search continues to evolve:

Learned sparse representations: Neural models (like SPLADE) generate sparse, keyword-like vectors that bridge BM25 and dense retrieval.
Contextual re-ranking: Using a cross-encoder LLM to re-rank hybrid results based on query context for even higher precision.
Multi-vector search: Combining multiple embedding models with BM25 for even more robust retrieval coverage.

Conclusion

Hybrid search is not a luxury — it is a practical necessity for production RAG systems. Pure vector search excels at semantic understanding but misses exact matches. BM25 is fast and precise for keywords but lacks semantic depth.

By combining both methods with Reciprocal Rank Fusion, you get retrieval that is robust, accurate, and handles the full range of real-world queries. If your RAG system currently uses only vector search, adding BM25 and RRF fusion is one of the highest-impact improvements you can make.

Key Takeaways

Pure vector search struggles with exact keyword matches and rare terms; BM25 lacks semantic understanding — hybrid search covers both failure modes.
Reciprocal Rank Fusion (RRF) is the simplest and most effective way to merge BM25 and vector rankings — it uses only ranks, not raw scores, making it immune to scale mismatches between the two methods.
Many vector databases (Weaviate, Qdrant, Elasticsearch, Pinecone) now support hybrid search natively, simplifying implementation significantly.
Always evaluate retrieval quality using metrics like Precision@K, Recall@K, and MRR to confirm that hybrid search improves over each method individually before deploying.

References

Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. SIGIR 2009.
Weaviate — Hybrid Search Explained
Qdrant — Hybrid Search Documentation
Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 Datasets Track.

Embedding Models: Training, Fine-Tuning, and Optimization for Retrieval

Embedding quality determines what your retrieval system can find. How contrastive training...

How Retrieval-Augmented Generation (RAG) Works

RAG grounds LLM responses in retrieved documents rather than model weights. Walk...

Found this useful?