Designing Hybrid Search at Enterprise Scale
The core idea
Hybrid search is not "BM25 + vectors". It's a system design problem: choose candidates, fuse signals, and keep latency predictable.
- Use lexical for precision and navigational intent.
- Use semantic for recall and intent matching.
- Add reranking only where it changes outcomes.
A practical blueprint
A common pattern that stays stable across stacks (Elasticsearch/OpenSearch/MongoDB/Redis):
- Stage 1: lexical candidates (BM25, fields, boosts).
- Stage 2: semantic candidates (ANN vector retrieval).
- Stage 3: fuse scores (RRF / weighted sum) and apply business rules.
- Stage 4: optional rerank (cross-encoder / LLM reranker) on a small top-K.
What to measure (so it doesn't become opinion)
Treat relevance as an engineering loop. Build a small evaluation set and track it every time you change analyzers, embeddings, or fusion weights.
- Offline: NDCG@K / MRR / Recall@K.
- Online: CTR, reformulation rate, zero-result rate, latency p95/p99.
Common failure modes
Most hybrid rollouts fail due to uncontrolled cost/latency or uncontrolled drift in embeddings.
- Reranking too deep (K too large).
- No query-class routing (everything uses the same pipeline).
- No guardrails for "semantic hallucination" in UI snippets.
If you want to discuss architecture tradeoffs for your use case, reach out at srivastavark@gmail.com.