RAG Development Part 3 — Retrieval, Hybrid Search, and Reranking
Search quality largely defines RAG quality. This post explains dense retrieval, BM25, hybrid search, query rewriting, metadata filtering, and reranking from a practical engineering perspective.
Retrieval Quality Is RAG Quality
The model can only work with the evidence it receives.
So if retrieval fails, answer quality usually fails as well.
That means your system should be able to answer:
- Which documents were retrieved?
- Why were they retrieved?
- Where does retrieval break down?
Why Dense Retrieval Alone Is Not Enough
Vector retrieval is strong at semantic similarity, but weaker in cases like:
- Error codes
- API paths
- Class or function names
- Product-specific abbreviations
Examples:
ERR_AUTH_401/v1/tokens/refreshSpring Cloud Gateway
These often benefit from sparse retrieval such as BM25.
Why Hybrid Search Helps
A common pattern is:
- Sparse retrieval with BM25
- Dense retrieval with vector similarity
- Merge results
- Rerank them
User Query
-> Query normalization
-> BM25 top-k
-> Vector top-k
-> Merge
-> Rerank
-> Final context
This makes it easier to capture both exact terms and semantic meaning.
Query Rewrite Can Make a Big Difference
User queries are often too vague.
Examples:
- “login failed”
- “permission issue”
- “deploy error”
A rewrite step can turn them into stronger retrieval queries.
Original: "permission issue"
Rewritten: "How to diagnose AWS IAM or Kubernetes RBAC permission denials"
The risk is over-expansion, so rewrite should help retrieval without distorting user intent.
Metadata Filtering Is Mandatory
Similarity search across every document is not enough.
Useful filters include:
- language
- product
- category
- visibility
- freshness
- user permission
Example:
search(
query="token refresh behavior",
filters={
"language": "ko",
"product": "console",
"visibility": "public"
}
)
Without filtering, internal docs or stale versions can easily leak into the result set.
How Much Should Top-k Be?
Usually not too much.
A practical starting point:
- Retrieval stage: top 10 to 30
- Final context after reranking: top 3 to 8
Too few misses evidence. Too many adds noise.
Why Reranking Is Powerful
Retrieval finds candidates. Reranking orders them by direct usefulness.
Without reranking, the right chunk may be present but too low in rank to ever reach the model.
This is especially useful when:
- Documents are long
- There are many near-duplicate chunks
- Domain phrasing is repetitive
Example Retrieval Pipeline
def retrieve_context(query: str):
rewritten = rewrite_query(query)
sparse_hits = bm25_search(rewritten, top_k=10)
dense_hits = vector_search(rewritten, top_k=10)
merged = merge_hits(sparse_hits, dense_hits)
reranked = rerank(query, merged)
return reranked[:5]
Final ordering should still be optimized against the original user question.
Duplicate Suppression Matters Too
A common problem is multiple adjacent chunks from the same source dominating the top results.
That reduces diversity and wastes tokens.
Useful strategies:
- limit chunks per
doc_id - merge adjacent chunks
- use MMR-style diversification
Types of Retrieval Failure
Recall Failure
The correct document is not retrieved at all.
Possible causes:
- bad chunking
- weak query rewrite
- embedding mismatch
- over-restrictive filters
Ranking Failure
The correct document is retrieved but ranked too low.
Possible causes:
- no reranking
- weak merge strategy
- duplicate-heavy rankings
Grounding Failure
The right document is retrieved, but the final answer still drifts.
Possible causes:
- weak prompts
- noisy context selection
- too many chunks passed to generation
A Good Rollout Order
Instead of adding everything at once, build in this order:
- vector retrieval
- metadata filters
- BM25
- query rewrite
- reranking
- duplicate suppression
That makes it much easier to see which change actually improved quality.
What to Log
Useful retrieval logs include:
- original query
- rewritten query
- top-k retrieval results
- reranked order
- selected chunk IDs
- missing expected-doc cases
Without this, debugging retrieval quality is slow and guess-heavy.
Closing Thoughts
Strong RAG retrieval is not one mechanism. It is a pipeline.
You usually improve quality by:
- clarifying the query
- combining sparse and dense search
- filtering aggressively
- reranking carefully
- reducing duplicate dominance
That is the foundation for high-quality answer generation.