RAG Development Part 3 — Retrieval, Hybrid Search, and Reranking

Retrieval Quality Is RAG Quality

The model can only work with the evidence it receives.

So if retrieval fails, answer quality usually fails as well.

That means your system should be able to answer:

Which documents were retrieved?
Why were they retrieved?
Where does retrieval break down?

Why Dense Retrieval Alone Is Not Enough

Vector retrieval is strong at semantic similarity, but weaker in cases like:

Error codes
API paths
Class or function names
Product-specific abbreviations

Examples:

ERR_AUTH_401
/v1/tokens/refresh
Spring Cloud Gateway

These often benefit from sparse retrieval such as BM25.

Why Hybrid Search Helps

A common pattern is:

Sparse retrieval with BM25
Dense retrieval with vector similarity
Merge results
Rerank them

User Query
 -> Query normalization
 -> BM25 top-k
 -> Vector top-k
 -> Merge
 -> Rerank
 -> Final context

This makes it easier to capture both exact terms and semantic meaning.

Query Rewrite Can Make a Big Difference

User queries are often too vague.

Examples:

“login failed”
“permission issue”
“deploy error”

A rewrite step can turn them into stronger retrieval queries.

Original: "permission issue"
Rewritten: "How to diagnose AWS IAM or Kubernetes RBAC permission denials"

The risk is over-expansion, so rewrite should help retrieval without distorting user intent.

Metadata Filtering Is Mandatory

Similarity search across every document is not enough.

Useful filters include:

language
product
category
visibility
freshness
user permission

Example:

search(
    query="token refresh behavior",
    filters={
        "language": "ko",
        "product": "console",
        "visibility": "public"
    }
)

Without filtering, internal docs or stale versions can easily leak into the result set.

How Much Should Top-k Be?

Usually not too much.

A practical starting point:

Retrieval stage: top 10 to 30
Final context after reranking: top 3 to 8

Too few misses evidence. Too many adds noise.

Why Reranking Is Powerful

Retrieval finds candidates. Reranking orders them by direct usefulness.

Without reranking, the right chunk may be present but too low in rank to ever reach the model.

This is especially useful when:

Documents are long
There are many near-duplicate chunks
Domain phrasing is repetitive

Example Retrieval Pipeline

def retrieve_context(query: str):
    rewritten = rewrite_query(query)

    sparse_hits = bm25_search(rewritten, top_k=10)
    dense_hits = vector_search(rewritten, top_k=10)

    merged = merge_hits(sparse_hits, dense_hits)
    reranked = rerank(query, merged)

    return reranked[:5]

Final ordering should still be optimized against the original user question.

Duplicate Suppression Matters Too

A common problem is multiple adjacent chunks from the same source dominating the top results.

That reduces diversity and wastes tokens.

Useful strategies:

limit chunks per doc_id
merge adjacent chunks
use MMR-style diversification

Types of Retrieval Failure

Recall Failure

The correct document is not retrieved at all.

Possible causes:

bad chunking
weak query rewrite
embedding mismatch
over-restrictive filters

Ranking Failure

The correct document is retrieved but ranked too low.

Possible causes:

no reranking
weak merge strategy
duplicate-heavy rankings

Grounding Failure

The right document is retrieved, but the final answer still drifts.

Possible causes:

weak prompts
noisy context selection
too many chunks passed to generation

A Good Rollout Order

Instead of adding everything at once, build in this order:

vector retrieval
metadata filters
BM25
query rewrite
reranking
duplicate suppression

That makes it much easier to see which change actually improved quality.

What to Log

Useful retrieval logs include:

original query
rewritten query
top-k retrieval results
reranked order
selected chunk IDs
missing expected-doc cases

Without this, debugging retrieval quality is slow and guess-heavy.

Closing Thoughts

Strong RAG retrieval is not one mechanism. It is a pipeline.

You usually improve quality by:

clarifying the query
combining sparse and dense search
filtering aggressively
reranking carefully
reducing duplicate dominance

That is the foundation for high-quality answer generation.