TestForge Blog
← All Posts

RAG Development Part 3 — Retrieval, Hybrid Search, and Reranking

Search quality largely defines RAG quality. This post explains dense retrieval, BM25, hybrid search, query rewriting, metadata filtering, and reranking from a practical engineering perspective.

TestForge Team ·

Retrieval Quality Is RAG Quality

The model can only work with the evidence it receives.

So if retrieval fails, answer quality usually fails as well.

That means your system should be able to answer:

  • Which documents were retrieved?
  • Why were they retrieved?
  • Where does retrieval break down?

Why Dense Retrieval Alone Is Not Enough

Vector retrieval is strong at semantic similarity, but weaker in cases like:

  • Error codes
  • API paths
  • Class or function names
  • Product-specific abbreviations

Examples:

  • ERR_AUTH_401
  • /v1/tokens/refresh
  • Spring Cloud Gateway

These often benefit from sparse retrieval such as BM25.

Why Hybrid Search Helps

A common pattern is:

  • Sparse retrieval with BM25
  • Dense retrieval with vector similarity
  • Merge results
  • Rerank them
User Query
 -> Query normalization
 -> BM25 top-k
 -> Vector top-k
 -> Merge
 -> Rerank
 -> Final context

This makes it easier to capture both exact terms and semantic meaning.

Query Rewrite Can Make a Big Difference

User queries are often too vague.

Examples:

  • “login failed”
  • “permission issue”
  • “deploy error”

A rewrite step can turn them into stronger retrieval queries.

Original: "permission issue"
Rewritten: "How to diagnose AWS IAM or Kubernetes RBAC permission denials"

The risk is over-expansion, so rewrite should help retrieval without distorting user intent.

Metadata Filtering Is Mandatory

Similarity search across every document is not enough.

Useful filters include:

  • language
  • product
  • category
  • visibility
  • freshness
  • user permission

Example:

search(
    query="token refresh behavior",
    filters={
        "language": "ko",
        "product": "console",
        "visibility": "public"
    }
)

Without filtering, internal docs or stale versions can easily leak into the result set.

How Much Should Top-k Be?

Usually not too much.

A practical starting point:

  • Retrieval stage: top 10 to 30
  • Final context after reranking: top 3 to 8

Too few misses evidence. Too many adds noise.

Why Reranking Is Powerful

Retrieval finds candidates. Reranking orders them by direct usefulness.

Without reranking, the right chunk may be present but too low in rank to ever reach the model.

This is especially useful when:

  • Documents are long
  • There are many near-duplicate chunks
  • Domain phrasing is repetitive

Example Retrieval Pipeline

def retrieve_context(query: str):
    rewritten = rewrite_query(query)

    sparse_hits = bm25_search(rewritten, top_k=10)
    dense_hits = vector_search(rewritten, top_k=10)

    merged = merge_hits(sparse_hits, dense_hits)
    reranked = rerank(query, merged)

    return reranked[:5]

Final ordering should still be optimized against the original user question.

Duplicate Suppression Matters Too

A common problem is multiple adjacent chunks from the same source dominating the top results.

That reduces diversity and wastes tokens.

Useful strategies:

  • limit chunks per doc_id
  • merge adjacent chunks
  • use MMR-style diversification

Types of Retrieval Failure

Recall Failure

The correct document is not retrieved at all.

Possible causes:

  • bad chunking
  • weak query rewrite
  • embedding mismatch
  • over-restrictive filters

Ranking Failure

The correct document is retrieved but ranked too low.

Possible causes:

  • no reranking
  • weak merge strategy
  • duplicate-heavy rankings

Grounding Failure

The right document is retrieved, but the final answer still drifts.

Possible causes:

  • weak prompts
  • noisy context selection
  • too many chunks passed to generation

A Good Rollout Order

Instead of adding everything at once, build in this order:

  1. vector retrieval
  2. metadata filters
  3. BM25
  4. query rewrite
  5. reranking
  6. duplicate suppression

That makes it much easier to see which change actually improved quality.

What to Log

Useful retrieval logs include:

  • original query
  • rewritten query
  • top-k retrieval results
  • reranked order
  • selected chunk IDs
  • missing expected-doc cases

Without this, debugging retrieval quality is slow and guess-heavy.

Closing Thoughts

Strong RAG retrieval is not one mechanism. It is a pipeline.

You usually improve quality by:

  • clarifying the query
  • combining sparse and dense search
  • filtering aggressively
  • reranking carefully
  • reducing duplicate dominance

That is the foundation for high-quality answer generation.