TestForge Blog
← All Posts

RAG Architecture Design Guide — From Retrieval Quality to Answer Generation

A practical guide to designing RAG systems. Covers document ingestion, chunking, embeddings, vector search, reranking, prompt composition, and evaluation from a real product engineering perspective.

TestForge Team ·

Why RAG Exists

If you build a service with only an LLM, it struggles to reflect the latest documents, internal knowledge, and domain-specific policies accurately.

RAG solves that by retrieving relevant external knowledge and injecting it into generation.

User Query
 -> Query Rewrite
 -> Retrieve Documents
 -> Rerank
 -> Prompt Compose
 -> LLM Generate
 -> Final Answer

The key question is not only whether the model is smart. It is whether the system retrieves the right information at the right time.

Core Stages of a RAG System

1. Document Collection

Typical sources include:

  • Internal wikis
  • PDF manuals
  • Product docs
  • Help center FAQs
  • Runbooks
  • Code and config documentation

If source quality is weak, retrieval quality will also be weak.

2. Document Preprocessing

Important preprocessing tasks:

  • Extracting text from HTML or PDF
  • Preserving tables and code blocks
  • Keeping heading structure
  • Removing duplicates
  • Adding metadata

Example metadata:

{
  "source": "docs/api/authentication.md",
  "category": "api",
  "product": "console",
  "updated_at": "2026-04-10"
}

This metadata matters for filtering and source attribution.

3. Chunking

Chunking is not just cutting text into equal pieces.

Good chunking usually means:

  • Keeping headings with body content
  • Avoiding splitting code blocks
  • Preserving lists and tables when possible
  • Testing within a 300 to 800 token range
  • Using overlap to reduce context loss

Too small and meaning is lost. Too large and unrelated noise comes along.

Embeddings and Indexing

Embedding models convert text into vectors for similarity search.

In practice, evaluation should focus on:

  • Korean and English quality where needed
  • Cost
  • Indexing speed
  • Query latency
  • Operational simplicity

A reproducible ingestion and re-indexing workflow usually matters more than chasing the absolute highest benchmark.

The Biggest Retrieval Quality Levers

Query Rewrite

User questions are often underspecified.

Example:

Original: "permission error"
Expanded: "How to diagnose AWS IAM permission errors that cause API requests to fail"

Vector search alone is often not enough.

A practical combination is:

  • BM25 keyword search
  • Vector similarity search
  • Result merging

Precise tokens like error codes or API names are often handled better by keyword search.

Reranking

Instead of passing top-k retrieval results directly to the model, reranking often improves answer quality significantly.

This matters especially when documents are long or many chunks are semantically close.

Prompt Structure Matters

RAG is not just “attach retrieved text and ask the model.”

System:
You are a technical support assistant that must answer only from the provided documents.
If evidence is insufficient, say so explicitly.

Context:
[Document 1 summary]
[Document 2 summary]

User Question:
How does token refresh work when the access token expires?

Useful prompt rules include:

  • Clear role definition
  • Evidence scope limitation
  • Explicit uncertainty behavior
  • Consistent citation format

Safety Matters More Than Fluency

In many RAG products, a safe answer is more valuable than a polished but misleading one.

Things to design for:

  • Fallback behavior when retrieval is weak
  • Warnings on outdated docs
  • Source links
  • Sensitive document filtering
  • Permission-aware retrieval

If the system searches internal docs, what a user can retrieve should depend on access rights.

How to Evaluate RAG

Do not evaluate by intuition alone.

Useful metrics include:

  • Retrieval precision
  • Retrieval recall
  • Answer groundedness
  • Hallucination rate
  • Citation accuracy
  • End-to-end latency

Example evaluation sample:

{
  "question": "When is the token refresh API called?",
  "expected_docs": ["auth/token-refresh.md"],
  "must_include": ["refresh token", "expiration"],
  "must_not_include": ["password reset"]
}

Common Operational Failures

Stale Indexes

If docs change but embeddings are not refreshed, answers become stale immediately.

Inconsistent Chunking

If each document type is chunked differently, retrieval quality becomes highly uneven.

Too Many Retrieved Chunks

Increasing top-k without discipline often makes answers worse, not better.

Missing Citations in the UI

If users cannot verify answers, they will not trust the system.

A Good Starting Architecture

A strong first version usually includes:

  • Ingestion pipeline
  • Preprocessing and chunking
  • Embedding generation
  • Vector storage
  • Search + prompt composition in the API layer
  • Response plus citations

Then you can add:

  • Query rewrite
  • Reranking
  • Hybrid search
  • Offline evaluation
  • Feedback loop

Closing Thoughts

RAG quality depends at least as much on data pipelines and retrieval strategy as it does on the language model itself.

Strong teams usually do not start by overcomplicating everything. They:

  • Clean the documents
  • Define chunking rules
  • Measure retrieval quality
  • Build citation-based answers
  • Keep iterating with evaluation

That discipline is what makes a RAG system actually useful.