RAG Architecture Design Guide — From Retrieval Quality to Answer Generation

Why RAG Exists

If you build a service with only an LLM, it struggles to reflect the latest documents, internal knowledge, and domain-specific policies accurately.

RAG solves that by retrieving relevant external knowledge and injecting it into generation.

User Query
 -> Query Rewrite
 -> Retrieve Documents
 -> Rerank
 -> Prompt Compose
 -> LLM Generate
 -> Final Answer

The key question is not only whether the model is smart. It is whether the system retrieves the right information at the right time.

Core Stages of a RAG System

1. Document Collection

Typical sources include:

Internal wikis
PDF manuals
Product docs
Help center FAQs
Runbooks
Code and config documentation

If source quality is weak, retrieval quality will also be weak.

2. Document Preprocessing

Important preprocessing tasks:

Extracting text from HTML or PDF
Preserving tables and code blocks
Keeping heading structure
Removing duplicates
Adding metadata

Example metadata:

{
  "source": "docs/api/authentication.md",
  "category": "api",
  "product": "console",
  "updated_at": "2026-04-10"
}

This metadata matters for filtering and source attribution.

3. Chunking

Chunking is not just cutting text into equal pieces.

Good chunking usually means:

Keeping headings with body content
Avoiding splitting code blocks
Preserving lists and tables when possible
Testing within a 300 to 800 token range
Using overlap to reduce context loss

Too small and meaning is lost. Too large and unrelated noise comes along.

Embeddings and Indexing

Embedding models convert text into vectors for similarity search.

In practice, evaluation should focus on:

Korean and English quality where needed
Cost
Indexing speed
Query latency
Operational simplicity

A reproducible ingestion and re-indexing workflow usually matters more than chasing the absolute highest benchmark.

The Biggest Retrieval Quality Levers

Query Rewrite

User questions are often underspecified.

Example:

Original: "permission error"
Expanded: "How to diagnose AWS IAM permission errors that cause API requests to fail"

Hybrid Search

Vector search alone is often not enough.

A practical combination is:

BM25 keyword search
Vector similarity search
Result merging

Precise tokens like error codes or API names are often handled better by keyword search.

Reranking

Instead of passing top-k retrieval results directly to the model, reranking often improves answer quality significantly.

This matters especially when documents are long or many chunks are semantically close.

Prompt Structure Matters

RAG is not just “attach retrieved text and ask the model.”

System:
You are a technical support assistant that must answer only from the provided documents.
If evidence is insufficient, say so explicitly.

Context:
[Document 1 summary]
[Document 2 summary]

User Question:
How does token refresh work when the access token expires?

Useful prompt rules include:

Clear role definition
Evidence scope limitation
Explicit uncertainty behavior
Consistent citation format

Safety Matters More Than Fluency

In many RAG products, a safe answer is more valuable than a polished but misleading one.

Things to design for:

Fallback behavior when retrieval is weak
Warnings on outdated docs
Source links
Sensitive document filtering
Permission-aware retrieval

If the system searches internal docs, what a user can retrieve should depend on access rights.

How to Evaluate RAG

Do not evaluate by intuition alone.

Useful metrics include:

Retrieval precision
Retrieval recall
Answer groundedness
Hallucination rate
Citation accuracy
End-to-end latency

Example evaluation sample:

{
  "question": "When is the token refresh API called?",
  "expected_docs": ["auth/token-refresh.md"],
  "must_include": ["refresh token", "expiration"],
  "must_not_include": ["password reset"]
}

Common Operational Failures

Stale Indexes

If docs change but embeddings are not refreshed, answers become stale immediately.

Inconsistent Chunking

If each document type is chunked differently, retrieval quality becomes highly uneven.

Too Many Retrieved Chunks

Increasing top-k without discipline often makes answers worse, not better.

Missing Citations in the UI

If users cannot verify answers, they will not trust the system.

A Good Starting Architecture

A strong first version usually includes:

Ingestion pipeline
Preprocessing and chunking
Embedding generation
Vector storage
Search + prompt composition in the API layer
Response plus citations

Then you can add:

Query rewrite
Reranking
Hybrid search
Offline evaluation
Feedback loop

Closing Thoughts

RAG quality depends at least as much on data pipelines and retrieval strategy as it does on the language model itself.

Strong teams usually do not start by overcomplicating everything. They:

Clean the documents
Define chunking rules
Measure retrieval quality
Build citation-based answers
Keep iterating with evaluation

That discipline is what makes a RAG system actually useful.