RAG Architecture Design Guide — From Retrieval Quality to Answer Generation
A practical guide to designing RAG systems. Covers document ingestion, chunking, embeddings, vector search, reranking, prompt composition, and evaluation from a real product engineering perspective.
Why RAG Exists
If you build a service with only an LLM, it struggles to reflect the latest documents, internal knowledge, and domain-specific policies accurately.
RAG solves that by retrieving relevant external knowledge and injecting it into generation.
User Query
-> Query Rewrite
-> Retrieve Documents
-> Rerank
-> Prompt Compose
-> LLM Generate
-> Final Answer
The key question is not only whether the model is smart. It is whether the system retrieves the right information at the right time.
Core Stages of a RAG System
1. Document Collection
Typical sources include:
- Internal wikis
- PDF manuals
- Product docs
- Help center FAQs
- Runbooks
- Code and config documentation
If source quality is weak, retrieval quality will also be weak.
2. Document Preprocessing
Important preprocessing tasks:
- Extracting text from HTML or PDF
- Preserving tables and code blocks
- Keeping heading structure
- Removing duplicates
- Adding metadata
Example metadata:
{
"source": "docs/api/authentication.md",
"category": "api",
"product": "console",
"updated_at": "2026-04-10"
}
This metadata matters for filtering and source attribution.
3. Chunking
Chunking is not just cutting text into equal pieces.
Good chunking usually means:
- Keeping headings with body content
- Avoiding splitting code blocks
- Preserving lists and tables when possible
- Testing within a 300 to 800 token range
- Using overlap to reduce context loss
Too small and meaning is lost. Too large and unrelated noise comes along.
Embeddings and Indexing
Embedding models convert text into vectors for similarity search.
In practice, evaluation should focus on:
- Korean and English quality where needed
- Cost
- Indexing speed
- Query latency
- Operational simplicity
A reproducible ingestion and re-indexing workflow usually matters more than chasing the absolute highest benchmark.
The Biggest Retrieval Quality Levers
Query Rewrite
User questions are often underspecified.
Example:
Original: "permission error"
Expanded: "How to diagnose AWS IAM permission errors that cause API requests to fail"
Hybrid Search
Vector search alone is often not enough.
A practical combination is:
- BM25 keyword search
- Vector similarity search
- Result merging
Precise tokens like error codes or API names are often handled better by keyword search.
Reranking
Instead of passing top-k retrieval results directly to the model, reranking often improves answer quality significantly.
This matters especially when documents are long or many chunks are semantically close.
Prompt Structure Matters
RAG is not just “attach retrieved text and ask the model.”
System:
You are a technical support assistant that must answer only from the provided documents.
If evidence is insufficient, say so explicitly.
Context:
[Document 1 summary]
[Document 2 summary]
User Question:
How does token refresh work when the access token expires?
Useful prompt rules include:
- Clear role definition
- Evidence scope limitation
- Explicit uncertainty behavior
- Consistent citation format
Safety Matters More Than Fluency
In many RAG products, a safe answer is more valuable than a polished but misleading one.
Things to design for:
- Fallback behavior when retrieval is weak
- Warnings on outdated docs
- Source links
- Sensitive document filtering
- Permission-aware retrieval
If the system searches internal docs, what a user can retrieve should depend on access rights.
How to Evaluate RAG
Do not evaluate by intuition alone.
Useful metrics include:
- Retrieval precision
- Retrieval recall
- Answer groundedness
- Hallucination rate
- Citation accuracy
- End-to-end latency
Example evaluation sample:
{
"question": "When is the token refresh API called?",
"expected_docs": ["auth/token-refresh.md"],
"must_include": ["refresh token", "expiration"],
"must_not_include": ["password reset"]
}
Common Operational Failures
Stale Indexes
If docs change but embeddings are not refreshed, answers become stale immediately.
Inconsistent Chunking
If each document type is chunked differently, retrieval quality becomes highly uneven.
Too Many Retrieved Chunks
Increasing top-k without discipline often makes answers worse, not better.
Missing Citations in the UI
If users cannot verify answers, they will not trust the system.
A Good Starting Architecture
A strong first version usually includes:
- Ingestion pipeline
- Preprocessing and chunking
- Embedding generation
- Vector storage
- Search + prompt composition in the API layer
- Response plus citations
Then you can add:
- Query rewrite
- Reranking
- Hybrid search
- Offline evaluation
- Feedback loop
Closing Thoughts
RAG quality depends at least as much on data pipelines and retrieval strategy as it does on the language model itself.
Strong teams usually do not start by overcomplicating everything. They:
- Clean the documents
- Define chunking rules
- Measure retrieval quality
- Build citation-based answers
- Keep iterating with evaluation
That discipline is what makes a RAG system actually useful.