RAG Development Part 5 — Evaluation, Observability, and Production Operations
To move RAG into production, you need quality evaluation, logging, latency tracking, and feedback loops. This post covers retrieval metrics, groundedness, citation accuracy, observability, and operational checklists.
RAG in Production Is Not Just a Demo With More Users
At first, a few strong example answers can make the system look successful.
In production, new problems appear:
- Some question types work far better than others
- Quality shifts after document updates
- Latency grows over time
- Citations drift or become inaccurate
- User trust drops when answers are hard to verify
That is why evaluation and operations are first-class parts of RAG.
What Should Be Measured?
It helps to separate evaluation into three layers.
1. Retrieval Quality
- Recall@k
- Precision@k
- MRR
- NDCG
2. Answer Quality
- groundedness
- citation accuracy
- hallucination rate
- completeness
3. System Quality
- end-to-end latency
- retrieval latency
- token usage
- error rate
- fallback rate
Build a Real Evaluation Set
Useful evaluation examples look like this:
{
"question": "When is the token refresh API called?",
"expected_doc_ids": ["auth-guide", "auth-errors"],
"must_include": ["refresh token", "expiration"],
"must_not_include": ["password reset"],
"answer_type": "procedural"
}
A good test set should include:
- definition questions
- procedure questions
- comparison questions
- troubleshooting questions
- ambiguous questions
- freshness-sensitive questions
Evaluate Retrieval Separately
Bad final answers do not always mean the model is the problem.
You need to distinguish:
- The correct doc was never retrieved
- The correct doc was retrieved but ranked too low
- The correct doc was retrieved but the model used it poorly
This separation makes debugging much faster.
How to Think About Groundedness
Groundedness means the answer is actually supported by the retrieved evidence.
Simple checks include:
- Compare answer claims against cited chunks
- Detect unsupported factual assertions
- Look for entities that never appear in evidence
If the system states “tokens expire in 24 hours” but none of the retrieved docs say that, groundedness failed.
What to Log in Production
Useful RAG logs include:
- original query
- rewritten query
- retrieval filters
- retrieval top-k results
- reranked order
- final selected chunks
- prompt size
- model name
- latency
- citations
- fallback events
Without this, root-cause analysis becomes guesswork.
Latency Usually Comes From Multiple Places
End-to-end latency often includes:
- query rewriting
- retrieval
- reranking
- generation
- post-processing
Example:
Query rewrite: 120ms
Retrieval: 80ms
Reranking: 140ms
Generation: 900ms
Post-process: 40ms
Total: 1280ms
You need this breakdown to optimize intelligently.
Freshness and Re-indexing Are Operational Metrics Too
If documents change but the index lags behind, the system becomes stale.
Things to watch:
- document change detection delay
- chunk generation delay
- embedding job failure rate
- time to searchable availability
For operational documentation, freshness can be as important as answer fluency.
User Feedback Loops Still Matter
Quantitative metrics are not enough by themselves.
Useful feedback signals:
- helpful / not helpful
- citation relevance
- answer verbosity
- whether the answer solved the real task
In practice, feedback is often most useful first as failure categorization data.
Common Production Issues
- Quality drops after document updates
- One category of questions underperforms
- Citations point to weak evidence
- Latency grows because top-k and prompt size expand over time
These are normal operating problems, not rare exceptions.
A Practical Dashboard
An early production dashboard can be fairly simple:
- request count
- average latency
- fallback rate
- top-k retrieval success rate
- citation coverage
- user feedback score
- failure rate by category
The point is to detect drift quickly, not to build a giant dashboard first.
Production Checklist
- Is there an evaluation set?
- Are retrieval and generation measured separately?
- Is there a re-indexing workflow for document changes?
- Do you track latency by stage?
- Do you validate citation accuracy?
- Is there a fallback strategy?
- Is there a feedback loop?
Closing Thoughts
Moving RAG into production is less about building one smart response and more about building a repeatable quality system.
Strong teams:
- re-index when docs change
- review retrieval failures
- compare variants using evaluation sets
- track both citation accuracy and latency
That is what turns a RAG demo into a reliable product capability.