RAG Development Part 5 — Evaluation, Observability, and Production Operations

RAG in Production Is Not Just a Demo With More Users

At first, a few strong example answers can make the system look successful.

In production, new problems appear:

Some question types work far better than others
Quality shifts after document updates
Latency grows over time
Citations drift or become inaccurate
User trust drops when answers are hard to verify

That is why evaluation and operations are first-class parts of RAG.

What Should Be Measured?

It helps to separate evaluation into three layers.

1. Retrieval Quality

Recall@k
Precision@k
MRR
NDCG

2. Answer Quality

groundedness
citation accuracy
hallucination rate
completeness

3. System Quality

end-to-end latency
retrieval latency
token usage
error rate
fallback rate

Build a Real Evaluation Set

Useful evaluation examples look like this:

{
  "question": "When is the token refresh API called?",
  "expected_doc_ids": ["auth-guide", "auth-errors"],
  "must_include": ["refresh token", "expiration"],
  "must_not_include": ["password reset"],
  "answer_type": "procedural"
}

A good test set should include:

definition questions
procedure questions
comparison questions
troubleshooting questions
ambiguous questions
freshness-sensitive questions

Evaluate Retrieval Separately

Bad final answers do not always mean the model is the problem.

You need to distinguish:

The correct doc was never retrieved
The correct doc was retrieved but ranked too low
The correct doc was retrieved but the model used it poorly

This separation makes debugging much faster.

How to Think About Groundedness

Groundedness means the answer is actually supported by the retrieved evidence.

Simple checks include:

Compare answer claims against cited chunks
Detect unsupported factual assertions
Look for entities that never appear in evidence

If the system states “tokens expire in 24 hours” but none of the retrieved docs say that, groundedness failed.

What to Log in Production

Useful RAG logs include:

original query
rewritten query
retrieval filters
retrieval top-k results
reranked order
final selected chunks
prompt size
model name
latency
citations
fallback events

Without this, root-cause analysis becomes guesswork.

Latency Usually Comes From Multiple Places

End-to-end latency often includes:

query rewriting
retrieval
reranking
generation
post-processing

Example:

Query rewrite: 120ms
Retrieval: 80ms
Reranking: 140ms
Generation: 900ms
Post-process: 40ms
Total: 1280ms

You need this breakdown to optimize intelligently.

Freshness and Re-indexing Are Operational Metrics Too

If documents change but the index lags behind, the system becomes stale.

Things to watch:

document change detection delay
chunk generation delay
embedding job failure rate
time to searchable availability

For operational documentation, freshness can be as important as answer fluency.

User Feedback Loops Still Matter

Quantitative metrics are not enough by themselves.

Useful feedback signals:

helpful / not helpful
citation relevance
answer verbosity
whether the answer solved the real task

In practice, feedback is often most useful first as failure categorization data.

Common Production Issues

Quality drops after document updates
One category of questions underperforms
Citations point to weak evidence
Latency grows because top-k and prompt size expand over time

These are normal operating problems, not rare exceptions.

A Practical Dashboard

An early production dashboard can be fairly simple:

request count
average latency
fallback rate
top-k retrieval success rate
citation coverage
user feedback score
failure rate by category

The point is to detect drift quickly, not to build a giant dashboard first.

Production Checklist

Is there an evaluation set?
Are retrieval and generation measured separately?
Is there a re-indexing workflow for document changes?
Do you track latency by stage?
Do you validate citation accuracy?
Is there a fallback strategy?
Is there a feedback loop?

Closing Thoughts

Moving RAG into production is less about building one smart response and more about building a repeatable quality system.

Strong teams:

re-index when docs change
review retrieval failures
compare variants using evaluation sets
track both citation accuracy and latency

That is what turns a RAG demo into a reliable product capability.