Operating LLM Services in Production — Stability Guide for AI Applications
How to reliably operate LLM-based services in production. Covers cost management, latency optimization, incident response, and monitoring — all from real-world experience.
TestForge Team ·
Unique Challenges of LLM Services
Unlike regular APIs:
- Latency: Seconds to tens of seconds (users drop off without streaming)
- Cost: Traffic × token count = unpredictable bill spikes
- Non-determinism: Same input can produce different output
- Rate limits: Provider-imposed per-minute request/token limits
1. Streaming for Better UX
The most effective way to reduce perceived wait time.
# FastAPI SSE streaming
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import Anthropic
client = Anthropic()
app = FastAPI()
@app.post("/chat/stream")
async def chat_stream(message: str):
async def generate():
with client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": message}]
) as stream:
for text in stream.text_stream:
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
// Frontend
const response = await fetch('/chat/stream', { method: 'POST', body: ... });
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = new TextDecoder().decode(value);
appendToUI(text);
}
2. Prompt Caching to Cut Costs
Avoid paying to resend the same large system prompt repeatedly.
# Anthropic Prompt Caching
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # Thousands of tokens of fixed prompt
"cache_control": {"type": "ephemeral"} # 5-minute cache
}
],
messages=[{"role": "user", "content": user_message}]
)
# Cache hit = 90% reduction in input cost
3. Model Routing Strategy
Not every request needs the most expensive model.
def select_model(task_type: str, complexity: int) -> str:
"""
complexity: 1 (simple) ~ 5 (complex)
"""
if task_type == "classification" or complexity <= 2:
return "claude-haiku-4-5-20251001" # Fast and cheap
elif complexity <= 4:
return "claude-sonnet-4-6" # Balanced
else:
return "claude-opus-4-6" # Complex reasoning
# Cost comparison (Input/Output per 1M tokens)
# Haiku: $0.80 / $4
# Sonnet: $3 / $15
# Opus: $15 / $75
4. Handling Rate Limits
import asyncio
from tenacity import retry, wait_exponential, retry_if_exception_type
from anthropic import RateLimitError, APIConnectionError
@retry(
retry=retry_if_exception_type((RateLimitError, APIConnectionError)),
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(5),
)
async def call_llm_with_retry(messages: list) -> str:
response = await client.messages.create(...)
return response.content[0].text
# Limit concurrent requests with a semaphore
semaphore = asyncio.Semaphore(10) # Max 10 concurrent requests
async def safe_llm_call(messages):
async with semaphore:
return await call_llm_with_retry(messages)
5. Cost Monitoring
# Track token usage per request
import time
async def tracked_llm_call(user_id: str, messages: list):
start = time.time()
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages,
)
# Calculate cost
input_cost = response.usage.input_tokens * 3 / 1_000_000
output_cost = response.usage.output_tokens * 15 / 1_000_000
# Record metrics
metrics.record({
"user_id": user_id,
"model": "claude-sonnet-4-6",
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost_usd": input_cost + output_cost,
"latency_ms": (time.time() - start) * 1000,
})
return response
6. Fallback Strategy
MODELS_BY_PRIORITY = [
"claude-sonnet-4-6", # Primary
"claude-haiku-4-5-20251001", # Fallback
]
async def call_with_fallback(messages: list) -> str:
for model in MODELS_BY_PRIORITY:
try:
response = await client.messages.create(
model=model, messages=messages, max_tokens=1024
)
return response.content[0].text
except RateLimitError:
continue # Try next model
raise RuntimeError("All models rate limited")
7. Monitoring Dashboard — Key Metrics
| Metric | Healthy | Alert Threshold |
|---|---|---|
| Average latency | < 3s | > 10s |
| P99 latency | < 15s | > 30s |
| Error rate | < 1% | > 5% |
| Rate limit rate | < 0.1% | > 1% |
| Daily cost | Within budget | > 80% of budget |
| Avg tokens/request | Within baseline | > 2x baseline |
Operations Checklist
- Streaming responses (UX)
- Prompt caching applied (cost)
- Retry + exponential backoff (stability)
- Concurrent request limit (rate limit defense)
- Per-model cost tracking
- Daily cost alert configured
- Fallback model defined
- Timeout set (30s)
- Prompt injection defense