Operating LLM Services in Production — Stability Guide for AI Applications

Unique Challenges of LLM Services

Unlike regular APIs:

Latency: Seconds to tens of seconds (users drop off without streaming)
Cost: Traffic × token count = unpredictable bill spikes
Non-determinism: Same input can produce different output
Rate limits: Provider-imposed per-minute request/token limits

1. Streaming for Better UX

The most effective way to reduce perceived wait time.

# FastAPI SSE streaming
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import Anthropic

client = Anthropic()
app = FastAPI()

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        with client.messages.stream(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{"role": "user", "content": message}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

// Frontend
const response = await fetch('/chat/stream', { method: 'POST', body: ... });
const reader = response.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const text = new TextDecoder().decode(value);
  appendToUI(text);
}

2. Prompt Caching to Cut Costs

Avoid paying to resend the same large system prompt repeatedly.

# Anthropic Prompt Caching
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Thousands of tokens of fixed prompt
            "cache_control": {"type": "ephemeral"}  # 5-minute cache
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
# Cache hit = 90% reduction in input cost

3. Model Routing Strategy

Not every request needs the most expensive model.

def select_model(task_type: str, complexity: int) -> str:
    """
    complexity: 1 (simple) ~ 5 (complex)
    """
    if task_type == "classification" or complexity <= 2:
        return "claude-haiku-4-5-20251001"   # Fast and cheap
    elif complexity <= 4:
        return "claude-sonnet-4-6"            # Balanced
    else:
        return "claude-opus-4-6"              # Complex reasoning

# Cost comparison (Input/Output per 1M tokens)
# Haiku:  $0.80  / $4
# Sonnet: $3     / $15
# Opus:   $15    / $75

4. Handling Rate Limits

import asyncio
from tenacity import retry, wait_exponential, retry_if_exception_type
from anthropic import RateLimitError, APIConnectionError

@retry(
    retry=retry_if_exception_type((RateLimitError, APIConnectionError)),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5),
)
async def call_llm_with_retry(messages: list) -> str:
    response = await client.messages.create(...)
    return response.content[0].text

# Limit concurrent requests with a semaphore
semaphore = asyncio.Semaphore(10)  # Max 10 concurrent requests

async def safe_llm_call(messages):
    async with semaphore:
        return await call_llm_with_retry(messages)

5. Cost Monitoring

# Track token usage per request
import time

async def tracked_llm_call(user_id: str, messages: list):
    start = time.time()
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=messages,
    )
    
    # Calculate cost
    input_cost = response.usage.input_tokens * 3 / 1_000_000
    output_cost = response.usage.output_tokens * 15 / 1_000_000
    
    # Record metrics
    metrics.record({
        "user_id": user_id,
        "model": "claude-sonnet-4-6",
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "cost_usd": input_cost + output_cost,
        "latency_ms": (time.time() - start) * 1000,
    })
    
    return response

6. Fallback Strategy

MODELS_BY_PRIORITY = [
    "claude-sonnet-4-6",         # Primary
    "claude-haiku-4-5-20251001", # Fallback
]

async def call_with_fallback(messages: list) -> str:
    for model in MODELS_BY_PRIORITY:
        try:
            response = await client.messages.create(
                model=model, messages=messages, max_tokens=1024
            )
            return response.content[0].text
        except RateLimitError:
            continue  # Try next model
    raise RuntimeError("All models rate limited")

7. Monitoring Dashboard — Key Metrics

Metric	Healthy	Alert Threshold
Average latency	< 3s	> 10s
P99 latency	< 15s	> 30s
Error rate	< 1%	> 5%
Rate limit rate	< 0.1%	> 1%
Daily cost	Within budget	> 80% of budget
Avg tokens/request	Within baseline	> 2x baseline

A practical hub for operating and improving AI services

Operating LLM Services in Production — Stability Guide for AI Applications

Unique Challenges of LLM Services

1. Streaming for Better UX

2. Prompt Caching to Cut Costs

3. Model Routing Strategy

4. Handling Rate Limits

5. Cost Monitoring

6. Fallback Strategy

7. Monitoring Dashboard — Key Metrics

Operations Checklist