TestForge Blog
← All Posts

Operating LLM Services in Production — Stability Guide for AI Applications

How to reliably operate LLM-based services in production. Covers cost management, latency optimization, incident response, and monitoring — all from real-world experience.

TestForge Team ·

Unique Challenges of LLM Services

Unlike regular APIs:

  • Latency: Seconds to tens of seconds (users drop off without streaming)
  • Cost: Traffic × token count = unpredictable bill spikes
  • Non-determinism: Same input can produce different output
  • Rate limits: Provider-imposed per-minute request/token limits

1. Streaming for Better UX

The most effective way to reduce perceived wait time.

# FastAPI SSE streaming
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from anthropic import Anthropic

client = Anthropic()
app = FastAPI()

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        with client.messages.stream(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{"role": "user", "content": message}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")
// Frontend
const response = await fetch('/chat/stream', { method: 'POST', body: ... });
const reader = response.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const text = new TextDecoder().decode(value);
  appendToUI(text);
}

2. Prompt Caching to Cut Costs

Avoid paying to resend the same large system prompt repeatedly.

# Anthropic Prompt Caching
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # Thousands of tokens of fixed prompt
            "cache_control": {"type": "ephemeral"}  # 5-minute cache
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
# Cache hit = 90% reduction in input cost

3. Model Routing Strategy

Not every request needs the most expensive model.

def select_model(task_type: str, complexity: int) -> str:
    """
    complexity: 1 (simple) ~ 5 (complex)
    """
    if task_type == "classification" or complexity <= 2:
        return "claude-haiku-4-5-20251001"   # Fast and cheap
    elif complexity <= 4:
        return "claude-sonnet-4-6"            # Balanced
    else:
        return "claude-opus-4-6"              # Complex reasoning

# Cost comparison (Input/Output per 1M tokens)
# Haiku:  $0.80  / $4
# Sonnet: $3     / $15
# Opus:   $15    / $75

4. Handling Rate Limits

import asyncio
from tenacity import retry, wait_exponential, retry_if_exception_type
from anthropic import RateLimitError, APIConnectionError

@retry(
    retry=retry_if_exception_type((RateLimitError, APIConnectionError)),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5),
)
async def call_llm_with_retry(messages: list) -> str:
    response = await client.messages.create(...)
    return response.content[0].text

# Limit concurrent requests with a semaphore
semaphore = asyncio.Semaphore(10)  # Max 10 concurrent requests

async def safe_llm_call(messages):
    async with semaphore:
        return await call_llm_with_retry(messages)

5. Cost Monitoring

# Track token usage per request
import time

async def tracked_llm_call(user_id: str, messages: list):
    start = time.time()
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=messages,
    )
    
    # Calculate cost
    input_cost = response.usage.input_tokens * 3 / 1_000_000
    output_cost = response.usage.output_tokens * 15 / 1_000_000
    
    # Record metrics
    metrics.record({
        "user_id": user_id,
        "model": "claude-sonnet-4-6",
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "cost_usd": input_cost + output_cost,
        "latency_ms": (time.time() - start) * 1000,
    })
    
    return response

6. Fallback Strategy

MODELS_BY_PRIORITY = [
    "claude-sonnet-4-6",         # Primary
    "claude-haiku-4-5-20251001", # Fallback
]

async def call_with_fallback(messages: list) -> str:
    for model in MODELS_BY_PRIORITY:
        try:
            response = await client.messages.create(
                model=model, messages=messages, max_tokens=1024
            )
            return response.content[0].text
        except RateLimitError:
            continue  # Try next model
    raise RuntimeError("All models rate limited")

7. Monitoring Dashboard — Key Metrics

MetricHealthyAlert Threshold
Average latency< 3s> 10s
P99 latency< 15s> 30s
Error rate< 1%> 5%
Rate limit rate< 0.1%> 1%
Daily costWithin budget> 80% of budget
Avg tokens/requestWithin baseline> 2x baseline

Operations Checklist

  • Streaming responses (UX)
  • Prompt caching applied (cost)
  • Retry + exponential backoff (stability)
  • Concurrent request limit (rate limit defense)
  • Per-model cost tracking
  • Daily cost alert configured
  • Fallback model defined
  • Timeout set (30s)
  • Prompt injection defense