Office Hours — What common issues are you hitting with LLM gateways and API management in production? A daily developer question about AI/LLMs, answered with a direct, opinionated take. 2026-05-29T12:00:00.000Z Office Hours Office Hours office-hoursq-and-apractical-ai

Office Hours — What common issues are you hitting with LLM gateways and API management in production?

A daily developer question about AI/LLMs, answered with a direct, opinionated take.

Daily One question from the trenches, one opinionated answer.

What common issues are you hitting with LLM gateways and API management in production?

The gap between “LLM API works in staging” and “LLM API works at scale” is where most teams discover they built the wrong thing. Gateway and API management issues aren’t flashy, but they’re what actually breaks production systems once you stop experimenting.

Rate Limiting Cascades and Retry Storms

The most dangerous pattern I see: teams treat LLM APIs like synchronous database calls, so when Claude Opus 4.7 or GPT-5.5 hits rate limits, the entire request queue backs up. One slow endpoint becomes a denial-of-service against your own infrastructure.

The naive fix is exponential backoff, but that’s insufficient. What actually works is decoupling request submission from response polling. Queue requests to a broker, let the LLM API respond at its own pace, and have workers consume responses asynchronously.

# Bad: Synchronous blocking
response = client.messages.create(
    model="claude-opus-4.7",
    messages=messages
)  # Blocks if rate-limited

# Better: Fire-and-forget with broker
task_id = queue.enqueue(
    "call_claude",
    model="claude-opus-4.7",
    messages=messages,
    timeout=300
)
# Your request returns immediately
# Worker processes when capacity exists
response = wait_for_task(task_id, timeout=30)

Even with this pattern, you need per-provider rate limit budgets and request coalescing. If ten different services all need the same inference, batch them into a single call rather than hammering the API ten times.

Model Drift and Versioning Confusion

You deploy against Claude Sonnet 4.6, then Anthropic releases Claude Opus 4.7, and suddenly your gateway is routing requests to a different model without your knowledge. The response format doesn’t change, but the behavior does. Your evals pass, but your actual output quality shifts because the underlying capability moved.

The fix is explicit model pinning. Don’t use latest-available aliases in production. Specify exact model versions and have a planned upgrade path, not an automatic one.

# Bad
model = "claude-opus-latest"  # Changes on you

# Good
model = "claude-opus-4.7"  # Explicit, testable

# Your upgrade cadence
APPROVED_MODELS = {
    "prod": "claude-opus-4.7",
    "canary": "claude-opus-5.0-preview",
    "staging": "*"  # Test new models freely
}

Pair this with lightweight evals running continuously. When you upgrade models, run your eval suite against both old and new versions. If the new model regresses on your task, you catch it before it hits production. This is non-negotiable for anyone doing real work.

Cost Isolation and Runaway Spending

Teams implement cost limits per request or per hour, but they rarely implement cost limits per customer or per feature. One high-volume customer making inefficient requests can silently drain your entire monthly budget on a single gateway instance.

Multi-tenancy in LLM gateways needs compartmentalized cost tracking. You need to know not just total spend, but spend per tenant, per endpoint, per model, and per time window. Most open-source gateways (like Litellm or Vllm with a control plane) don’t ship this out of the box.

# Track granularly
cost_tracker.record(
    tenant_id="customer_acme",
    endpoint="/summarize",
    model="gpt-5.5",
    input_tokens=1200,
    output_tokens=400,
    cost_cents=15
)

# Enforce hard limits
if tenant_spend_today > tenant.monthly_budget / 30:
    return 429, "Daily limit exceeded"

Without this, your “cost limit” becomes cargo cult protection. You’ll discover you’re overspending when the bill arrives, not when the runaway request happens.

Hallucinated Errors and Silent Failures

LLM APIs occasionally return responses that look successful but are actually malformed. A response parses as JSON but the content field is empty. The API call succeeded (HTTP 200), but the model returned nothing. Your retry logic doesn’t catch it because there’s no error status. The request is simply lost.

You need application-level validation that goes beyond HTTP status codes. Validate the response shape, validate that required fields exist, and validate that the content is non-empty. If validation fails, treat it as a retryable error.

response = client.messages.create(...)

# Don't trust HTTP 200
if not response.content or len(response.content) == 0:
    raise RetryableError("Empty response from API")

if response.stop_reason == "max_tokens":
    logger.warning(f"Truncated response: {response.usage}")
    # Decide: retry with larger context? Accept truncation?

This is especially critical with frontier models doing complex reasoning. GPT-5.4 and Claude Opus 4.7 are generally reliable, but edge cases do exist. Your gateway should catch them.

Token Limit Thrashing

Every LLM API has a context window limit. When you hit it, the request fails. Teams then reduce the context size and retry, which degrades quality. You’re now in a loop where larger contexts fail and smaller contexts return worse results.

Pre-compute your token budgets. Know how many tokens your prompt template uses, how many tokens the input will be, and what’s left for the response. If the total exceeds the model’s context window, queue it for a larger-context model or reject the request upfront instead of failing in production.

prompt_tokens = estimate_tokens(system_prompt + user_input)
max_completion_tokens = 2000  # Expected output size

total_needed = prompt_tokens + max_completion_tokens
context_limit = model_limits[model_name]

if total_needed > context_limit:
    if total_needed <= model_limits["claude-opus-4.7"]:
        model = "claude-opus-4.7"  # Upgrade silently
    else:
        raise ValueError(f"Request too large even for largest model")

The Cost of Multi-Tenancy Without Isolation

Running a shared gateway that routes requests from different customers to the same LLM API creates a hidden coupling. If one customer’s prompt engineering is bad and exhausts quota, all customers suffer. If one customer’s request is slow, it blocks others in the queue.

Implement per-customer rate limits and quotas at the gateway level, not just at the LLM API level. Give each customer a token bucket or request queue. If a customer exceeds their quota, they get throttled, not everyone.

This also prevents information leakage. A customer shouldn’t be able to infer what other customers are doing based on latency patterns. Isolation is both a fairness and security requirement.

Monitoring Gaps

Most teams monitor API latency and error rates, but they don’t monitor the things that actually matter: token efficiency, cost per transaction, hallucination rates, and whether responses actually solve the user’s problem.

Latency and error rates are infrastructure metrics. You need application metrics. For every request, log the input tokens, output tokens, cost, and whether the response was actually useful. This requires application-level instrumentation, not just gateway access logs.

Bottom line: Start with explicit model pinning, granular cost tracking per tenant, and application-level response validation. These three things prevent most of the disasters I see in production LLM systems. Everything else is optimization.

Question via Hacker News