Office Hours — How are you using LLMs in production and what unexpected issues have you hit?

How are you using LLMs in production and what unexpected issues have you hit?

We’re running Claude Opus 4.7 and GPT-5.4 in parallel for three systems: document classification for compliance workflows, multi-step customer support ticket routing, and code review assistance for junior engineers. The parallelism was supposed to be temporary—benchmark them, pick a winner, move on. We’re still running both because they fail in completely different ways, and neither failure mode is what the models’ benchmark scores suggested.

The latency trap nobody talks about

We optimized for cost and picked Claude Opus 4.7 for the first 90 days because token prices looked better in spreadsheets. But “better pricing” doesn’t account for the operational reality of latency degradation under load. When we hit about 40 concurrent requests during business hours, Claude’s API started returning overloaded errors systematically while GPT-5.4 stayed responsive. We didn’t have good observability around this initially, so support tickets would spike seemingly at random.

The fix wasn’t free. We added request queuing and retry logic with exponential backoff, plus we switched to a hybrid strategy: Claude for async batch jobs that can tolerate 10-15 second latency, GPT-5.4 for anything customer-facing with a hard SLA. This doubled our API spend because we’re calling both for some workloads just to meet SLAs, but the alternative was dropped requests degrading the user experience.

# What we had to add:
async def call_llm_with_fallback(prompt, primary_model="claude", backup_model="gpt"):
    try:
        result = await call_claude(prompt, timeout=8)
        return result
    except OverloadedError:
        # Claude overloaded, fallback to GPT within the customer-facing SLA
        return await call_gpt(prompt, timeout=4)
    except TimeoutError:
        # Both slow, return cached answer or degrade gracefully
        return get_cached_response(prompt)

This pattern works, but it’s not in any tutorial. You discover it when your dashboards show p95 latency spiking at 3pm every day.

Structural hallucinations in compliance workflows are domain-specific

Our compliance classification task is objectively simple: read a document, classify it as one of seven regulatory categories (GDPR, SOX, HIPAA, etc.), and return the category plus a confidence score. Both models score well on benchmark evals. In production, they perform completely differently.

Claude Opus 4.7 exhibits a pattern where it confidently assigns documents to the wrong category but does so consistently wrong for a given domain. If you feed it vendor contracts, it will frequently classify them as SOX-compliance documents. It’s not random hallucination. It’s systematic confusion tied to document structure. We caught this because our human reviewers flagged a 6-week stretch where 40% of vendor contracts were misclassified the same way.

GPT-5.4 has the opposite failure mode: it’s more conservative and will sometimes refuse to classify documents at all, returning a “insufficient information” response even when the document clearly belongs in a category. This is safer for compliance (false negatives are usually cheaper than false positives), but it kills throughput because those documents still need manual review.

We solved this by building a simple router: if the document is a contract-like structure, route to GPT-5.4. For regulatory filings and policy docs, use Claude. This is a hack, but it reduced misclassifications by 85% without upgrading either model.

Token window blindness

We have long documents (some >100K tokens when tokenized) that need analysis. Claude Opus 4.7 has a larger context window, so we assumed it would handle these better. It doesn’t. What we discovered is that both models perform visibly worse on documents in the 80-100K token range compared to documents under 40K. There’s no cliff edge, but there’s a steady degradation.

For the code review system, reviewers complained that the LLM comments on the first 10% of a large PR were sharp and correct, but comments on sections 70-90% into the PR were shallow or sometimes irrelevant. We suspected attention degradation in the context window, and we were right.

The workaround: chunk documents into 30K token sections, run analysis on each chunk in parallel, then synthesize results. This added latency and complexity, but accuracy on large documents jumped back up to baseline.

Inconsistent behavior with identical prompts and locked temperature

This one still bugs us. We have a customer support routing system that takes a ticket and decides: escalate to human, handle with FAQ bot, or route to specific department. We lock temperature to 0 and use the exact same prompt. Same model version. Same API. Yet we get different routing decisions on identical tickets when we test them hours apart, or sometimes on the same batch.

After weeks of debugging, we found the culprit: model version rollouts. OpenAI and Anthropic quietly roll out minor versions of models (GPT-5.4.1, GPT-5.4.2, etc.) and don’t always announce them clearly. Your API calls automatically hit the latest version. This means your “locked” behavior is actually shifting underneath you.

We now explicitly pin model versions: gpt-5.4-2025-03-15 instead of gpt-5.4-latest. This trades ease of upgrades for predictability. When a new version drops, we test it in staging for a week before rolling it to prod. This adds friction, but it eliminates the “why is behavior changing” mystery.

When structured output isn’t

Both models support structured output formats (JSON schema enforcement), and we use them heavily. The assumption is that if you enforce a schema, the output is valid JSON and conforms to your spec. Both things are true maybe 97% of the time. The other 3% is a nightmare.

Sometimes the model returns valid JSON that technically conforms to your schema but has semantically invalid values. Example: a status field that should be one of ["open", "closed", "pending"] comes back with "open" but mixed case in other fields suggests the model was confused about the expected format. Or a required field is present but null, which is technically schema-valid but breaks your downstream logic.

We added a validation layer that doesn’t just check schema conformance but also checks semantic reasonableness. If a confidence score is outside [0, 1], if a category doesn’t exist in our taxonomy, or if multiple mutually exclusive fields are both populated, we flag it and regenerate. This isn’t foolproof, but it catches ~80% of the junk outputs before they hit databases.

Bottom line:

Run your LLMs in production with explicit version pinning, add fallback routing between models for customer-facing SLAs, and assume that neither latency nor accuracy will match your lab benchmarks. Build monitoring around failure modes specific to your domain (don’t just track BLEU scores), not generic hallucination metrics. The models work, but they require operational scaffolding that nobody tells you about until your system is already live.

Question via Hacker News