Office Hours — Has anyone deployed LLMs to production and what were the biggest operational challenges?

Has anyone deployed LLMs to production and what were the biggest operational challenges?

Yeah, plenty of teams are live with LLMs now. The operational challenges aren’t really about whether models work—they do. It’s everything around them.

Cost tracking and optimization

Cost is real. Frontier models like GPT-5.4 and Claude Opus 4.7 add up fast at scale. Token counting matters because, as we saw this week, Anthropic’s tokenizer shift on Opus 4.7 quietly inflated costs by up to 47% despite matching per-token pricing. You need actual cost tracking before you ship.

Here’s what that looks like in practice. If you’re processing 100K documents per day through Claude Opus 4.7 at roughly 2,000 tokens per document (input + output), that’s 200M tokens daily. At $3 per million input tokens, you’re looking at $600/day or $18K/month on inference alone. Swap to Claude Sonnet 4.6 and you’re at $180/day. The gap matters.

Inference caching helps. Anthropic’s prompt caching on Opus 4.7 costs 90% less for cached tokens, so if you’re reusing system prompts or retrieval context across requests, you can drop costs by 30-40% with minimal latency penalty. But you need to architect for it. You can’t retrofit caching into a stateless API easily.

Some teams run open-source models locally with Ollama or vLLM to cut API dependency and cost. Llama 4 Maverick with a million-token context fits on modern GPUs. But then you’re managing GPU infrastructure, quantization, serving latency, and keeping up with model updates. Pick your pain: API costs or infrastructure ops.

Latency, reliability, and fallback patterns

APIs work, but network calls fail. Timeouts happen. Retries compound. You need circuit breakers, exponential backoff, and fallback strategies. If GPT-5.4 times out, can you degrade to GPT-4.1 Nano? If Claude Opus 4.7 is rate-limited, do you queue or use a different model entirely?

Teams running production agents report that a single slow API call cascades. If an agent makes 10 API calls in sequence and each has a 2% timeout rate, you’re looking at roughly 18% total failure rate for the full workflow. That’s why buffer time and graceful degradation matter.

One pattern that works: implement a model tier strategy where you route based on request complexity or latency budget. A customer-facing summary task might start with Claude Sonnet 4.6, fall back to Gemini 3.1 Flash if latency exceeds 3 seconds, and queue for async processing if both are slow. Build observability to track which fallback paths you’re actually hitting in production. If you’re failing over to cheaper models more than 5% of the time, your primary model choice is wrong.

Hallucination verification in production

The hallucination problem doesn’t disappear. RAG systems retrieve documents correctly but models still confidently output wrong answers. If you can’t verify whether the output is correct automatically (via a test, a schema, a database query), you need human review in the loop.

Coding agents like Claude Code and Devin work because they have fast objective signals: tests pass or fail, linters run, CI gates succeed or block. A refactoring task with 100 unit tests has clear success criteria. Open-ended tasks like “summarize this customer feedback and recommend next steps” don’t. Someone has to check the output.

The hard cases are semantic: did the agent pick the right tool? Was the retrieval actually relevant? A vector similarity score of 0.8 doesn’t mean the retrieved chunk answers the query. Building eval frameworks for these is where most teams struggle because off-the-shelf observability tools can’t measure task-specific correctness.

State, memory, and agentic workflows

Stateless single-turn interactions are fine. But production agents need context that persists, and that’s messy. Session management, memory architectures, and long-running workflows require infrastructure most teams don’t have.

Teams are emerging with patterns: git worktrees for parallel agent tasks so multiple agents can work on different branches without collision, draft-approve-execute flows so humans can review agent decisions before applying changes, and explicit state machines so the agent can’t drift into nonsense states. These exist because naive agent loops—“call the LLM, execute what it says”—don’t cut it in production.

Autonomous coding agents like GitHub Copilot (now multi-model across GPT-5.4, Claude Sonnet 4.6, and Gemini 3.1 Pro) can handle genuine multi-step tasks: cloning repos, running tests, fixing failures, opening PRs. But they work because code execution is deterministic and verifiable. Expand to unstructured environments or ambiguous success criteria and the same agents drift. The difference isn’t the model. It’s whether you have a fast feedback loop.

Monitoring and evaluation

Traditional metrics don’t tell you if an LLM is actually working for your users. You end up building custom eval frameworks because production observability needs to measure “did the agent pick the right tool?” or “was the retrieval actually relevant?” or “did the summary miss critical context?”. This is one of the least mature parts of the stack.

Start with baseline metrics: latency, error rate, cost per request, and cache hit ratio. Then layer in task-specific evals. For retrieval, measure whether the top-5 documents actually contain the answer. For classification, run periodic human spot checks on low-confidence predictions. For generation, set a threshold for automatic review (e.g., if a summary is more than 20% longer than expected, flag it). These don’t scale automatically, but they catch real drift before users notice.

Bottom line: Deploy LLMs where success is verifiable: code generation with tests, structured extraction with validation schemas, multi-choice classification. For tasks where correctness is subjective or hard to measure, keep humans in the loop. Budget for cost tracking, inference optimization, and orchestration infrastructure, not just API calls. Assume network failures and plan fallbacks. Build monitoring that actually tells you if the model is working for your specific use case. Accept that you’ll need custom eval frameworks. The teams shipping LLMs reliably aren’t using generic tooling.

Question via Hacker News