Office Hours — Who is actually getting measurable value from AI agents in production?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
Who is actually getting measurable value from AI agents in production?
The honest answer: coding agents and deterministic workflow automation. Claude Code with Opus 4.7, Cursor Agent, GitHub Copilot (now multi-model across GPT-5.4, Claude Sonnet 4.6, and Gemini 3.1 Pro), and Devin are genuinely shipping multi-step tasks end-to-end—cloning repos, running tests, fixing failures, opening PRs—with minimal human intervention on individual steps. That’s real and in production. Hyatt deploying ChatGPT Enterprise across global operations signals mainstream adoption for structured tasks like customer service triage and policy lookup. Google’s new Ads Advisor controls show agents working in high-stakes commercial systems with clear compliance requirements and verifiable outcomes.
Where agents fail
What’s not working yet: anything requiring subjective judgment or operating in unstructured environments. RAG systems still fail silently as memory grows, hallucinating with confidence. Agents in open-ended creative or analytical tasks drift. Multi-source agentic RAG remains unreliable because there’s no clean feedback signal when sources conflict. The pattern is clear: agents work when success is objectively verifiable (test passes, linter passes, policy matched). They struggle when the goal is fuzzy or the feedback loop is slow.
Consider a concrete difference. A coding agent with Opus 4.7 can autonomously refactor a module because the test suite provides immediate, deterministic feedback. It runs tests, sees failures, fixes the code, iterates. Each step has a binary signal. But asking an agent to “improve code quality” without defined metrics fails because there’s no objective function to optimize against. The agent makes changes that seem reasonable locally, but lacks the signal to know if it’s actually better.
A real example: a team deployed an agent to optimize database queries in their Rails application. Without explicit success criteria, the agent made changes that reduced latency by 15% on average but introduced occasional N+1 problems under concurrent load. The metric was too coarse. They redeployed with explicit guardrails: the agent could only apply transformations that passed a specific load-test suite (1000 concurrent requests, p99 latency under 200ms). The second deployment worked. The agent optimized within known boundaries.
The configuration that made it work was minimal but specific:
task:
scope: query_optimization_only # no schema changes, no index creation
exit_condition: load_test_green # 1000 concurrent, p99 < 200ms
rollback_trigger:
metric: error_rate
threshold: 0.01
window_seconds: 60
Three fields. Transformation scope, exit condition, rollback trigger. The agent didn’t need more latitude. It needed less. Every production deployment that’s worked well has some version of this: a constrained action space, an objective exit condition, and a hard stop that doesn’t ask the agent to recover.
The cost and reliability tradeoff
Local deployment often wins over cloud APIs for mission-critical workflows. A locally-deployed Llama 4 or Devstral 2 with guaranteed response times and no rate limits can be more valuable than calling GPT-5.5 when your SLA requires sub-500ms latency and 99.9% availability. This trades model quality for operational control, and it’s often the right tradeoff in production.
Consider a payment processing workflow. Calling GPT-5.5 with 96% accuracy on transaction classification sounds better than a local Devstral 2 at 92%, until you hit rate limits during peak load, or the API adds latency on a cold start, or a hallucination misclassifies a fraudulent charge. The local model is slightly less capable, but deterministic. You know its failure boundaries. You can fine-tune it on your specific transaction types. When failures cost real money, predictability beats raw capability.
The economics matter too. A local deployment of Devstral 2 running on modest hardware processes inference at a fraction of the per-call cost of frontier API pricing. At high transaction volume, the math favors local infrastructure by a wide margin. Below roughly 10k daily transactions, the break-even shifts and API calls win on simplicity. Calculate your volume, multiply by pricing, factor in the cost of one serious rate-limit incident during peak traffic, then decide.
One edge case worth flagging: hybrid routing. Some teams run a local model for the high-volume, low-complexity classifications and fall back to a frontier API only for low-confidence cases. If 85% of your transactions are unambiguous, you pay API costs on 15% of volume and keep latency predictable across the board. The routing logic itself needs to be simple and auditable, a confidence threshold, not another agent call.
Production control flow
The inflection point is happening around coding and logistics. Enterprises are shipping agents for code generation, deployment, and document processing. But the proliferation of bespoke sandboxing solutions and internal state machine frameworks tells you teams are still wrestling with architectural rigor. Production agents need deterministic control flow, not just capability. You can’t treat an agent like a black box in mission-critical systems. You need explicit step definitions, rollback mechanisms, and human checkpoints at risky transitions.
Concrete example: a deployment pipeline agent should have explicit rollback gates, not just error recovery. If the agent patches a service and 10% of requests fail, you want a hard checkpoint that alerts a human, not an agent that tries to fix it by making more changes. Define the state machine upfront. Let the agent execute steps, but own the transitions.
The killer insight from recent deployments: determinism matters more than capability once you’re past the prototype phase. A predictable 92%-accuracy agent with clear failure modes beats a 96%-accuracy agent that occasionally hallucinates in ways you can’t anticipate. Production systems need to know their failure boundaries.
Bottom line: Deploy agents where success is verifiable and feedback is fast (coding, workflow automation, structured task routing). Keep humans in the loop everywhere else. Build explicit control flow, not just capability. And don’t confuse capable models with reliable systems.
Question via Hacker News