Office Hours — Who is actually getting real business value from AI agents right now, and how?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
Who is actually getting real business value from AI agents right now, and how?
Real value is happening in narrow, well-defined workflows where the stakes are moderate and the input space is constrained. Customer support teams using Claude Opus 4.7 or GPT-5.4 to handle tier-1 tickets with human handoff are seeing measurable ROI. Insurance companies are running agents on structured claim data, extracting info and routing to underwriters. Recruiting firms use agents to parse resumes, flag candidates, and schedule initial screens. These work because the agent isn’t making the final call, it’s handling the tedious middle mile.
What’s not working yet: fully autonomous agents in high-stakes domains where success is subjective or ambiguous. A sales agent that independently qualifies leads and commits to pricing? Still fails in production. Agents that make financial decisions without guardrails? Companies have learned that lesson. The failure mode is consistent: the agent drifts when there’s no fast, objective signal to correct course.
Where agents actually win
The pattern is consistent across working deployments. You get real value when you build an agent that does 60-70% of the work predictably well, then hand off the 30-40% ambiguous cases to humans. The agent needs clear input boundaries (structured data or tightly scoped documents), explicit decision trees for routing, and a human feedback loop.
Coding agents are the exception that proves the rule. Claude Code with Opus 4.7, Cursor Agent, GitHub Copilot (now multi-model across GPT-5.4, Claude Sonnet 4.6, and Gemini 3.1 Pro), and Devin can handle genuine multi-step tasks without human intervention: cloning repos, running tests, fixing failures, opening PRs. This works because success is verifiable (tests pass or fail, linter accepts or rejects, CI is green or red). The objective signal is immediate. In contrast, agents deployed on unstructured text, customer conversations, or subjective judgment calls still need human validation.
Cost and model selection
Llama 4 and Mistral Large 3 can handle moderate-complexity agent workflows at lower cost if you’re willing to invest in guardrail complexity. A tier-1 support agent running on Llama 4 Scout costs roughly 60-70% less per token than Claude Opus 4.7, but requires more explicit routing logic and tighter input validation. The trade-off is real: frontier models like GPT-5.5 or Claude Opus 4.7 need fewer safety constraints because they’re less prone to hallucination in edge cases, but they cost more. For high-volume, well-defined tasks, the math favors open source. For tasks with higher error cost or more ambiguous inputs, the frontier models pay for themselves.
A concrete example: an insurance claim router handling 10,000 claims per day. Running on Llama 4 Scout at $0.04 per 1M tokens costs roughly $400/day in inference. Claude Opus 4.7 at similar throughput costs closer to $1,100/day. The Llama-based system requires you to define claim categories, policy limits, and contradiction detection rules upfront. The Claude system needs less scaffolding because it catches edge cases on its own. If the Llama system’s 2-3% higher error rate costs your company $5,000 per month in false rejections, the Claude model becomes the cheaper option even at 3x the token cost.
Consider the actual implementation. A claim router on Llama 4 Scout needs explicit configuration:
routing_rules:
auto_approve:
- condition: claim_amount < policy_limit AND
complete_documentation AND
no_contradictions
- confidence_threshold: 0.92
escalate:
- condition: missing_fields OR suspicious_patterns
- queue: human_review
Claude Opus 4.7 can handle more of this through natural instruction without hardcoded logic. The tradeoff: Llama needs more engineering upfront, Claude needs budget upfront.
The handoff pattern that works
The successful deployments all follow the same structure. An agent processes the request, applies a confidence threshold (explicit or learned), and either commits to an action or routes to a human queue. Insurance claim routing agents, for example, auto-approve straightforward claims under policy limits and flag anything with missing data, contradictions, or unusual patterns. The human underwriter never sees the easy cases. The agent never makes the hard call.
This requires explicit routing logic in your agentic system, not just hope. Define what “confident” means for your domain (threshold on model confidence, logic checks, or learned signals from production). Build the handoff as a first-class component, not an afterthought.
Where the pattern breaks
The handoff model works poorly when confidence becomes hard to define. A content moderation agent flags text as violating policy, but context matters: is dark humor allowed? Is criticism of a public figure harassment? These are judgment calls. An agent can surface a decision with supporting context, but trying to automate the threshold causes either too many false positives (user frustration, valid content removed) or false negatives (policy violations slip through). The agent is more useful here as a triage tool that ranks by likelihood and severity, leaving the binary decision to humans.
Similarly, agents in exploratory domains like architecture review or strategy consulting fail autonomously. The task isn’t well-defined enough. “Should we use Postgres or DynamoDB?” depends on access patterns, consistency requirements, team skills, and operational budget. An agent can enumerate trade-offs, but the decision belongs to an engineer who understands the context. You can use the agent to generate options and summaries, not to decide.
The long chain problem
One underexplored failure mode: agents performing well in isolation but drifting across long chains of steps. A refactoring agent might execute steps 1-8 correctly, then at step 9 make a decision that contradicts the original intent. This happens because each step compounds small deviations. In coding, this breaks immediately (tests fail, CI rejects). In customer communication or document processing, the error can propagate silently. The agents that succeed in multi-step domains are the ones where intermediate steps produce verifiable artifacts (code that compiles, documents that pass validation, structured data that passes schema checks). When intermediate steps are subjective or invisible, the agent needs checkpoints and human review built in.
Bottom line: Deploy agents for high-volume, moderate-stakes tasks with clear handoff points to humans and verifiable success signals. Use autonomous agents confidently in domains with fast objective feedback (code, structured data, routing decisions). Keep humans in the loop for judgment calls, subjective evaluation, and high-stakes decisions. Model selection should track error cost and error rate tolerance, not brand preference. Plan guardrails into the system architecture from the start, not as a patch. The gap between “agent handles grunt work” and “fully autonomous system” remains wide, but the grunt work is where real money gets saved.
Question via Hacker News