Office Hours — Why is AI agent reliability barely improving despite 18 months of model upgrades?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
Why is AI agent reliability barely improving despite 18 months of model upgrades?
Model capability and agent reliability are different problems, and we’ve been conflating them. GPT-5.4 is genuinely smarter than GPT-5.3 Codex, but “smarter” doesn’t mean “more reliable at following a 10-step workflow without hallucinating step 7.”
The real blockers are architectural, not model-based. Agents fail because of tool call failures, context window limits mid-execution, reward hacking in reward models, and state management across LLM calls. A better base model helps marginally, but it doesn’t solve the fact that your agent has no clean way to backtrack when it hits a dead end, or that it can’t reliably update its own memory mid-task.
Where Model Upgrades Actually Help (and Where They Don’t)
Claude Opus 4.7 and GPT-5.4 are objectively better at reasoning through complex sequences. But production agents aren’t failing because the model can’t think hard enough. They’re failing because the system has no retry logic when a tool call times out, no way to gracefully handle a partially-filled context window 6 steps into an 8-step plan, or no mechanism to detect when it’s entered a loop.
Consider a typical coding agent scenario: clone a repo, install dependencies, run tests, fix failures, open a PR. The frontier models can handle this end-to-end. But when the test suite takes 45 seconds and the agent’s timeout is set to 30 seconds, a smarter model doesn’t help. Neither does upgrading from Claude Opus 4.6 to Claude Opus 4.7. You need structured timeouts and explicit fallback paths.
The same principle applies to multi-step RAG workflows. A more capable model won’t fix an agent that retrieves from 3 incompatible knowledge bases, merges conflicting signals, and has no way to signal uncertainty back to the caller.
What Actually Scales Agent Reliability
We’ve also been measuring the wrong thing. A 3-point improvement on SWE-Bench doesn’t translate to agents that actually complete your workflows reliably. Most agent benchmarks still have high variance, and production agents are hitting walls around consistency, not raw reasoning power.
The constraints are engineering discipline: structured outputs (JSON schemas, not prose), explicit state machines (define every valid transition), fallback strategies (what happens when tool X fails?), and honest error handling (fail fast instead of hallucinating recovery). An agent with a clear state diagram, timeouts on every external call, and rollback logic will outperform a smarter agent with a brittle control flow.
Concrete example: if your agent needs to update a database record and then send a notification, don’t let it proceed speculatively. After the database write succeeds, checkpoint the state. If the notification fails, you can retry from that checkpoint without re-running the expensive parts. This costs you a few hundred milliseconds and some storage, but it eliminates entire categories of failure modes that no model upgrade will fix.
Here’s what that looks like in practice:
class AgentCheckpoint:
def execute_with_checkpoint(self, step_name, fn, *args):
if self.load_checkpoint(step_name):
return self.get_cached_result(step_name)
result = fn(*args)
self.save_checkpoint(step_name, result)
return result
# In your agent loop:
agent = AgentCheckpoint()
db_result = agent.execute_with_checkpoint(
"database_update",
update_user_record,
user_id, new_data
)
notification_result = agent.execute_with_checkpoint(
"send_notification",
notify_user,
user_id, db_result
)
This pattern adds negligible latency but prevents your agent from retrying expensive operations when only the final step failed. Without checkpointing, a smarter model just means a smarter hallucination about recovery.
The Production Reality
Autonomous coding agents like Cursor Agent and GitHub Copilot (now multi-model with GPT-5.4 and Claude Sonnet 4.6 options) do work for genuine multi-step tasks. They clone repos, run tests, fix failures, and open PRs with minimal human intervention. But they work because code has a fast objective signal: tests pass or they don’t. Linters and CI pipelines give instant feedback.
When success is ambiguous, agents drift. That same agent architecture applied to content moderation, design review, or subjective technical decisions will fail consistently, regardless of model version. The difference is measurable: coding agents running in production report success rates around 65-78% on real repos, depending on complexity. Switch the same agent to unstructured domains without clear pass/fail criteria, and success drops to 20-35%.
The second constraint is underrated: longer doesn’t mean better. Claude Opus 4.6 holds the record for longest autonomous operation window (14.5 hours), but most agent failures happen around step 5-8, not after 50 steps. The issue isn’t context length. It’s that every LLM call adds entropy. Each decision point is a chance to drift. Better models reduce drift per call but don’t eliminate it.
Bottom line: Stop waiting for the next model release to fix agent reliability. Audit your agent’s actual failure modes in production, then architect around them. Implement checkpoints at state transitions, set explicit timeouts on every external call, define clear fallback paths, and only deploy agents where success is objectively verifiable. Claude Opus 4.7 and GPT-5.4 are excellent tools, but they’re not the constraint right now.
Question via Hacker News