Office Hours — What are the actual best practices for learning to build effective multi-step AI agents beyond toy examples?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
What are the actual best practices for learning to build effective multi-step AI agents beyond toy examples?
The gap between a chatbot that answers one question and an agent that reliably chains 10+ steps is where most engineering intuition breaks down. You can’t learn this from benchmarks or documentation. You have to build something that fails, then systematically fix the failure modes.
Stop treating agents as a model problem
The biggest mistake is assuming a better model solves agent brittleness. It doesn’t. Claude Opus 4.8 hallucinates step sequences just as confidently as smaller models do. GPT-5.5 can still get stuck in loops. The model is maybe 30% of the problem. The other 70% is architecture: how you structure state, how you verify outcomes, how you handle divergence from the plan.
The signal from production is clear. Qwen3.7-Max holds the record for longest autonomous operation (35 hours on chip code optimization in May 2026), but that wasn’t because the model is magically better at reasoning. It’s because the task was highly constrained, had a fast objective signal (tests pass or fail), and the system could verify success at every step. Compare that to agentic RAG across heterogeneous data sources, which still fails regularly even with frontier models, because there’s no clear success criterion until the human reads the output.
Build toward verifiable, not open-ended tasks
Start by building agents on problems where you can measure success mechanically. Autonomous code modification works because tests exist. You push code, run the suite, get a binary signal. The agent can reason about that feedback and iterate.
Don’t start with “write a market analysis” or “debug this customer issue.” Those require human judgment. The agent will confidently produce garbage because it can’t tell the difference between good and bad output. You’re debugging against a human reading comprehension, which is the slowest feedback loop in the world.
Real example: if you’re building an agentic BI tool to query your data warehouse, give the agent not just the schema but also a hardness test: known queries with expected results that it must pass before you let it run on live data. The agent learns fast when it gets immediate, unambiguous feedback.
Design for observable failure, not hidden drift
Long agent chains fail silently. The agent completes without error but arrives at the wrong conclusion. You won’t notice until a human checks. By then, the context window is gone, the execution is stale, and debugging is nightmare work.
Instead, inject checkpoints. After each major step, have the agent explicitly state what it did and why. Have it predict what the next step should accomplish, then verify that outcome. This isn’t free; you’re burning tokens. But you’re buying observability.
Here’s a concrete pattern:
def agent_step_with_verification(state, action):
"""Execute one agent action and verify the result."""
# Execute action
result = execute_tool(action)
# Agent predicts what should have happened
prediction = model.predict_outcome(state, action)
# Verify: does the result match the prediction?
verification = model.check_match(result, prediction)
if not verification.passes:
# Log divergence explicitly
return {
"result": result,
"expected": prediction,
"divergence": verification.reason,
"should_retry": True
}
return {"result": result, "divergence": None, "should_retry": False}
This costs more per step. But you catch failures immediately instead of at the end of a 15-step chain. That’s a tradeoff that always wins.
Context is your constraint, not your feature
Agents fail because they forget context, repeat steps, or lose track of what they’ve already tried. This isn’t a reasoning problem; it’s an architecture problem.
The standard approach is to keep a structured execution log: every action taken, every outcome, every decision point. Not in the prompt (where it gets lost in noise), but in a separate data structure that gets explicitly queried.
The real lesson from production agents: the agent shouldn’t be managing its own memory. You manage it. You decide what’s relevant, what’s old enough to drop, what needs to be kept for auditing. Give the agent a read-only interface to that memory plus a structured “action log” it appends to. It’s more constrained but way more reliable.
If you’re building with Claude Opus 4.8 or GPT-5.5, both handle long contexts well (Claude’s 200k, GPT-5.5’s 128k). Don’t use that as an excuse to dump everything into the prompt. Use it to be explicit about structure: put the execution log in a separate section, make it easy for the agent to query (“what tools did I already try?”), and force it to be precise about its next step before executing.
Test agents against their own failure modes
Write tests that specifically target the ways agents break in production:
- Infinite retry loops: does the agent exit after N failed attempts at the same tool?
- Context amnesia: does the agent remember it already tried something 8 steps ago?
- Hallucinated outputs: if a tool returns an error, does the agent treat it as success anyway?
- Tool confusion: does the agent call the right tool with the right arguments, or does it get creative with parameters?
Use something like EVA-Bench Data (released in early June 2026, with 121 tools and 213 scenarios) to stress test before production. But build your own tests that specifically target your domain. A generic benchmark won’t catch the ways your particular agent gets stuck.
Run these tests before every deployment. Not as nice-to-have documentation; as actual gates. An agent that passes your loop-exit test should not ship until it continues to pass after any model update.
Autonomous doesn’t mean unsupervised
The best agents in production (Claude Code, GitHub Copilot’s multi-model approach with GPT-5.4, Devin) aren’t fully autonomous. They’re constrained: limited to specific domains (code), bounded in scope (modify this file, not your entire codebase), and monitored (humans see every change before it ships).
Real autonomous workflows work when the blast radius is small and reversible. Agents that can modify code work because git exists; you can revert. Agents that can query databases work if they’re read-only or hitting test databases. Agents that “solve any problem” don’t work, period.
When learning, build toward domain-specific constraints, not generality. An agent that’s really good at “test my Python code and iterate” is worth more than an agent that’s mediocre at “do whatever the user asks.”
The cost issue is real and nobody’s solving it yet
Uber capped Claude Code usage in June 2026 because the bill was unsustainable. That’s not unusual; it’s the default. Multi-step agents burn tokens fast. Each step is a full round-trip to the model. A 20-step task with GPT-5.4 or Claude Opus 4.8 is expensive.
The patterns that work: use cheaper models for verification steps, use smaller models for routine tasks (classification, parsing, simple retrieval), reserve expensive models for reasoning and planning. Some teams are experimenting with local models (Llama 4, Mistral Large 3, open-weight alternatives) for specific steps where latency doesn’t matter.
There’s no magic here. You’re optimizing for cost-per-successful-task, not cost-per-token. An agent that takes 3x tokens but succeeds on the first try beats an agent that’s cheaper per token but fails and retries.
Bottom line:
Build agents on problems with verifiable success criteria (test pass/fail, not human judgment), structure memory as an explicit data layer the agent queries rather than buries in context, and inject verification checkpoints after every major step. The model matters less than the architecture; even frontier models fail at long chains without proper state management.
Question via Hacker News