Office Hours — What's the hardest part of building AI agents that actually work?

What’s the hardest part of building AI agents that actually work?

State management. Everyone focuses on prompt engineering or picking between Claude Opus 4.7 and GPT-5.5, but the real problem is keeping track of what the agent actually knows at any point in time.

You start with a task. The agent needs to make decisions, call tools, read responses, maybe branch into subtasks. Each step changes what should happen next. But LLMs are stateless. They don’t remember context across tool calls unless you explicitly thread it back. You end up building this janky state machine where you’re manually tracking “did we already fetch this data?”, “which step failed?”, “what should the next prompt even be?”.

Consider a concrete example: an agent that fixes a failing test suite. The flow looks clean on paper.

Clone the repo
Run tests, capture failures
For each failure, read the test file and implementation
Generate a fix
Test it
Commit if passing

In practice, you’re juggling: which repo state does the model see when it generates a fix (old or new)? Did a previous fix break something else, and does the agent know that yet? If a test passes locally but fails in CI, does the agent retry with different context? You need explicit checkpoints after each tool call. You need to version your context. You need rollback semantics. That’s the state machine work.

A minimal checkpoint structure looks something like this:

@dataclass
class AgentCheckpoint:
    step: int
    tool_call: str
    tool_output: str
    repo_commit_sha: str        # snapshot the actual repo state
    context_tokens_used: int
    outcome: Literal["success", "failure", "pending"]
    retry_count: int
    known_failures: list[str]   # accumulated across all steps

The repo_commit_sha field matters more than it looks. Without it, your agent generates fixes against a repo state that no longer exists after the previous patch applied. You get coherent-looking diffs that fail to apply cleanly. Pinning the commit hash before each tool call eliminates an entire class of “the agent is confused” bugs that are actually “the agent has wrong context” bugs.

Context Window and History Management

Here’s where it gets expensive. If you’re using Claude Opus 4.7 or GPT-5.5 with tool use, the model can see the full history of previous tool calls in a single context window. That’s a feature. But the window has a limit. Long-running agents hit that limit fast.

Consider a test-fixing agent running against a real codebase. First run: 8KB of test output, 12KB of implementation files, maybe 2-3 tool calls. Second fix attempt: you’re now showing the model the previous attempt’s failures, the new test output, the modified code. Third attempt: context balloons. By attempt five, you’ve burned through 100K tokens of context just showing the agent its own work history.

Your options are grim. You can prune history (lossy; the agent forgets why it abandoned an approach). You can summarize previous attempts (adds latency, requires another model call). You can split into parallel branches and track them separately (multiplies your token spend). Most production systems do something like: keep full context for the last 3 tool calls, summarize everything before that into a “previous attempts” section, and accept that you lose granular details.

A concrete tradeoff: GitHub Copilot agents with GPT-5.4 get roughly 80K tokens of fresh context per step. A typical multi-file bug fix consumes 15K tokens of context just loading the relevant source. After 4-5 tool calls with full history, you’re forced to summarize or drop old steps. The cost difference is stark. Keeping full history: roughly $0.80 per fix attempt. Summarizing aggressively: roughly $0.35 per attempt, but the agent makes suboptimal decisions 15-20% more often because it’s working from summaries instead of ground truth.

One underrated option: use a smaller, cheaper model for the summarization step itself. Claude Haiku 4.5 or GPT-4.1 Nano can compress a 20K-token tool call history into a 2K-token summary for a fraction of the cost of calling your primary model. The summary quality is usually good enough, because you’re preserving outcomes and key facts, not reproducing reasoning.

Recovery and Failure Modes

Then there’s the recovery problem. A tool call fails midway through a longer sequence. Do you retry that step? Backtrack? Prune that branch and continue? The agent doesn’t know, so you have to encode that logic outside the model. That’s where agents fail in production. Not because the model is bad. Because your state machine is brittle.

The tradeoff: strict rollback (one failure kills the whole task) is safe but wastes context. Loose retry (keep going, hope the agent notices) burns tokens and accumulates errors. Most production systems end up somewhere in between. Retry 3 times on transient failures, escalate to a human on persistent ones, keep a separate known_failures log that you inject into every subsequent prompt so the agent doesn’t loop on the same mistake.

A specific edge case: flaky tests. Your agent fixes a test, it passes locally, the agent commits. Later, it flakes in CI. Now you have a false positive in your history. The agent sees “test passed” but the actual repo is broken. Explicit verification steps are non-negotiable. Run the same test multiple times before recording success. Wait for CI to finish before considering a fix confirmed. This adds time and cost, but shipping a broken fix because the agent trusted a single local pass is worse.

Knowing when to stop is its own problem. An agent can keep looping, keep calling tools, keep spending tokens. Token limits help but they’re crude. Real agents need explicit termination criteria defined upfront. “Fix this failing test suite” has a clear stopping point only if you define “all tests pass, CI is green” as the criterion before the run starts, and verify it independently. Open-ended tasks (“research this market”) need a different approach entirely: a fixed step budget, not a success condition.

Bottom line: Before you pick a model, design your state machine. How are you tracking context between tool calls? How do you snapshot repo state? How do you recover from failures without losing relevant history? How do you verify success independently of what the agent believes? That’s where the actual work lives. The model choice matters at the margins. The architecture matters from the start.

Question via Hacker News