Office Hours — What's the hardest part of building AI agents that actually work?

What’s the hardest part of building AI agents that actually work?

State management. Everyone focuses on prompt engineering or picking between Claude Opus 4.6 and GPT-5.4, but the real problem is keeping track of what the agent actually knows at any point in time.

You start with a task. The agent needs to make decisions, call tools, read responses, maybe branch into subtasks. Each step changes what should happen next. But LLMs are stateless. They don’t remember context across tool calls unless you explicitly thread it back. You end up building this janky state machine where you’re manually tracking “did we already fetch this data?”, “which step failed?”, “what should the next prompt even be?”.

Then there’s the recovery problem. A tool call fails midway through a longer sequence. Do you retry that step? Backtrack? Prune that branch and continue? The agent doesn’t know, so you have to encode that logic outside the model. That’s where agents fail in production. Not because the model is bad. Because your state machine is brittle.

The other hard part is actually knowing when to stop. An agent can keep looping, keep calling tools, keep spinning. You need explicit termination logic. Token limits help but they’re crude. Real agents need to understand when they have enough information to answer the original question.

Bottom line: Before you pick a model, design your state machine. How are you tracking context between tool calls? How do you recover from failures? That’s where the work actually is.

Question via Hacker News