Office Hours — How are you orchestrating multi-agent AI workflows in production?

How are you orchestrating multi-agent AI workflows in production?

The honest answer: most teams aren’t. They’re running single agents with deterministic fallbacks, and the ones attempting real multi-agent orchestration are discovering that the hard problems aren’t what they expected.

What Actually Works

When agents work in production, it’s because success is verifiable. Coding agents that run tests and open PRs succeed because a passing test is an objective signal. Claude Code with Opus 4.8 can autonomously fix failures across multiple files because the feedback loop is tight: compiler error, code change, rerun, done. Qwen3.7-Max demonstrated 35 hours of continuous operation on chip optimization in May, which is remarkable, but the task had a clear objective: minimize power consumption while maintaining performance targets. The model could evaluate its own progress.

The moment you move to ambiguous success criteria, agents drift. A customer support agent that’s supposed to “resolve customer frustration” has no ground truth. An agent tasked with “improve code quality” doesn’t know when to stop. These require human judgment in the loop, which defeats the autonomous part.

The Orchestration Problem Is Actually an Integration Problem

If you’re running multiple agents, your actual bottleneck isn’t the agents themselves. It’s describing what tools they can access and how. Most teams I’ve spoken with spend 60% of their time building glue code between agents and APIs.

Model Context Protocol (MCP) is starting to fix this. Instead of each agent having its own bespoke integrations with Slack, GitHub, Jira, and your internal APIs, you define tools once as MCP servers. Agents discover them, understand their semantics from documentation, and invoke them. This sounds boring until you’ve maintained five different tool definitions across three different agent frameworks.

A real example: a company running both Claude Opus 4.8 for code review and GPT-5.5 for customer triage was maintaining parallel tool definitions. One agent’s “get PR info” call had different parameters than the other’s. Migrating both to MCP servers meant writing the GitHub integration once, versioning it, and both agents immediately understood it the same way.

Multi-Agent Patterns That Scale

The working patterns I’m seeing split into three types:

Sequential pipelines are the most reliable. Agent A generates SQL, Agent B validates it against a schema, Agent C executes it and formats results. Each step has a verifiable output. Claude Opus 4.8 at each stage handles complexity well. You can inject human review between steps if a result looks suspicious. This works in production because you can observe and debug at each junction.

Hierarchical dispatch is emerging as a practical alternative to flat multi-agent systems. You have a router agent (often a smaller, faster model like Gemini 3.1 Flash-Lite) that receives a user request and routes it to specialist agents. A request like “analyze our Q2 revenue and suggest cost optimizations” routes to a data query agent and then a strategic analysis agent. The router itself is stateless and fast. Specialist agents are single-purpose, so you can tune them independently. This scales better than trying to make one mega-agent do everything.

Event-driven orchestration works for long-running workflows. An agent completes a task (e.g., “scrape competitor pricing”), publishes an event, and another agent picks it up (e.g., “compare to our pricing and flag discrepancies”). You need something like Temporal or a message queue (Kafka, Redis Streams) to track state and retry. The advantage is decoupling: agents don’t need to know about each other. The disadvantage is operational complexity you don’t have with a simpler approach.

What Breaks

Tool proliferation is a real failure mode. I’ve watched teams give agents access to 50+ APIs, expecting the LLM to figure out which one matters. It doesn’t. Models consistently pick plausible-sounding tools that don’t work. The fix is counterintuitive: give agents fewer, more carefully designed tools. If you’re choosing between three customer lookup methods, collapse them into one with clear parameters. If an agent needs both the internal API and Stripe API, create a single “customer financial data” tool that wraps both.

Partial failures cascade. If agent A succeeds but agent B fails halfway, you’re in an inconsistent state. An agent wrote a draft, another reviewed it, a third tried to publish but the API went down. Now you have a draft marked as reviewed but not published. Handle this with idempotent operations and explicit state machines, not hopeful retry logic.

Long context isn’t enough to compensate for unclear task boundaries. Agents with 100K context still fail at aggregation tasks if the task itself is ambiguous. A recent Towards Data Science analysis found that simply throwing bigger context windows at RAG doesn’t improve accuracy for queries requiring aggregation—the agent doesn’t know which results to trust. You need deterministic SQL for math, not LLM reasoning over data.

The Architecture That Scales

Here’s a template that’s working for teams running real production workflows:

class WorkflowOrchestrator:
    def __init__(self, tools_registry, state_store):
        self.tools = tools_registry  # MCP servers, versioned
        self.state = state_store     # Temporal or Postgres
        self.router = Router(model=Gemini3_1Flash_Lite)  # Fast dispatcher
        self.specialists = {
            "data": Agent(model=Claude_Opus48),
            "code": Agent(model=GPT_5_3_Codex),
            "review": Agent(model=Claude_Opus48)
        }
    
    async def execute(self, request):
        task = self.router.dispatch(request)  # Which specialist?
        result = await self.specialists[task.agent].run(
            request=request,
            tools=self.tools.filter(task.required_capabilities),
            checkpoint=self.state.load(request.id)
        )
        self.state.save(request.id, result)
        return result

The key moves: router is cheap and stateless, specialists are single-purpose, tools are registered centrally, state is persistent so you can resume if an agent times out or an API fails.

Cost and Latency Reality

Multi-agent workflows cost more. Each agent call is an API request. A three-step workflow costs 3x a single call. But when the alternative is manual work or a brittle single-agent system that fails silently, it’s worth it. Use cheaper models (GPT-4.1 Nano, Gemini 3.1 Flash-Lite) for routing and simple steps. Reserve Claude Opus 4.8 or GPT-5.5 for the expensive reasoning steps.

Latency is the other constraint. If agents run sequentially, a five-step workflow with 2-second per-step latency is 10 seconds minimum. Run them in parallel where possible (if agent B doesn’t depend on agent A’s output, spawn both). Temporal’s branching semantics handle this well.

What’s Actually Missing

There’s no standard for agent-to-agent communication semantics. Does Agent A return structured data or free text? What happens if it fails? You’re writing custom serialization and error handling for every workflow. Anthropic’s recent self-harness work (where Claude generates its own orchestration logic) is interesting partly because it suggests models could write that glue code dynamically, but it’s still early.

Bottom line: Start with sequential pipelines where each step has verifiable output. Add a router only when you have genuinely different task types. Use MCP to stop rewriting tool definitions. Don’t attempt flat multi-agent coordination until you’ve proven your single-agent pipeline works reliably.

Question via Hacker News