AI Agent Orchestration Patterns 2026-04-21T09:00:00.000Z Deep Dives Deep Dives deep-divereferencearchitecture

AI Agent Orchestration Patterns

The post you bookmark. One topic, covered end to end.

Complete guide to building reliable AI agent orchestration: single-agent loops, multi-agent delegation, supervisor hierarchies, state management, error recovery, and production patterns with code.

AI Agent Orchestration Patterns

Every major AI provider now ships some form of agent framework. Most production agent deployments still fail in predictable ways — not because the LLM is wrong, but because the orchestration around it handles state, errors, and control flow poorly. The gap between a demo agent and a production agent is almost entirely an engineering problem.

This reference covers the three dominant orchestration patterns (single-agent loops, multi-agent delegation, and supervisor architectures), how state flows through each, where they break, and when to pick one over another.

Table of Contents

The Agent Loop Primitive

Every agent pattern reduces to the same primitive: a loop that calls an LLM, checks if the LLM wants to use a tool, executes that tool, feeds the result back, and repeats until the LLM produces a final response or a termination condition fires.

User InputLLM Call(model + context)Tool Call?Tool Execution(API, DB, code)Final Response promptparse outputyesresultno

The core agent loop: call the LLM, check for tool use, execute, repeat.

This loop is deceptively simple. The complexity lives in five places: (1) how context accumulates across iterations, (2) what happens when a tool call fails, (3) how many iterations to allow before force-stopping, (4) how to manage token budget as the conversation grows, and (5) how to persist state across process boundaries.

A minimal implementation in Python:

import anthropic

client = anthropic.Anthropic()

def agent_loop(messages: list, tools: list, max_turns: int = 10) -> str:
    for _ in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-5-20250514",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            return response.content[0].text

        # Collect tool uses from response
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

    return "Max turns reached without final response."

Every framework — LangGraph, CrewAI, OpenAI Agents SDK, Autogen — wraps this loop with different abstractions. The differences that matter are in state management, error handling, and how they compose multiple loops together.

Pattern 1: Single-Agent with Tool Use

The simplest production pattern: one LLM, one system prompt, multiple tools. The LLM decides which tools to call and in what order. No routing logic, no delegation — just a capable model with access to functions.

UserSingle Agent(Claude Sonnet 4.6)Tool Registry(search, DB, API, calc)Context Window(conversation + results)Response querytool callsresultsaccumulatefinal answer

Single-agent pattern: one model handles all reasoning and tool selection.

When This Works

Single-agent works well when:

  • The task requires fewer than ~8 tool calls in sequence
  • All tools share a common domain (e.g., all operate on the same database)
  • The total context fits comfortably in the model’s window (under 50K tokens of accumulated tool results)
  • Latency budget allows sequential tool execution (each call adds 200-800ms for the LLM round-trip plus tool execution time)

When This Breaks

The failure modes are consistent across implementations:

Context window saturation. Each tool call adds its input and output to the message history. A database query returning 2,000 rows of JSON can consume 15K tokens in one step. After 5-6 such calls, the agent is spending most of its context on tool results and losing track of the original goal.

Tool selection confusion. With more than 10-15 tools, models start making incorrect tool selections — calling a search tool when they should call a database tool, or hallucinating tool parameters. Claude Opus 4.6 handles ~25 tools reasonably; Claude Haiku 4.5 degrades noticeably above 10.

Unbounded loops. Without a hard iteration cap, agents can enter cycles — calling the same tool repeatedly with slightly different parameters, or alternating between two tools without converging. Always set max_turns.

Tool Result Summarization

The most effective mitigation for context saturation is summarizing tool results before appending them to the message history:

def summarize_if_large(tool_name: str, result: str, threshold: int = 3000) -> str:
    if len(result) < threshold:
        return result

    # Use a fast, cheap model for summarization
    summary = client.messages.create(
        model="claude-haiku-4-5-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Summarize this {tool_name} result, preserving key data:\n{result}"
        }],
    )
    return f"[Summarized] {summary.content[0].text}"

This adds one cheap LLM call per large tool result but can reduce context consumption by 80-90% for data-heavy tools.

Pattern 2: Multi-Agent Delegation

Multiple specialized agents, each with their own system prompt and tool set, with one agent able to hand off tasks to another. The key distinction from the supervisor pattern: there is no central coordinator. Agents delegate laterally.

UserResearch Agent(search, scrape)Analysis Agent(DB, calc, charts)Writing Agent(formatting, citations)Final Output querydelegatedelegateformatted result

Multi-agent delegation: agents hand off to peers without a central coordinator.

Implementation

The standard implementation gives each agent a transfer_to_agent tool. When agent A calls transfer_to_agent("analysis_agent", context), the orchestrator pauses agent A’s loop and starts agent B’s loop with the provided context.

from dataclasses import dataclass, field

@dataclass
class Agent:
    name: str
    model: str
    system_prompt: str
    tools: list
    max_turns: int = 10

@dataclass
class Orchestrator:
    agents: dict[str, Agent] = field(default_factory=dict)
    conversation_history: list = field(default_factory=list)

    def run(self, agent_name: str, user_message: str) -> str:
        agent = self.agents[agent_name]
        messages = [{"role": "user", "content": user_message}]

        for _ in range(agent.max_turns):
            response = call_llm(agent.model, agent.system_prompt,
                              agent.tools + [self.transfer_tool()],
                              messages)

            if response.stop_reason == "end_turn":
                return extract_text(response)

            for tool_call in extract_tool_calls(response):
                if tool_call.name == "transfer_to_agent":
                    # Delegate to another agent with context
                    target = tool_call.input["agent_name"]
                    context = tool_call.input["context"]
                    return self.run(target, context)  # recursive

                result = execute_tool(tool_call.name, tool_call.input)
                messages = append_tool_result(messages, response, tool_call, result)

        return "Max turns reached."

    def transfer_tool(self):
        return {
            "name": "transfer_to_agent",
            "description": "Hand off the current task to a specialized agent.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "agent_name": {"type": "string", "enum": list(self.agents.keys())},
                    "context": {"type": "string", "description": "What to tell the target agent"},
                },
                "required": ["agent_name", "context"],
            },
        }

The Context Handoff Problem

The critical design decision in multi-agent delegation is what context transfers between agents. Three approaches:

Full history transfer. Pass the entire conversation history to the receiving agent. Preserves all information but consumes tokens fast. With three agents in sequence, the third agent’s context includes all tool results from agents one and two.

Summary transfer. The delegating agent writes a summary of its findings and passes only that. Loses detail but stays within token budgets. Best when agents operate on different data domains.

Structured handoff. Define a typed schema for inter-agent communication. Agent A produces a JSON object with specific fields; agent B’s system prompt expects that structure.

# Structured handoff schema
handoff_schema = {
    "findings": "list of key data points discovered",
    "original_query": "the user's original request",
    "remaining_tasks": "what still needs to be done",
    "constraints": "any constraints or preferences identified",
}

Structured handoffs are the most reliable in practice. They prevent context bloat and make the inter-agent contract explicit and testable.

Failure Modes

Delegation loops. Agent A delegates to agent B, which delegates back to agent A. Solve with a delegation depth counter or a “no-backsies” rule (an agent cannot delegate to the agent that delegated to it).

Context loss. The delegating agent summarizes too aggressively, losing a critical detail. The receiving agent then makes decisions based on incomplete information. Mitigation: include the original user query in every handoff, not just the summary.

Responsibility diffusion. With no central coordinator, no agent takes ownership of the final output quality. If agent C produces a bad result, it’s unclear whether agent A should have provided better context, agent B should have caught an error, or agent C’s tools were insufficient.

Pattern 3: Supervisor Architecture

A dedicated supervisor agent coordinates multiple worker agents. The supervisor receives the user’s request, decomposes it into subtasks, dispatches those subtasks to specialized workers, collects results, and synthesizes a final response.

UserSupervisor Agent(GPT-5.4 / Opus 4.6)Worker Pool(research, code, data)Result Aggregation(merge, dedupe, format)Final Output tasksubtaskspartial resultscombined resultsynthesized response

Supervisor pattern: central coordinator dispatches to specialized workers and synthesizes results.

Why Use a Supervisor

The supervisor pattern solves the three main problems with peer delegation:

  1. Task decomposition is centralized. One agent with a high-level view breaks the problem down, rather than each agent deciding ad hoc what to delegate.
  2. Result synthesis is explicit. The supervisor reviews all partial results and produces a coherent final output.
  3. Error handling has a single owner. If a worker fails, the supervisor decides whether to retry, use a different worker, or degrade gracefully.

Implementation with Parallel Dispatch

The supervisor’s main advantage over sequential delegation is parallel execution. If two subtasks are independent, dispatch them simultaneously:

import asyncio
from dataclasses import dataclass

@dataclass
class SubTask:
    agent_name: str
    instruction: str
    depends_on: list[str] = None  # IDs of tasks this depends on

async def supervisor_loop(
    user_query: str,
    supervisor_model: str,
    workers: dict[str, Agent],
) -> str:
    # Step 1: Decompose
    plan = await decompose_task(supervisor_model, user_query, list(workers.keys()))

    # Step 2: Execute tasks respecting dependencies
    results = {}
    for batch in topological_batches(plan.subtasks):
        # Run independent tasks in parallel
        batch_results = await asyncio.gather(*[
            run_worker(workers[task.agent_name], task.instruction, results)
            for task in batch
        ])
        for task, result in zip(batch, batch_results):
            results[task.id] = result

    # Step 3: Synthesize
    return await synthesize(supervisor_model, user_query, results)

def topological_batches(subtasks: list[SubTask]) -> list[list[SubTask]]:
    """Group subtasks into batches that can run in parallel."""
    # Tasks with no dependencies go in batch 0
    # Tasks depending only on batch 0 go in batch 1, etc.
    resolved = set()
    batches = []
    remaining = list(subtasks)

    while remaining:
        batch = [t for t in remaining
                 if not t.depends_on or all(d in resolved for d in t.depends_on)]
        if not batch:
            raise ValueError("Circular dependency in task plan")
        batches.append(batch)
        resolved.update(t.id for t in batch)
        remaining = [t for t in remaining if t not in batch]

    return batches

The Decomposition Prompt

The supervisor’s decomposition quality determines the entire system’s performance. A good decomposition prompt:

SUPERVISOR_SYSTEM = """You are a task coordinator. Given a user request and a list
of available specialist agents, produce a plan.

Available agents:
{agent_descriptions}

Output a JSON plan:
{{
  "subtasks": [
    {{
      "id": "t1",
      "agent": "research_agent",
      "instruction": "specific instruction for this agent",
      "depends_on": []
    }},
    {{
      "id": "t2",
      "agent": "analysis_agent",
      "instruction": "analyze the data from t1",
      "depends_on": ["t1"]
    }}
  ]
}}

Rules:
- Each subtask must be self-contained: include all context the agent needs.
- Maximize parallelism: only add depends_on when truly necessary.
- Use at most 5 subtasks. Prefer fewer.
- If the task is simple enough for one agent, use one subtask.
"""

The “if the task is simple enough for one agent, use one subtask” rule is important. Over-decomposition — splitting a simple request into four subtasks — adds latency and increases the chance of synthesis errors.

Supervisor Model Selection

The supervisor doesn’t need to be the most capable model in the system. It needs to be good at structured planning and synthesis, but it doesn’t need domain expertise — that’s what the workers provide.

A common and cost-effective pattern:

RoleModelRationale
SupervisorClaude Sonnet 4.6Good structured output, fast, reasonable cost
Research workerClaude Sonnet 4.6Needs tool use + reasoning
Code workerClaude Opus 4.6Complex code generation benefits from top-tier
Data workerGPT-4.1 NanoHigh throughput for structured data extraction
Synthesis passClaude Sonnet 4.6Same model as supervisor for consistency

Using a cheaper model for the supervisor than for specialized workers is counterintuitive but often correct. The supervisor makes routing decisions; the workers do the hard cognitive work.

State Management

State is the hardest part of agent orchestration. A single-agent loop can keep state in the message array. Multi-agent and supervisor patterns need something more structured.

State Categories

Conversation State(messages, tool results)Task State(plan, progress, subtask results)World State(DB records, file system,external API state)Meta State(token count, cost, latency,iteration count) informs planningdrives side effectsgrounds responsesenforces budgets

Four categories of state in agent systems, and how they interact.

Conversation state is the message history. In OpenAI and Anthropic’s APIs, this is the messages array. It grows monotonically within a single agent loop.

Task state tracks the plan, which subtasks are complete, partial results, and what remains. This is the supervisor’s working memory.

World state is external: database rows modified, files created, API calls made. These are side effects that can’t be rolled back by simply rewinding the conversation.

Meta state tracks operational concerns: total tokens consumed, wall-clock time elapsed, cost accrued, number of LLM calls made. Critical for enforcing budgets and SLAs.

Persistence Strategies

For agents that complete in under 30 seconds (most single-agent tool-use patterns), in-memory state is fine. For longer workflows:

Checkpointing to a database. After each agent turn, serialize the full state (messages, task plan, partial results) to a row in PostgreSQL or a document in Redis. If the process crashes, resume from the last checkpoint.

import json
import redis

r = redis.Redis()

def checkpoint(workflow_id: str, state: dict):
    r.set(f"agent:state:{workflow_id}", json.dumps(state))
    r.expire(f"agent:state:{workflow_id}", 86400)  # 24h TTL

def restore(workflow_id: str) -> dict | None:
    data = r.get(f"agent:state:{workflow_id}")
    return json.loads(data) if data else None

Durable execution frameworks. Temporal, Inngest, and Trigger.dev provide workflow engines that automatically checkpoint state and resume after failures. Temporal’s model — deterministic workflow code that calls activities — maps naturally to the supervisor pattern: the workflow is the supervisor, activities are worker agents.

# Pseudo-code for a Temporal workflow acting as supervisor
@workflow.defn
class ResearchWorkflow:
    @workflow.run
    async def run(self, query: str) -> str:
        plan = await workflow.execute_activity(
            decompose_task, query, start_to_close_timeout=timedelta(seconds=30)
        )

        results = {}
        for batch in topological_batches(plan.subtasks):
            batch_results = await asyncio.gather(*[
                workflow.execute_activity(
                    run_worker_agent,
                    task,
                    start_to_close_timeout=timedelta(minutes=5),
                )
                for task in batch
            ])
            for task, result in zip(batch, batch_results):
                results[task.id] = result

        return await workflow.execute_activity(
            synthesize_results, query, results,
            start_to_close_timeout=timedelta(seconds=30)
        )

The Temporal approach gives automatic retries, timeouts per activity, and full audit logs. The tradeoff is infrastructure complexity — running a Temporal cluster is nontrivial, though Temporal Cloud handles this.

Context Window as State Budget

A practical way to think about state management: the context window is a fixed token budget, and every piece of state competes for space within it.

State typeTypical token costCompression strategy
System prompt500-2,000Fixed; keep tight
User messages100-500 per turnRarely compressible
Tool call + result200-5,000 per callSummarize large results
Accumulated historyGrows ~1K per turnSliding window or summarize
Task plan (supervisor)300-800Structured JSON stays small

For a model with a 200K token context window, this seems luxurious. In practice, agent performance degrades well before the window is full. Models attend less effectively to information in the middle of long contexts. Keeping total context under 50K tokens per agent, even with 200K available, produces more reliable tool selection and reasoning.

Error Recovery and Retry Semantics

Agent failures fall into three categories, each requiring different recovery strategies.

Transient Failures(rate limits, timeouts,network errors)Tool Failures(invalid params, API errors,empty results)Reasoning Failures(wrong tool, loops,hallucinated params)Retry(automatic)Inform Agent(tool result = error)Intervene(programmatic) exponential backofffeed error to LLMcap iterations,rewrite prompt,or escalate

Three failure categories and their recovery strategies.

Transient Failures

Rate limits, network timeouts, 500 errors from tool APIs. Handle with standard retry logic — exponential backoff with jitter. Don’t feed these back to the LLM; the model can’t do anything useful with “the API returned a 429.”

import time
import random

async def call_with_retry(fn, *args, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return await fn(*args)
        except (RateLimitError, TimeoutError, ConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
            time.sleep(delay)

Tool Failures

The tool executed but returned an error — invalid parameters, a database query that returned zero rows, an API that returned a 400 with a descriptive error message. These should be fed back to the LLM as the tool result. Models are generally good at interpreting error messages and adjusting their approach:

def execute_tool_safe(name: str, params: dict) -> str:
    try:
        result = execute_tool(name, params)
        return json.dumps({"status": "success", "data": result})
    except ToolError as e:
        return json.dumps({"status": "error", "error": str(e)})

The model sees {"status": "error", "error": "No rows matched filter: date > 2026-04-30"} and can adjust the date parameter. This works reliably for 1-2 recovery attempts. If the same tool fails 3+ times in a row, something is structurally wrong and programmatic intervention is better than letting the model keep trying.

Reasoning Failures

The hardest category. The model calls the wrong tool, enters a loop, or hallucinates tool parameters that are syntactically valid but semantically wrong (e.g., querying a user table with a product ID).

Mitigations:

  1. Iteration cap. Always set max_turns. 10-15 is reasonable for most tasks. Beyond that, the agent is probably stuck.
  2. Loop detection. Track the sequence of tool calls. If the same tool is called with identical or near-identical parameters twice in a row, inject a system message: “You’ve called this tool with similar parameters already. Try a different approach or provide a final answer.”
  3. Budget enforcement. Track cumulative tokens. If the agent has consumed 80% of its token budget without producing a final answer, force termination.
  4. Fallback prompting. After N failed iterations, inject a prompt that says: “Summarize what you’ve found so far and provide the best answer you can with available information.”
def detect_loop(tool_calls: list[dict], window: int = 3) -> bool:
    if len(tool_calls) < window:
        return False
    recent = tool_calls[-window:]
    # Check if same tool called with same params
    signatures = [(c["name"], json.dumps(c["input"], sort_keys=True)) for c in recent]
    return len(set(signatures)) == 1

Control Flow: Deterministic vs LLM-Driven Routing

A spectrum exists between fully deterministic pipelines (each step is hardcoded) and fully LLM-driven routing (the model decides everything). Most production systems land somewhere in the middle.

Deterministic(hardcoded DAG)Hybrid(LLM picks fromconstrained options)Fully Autonomous(LLM decideseverything) add flexibilityadd autonomy

The control flow spectrum. Most production systems are hybrid.

Deterministic Pipelines

# Step 1 always runs, Step 2 always runs, etc.
async def pipeline(query: str) -> str:
    search_results = await research_agent.run(f"Search for: {query}")
    analysis = await analysis_agent.run(f"Analyze: {search_results}")
    report = await writing_agent.run(f"Write report: {analysis}")
    return report

Predictable latency, predictable cost, easy to test. But inflexible — if the research step returns nothing useful, the analysis step runs anyway.

Hybrid Routing

The supervisor decides which workers to invoke, but from a constrained set of options. The routing is LLM-driven but bounded:

ROUTING_PROMPT = """Based on the user's query, select which agents to invoke.
You MUST respond with valid JSON.

Available agents:
- research: for finding information from external sources
- database: for querying internal data
- calculator: for numerical analysis

Respond with:
{{"agents": ["research", "database"], "parallel": true}}
"""

The LLM picks from a fixed menu. It can’t invent new agents or execute arbitrary logic. This gives flexibility (skip the database step if the query is purely about external information) while maintaining predictability.

Fully Autonomous

The LLM decides what to do at every step, including whether to spawn sub-agents, what tools to use, and when to stop. This is the pattern most demos show. It’s also the least reliable in production.

The failure rate for fully autonomous multi-step tasks scales roughly with the number of decision points. If each routing decision has 90% accuracy, a 5-step pipeline has ~59% end-to-end success rate (0.9^5). At 95% per step, a 5-step pipeline reaches ~77%. These numbers improve with better models, but the multiplicative nature of sequential decisions means even small per-step error rates compound.

Recommendation: Use deterministic pipelines for workflows with known structure. Use hybrid routing when the workflow varies based on input but the set of possible paths is finite. Reserve fully autonomous routing for exploratory tasks where the workflow genuinely can’t be predicted in advance (e.g., open-ended research, complex debugging).

Human-in-the-Loop Patterns

Three points where human intervention commonly slots into agent workflows:

Approval Gate(review beforeside effects)Mid-Flow Correction(redirect afterpartial results)Escalation(agent hits limit,human takes over)Agent WorkAgent Continues dangerous actionapprovedpartial resultcorrected contextstuck / uncertainhuman input

Three human-in-the-loop insertion points in agent workflows.

Approval Gates

Before executing side effects (sending an email, modifying a database, calling a paid API), pause the agent and present the proposed action to a human for approval.

TOOLS_REQUIRING_APPROVAL = {"send_email", "update_database", "charge_payment"}

async def execute_with_approval(tool_name: str, params: dict, workflow_id: str):
    if tool_name in TOOLS_REQUIRING_APPROVAL:
        # Persist the pending action
        await store_pending_action(workflow_id, tool_name, params)
        # Notify human (webhook, Slack, email)
        await notify_approver(workflow_id, tool_name, params)
        # Suspend workflow — Temporal, Inngest, or custom polling
        approved = await wait_for_approval(workflow_id, timeout=timedelta(hours=24))
        if not approved:
            return {"status": "rejected", "message": "Action rejected by reviewer"}

    return await execute_tool(tool_name, params)

This requires durable state. The agent workflow must be able to pause for hours or days and resume exactly where it stopped. Temporal handles this natively. Without a durable execution framework, implement it as a state machine persisted to a database, with a separate process that checks for approvals and resumes workflows.

Escalation

When the agent exceeds its iteration budget or confidence drops below a threshold, escalate to a human. The key is providing enough context for the human to take over effectively:

def escalate(workflow_id: str, agent_state: dict):
    summary = {
        "original_query": agent_state["query"],
        "steps_taken": len(agent_state["tool_calls"]),
        "partial_results": agent_state.get("partial_results"),
        "failure_reason": agent_state.get("failure_reason", "Max iterations reached"),
        "full_history_link": f"/admin/workflows/{workflow_id}",
    }
    send_to_human_queue(summary)

Cost and Latency Profiles

The choice of orchestration pattern has direct cost and latency implications. These are approximate ranges based on typical implementations.

PatternLLM Calls per TaskTypical LatencyToken UsageBest For
Single agent, 3 tool calls4 (3 tool rounds + final)3-8s5K-15KSimple lookups, CRUD
Single agent, 8 tool calls910-25s20K-50KMulti-step research
Multi-agent, 3 agents sequential8-1515-45s30K-80KSpecialized pipelines
Supervisor + 3 parallel workers6-128-20s25K-60KComplex, decomposable tasks
Supervisor + 5 workers, mixed12-2520-60s50K-150KLarge research/analysis

Parallel dispatch in the supervisor pattern is the primary latency advantage over sequential delegation. If three workers each take 5 seconds, sequential delegation takes 15+ seconds for the worker phase alone; parallel dispatch takes ~5 seconds.

Cost Optimization Techniques

Model routing per agent. Use expensive models only where they add value. A research agent that primarily calls search APIs doesn’t need Claude Opus 4.6; Claude Haiku 4.5 or GPT-4.1 Nano handles tool use adequately for straightforward retrieval.

Early termination. If the first tool call returns a complete answer, skip the remaining planned subtasks. The supervisor should re-evaluate after each batch of worker results.

Prompt caching. If agents share system prompts across invocations (they usually do), prompt caching from Anthropic and OpenAI can reduce input token costs by 75-90% on cached portions. This matters most for agents with long system prompts (1,000+ tokens).

Token budgeting. Give each agent a token budget and track usage:

@dataclass
class TokenBudget:
    max_input_tokens: int = 50_000
    max_output_tokens: int = 10_000
    used_input: int = 0
    used_output: int = 0

    @property
    def remaining_input(self) -> int:
        return self.max_input_tokens - self.used_input

    @property
    def exhausted(self) -> bool:
        return self.used_input >= self.max_input_tokens

    def record(self, input_tokens: int, output_tokens: int):
        self.used_input += input_tokens
        self.used_output += output_tokens

Framework Comparison

The major agent orchestration frameworks as of April 2026:

FrameworkPattern SupportState ManagementLanguageStrengthsWeaknesses
LangGraphAll threeBuilt-in checkpointingPython, JSFlexible graph model, persistenceComplexity for simple use cases
OpenAI Agents SDKSingle, delegationIn-memory (extensible)PythonClean API, built-in handoffsOpenAI-centric
CrewAIMulti-agent, supervisorBuilt-inPythonEasy multi-agent setupLess control over low-level flow
Autogen (Microsoft)All threeConversation-basedPythonStrong multi-agent patternsSteep learning curve
MastraAll threeBuilt-in persistenceTypeScriptGood DX, workflow engineNewer, smaller ecosystem
Custom (no framework)AnyRoll your ownAnyFull control, no abstractionsMore code to maintain

LangGraph

LangGraph models agent workflows as state machines (graphs). Nodes are functions that transform state; edges define transitions. Conditional edges enable LLM-driven routing.

from langgraph.graph import StateGraph, MessagesState

def call_model(state: MessagesState):
    response = model.invoke(state["messages"])
    return {"messages": [response]}

def call_tools(state: MessagesState):
    # Execute tool calls from the last message
    ...

graph = StateGraph(MessagesState)
graph.add_node("agent", call_model)
graph.add_node("tools", call_tools)
graph.add_edge("__start__", "agent")
graph.add_conditional_edges("agent", should_continue, {
    "tools": "tools",
    "end": "__end__",
})
graph.add_edge("tools", "agent")

app = graph.compile(checkpointer=MemorySaver())

LangGraph’s main advantage is built-in persistence and the ability to interrupt/resume at any node — enabling human-in-the-loop without external infrastructure.

OpenAI Agents SDK

OpenAI’s framework uses a handoff primitive for multi-agent delegation:

from openai import agents

research_agent = agents.Agent(
    name="Research",
    model="gpt-5.4",
    instructions="You research topics using search tools.",
    tools=[search_tool],
)

writer_agent = agents.Agent(
    name="Writer",
    model="gpt-5.4",
    instructions="You write reports based on research.",
    handoffs=[],  # terminal agent
)

research_agent.handoffs = [writer_agent]

result = agents.Runner.run(research_agent, "Write a report on AI agent patterns")

Simple and clean for the delegation pattern. Less suitable for supervisor architectures, which require more manual construction.

When to Use No Framework

Frameworks add value when:

  • Multiple agents need persistent state across process boundaries
  • Human-in-the-loop gates are required
  • The workflow graph is complex (5+ agents, conditional branching)

Frameworks add overhead when:

  • A single agent with tool use is sufficient
  • The workflow is a simple sequential pipeline
  • Tight control over prompts and API calls is needed (frameworks often wrap the API in ways that hide prompt details)

For a single-agent loop with 3-5 tools, the raw API code from the beginning of this post is probably the right choice. Adding LangGraph to that is adding a dependency, a conceptual framework, and debugging complexity for a pattern that’s ~40 lines of code.

When to Use Which Pattern

Simple Task(< 5 tool calls,single domain)Moderate Task(5-10 steps,multiple domains)Complex Task(10+ steps, decomposable,independent subtasks)Single AgentMulti-AgentDelegationSupervisor useuseuse

Pattern selection based on task complexity.

Decision Framework

CriterionSingle AgentMulti-Agent DelegationSupervisor
Task steps1-53-105-20+
Domain breadthSingle domain2-3 domains3+ domains
Parallelism neededNoRarelyOften
Error accountabilitySimpleUnclearClear
Latency sensitivityBestWorst (sequential)Good (parallel)
Implementation effortLowMediumHigh
Testing complexityLowMediumHigh

Start with a single agent. The most common mistake is reaching for multi-agent patterns prematurely. A single Claude Sonnet 4.6 or GPT-5.4 call with 8-10 well-designed tools handles the majority of real-world agent tasks. If you’re building a customer support agent that needs to look up orders, check shipping status, and issue refunds, that’s one agent with three tools — not three specialized agents.

Move to delegation when tool count exceeds ~15 or domains diverge. If the agent needs database tools, web search tools, code execution tools, and email tools, each with different authentication and error handling, splitting into specialized agents makes the system more maintainable.

Use a supervisor when tasks decompose into independent subtasks. The supervisor pattern only pays for its overhead when parallel execution is possible. If every step depends on the previous step’s output, a supervisor adds an extra LLM call (for decomposition) without improving latency.

Anti-Patterns

The “one agent per tool” anti-pattern. Creating a dedicated agent for each tool (a “search agent” that just calls search, a “calculator agent” that just runs calculations). The overhead of inter-agent communication far exceeds the benefit. Give the tools to one agent.

The “committee” anti-pattern. Multiple agents debate a decision, voting or iterating on each other’s outputs. Sounds appealing in theory. In practice, this is expensive (3-5x the token cost), slow, and the final output is usually no better than a single strong model’s output. Anthropic’s own research suggests that using a better model once outperforms using a weaker model in a multi-agent debate.

The “deep nesting” anti-pattern. A supervisor delegates to sub-supervisors, which delegate to workers, which sometimes delegate to other workers. Three levels deep is already hard to debug. Keep hierarchies flat: one supervisor, N workers.

Summary

The single-agent loop with tool use is the right default for most applications. It’s simple, predictable, and fast. Reach for multi-agent patterns only when task complexity, domain breadth, or parallelism requirements genuinely demand it.

State management is the hard part. Token budget tracking, context summarization, and checkpointing matter more than the choice between delegation and supervisor patterns.

Error recovery requires different strategies for different failure types: automatic retry for transient failures, LLM-informed recovery for tool failures, and programmatic intervention for reasoning failures. Loop detection and iteration caps are non-negotiable in production.

Hybrid control flow — deterministic pipeline structure with LLM-driven routing at specific decision points — outperforms both fully hardcoded and fully autonomous approaches for most production workloads.

Start with a single agent and the raw API. Add a framework when you need persistent state, human-in-the-loop, or complex multi-agent graphs. Not before.

Further Reading

  • LangGraph Documentation — Official docs for LangGraph, covering state machines, persistence, and human-in-the-loop patterns
  • OpenAI Agents SDK — OpenAI’s Python framework for agent orchestration with handoffs and tool use
  • Anthropic Agent Documentation — Anthropic’s guide to building agents with Claude, including tool use patterns and best practices
  • Microsoft Autogen — Microsoft’s multi-agent conversation framework with support for various orchestration patterns
  • CrewAI — Framework for orchestrating role-playing AI agents with built-in delegation and task management
  • Temporal — Durable execution platform for long-running workflows, applicable to agent orchestration with checkpointing and retry semantics
  • Mastra — TypeScript-first AI agent framework with built-in workflows and persistence
  • Building effective agents (Anthropic) — Anthropic’s opinionated guide to agent architecture, arguing for simplicity over complexity
  • Voyage AI Agent Patterns — Research on retrieval-augmented agent architectures and embedding-based tool selection
  • LangGraph “Plan-and-Execute” Example — Reference implementation of the supervisor pattern with parallel worker dispatch