AI Agent Orchestration Patterns
Complete guide to building reliable AI agent orchestration: single-agent loops, multi-agent delegation, supervisor hierarchies, state management, error recovery, and production patterns with code.
AI Agent Orchestration Patterns
Every major AI provider now ships some form of agent framework. Most production agent deployments still fail in predictable ways — not because the LLM is wrong, but because the orchestration around it handles state, errors, and control flow poorly. The gap between a demo agent and a production agent is almost entirely an engineering problem.
This reference covers the three dominant orchestration patterns (single-agent loops, multi-agent delegation, and supervisor architectures), how state flows through each, where they break, and when to pick one over another.
Table of Contents
- The Agent Loop Primitive
- Pattern 1: Single-Agent with Tool Use
- Pattern 2: Multi-Agent Delegation
- Pattern 3: Supervisor Architecture
- State Management
- Error Recovery and Retry Semantics
- Control Flow: Deterministic vs LLM-Driven Routing
- Human-in-the-Loop Patterns
- Cost and Latency Profiles
- Framework Comparison
- When to Use Which Pattern
- Summary
- Further Reading
The Agent Loop Primitive
Every agent pattern reduces to the same primitive: a loop that calls an LLM, checks if the LLM wants to use a tool, executes that tool, feeds the result back, and repeats until the LLM produces a final response or a termination condition fires.
The core agent loop: call the LLM, check for tool use, execute, repeat.
This loop is deceptively simple. The complexity lives in five places: (1) how context accumulates across iterations, (2) what happens when a tool call fails, (3) how many iterations to allow before force-stopping, (4) how to manage token budget as the conversation grows, and (5) how to persist state across process boundaries.
A minimal implementation in Python:
import anthropic
client = anthropic.Anthropic()
def agent_loop(messages: list, tools: list, max_turns: int = 10) -> str:
for _ in range(max_turns):
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=4096,
tools=tools,
messages=messages,
)
if response.stop_reason == "end_turn":
return response.content[0].text
# Collect tool uses from response
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
return "Max turns reached without final response."
Every framework — LangGraph, CrewAI, OpenAI Agents SDK, Autogen — wraps this loop with different abstractions. The differences that matter are in state management, error handling, and how they compose multiple loops together.
Pattern 1: Single-Agent with Tool Use
The simplest production pattern: one LLM, one system prompt, multiple tools. The LLM decides which tools to call and in what order. No routing logic, no delegation — just a capable model with access to functions.
Single-agent pattern: one model handles all reasoning and tool selection.
When This Works
Single-agent works well when:
- The task requires fewer than ~8 tool calls in sequence
- All tools share a common domain (e.g., all operate on the same database)
- The total context fits comfortably in the model’s window (under 50K tokens of accumulated tool results)
- Latency budget allows sequential tool execution (each call adds 200-800ms for the LLM round-trip plus tool execution time)
When This Breaks
The failure modes are consistent across implementations:
Context window saturation. Each tool call adds its input and output to the message history. A database query returning 2,000 rows of JSON can consume 15K tokens in one step. After 5-6 such calls, the agent is spending most of its context on tool results and losing track of the original goal.
Tool selection confusion. With more than 10-15 tools, models start making incorrect tool selections — calling a search tool when they should call a database tool, or hallucinating tool parameters. Claude Opus 4.6 handles ~25 tools reasonably; Claude Haiku 4.5 degrades noticeably above 10.
Unbounded loops. Without a hard iteration cap, agents can enter cycles — calling the same tool repeatedly with slightly different parameters, or alternating between two tools without converging. Always set max_turns.
Tool Result Summarization
The most effective mitigation for context saturation is summarizing tool results before appending them to the message history:
def summarize_if_large(tool_name: str, result: str, threshold: int = 3000) -> str:
if len(result) < threshold:
return result
# Use a fast, cheap model for summarization
summary = client.messages.create(
model="claude-haiku-4-5-20250514",
max_tokens=500,
messages=[{
"role": "user",
"content": f"Summarize this {tool_name} result, preserving key data:\n{result}"
}],
)
return f"[Summarized] {summary.content[0].text}"
This adds one cheap LLM call per large tool result but can reduce context consumption by 80-90% for data-heavy tools.
Pattern 2: Multi-Agent Delegation
Multiple specialized agents, each with their own system prompt and tool set, with one agent able to hand off tasks to another. The key distinction from the supervisor pattern: there is no central coordinator. Agents delegate laterally.
Multi-agent delegation: agents hand off to peers without a central coordinator.
Implementation
The standard implementation gives each agent a transfer_to_agent tool. When agent A calls transfer_to_agent("analysis_agent", context), the orchestrator pauses agent A’s loop and starts agent B’s loop with the provided context.
from dataclasses import dataclass, field
@dataclass
class Agent:
name: str
model: str
system_prompt: str
tools: list
max_turns: int = 10
@dataclass
class Orchestrator:
agents: dict[str, Agent] = field(default_factory=dict)
conversation_history: list = field(default_factory=list)
def run(self, agent_name: str, user_message: str) -> str:
agent = self.agents[agent_name]
messages = [{"role": "user", "content": user_message}]
for _ in range(agent.max_turns):
response = call_llm(agent.model, agent.system_prompt,
agent.tools + [self.transfer_tool()],
messages)
if response.stop_reason == "end_turn":
return extract_text(response)
for tool_call in extract_tool_calls(response):
if tool_call.name == "transfer_to_agent":
# Delegate to another agent with context
target = tool_call.input["agent_name"]
context = tool_call.input["context"]
return self.run(target, context) # recursive
result = execute_tool(tool_call.name, tool_call.input)
messages = append_tool_result(messages, response, tool_call, result)
return "Max turns reached."
def transfer_tool(self):
return {
"name": "transfer_to_agent",
"description": "Hand off the current task to a specialized agent.",
"input_schema": {
"type": "object",
"properties": {
"agent_name": {"type": "string", "enum": list(self.agents.keys())},
"context": {"type": "string", "description": "What to tell the target agent"},
},
"required": ["agent_name", "context"],
},
}
The Context Handoff Problem
The critical design decision in multi-agent delegation is what context transfers between agents. Three approaches:
Full history transfer. Pass the entire conversation history to the receiving agent. Preserves all information but consumes tokens fast. With three agents in sequence, the third agent’s context includes all tool results from agents one and two.
Summary transfer. The delegating agent writes a summary of its findings and passes only that. Loses detail but stays within token budgets. Best when agents operate on different data domains.
Structured handoff. Define a typed schema for inter-agent communication. Agent A produces a JSON object with specific fields; agent B’s system prompt expects that structure.
# Structured handoff schema
handoff_schema = {
"findings": "list of key data points discovered",
"original_query": "the user's original request",
"remaining_tasks": "what still needs to be done",
"constraints": "any constraints or preferences identified",
}
Structured handoffs are the most reliable in practice. They prevent context bloat and make the inter-agent contract explicit and testable.
Failure Modes
Delegation loops. Agent A delegates to agent B, which delegates back to agent A. Solve with a delegation depth counter or a “no-backsies” rule (an agent cannot delegate to the agent that delegated to it).
Context loss. The delegating agent summarizes too aggressively, losing a critical detail. The receiving agent then makes decisions based on incomplete information. Mitigation: include the original user query in every handoff, not just the summary.
Responsibility diffusion. With no central coordinator, no agent takes ownership of the final output quality. If agent C produces a bad result, it’s unclear whether agent A should have provided better context, agent B should have caught an error, or agent C’s tools were insufficient.
Pattern 3: Supervisor Architecture
A dedicated supervisor agent coordinates multiple worker agents. The supervisor receives the user’s request, decomposes it into subtasks, dispatches those subtasks to specialized workers, collects results, and synthesizes a final response.
Supervisor pattern: central coordinator dispatches to specialized workers and synthesizes results.
Why Use a Supervisor
The supervisor pattern solves the three main problems with peer delegation:
- Task decomposition is centralized. One agent with a high-level view breaks the problem down, rather than each agent deciding ad hoc what to delegate.
- Result synthesis is explicit. The supervisor reviews all partial results and produces a coherent final output.
- Error handling has a single owner. If a worker fails, the supervisor decides whether to retry, use a different worker, or degrade gracefully.
Implementation with Parallel Dispatch
The supervisor’s main advantage over sequential delegation is parallel execution. If two subtasks are independent, dispatch them simultaneously:
import asyncio
from dataclasses import dataclass
@dataclass
class SubTask:
agent_name: str
instruction: str
depends_on: list[str] = None # IDs of tasks this depends on
async def supervisor_loop(
user_query: str,
supervisor_model: str,
workers: dict[str, Agent],
) -> str:
# Step 1: Decompose
plan = await decompose_task(supervisor_model, user_query, list(workers.keys()))
# Step 2: Execute tasks respecting dependencies
results = {}
for batch in topological_batches(plan.subtasks):
# Run independent tasks in parallel
batch_results = await asyncio.gather(*[
run_worker(workers[task.agent_name], task.instruction, results)
for task in batch
])
for task, result in zip(batch, batch_results):
results[task.id] = result
# Step 3: Synthesize
return await synthesize(supervisor_model, user_query, results)
def topological_batches(subtasks: list[SubTask]) -> list[list[SubTask]]:
"""Group subtasks into batches that can run in parallel."""
# Tasks with no dependencies go in batch 0
# Tasks depending only on batch 0 go in batch 1, etc.
resolved = set()
batches = []
remaining = list(subtasks)
while remaining:
batch = [t for t in remaining
if not t.depends_on or all(d in resolved for d in t.depends_on)]
if not batch:
raise ValueError("Circular dependency in task plan")
batches.append(batch)
resolved.update(t.id for t in batch)
remaining = [t for t in remaining if t not in batch]
return batches
The Decomposition Prompt
The supervisor’s decomposition quality determines the entire system’s performance. A good decomposition prompt:
SUPERVISOR_SYSTEM = """You are a task coordinator. Given a user request and a list
of available specialist agents, produce a plan.
Available agents:
{agent_descriptions}
Output a JSON plan:
{{
"subtasks": [
{{
"id": "t1",
"agent": "research_agent",
"instruction": "specific instruction for this agent",
"depends_on": []
}},
{{
"id": "t2",
"agent": "analysis_agent",
"instruction": "analyze the data from t1",
"depends_on": ["t1"]
}}
]
}}
Rules:
- Each subtask must be self-contained: include all context the agent needs.
- Maximize parallelism: only add depends_on when truly necessary.
- Use at most 5 subtasks. Prefer fewer.
- If the task is simple enough for one agent, use one subtask.
"""
The “if the task is simple enough for one agent, use one subtask” rule is important. Over-decomposition — splitting a simple request into four subtasks — adds latency and increases the chance of synthesis errors.
Supervisor Model Selection
The supervisor doesn’t need to be the most capable model in the system. It needs to be good at structured planning and synthesis, but it doesn’t need domain expertise — that’s what the workers provide.
A common and cost-effective pattern:
| Role | Model | Rationale |
|---|---|---|
| Supervisor | Claude Sonnet 4.6 | Good structured output, fast, reasonable cost |
| Research worker | Claude Sonnet 4.6 | Needs tool use + reasoning |
| Code worker | Claude Opus 4.6 | Complex code generation benefits from top-tier |
| Data worker | GPT-4.1 Nano | High throughput for structured data extraction |
| Synthesis pass | Claude Sonnet 4.6 | Same model as supervisor for consistency |
Using a cheaper model for the supervisor than for specialized workers is counterintuitive but often correct. The supervisor makes routing decisions; the workers do the hard cognitive work.
State Management
State is the hardest part of agent orchestration. A single-agent loop can keep state in the message array. Multi-agent and supervisor patterns need something more structured.
State Categories
Four categories of state in agent systems, and how they interact.
Conversation state is the message history. In OpenAI and Anthropic’s APIs, this is the messages array. It grows monotonically within a single agent loop.
Task state tracks the plan, which subtasks are complete, partial results, and what remains. This is the supervisor’s working memory.
World state is external: database rows modified, files created, API calls made. These are side effects that can’t be rolled back by simply rewinding the conversation.
Meta state tracks operational concerns: total tokens consumed, wall-clock time elapsed, cost accrued, number of LLM calls made. Critical for enforcing budgets and SLAs.
Persistence Strategies
For agents that complete in under 30 seconds (most single-agent tool-use patterns), in-memory state is fine. For longer workflows:
Checkpointing to a database. After each agent turn, serialize the full state (messages, task plan, partial results) to a row in PostgreSQL or a document in Redis. If the process crashes, resume from the last checkpoint.
import json
import redis
r = redis.Redis()
def checkpoint(workflow_id: str, state: dict):
r.set(f"agent:state:{workflow_id}", json.dumps(state))
r.expire(f"agent:state:{workflow_id}", 86400) # 24h TTL
def restore(workflow_id: str) -> dict | None:
data = r.get(f"agent:state:{workflow_id}")
return json.loads(data) if data else None
Durable execution frameworks. Temporal, Inngest, and Trigger.dev provide workflow engines that automatically checkpoint state and resume after failures. Temporal’s model — deterministic workflow code that calls activities — maps naturally to the supervisor pattern: the workflow is the supervisor, activities are worker agents.
# Pseudo-code for a Temporal workflow acting as supervisor
@workflow.defn
class ResearchWorkflow:
@workflow.run
async def run(self, query: str) -> str:
plan = await workflow.execute_activity(
decompose_task, query, start_to_close_timeout=timedelta(seconds=30)
)
results = {}
for batch in topological_batches(plan.subtasks):
batch_results = await asyncio.gather(*[
workflow.execute_activity(
run_worker_agent,
task,
start_to_close_timeout=timedelta(minutes=5),
)
for task in batch
])
for task, result in zip(batch, batch_results):
results[task.id] = result
return await workflow.execute_activity(
synthesize_results, query, results,
start_to_close_timeout=timedelta(seconds=30)
)
The Temporal approach gives automatic retries, timeouts per activity, and full audit logs. The tradeoff is infrastructure complexity — running a Temporal cluster is nontrivial, though Temporal Cloud handles this.
Context Window as State Budget
A practical way to think about state management: the context window is a fixed token budget, and every piece of state competes for space within it.
| State type | Typical token cost | Compression strategy |
|---|---|---|
| System prompt | 500-2,000 | Fixed; keep tight |
| User messages | 100-500 per turn | Rarely compressible |
| Tool call + result | 200-5,000 per call | Summarize large results |
| Accumulated history | Grows ~1K per turn | Sliding window or summarize |
| Task plan (supervisor) | 300-800 | Structured JSON stays small |
For a model with a 200K token context window, this seems luxurious. In practice, agent performance degrades well before the window is full. Models attend less effectively to information in the middle of long contexts. Keeping total context under 50K tokens per agent, even with 200K available, produces more reliable tool selection and reasoning.
Error Recovery and Retry Semantics
Agent failures fall into three categories, each requiring different recovery strategies.
Three failure categories and their recovery strategies.
Transient Failures
Rate limits, network timeouts, 500 errors from tool APIs. Handle with standard retry logic — exponential backoff with jitter. Don’t feed these back to the LLM; the model can’t do anything useful with “the API returned a 429.”
import time
import random
async def call_with_retry(fn, *args, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return await fn(*args)
except (RateLimitError, TimeoutError, ConnectionError) as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
time.sleep(delay)
Tool Failures
The tool executed but returned an error — invalid parameters, a database query that returned zero rows, an API that returned a 400 with a descriptive error message. These should be fed back to the LLM as the tool result. Models are generally good at interpreting error messages and adjusting their approach:
def execute_tool_safe(name: str, params: dict) -> str:
try:
result = execute_tool(name, params)
return json.dumps({"status": "success", "data": result})
except ToolError as e:
return json.dumps({"status": "error", "error": str(e)})
The model sees {"status": "error", "error": "No rows matched filter: date > 2026-04-30"} and can adjust the date parameter. This works reliably for 1-2 recovery attempts. If the same tool fails 3+ times in a row, something is structurally wrong and programmatic intervention is better than letting the model keep trying.
Reasoning Failures
The hardest category. The model calls the wrong tool, enters a loop, or hallucinates tool parameters that are syntactically valid but semantically wrong (e.g., querying a user table with a product ID).
Mitigations:
- Iteration cap. Always set
max_turns. 10-15 is reasonable for most tasks. Beyond that, the agent is probably stuck. - Loop detection. Track the sequence of tool calls. If the same tool is called with identical or near-identical parameters twice in a row, inject a system message: “You’ve called this tool with similar parameters already. Try a different approach or provide a final answer.”
- Budget enforcement. Track cumulative tokens. If the agent has consumed 80% of its token budget without producing a final answer, force termination.
- Fallback prompting. After N failed iterations, inject a prompt that says: “Summarize what you’ve found so far and provide the best answer you can with available information.”
def detect_loop(tool_calls: list[dict], window: int = 3) -> bool:
if len(tool_calls) < window:
return False
recent = tool_calls[-window:]
# Check if same tool called with same params
signatures = [(c["name"], json.dumps(c["input"], sort_keys=True)) for c in recent]
return len(set(signatures)) == 1
Control Flow: Deterministic vs LLM-Driven Routing
A spectrum exists between fully deterministic pipelines (each step is hardcoded) and fully LLM-driven routing (the model decides everything). Most production systems land somewhere in the middle.
The control flow spectrum. Most production systems are hybrid.
Deterministic Pipelines
# Step 1 always runs, Step 2 always runs, etc.
async def pipeline(query: str) -> str:
search_results = await research_agent.run(f"Search for: {query}")
analysis = await analysis_agent.run(f"Analyze: {search_results}")
report = await writing_agent.run(f"Write report: {analysis}")
return report
Predictable latency, predictable cost, easy to test. But inflexible — if the research step returns nothing useful, the analysis step runs anyway.
Hybrid Routing
The supervisor decides which workers to invoke, but from a constrained set of options. The routing is LLM-driven but bounded:
ROUTING_PROMPT = """Based on the user's query, select which agents to invoke.
You MUST respond with valid JSON.
Available agents:
- research: for finding information from external sources
- database: for querying internal data
- calculator: for numerical analysis
Respond with:
{{"agents": ["research", "database"], "parallel": true}}
"""
The LLM picks from a fixed menu. It can’t invent new agents or execute arbitrary logic. This gives flexibility (skip the database step if the query is purely about external information) while maintaining predictability.
Fully Autonomous
The LLM decides what to do at every step, including whether to spawn sub-agents, what tools to use, and when to stop. This is the pattern most demos show. It’s also the least reliable in production.
The failure rate for fully autonomous multi-step tasks scales roughly with the number of decision points. If each routing decision has 90% accuracy, a 5-step pipeline has ~59% end-to-end success rate (0.9^5). At 95% per step, a 5-step pipeline reaches ~77%. These numbers improve with better models, but the multiplicative nature of sequential decisions means even small per-step error rates compound.
Recommendation: Use deterministic pipelines for workflows with known structure. Use hybrid routing when the workflow varies based on input but the set of possible paths is finite. Reserve fully autonomous routing for exploratory tasks where the workflow genuinely can’t be predicted in advance (e.g., open-ended research, complex debugging).
Human-in-the-Loop Patterns
Three points where human intervention commonly slots into agent workflows:
Three human-in-the-loop insertion points in agent workflows.
Approval Gates
Before executing side effects (sending an email, modifying a database, calling a paid API), pause the agent and present the proposed action to a human for approval.
TOOLS_REQUIRING_APPROVAL = {"send_email", "update_database", "charge_payment"}
async def execute_with_approval(tool_name: str, params: dict, workflow_id: str):
if tool_name in TOOLS_REQUIRING_APPROVAL:
# Persist the pending action
await store_pending_action(workflow_id, tool_name, params)
# Notify human (webhook, Slack, email)
await notify_approver(workflow_id, tool_name, params)
# Suspend workflow — Temporal, Inngest, or custom polling
approved = await wait_for_approval(workflow_id, timeout=timedelta(hours=24))
if not approved:
return {"status": "rejected", "message": "Action rejected by reviewer"}
return await execute_tool(tool_name, params)
This requires durable state. The agent workflow must be able to pause for hours or days and resume exactly where it stopped. Temporal handles this natively. Without a durable execution framework, implement it as a state machine persisted to a database, with a separate process that checks for approvals and resumes workflows.
Escalation
When the agent exceeds its iteration budget or confidence drops below a threshold, escalate to a human. The key is providing enough context for the human to take over effectively:
def escalate(workflow_id: str, agent_state: dict):
summary = {
"original_query": agent_state["query"],
"steps_taken": len(agent_state["tool_calls"]),
"partial_results": agent_state.get("partial_results"),
"failure_reason": agent_state.get("failure_reason", "Max iterations reached"),
"full_history_link": f"/admin/workflows/{workflow_id}",
}
send_to_human_queue(summary)
Cost and Latency Profiles
The choice of orchestration pattern has direct cost and latency implications. These are approximate ranges based on typical implementations.
| Pattern | LLM Calls per Task | Typical Latency | Token Usage | Best For |
|---|---|---|---|---|
| Single agent, 3 tool calls | 4 (3 tool rounds + final) | 3-8s | 5K-15K | Simple lookups, CRUD |
| Single agent, 8 tool calls | 9 | 10-25s | 20K-50K | Multi-step research |
| Multi-agent, 3 agents sequential | 8-15 | 15-45s | 30K-80K | Specialized pipelines |
| Supervisor + 3 parallel workers | 6-12 | 8-20s | 25K-60K | Complex, decomposable tasks |
| Supervisor + 5 workers, mixed | 12-25 | 20-60s | 50K-150K | Large research/analysis |
Parallel dispatch in the supervisor pattern is the primary latency advantage over sequential delegation. If three workers each take 5 seconds, sequential delegation takes 15+ seconds for the worker phase alone; parallel dispatch takes ~5 seconds.
Cost Optimization Techniques
Model routing per agent. Use expensive models only where they add value. A research agent that primarily calls search APIs doesn’t need Claude Opus 4.6; Claude Haiku 4.5 or GPT-4.1 Nano handles tool use adequately for straightforward retrieval.
Early termination. If the first tool call returns a complete answer, skip the remaining planned subtasks. The supervisor should re-evaluate after each batch of worker results.
Prompt caching. If agents share system prompts across invocations (they usually do), prompt caching from Anthropic and OpenAI can reduce input token costs by 75-90% on cached portions. This matters most for agents with long system prompts (1,000+ tokens).
Token budgeting. Give each agent a token budget and track usage:
@dataclass
class TokenBudget:
max_input_tokens: int = 50_000
max_output_tokens: int = 10_000
used_input: int = 0
used_output: int = 0
@property
def remaining_input(self) -> int:
return self.max_input_tokens - self.used_input
@property
def exhausted(self) -> bool:
return self.used_input >= self.max_input_tokens
def record(self, input_tokens: int, output_tokens: int):
self.used_input += input_tokens
self.used_output += output_tokens
Framework Comparison
The major agent orchestration frameworks as of April 2026:
| Framework | Pattern Support | State Management | Language | Strengths | Weaknesses |
|---|---|---|---|---|---|
| LangGraph | All three | Built-in checkpointing | Python, JS | Flexible graph model, persistence | Complexity for simple use cases |
| OpenAI Agents SDK | Single, delegation | In-memory (extensible) | Python | Clean API, built-in handoffs | OpenAI-centric |
| CrewAI | Multi-agent, supervisor | Built-in | Python | Easy multi-agent setup | Less control over low-level flow |
| Autogen (Microsoft) | All three | Conversation-based | Python | Strong multi-agent patterns | Steep learning curve |
| Mastra | All three | Built-in persistence | TypeScript | Good DX, workflow engine | Newer, smaller ecosystem |
| Custom (no framework) | Any | Roll your own | Any | Full control, no abstractions | More code to maintain |
LangGraph
LangGraph models agent workflows as state machines (graphs). Nodes are functions that transform state; edges define transitions. Conditional edges enable LLM-driven routing.
from langgraph.graph import StateGraph, MessagesState
def call_model(state: MessagesState):
response = model.invoke(state["messages"])
return {"messages": [response]}
def call_tools(state: MessagesState):
# Execute tool calls from the last message
...
graph = StateGraph(MessagesState)
graph.add_node("agent", call_model)
graph.add_node("tools", call_tools)
graph.add_edge("__start__", "agent")
graph.add_conditional_edges("agent", should_continue, {
"tools": "tools",
"end": "__end__",
})
graph.add_edge("tools", "agent")
app = graph.compile(checkpointer=MemorySaver())
LangGraph’s main advantage is built-in persistence and the ability to interrupt/resume at any node — enabling human-in-the-loop without external infrastructure.
OpenAI Agents SDK
OpenAI’s framework uses a handoff primitive for multi-agent delegation:
from openai import agents
research_agent = agents.Agent(
name="Research",
model="gpt-5.4",
instructions="You research topics using search tools.",
tools=[search_tool],
)
writer_agent = agents.Agent(
name="Writer",
model="gpt-5.4",
instructions="You write reports based on research.",
handoffs=[], # terminal agent
)
research_agent.handoffs = [writer_agent]
result = agents.Runner.run(research_agent, "Write a report on AI agent patterns")
Simple and clean for the delegation pattern. Less suitable for supervisor architectures, which require more manual construction.
When to Use No Framework
Frameworks add value when:
- Multiple agents need persistent state across process boundaries
- Human-in-the-loop gates are required
- The workflow graph is complex (5+ agents, conditional branching)
Frameworks add overhead when:
- A single agent with tool use is sufficient
- The workflow is a simple sequential pipeline
- Tight control over prompts and API calls is needed (frameworks often wrap the API in ways that hide prompt details)
For a single-agent loop with 3-5 tools, the raw API code from the beginning of this post is probably the right choice. Adding LangGraph to that is adding a dependency, a conceptual framework, and debugging complexity for a pattern that’s ~40 lines of code.
When to Use Which Pattern
Pattern selection based on task complexity.
Decision Framework
| Criterion | Single Agent | Multi-Agent Delegation | Supervisor |
|---|---|---|---|
| Task steps | 1-5 | 3-10 | 5-20+ |
| Domain breadth | Single domain | 2-3 domains | 3+ domains |
| Parallelism needed | No | Rarely | Often |
| Error accountability | Simple | Unclear | Clear |
| Latency sensitivity | Best | Worst (sequential) | Good (parallel) |
| Implementation effort | Low | Medium | High |
| Testing complexity | Low | Medium | High |
Start with a single agent. The most common mistake is reaching for multi-agent patterns prematurely. A single Claude Sonnet 4.6 or GPT-5.4 call with 8-10 well-designed tools handles the majority of real-world agent tasks. If you’re building a customer support agent that needs to look up orders, check shipping status, and issue refunds, that’s one agent with three tools — not three specialized agents.
Move to delegation when tool count exceeds ~15 or domains diverge. If the agent needs database tools, web search tools, code execution tools, and email tools, each with different authentication and error handling, splitting into specialized agents makes the system more maintainable.
Use a supervisor when tasks decompose into independent subtasks. The supervisor pattern only pays for its overhead when parallel execution is possible. If every step depends on the previous step’s output, a supervisor adds an extra LLM call (for decomposition) without improving latency.
Anti-Patterns
The “one agent per tool” anti-pattern. Creating a dedicated agent for each tool (a “search agent” that just calls search, a “calculator agent” that just runs calculations). The overhead of inter-agent communication far exceeds the benefit. Give the tools to one agent.
The “committee” anti-pattern. Multiple agents debate a decision, voting or iterating on each other’s outputs. Sounds appealing in theory. In practice, this is expensive (3-5x the token cost), slow, and the final output is usually no better than a single strong model’s output. Anthropic’s own research suggests that using a better model once outperforms using a weaker model in a multi-agent debate.
The “deep nesting” anti-pattern. A supervisor delegates to sub-supervisors, which delegate to workers, which sometimes delegate to other workers. Three levels deep is already hard to debug. Keep hierarchies flat: one supervisor, N workers.
Summary
The single-agent loop with tool use is the right default for most applications. It’s simple, predictable, and fast. Reach for multi-agent patterns only when task complexity, domain breadth, or parallelism requirements genuinely demand it.
State management is the hard part. Token budget tracking, context summarization, and checkpointing matter more than the choice between delegation and supervisor patterns.
Error recovery requires different strategies for different failure types: automatic retry for transient failures, LLM-informed recovery for tool failures, and programmatic intervention for reasoning failures. Loop detection and iteration caps are non-negotiable in production.
Hybrid control flow — deterministic pipeline structure with LLM-driven routing at specific decision points — outperforms both fully hardcoded and fully autonomous approaches for most production workloads.
Start with a single agent and the raw API. Add a framework when you need persistent state, human-in-the-loop, or complex multi-agent graphs. Not before.
Further Reading
- LangGraph Documentation — Official docs for LangGraph, covering state machines, persistence, and human-in-the-loop patterns
- OpenAI Agents SDK — OpenAI’s Python framework for agent orchestration with handoffs and tool use
- Anthropic Agent Documentation — Anthropic’s guide to building agents with Claude, including tool use patterns and best practices
- Microsoft Autogen — Microsoft’s multi-agent conversation framework with support for various orchestration patterns
- CrewAI — Framework for orchestrating role-playing AI agents with built-in delegation and task management
- Temporal — Durable execution platform for long-running workflows, applicable to agent orchestration with checkpointing and retry semantics
- Mastra — TypeScript-first AI agent framework with built-in workflows and persistence
- Building effective agents (Anthropic) — Anthropic’s opinionated guide to agent architecture, arguing for simplicity over complexity
- Voyage AI Agent Patterns — Research on retrieval-augmented agent architectures and embedding-based tool selection
- LangGraph “Plan-and-Execute” Example — Reference implementation of the supervisor pattern with parallel worker dispatch