Office Hours — How are you using multi-agent AI systems effectively in your daily workflow?

How are you using multi-agent AI systems effectively in your daily workflow?

Multi-agent systems are where most teams get excited and then hit a wall. The gap between “agents work great in demos” and “agents work reliably in production” is real, and it’s bigger than model capability alone.

The honest starting point

I’m using agents for specific, bounded tasks where success is verifiable. Code review automation, document parsing with fallback validation, data pipeline orchestration. Not for ambiguous problems where “good enough” is subjective. That distinction matters.

The systems that work share one pattern: they’re solving problems where the success signal is clear and fast. A test passes or fails. A linter catches the issue or it doesn’t. A query returns results or it doesn’t. When you remove that feedback loop, agents drift.

Concrete setup: Code review + fix workflow

Here’s what actually runs in production:

# Multi-agent orchestration for code review
from anthropic import Anthropic

client = Anthropic()

def review_and_fix_code(diff_text: str, repo_context: str) -> dict:
    agents = {
        "reviewer": {
            "role": "Analyzes code changes for issues",
            "model": "claude-opus-4.7"
        },
        "fixer": {
            "role": "Proposes concrete fixes based on review",
            "model": "claude-opus-4.7"
        },
        "validator": {
            "role": "Checks if fixes are safe and complete",
            "model": "claude-sonnet-4.6"
        }
    }
    
    # Agent 1: Reviewer identifies issues
    review_response = client.messages.create(
        model=agents["reviewer"]["model"],
        max_tokens=2000,
        system=f"""You are a code reviewer. Your job is to identify issues in code changes.
Be specific. List issues as a numbered checklist.
Context: {repo_context}""",
        messages=[{"role": "user", "content": f"Review this diff:\n{diff_text}"}]
    )
    issues = review_response.content[0].text
    
    # Agent 2: Fixer proposes changes
    fix_response = client.messages.create(
        model=agents["fixer"]["model"],
        max_tokens=3000,
        system=f"""You are a code fixer. Based on issues identified, propose concrete fixes.
Output only the fixed code blocks.
Context: {repo_context}""",
        messages=[
            {"role": "user", "content": f"Original diff:\n{diff_text}\n\nIssues found:\n{issues}\n\nPropose fixes."}
        ]
    )
    fixes = fix_response.content[0].text
    
    # Agent 3: Validator checks safety
    validation_response = client.messages.create(
        model=agents["validator"]["model"],
        max_tokens=1000,
        system="""You are a safety validator. Check if proposed fixes:
1. Actually address the issues
2. Don't introduce new problems
3. Follow the original intent
Respond with PASS or FAIL and reasoning.""",
        messages=[
            {"role": "user", "content": f"Issues:\n{issues}\n\nProposed fixes:\n{fixes}"}
        ]
    )
    validation = validation_response.content[0].text
    
    return {
        "issues": issues,
        "proposed_fixes": fixes,
        "validation": validation,
        "safe_to_merge": "PASS" in validation
    }

That’s it. Three focused agents, each with a single responsibility. No loops. No “agent decides to call agent 4.” No emergent behavior that surprises you at 2am.

Why this matters: the failure pattern

Most multi-agent setups fail because teams try to build “autonomous coordination.” Agent A calls Agent B, which decides it needs Agent C. The system becomes a Markov chain where every step compounds error.

The fix: explicit hand-offs. The orchestrator (your code) decides what happens next, not the agents. This is the difference between “running agents” and “using agents as tools in a structured workflow.”

The cost reality

Three agents per task. Opus 4.7 for reasoning, Sonnet 4.6 for validation. Each call is a few hundred to a couple thousand tokens. A typical code review costs about 0.02 cents. At scale, that’s not free, but it’s cheaper than a senior engineer doing the review async.

If you run this wrong (agents hallucinating the next step, calling external APIs unnecessarily), costs blow up fast. The token waste in my prompt above is probably 20% unnecessary. Optimization matters when you’re running thousands of these.

Where agents fail reliably

Anything requiring subjective judgment across multiple inputs. “Is this refactor safe?” depends on business context, team standards, and unknown downstream dependencies. Agents can’t know what they don’t know.

Anything with loose success criteria. “Improve this document” or “make this API faster” will generate something, but you can’t tell if it’s actually better without manual review, which defeats the purpose.

Anything in unstructured environments. If your retrieval system sometimes returns garbage, or your external APIs are flaky, agents will confidently build on that garbage. Human-in-the-loop becomes mandatory.

The single most important constraint

Keep agent workflows shallow. One or two layers of delegation max. If you’re designing a system where Agent A can spawn multiple instances of Agent B across multiple data sources, you’ve already lost observability.

The teams that are actually shipping multi-agent systems successfully are treating them as deterministic orchestration with AI-powered steps, not as autonomous swarms.

Bottom line: Multi-agent systems work in production when you use them as structured tools with clear hand-offs, not as autonomous coordinating entities. Design for shallow workflows, measure success with fast feedback loops (tests, validators, linters), and use cheaper models for validation steps.

Question via Hacker News