Paper of the Week — Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery Routing accuracy for 110+ agents drops sharply past 50 tools — new study maps the degradation curve and identifies three recovery strategies that work today. 2026-06-18T12:00:00.000Z Paper of the Week Paper of the Week researchpapersarxivpractical-ai

Paper of the Week — Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Routing accuracy for 110+ agents drops sharply past 50 tools — new study maps the degradation curve and identifies three recovery strategies that work today.

Weekly One research paper, broken down for people who build things.

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Kellen Gillespie, Robyn Perry. Published 2026-06-16. arXiv:2606.17519

One sentence summary

Routing accuracy in production agent systems degrades predictably as tool catalogs grow, but the paper identifies specific failure modes and recovery interventions that practitioners can apply without retraining.

Why this paper

If you’re building an enterprise assistant with more than a handful of specialized agents, routing is quietly your biggest reliability problem. This study uses a real, deployed 110-agent/584-tool catalog — not a toy benchmark — which makes the findings immediately actionable.

What they did

The authors evaluated three frontier models routing user requests across a 584-tool enterprise productivity catalog, measuring how accuracy changes as catalog size grows. They diagnosed why routing breaks — ambiguous tool descriptions, semantic overlap between agents, query phrasing mismatches — and tested targeted recovery strategies without touching the underlying models.

Key findings

  • Routing accuracy degrades non-linearly as catalog size grows; the steepest drops occur when crossing certain tool-count thresholds, not gradually
  • All three evaluated frontier models show the same degradation curve shape, suggesting this is a catalog-structure problem, not a model-capability problem
  • Semantic overlap between similarly-named tools is the leading cause of misroutes, ahead of query ambiguity or missing tool descriptions
  • Tool description rewriting alone recovers a meaningful portion of lost accuracy — in some cases closing the gap by more than half without any model changes
  • Ensemble routing (querying multiple models and taking majority vote) provides further gains but with proportional cost increases

Why it matters for practitioners

Most teams building agent catalogs discover routing degradation only after deployment, when user complaints surface. This paper gives you a diagnostic framework to identify which failure mode you’re hitting before you waste cycles on the wrong fix — for instance, rewriting descriptions when you actually need to split an over-broad agent. If you’re targeting a catalog beyond 50 tools, this is a must-read before you scale.

What you can use today

  • Audit your tool descriptions for semantic overlap using embedding-based clustering — tools whose embeddings fall within a tight cosine similarity range are your highest-risk misroute pairs
  • Apply the paper’s description rewriting heuristics (specificity, distinct trigger phrases, negative examples of what the tool does not handle) to your most-confused tool pairs first
  • If you’re on a multi-model stack, use the ensemble routing pattern only for requests that a single-model router flags as low-confidence, to control cost while recovering accuracy on the hard cases