Paper of the Week — Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Kellen Gillespie, Robyn Perry. Published 2026-06-16. arXiv:2606.17519

One sentence summary

Routing accuracy in production agent systems degrades predictably as tool catalogs grow, but the paper identifies specific failure modes and recovery interventions that practitioners can apply without retraining.

Why this paper

If you’re building an enterprise assistant with more than a handful of specialized agents, routing is quietly your biggest reliability problem. This study uses a real, deployed 110-agent/584-tool catalog — not a toy benchmark — which makes the findings immediately actionable.

What they did

The authors evaluated three frontier models routing user requests across a 584-tool enterprise productivity catalog, measuring how accuracy changes as catalog size grows. They diagnosed why routing breaks — ambiguous tool descriptions, semantic overlap between agents, query phrasing mismatches — and tested targeted recovery strategies without touching the underlying models.

Key findings

Routing accuracy degrades non-linearly as catalog size grows; the steepest drops occur when crossing certain tool-count thresholds, not gradually
All three evaluated frontier models show the same degradation curve shape, suggesting this is a catalog-structure problem, not a model-capability problem
Semantic overlap between similarly-named tools is the leading cause of misroutes, ahead of query ambiguity or missing tool descriptions
Tool description rewriting alone recovers a meaningful portion of lost accuracy — in some cases closing the gap by more than half without any model changes
Ensemble routing (querying multiple models and taking majority vote) provides further gains but with proportional cost increases

Why it matters for practitioners

Most teams building agent catalogs discover routing degradation only after deployment, when user complaints surface. This paper gives you a diagnostic framework to identify which failure mode you’re hitting before you waste cycles on the wrong fix — for instance, rewriting descriptions when you actually need to split an over-broad agent. If you’re targeting a catalog beyond 50 tools, this is a must-read before you scale.

What you can use today

Audit your tool descriptions for semantic overlap using embedding-based clustering — tools whose embeddings fall within a tight cosine similarity range are your highest-risk misroute pairs
Apply the paper’s description rewriting heuristics (specificity, distinct trigger phrases, negative examples of what the tool does not handle) to your most-confused tool pairs first
If you’re on a multi-model stack, use the ensemble routing pattern only for requests that a single-model router flags as low-confidence, to control cost while recovering accuracy on the hard cases