Paper of the Week — Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
Rewriting tool descriptions at deployment time—not training time—can recover 20-40% of function-calling accuracy lost to poorly written API docs.
Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
Ruocheng Guo, Kaiwen Dong, Xiang Gao, Kamalika Das. Published 2026-02-23. arXiv:2602.20426
One sentence summary
Rather than fine-tuning the agent itself, automatically rewriting tool descriptions at deployment time reliably closes most of the performance gap caused by poorly written API documentation.
Why this paper
Most teams building agents hit a ceiling where the model is capable but tool calls keep misfiring—and the instinct is to reach for more RLHF or better prompts on the agent side. This paper reframes that as a documentation quality problem, which is both cheaper to fix and more durable across model upgrades.
What they did
Tool descriptions shipped with real-world APIs are written for human developers, not LLMs—they omit parameter constraints, use ambiguous names, and bury critical edge cases in prose. The authors built a pipeline that takes existing tool schemas and uses an LLM to iteratively rewrite the descriptions, then scores each candidate rewrite against a small set of held-out invocation traces to keep only improvements. No agent retraining, no human annotation at scale.
Key findings
- Poorly written tool descriptions account for 20-40% of function-calling errors across the benchmarks tested, independent of model size
- Their rewriting pipeline recovered the majority of that gap on ToolBench and APIBench without any changes to the underlying agent
- Iterative refinement (rewrite → eval → keep or discard) consistently outperformed single-pass rewriting by 8-15 percentage points
- Gains held across frontier models tested, suggesting the fix is model-agnostic rather than tuned to one backend
- Descriptions rewritten for one model transferred reasonably well to others, reducing the cost of maintaining a multi-model tool registry
Why it matters for practitioners
If you’re running any tool-use pipeline—MCP servers, OpenAPI-backed agents, internal function libraries—your documentation is probably the weakest link, and you likely don’t have the labeled data to fine-tune around it. This gives you a principled, automatable way to audit and harden your tool layer before spending compute on model-side fixes.
What you can use today
- Audit your existing tool schemas by running a small batch of representative queries and logging where the model picks the wrong tool or malforms arguments—this surfaces which descriptions need rewriting first
- Use any capable frontier model with a prompt that instructs it to rewrite a description to be “unambiguous to a code-generating model, not a human developer,” then score rewrites against your logged traces
- Treat rewritten descriptions as versioned artifacts in your tool registry—when you swap underlying models, re-run the scoring step rather than assuming the rewrite transfers perfectly