Paper of the Week — Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

Ruocheng Guo, Kaiwen Dong, Xiang Gao, Kamalika Das. Published 2026-02-23. arXiv:2602.20426

One sentence summary

Rather than fine-tuning the agent itself, automatically rewriting tool descriptions at deployment time reliably closes most of the performance gap caused by poorly written API documentation.

Why this paper

Most teams building agents hit a ceiling where the model is capable but tool calls keep misfiring—and the instinct is to reach for more RLHF or better prompts on the agent side. This paper reframes that as a documentation quality problem, which is both cheaper to fix and more durable across model upgrades.

What they did

Tool descriptions shipped with real-world APIs are written for human developers, not LLMs—they omit parameter constraints, use ambiguous names, and bury critical edge cases in prose. The authors built a pipeline that takes existing tool schemas and uses an LLM to iteratively rewrite the descriptions, then scores each candidate rewrite against a small set of held-out invocation traces to keep only improvements. No agent retraining, no human annotation at scale.

Key findings

Poorly written tool descriptions account for 20-40% of function-calling errors across the benchmarks tested, independent of model size
Their rewriting pipeline recovered the majority of that gap on ToolBench and APIBench without any changes to the underlying agent
Iterative refinement (rewrite → eval → keep or discard) consistently outperformed single-pass rewriting by 8-15 percentage points
Gains held across frontier models tested, suggesting the fix is model-agnostic rather than tuned to one backend
Descriptions rewritten for one model transferred reasonably well to others, reducing the cost of maintaining a multi-model tool registry

Why it matters for practitioners

If you’re running any tool-use pipeline—MCP servers, OpenAPI-backed agents, internal function libraries—your documentation is probably the weakest link, and you likely don’t have the labeled data to fine-tune around it. This gives you a principled, automatable way to audit and harden your tool layer before spending compute on model-side fixes.

What you can use today

Audit your existing tool schemas by running a small batch of representative queries and logging where the model picks the wrong tool or malforms arguments—this surfaces which descriptions need rewriting first
Use any capable frontier model with a prompt that instructs it to rewrite a description to be “unambiguous to a code-generating model, not a human developer,” then score rewrites against your logged traces
Treat rewritten descriptions as versioned artifacts in your tool registry—when you swap underlying models, re-run the scoring step rather than assuming the rewrite transfers perfectly