Office Hours — Are you using finetuning for LLM agents in production, and if so, what trade-offs did you encounter versus using base models with prompt engineering?

Are you using finetuning for LLM agents in production, and if so, what trade-offs did you encounter versus using base models with prompt engineering?

We’ve done both, and I’ll be direct: finetune for agents only if you have a very specific problem that base models with good prompting actually can’t solve. Otherwise you’re paying in latency, cost per token, and operational drag for marginal gains.

Where finetune actually won for us

We run a code review agent that flags security issues in pull requests. It needs to understand our codebase patterns, our threat model, and our internal security standards. We finetuned Claude Sonnet 4.6 on about 8,000 past review comments plus annotations from our security team. The result was a ~15% improvement in catching issues we actually cared about (fewer false positives on our specific patterns, better context selectivity).

The finetune made sense because:

We had high-quality training data (human reviews, not synthetic).
The task had measurable signal (did it catch real vulns? Did it reduce noise?).
The cost was front-loaded (one-time training), not per-inference.

Here’s the actual cost comparison we did:

Finetune path:
  - Training data prep: ~40 hours eng time + $2k labeling
  - Finetune API call: ~$800
  - Inference: $0.003 per 1k input tokens (vs $0.003 base Claude Sonnet)
  - Latency: +200-400ms (finetune models are slower to load)
  - Monthly cost for 50k reviews: ~$150

Prompt engineering path:
  - Time: ~20 hours eng time, iterative
  - Inference: $0.003 per 1k input tokens
  - Latency: baseline (no finetune penalty)
  - Monthly cost for 50k reviews: ~$150

Actual difference: cost was flat, but finetune reduced the iteration loop
and gave us better calibration on our specific false-positive profile.

Where finetune absolutely didn’t work

We also tried finetuning an agent on our internal API documentation (Stripe-style, 200+ endpoints). Goal: the agent would call the right API with the right parameters without needing full prompt context every time.

We burned a week and ~$5k before realizing prompt engineering was strictly better:

The finetune picked up spurious correlations (certain API calls always followed others) that broke on new workflows.
Our API docs change every sprint. Retraining on each update was operationally insane.
A well-structured prompt with examples and a schema reference got 90% of the way there with zero retraining.

The real tradeoffs

When prompt engineering wins: If your agent’s task involves reasoning over variable inputs (code review, analysis, generation), a good prompt beats finetune every time. The base model is already general enough. You’re paying for flexibility you don’t use.

When finetune might win: If your agent does the same narrow thing repeatedly (classify documents in domain X, extract fields from form type Y, score candidates by your rubric), and you have 5,000+ examples of correct outputs, finetune might cut inference latency by 30-40% and improve accuracy by 10-15%. But you have to own the retraining cost.

Hidden costs of finetune that nobody talks about:

You’re locked into that model version. Claude Sonnet 4.6 finetuned models won’t automatically benefit from Claude Opus 4.7 improvements.
Debugging is harder. When your agent fails, is it the finetune or the agent logic? You can’t swap models to narrow it down.
Team velocity drops. New team members can’t iterate on the prompt; they have to wait for a retraining cycle.

What we actually use now

Most of our agents run on Claude Opus 4.7 with no finetune, structured prompts, and few-shot examples. The only exception is the code review agent (which we update every quarter as our security posture changes).

For agents, I’d argue finetune is a local optimization when you should be optimizing the harness. Better prompt structure, smarter retrieval, tighter action loops, and better state management will beat a finetuned model 80% of the time. And when they don’t, you’ve usually got other problems (insufficient context, bad data quality, unclear task definition).

Bottom line: Skip finetune for agents unless you have domain-specific repetitive tasks with 5,000+ labeled examples and a stable task definition. For most agent work, invest in prompt engineering, few-shot examples, and better retrieval instead; you’ll iterate faster and pay less.

Question via Hacker News