Office Hours — How do you handle prompt fatigue and prevent LLMs from producing lazy or low-effort outputs in production systems?

How do you handle prompt fatigue and prevent LLMs from producing lazy or low-effort outputs in production systems?

Prompt fatigue is real. After you’ve asked an LLM to do the same task five thousand times, degradation shows up as shorter answers, recycled explanations, or surface-level reasoning. It’s not a hallucination problem, it’s a motivation problem. The model isn’t broken. It’s cutting corners because it learned that corner-cutting gets rewarded by your metrics.

The Core Problem Isn’t the Model, It’s the Incentive

When you measure LLM performance on latency or cost-per-token, you’re explicitly rewarding brevity. Models optimize for what you measure. An LLM that could spend ten seconds reasoning through a problem learns instead to emit a response in two seconds because the downstream system never penalizes it for shallow work, only for slowness. Over weeks or months of production traffic, this compounds.

This is why setting temperature to 0 doesn’t fix it. Determinism removes randomness but doesn’t add rigor. A model at temperature 0 will consistently take the fast path through reasoning space, which is still the lazy path.

Structural Fixes Over Prompt Tweaks

The phrase “prompt fatigue” itself is misleading because it suggests the problem lives in your prompts. It doesn’t. It lives in your evaluation criteria and how you’ve structured the task.

If you’re seeing lazy outputs, start by examining what you’re actually measuring. Most production systems optimize for:

Time to first token (latency)
Total tokens generated (cost)
API call count (throughput)

None of these measure reasoning quality. You need a fourth metric: actual correctness on tasks where laziness causes failures. This means building evals that catch the difference between a shallow answer and a deep one.

Here’s a concrete pattern. Suppose you’re using an LLM to classify support tickets. You measure classification accuracy, response time, and cost per classification. Your model starts out at 94% accuracy. After six months of production traffic, it drifts to 88%. The model hasn’t degraded. Your system has trained it to classify faster by being less careful.

The fix: add a hold-out eval set of genuinely ambiguous tickets where lazy reasoning fails. Run this eval daily. When accuracy drops below a threshold on your hard cases (separate from your bulk accuracy metric), trigger an alert. Now the model is optimized for correctness on the cases that matter, not just bulk throughput.

Use Structured Outputs as a Friction Point

Lazy outputs often manifest as underspecified or vague responses. Structured outputs force the model to commit to specificity. If your LLM is supposed to extract three fields from a document but keeps returning one field with a generic summary, use schema enforcement.

Here’s a practical example. You’re using Claude Opus 4.8 to extract pricing information from contracts. After three months, you notice it’s returning price: "call for details" far more often than it should. That’s laziness, not inability.

Add a schema that forbids generic values:

{
  "type": "object",
  "properties": {
    "unit_price": {
      "type": "number",
      "description": "Numeric price per unit. If not found in document, return null with explanation in price_note field"
    },
    "price_note": {
      "type": "string",
      "description": "Explanation if unit_price is null: document states no price, pricing incomplete, non-standard terms, etc."
    }
  },
  "required": ["unit_price", "price_note"]
}

Now the model can’t hand-wave. It has to either extract a number or explicitly justify why. The friction of having to produce a reason forces more careful work.

Chain-of-Thought as a Cost, Not a Free Upgrade

Every generation of models includes better chain-of-thought performance. Teams immediately see this as free reasoning quality and assume they don’t need to prompt for it explicitly anymore. Then production traffic grows, cost pressure increases, and prompts get shortened. Reasoning disappears.

Make chain-of-thought explicit in your production prompts and measure its cost separately from final answer cost. If your prompt says “think step by step,” log the reasoning tokens as a separate metric. This prevents cost optimization from silently deleting the reasoning phase.

If you’re on GPT-5.5 or Claude Opus 4.8, you have access to extended thinking modes. These are not free, and they should not be used everywhere. Use them on the subset of tasks where your evals show that shallow reasoning fails. For everything else, structured prompts with explicit reasoning steps cost less and often work fine.

Detect Degradation Before It Happens

Set up regression detection on your core evals. Most teams only look at aggregate metrics. You need per-difficulty and per-category breakdown. A classifier that maintains 95% accuracy on easy cases but drops to 60% on hard cases has degraded significantly, but bulk accuracy might still look okay.

Here’s a monitoring pattern:

# Track accuracy by difficulty tier
eval_results_daily = {
    "easy": 0.97,      # Should stay >0.95
    "medium": 0.88,    # Should stay >0.85
    "hard": 0.71,      # Should stay >0.68 (catching degradation early)
}

for tier, accuracy in eval_results_daily.items():
    baseline = get_baseline_for_tier(tier)
    if accuracy < (baseline * 0.97):  # 3% drop triggers alert
        alert(f"Degradation detected in {tier}: {accuracy} vs baseline {baseline}")

Run this daily or weekly depending on your traffic volume. Hard cases are your canary. They degrade first when the model is taking shortcuts.

Rotate Your Eval Set

Using the same evals forever teaches the model to optimize specifically for your eval set. This is a different kind of laziness: the model learns the pattern of your tests rather than learning the underlying skill.

Every two weeks, refresh 20% of your eval set with new examples from the same distribution. This prevents the model from converging on test-specific shortcuts. You’ll see temporary metric dips as the model adjusts, but this is a sign your evals are doing their job.

When to Replace vs. Retune

If degradation persists after structural fixes, consider upgrading the model. Claude Opus 4.8 maintains reasoning quality under higher token throughput than Claude Sonnet 4.6. GPT-5.5 shows better consistency on repeated classifications than GPT-5.4. These aren’t massive differences, but they’re real.

Before upgrading, benchmark the specific task on both models with your eval set. A $0.03/1K token difference matters at scale, but not if it gains you 6% accuracy on hard cases.

Bottom line: Prompt fatigue is a system design problem, not a model problem. Measure reasoning quality separately from speed and cost, use structured outputs to force specificity, and monitor per-difficulty degradation to catch shortcuts before they compound. When evals stay green, your model is working. When they drift, your incentives have drifted, and no prompt change will fix it.

Question via Hacker News