Office Hours — Can you personalize an LLM by fine-tuning it on conversation history, or is that the wrong approach?

Can you personalize an LLM by fine-tuning it on conversation history, or is that the wrong approach?

Fine-tuning on conversation history is almost always the wrong move. You’re spending compute and money to make a model memorize one person’s chat patterns when you could just pass relevant context directly in the system prompt or conversation window.

Here’s the real issue: fine-tuning is slow (hours to days), expensive, and you’d need to retrain every time the user’s preferences shift. Meanwhile, context windows on GPT-5.5 and Claude Opus 4.7 are massive now. Just stuff the last 20 conversations or a summary of interaction patterns into the prompt. The model will pick up the user’s tone, preferences, and history instantly without any training.

Why In-Context Beats Fine-Tuning

The math is straightforward. A single fine-tuning job on Claude Opus 4.7 costs roughly $3-8 per million tokens processed, plus latency. You’re looking at hours of waiting for a model that’s only optimized for one user. In the same time, you load 50,000 tokens of conversation history into the prompt of GPT-5.5 or Claude Sonnet 4.6 (which still handles context efficiently) and get personalization in milliseconds. The user sees immediate adaptation; you see zero training overhead.

Even better: if the user’s preferences change, you update the context. No retraining. No version management. You’re not baking brittle assumptions into model weights.

The only time fine-tuning makes sense is if you’re trying to teach the model a consistent behavioral pattern that applies across thousands of users or a massive domain-specific knowledge base. Even then, you’re usually better off with retrieval plus good prompting first.

A Practical Example

Instead of fine-tuning, structure personalization like this:

System prompt:
User preferences: {
  "tone": "direct, no fluff",
  "technical_level": "senior engineer",
  "preferred_language": "Go",
  "known_constraints": ["latency-critical", "runs on ARM"]
}

Recent context (last 5 interactions):
[conversation_history_here]

Load this into every request. The model infers preferences from both the structured config and conversation patterns without ever touching weights. This scales to millions of users. Fine-tuning does not.

Edge Cases Worth Considering

There are rare exceptions. If your user base is homogeneous and you’re optimizing for a very specific, measurable output format (like structured clinical notes from a particular hospital system), fine-tuning on thousands of examples might shave latency or improve consistency enough to justify the cost. But you’re still better off testing whether a better system prompt or a retrieval layer gets you 80% of the way there first.

Also: fine-tuned models are harder to version, audit, and debug. You lose interpretability. If the model starts behaving oddly, you can’t easily trace whether it’s the fine-tuning or the context. With pure prompting, the signal is cleaner.

For real personalization, think about what actually matters: user preferences, prior context, feedback loops. Those live better in a vector DB or a simple JSON config file than baked into model weights.

Bottom line: Use your context window to pass conversation history directly into prompts instead of fine-tuning. Save fine-tuning for broad behavioral changes across your entire user base or domain-specific knowledge at scale, not individual personalization.

Question via Hacker News