Office Hours — Can you personalize an LLM by fine-tuning it on conversation history, or is that the wrong approach?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
Can you personalize an LLM by fine-tuning it on conversation history, or is that the wrong approach?
Fine-tuning on conversation history is almost always the wrong move. You’re spending compute and money to make a model memorize one person’s chat patterns when you could just pass relevant context directly in the system prompt or conversation window.
Here’s the real issue: fine-tuning is slow (hours to days), expensive, and you’d need to retrain every time the user’s preferences shift. Meanwhile, context windows on GPT-5.5 and Claude Opus 4.7 are massive now. Just stuff the last 20 conversations or a summary of interaction patterns into the prompt. The model will pick up the user’s tone, preferences, and history instantly without any training.
Why In-Context Beats Fine-Tuning
The math is straightforward. A single fine-tuning job on Claude Opus 4.7 costs roughly $3-8 per million tokens processed, plus latency. You’re looking at hours of waiting for a model that’s only optimized for one user. In the same time, you load 50,000 tokens of conversation history into the prompt of GPT-5.5 or Claude Sonnet 4.6 (which still handles context efficiently) and get personalization in milliseconds. The user sees immediate adaptation; you see zero training overhead.
Even better: if the user’s preferences change, you update the context. No retraining. No version management. You’re not baking brittle assumptions into model weights.
The only time fine-tuning makes sense is if you’re trying to teach the model a consistent behavioral pattern that applies across thousands of users or a massive domain-specific knowledge base. Even then, you’re usually better off with retrieval plus good prompting first.
A Practical Example
Instead of fine-tuning, structure personalization like this:
System prompt:
User preferences: {
"tone": "direct, no fluff",
"technical_level": "senior engineer",
"preferred_language": "Go",
"known_constraints": ["latency-critical", "runs on ARM"]
}
Recent context (last 5 interactions):
[conversation_history_here]
Load this into every request. The model infers preferences from both the structured config and conversation patterns without ever touching weights. This scales to millions of users. Fine-tuning does not.
Edge Cases Worth Considering
There are rare exceptions. If your user base is homogeneous and you’re optimizing for a very specific, measurable output format (like structured clinical notes from a particular hospital system), fine-tuning on thousands of examples might shave latency or improve consistency enough to justify the cost. But you’re still better off testing whether a better system prompt or a retrieval layer gets you 80% of the way there first.
Also: fine-tuned models are harder to version, audit, and debug. You lose interpretability. If the model starts behaving oddly, you can’t easily trace whether it’s the fine-tuning or the context. With pure prompting, the signal is cleaner.
For real personalization, think about what actually matters: user preferences, prior context, feedback loops. Those live better in a vector DB or a simple JSON config file than baked into model weights.
Bottom line: Use your context window to pass conversation history directly into prompts instead of fine-tuning. Save fine-tuning for broad behavioral changes across your entire user base or domain-specific knowledge at scale, not individual personalization.
Question via Hacker News