Office Hours — Is fine-tuning the right approach when you want a model to reliably remember your codebase or company-specific data?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
Is fine-tuning the right approach when you want a model to reliably remember your codebase or company-specific data?
No, and I’d gently push back on the framing. Fine-tuning buys you consistent style and tone adjustments. It does not reliably embed factual knowledge into a model’s weights. If you fine-tune on your codebase, you’re not creating a model that “knows” your code. You’re creating a model that learned statistical patterns from seeing it during training, but it’ll still hallucinate function signatures and invent APIs when it encounters novel queries.
Why fine-tuning fails for knowledge retention
The fundamental problem: a model can’t reliably store new facts in its weights after training. Fine-tuning on your codebase teaches the model to recognize patterns it saw in that data, but it doesn’t create a searchable index. Query your fine-tuned model with a function that exists in your codebase but was never explicitly shown during fine-tuning, and it will confabulate. You’ve added latency and cost without solving the core problem.
What actually works: RAG with good retrieval. Dump your codebase into a vector store (Pinecone, Weaviate, Milvus), chunk it sensibly by file or logical boundary, and pass relevant snippets to the model context at inference time. Claude Opus 4.7 and GPT-5.5 both have large context windows. Use them.
A concrete example: instead of fine-tuning on 50,000 lines of code, embed your codebase as markdown files split by module. When a developer queries “how does the auth middleware handle token refresh?”, your retrieval layer finds the relevant files, passes them to Claude Opus 4.7 with the query, and gets a factually grounded answer. If your codebase changes tomorrow, you re-embed the new files. No retraining, no waiting for the model to converge.
Here’s what that retrieval call looks like in practice:
results = index.query(
vector=embed(query),
top_k=8,
filter={"repo": "backend-api"},
include_metadata=True
)
context = "\n\n".join([r["metadata"]["content"] for r in results["matches"]])
response = anthropic_client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"Codebase context:\n{context}\n\nQuestion: {query}"
}]
)
That’s it. Eight chunks, metadata filtering by repo, direct context injection. The model answers from actual source, not from weights that drifted six weeks ago.
The only time fine-tuning makes sense here is if you’re training on hundreds of thousands of examples to teach the model a new behavior or reasoning pattern specific to your domain. That’s expensive, slow, and rarely justified for “remember my codebase.”
Cost and latency tradeoffs
Worried about cost or latency with full RAG? Use a two-stage pipeline: Gemini 3.1 Flash-Lite for the retrieval decision, then invoke Claude Opus 4.7 only when needed. That’s cheaper than fine-tuning infrastructure and stays current as your codebase evolves.
Concrete breakdown: assume 10,000 queries per month on a 500K LOC codebase. Querying Claude Opus 4.7 on every request (4K context tokens average) costs roughly $100-150/month. Two-stage filtering with Flash-Lite plus selective Opus 4.7 invocation typically runs 40-50% cheaper while keeping quality high. Fine-tuning, by contrast, requires upfront compute costs (roughly $800-2000 for a single training pass on that volume), verification cycles, and full retraining when your code changes significantly. After six weeks of code churn, your fine-tuned model is stale. Your RAG system stays fresh because it points to live files.
If you’re working with truly enormous codebases, chunk aggressively: one file per chunk, keyword indexing in addition to vector search, and a reranker step before context passing. The math still favors RAG over fine-tuning. BM25 hybrid search (keyword plus semantic) catches cases where vector similarity alone misses the right files. A function named validateTokenExpiry will surface cleanly on keyword match even when the semantic embedding drifts toward generic auth concepts.
When fine-tuning might make sense
Edge case: if your company has a proprietary domain-specific language or framework with novel syntax patterns that frontier models have never seen, fine-tuning might help the model recognize it faster. But even then, you’d combine fine-tuning with RAG. Fine-tuning alone won’t solve the hallucination problem. A better approach is teaching the model to generate code that calls into your framework correctly by embedding framework reference documentation in your retrieval system, then optionally fine-tuning on 500-1000 examples of correct usage patterns.
Another limited case: if you’re building a coding agent that needs to repeatedly operate on your codebase, fine-tuning the model to understand your project structure and conventions could reduce planning overhead. Autonomous agents like Claude Code with Opus 4.7 or GitHub Copilot (now multi-model across GPT-5.4, Claude Sonnet 4.6, and Gemini 3.1 Pro) handle multi-step tasks end-to-end, including cloning repos, running tests, fixing failures, and opening PRs. They work better when they can ground themselves against current code. Fine-tuning helps them recognize your patterns, but it doesn’t help them verify their own work. The bottleneck is almost always the agent’s ability to test and fix against a fast objective signal. If you’re watching an autonomous agent run for four hours without hitting a test case, that’s a signal problem, not a model knowledge problem.
Bottom line: Use RAG for codebase knowledge. Fine-tuning won’t reliably retain factual data anyway, and it’s slower to update when your code changes. A two-stage pipeline with a lightweight model for filtering and a heavyweight model for generation gives you better cost-quality tradeoffs than fine-tuning infrastructure. BM25 hybrid search beats pure vector retrieval for code. Save fine-tuning for behavioral customization when you have enough examples and a clear, measurable improvement to justify the cost.
Question via Hacker News