Office Hours — How are people safely reusing cached LLM answers in production RAG systems?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
How are people safely reusing cached LLM answers in production RAG systems?
Most teams doing this are leaning on prompt caching at the API level rather than building custom caching layers. OpenAI’s prompt caching and Anthropic’s prompt caching both let you cache the retrieval context (your RAG documents) so you’re only paying for new queries against cached embeddings. That’s the safest approach because the model itself validates freshness.
The trickier part is knowing when your cache is stale. If you’re caching answers for FAQ-style queries where the underlying documents rarely change, you’re fine. But if your RAG corpus updates weekly, you need explicit invalidation logic tied to your document update pipeline. Some teams use a simple version hash on their document set, others tag cache entries with timestamps and require explicit refresh after a cutoff.
Concrete Example: Context Caching vs Answer Caching
Say you’re building a support bot that answers questions about your API documentation. With OpenAI’s prompt caching, you’d cache the entire documentation set (often 20KB+) at the API level:
POST /v1/messages
{
"model": "gpt-5.4",
"system": [{
"type": "text",
"text": "You are a support agent..."
}, {
"type": "text",
"text": "[FULL API DOCS - 50KB]",
"cache_control": {"type": "ephemeral"}
}],
"messages": [{"role": "user", "content": "How do I authenticate?"}]
}
The first request pays the full cost (docs + query). The second request against the same docs pays only for the new query tokens, typically 10% of the cost. With Claude Opus 4.7 or GPT-5.4, you see cache hits after just two similar queries.
The mistake teams make: caching the final answer (“Your refund policy is X”) instead of the context. A cached answer survives a documentation update and returns stale information. Caching the retrieval step means the docs are always fresh in your inference context.
One gotcha people hit: they cache at the application layer without accounting for context drift. A cached answer to “what’s our refund policy” from three months ago might be outdated even if the LLM itself hasn’t changed. The safest pattern is caching the retrieval step (keeping the context fresh) rather than caching the final answer.
When Cache Invalidation Actually Matters
For truly static content (legal disclaimers, product specs that never change), application-layer answer caching is reasonable if you tag it with a document version. But for anything with velocity, assume the cache key will vary. User preferences, query phrasing, even tokenization differences across models can break cache hits more than you’d expect.
If you’re using Gemini 3.1 Pro or Claude Opus 4.7, both have solid native caching, but verify your actual hit rates in production. Monitor cache utilization per endpoint. Teams often see 30-40% hit rates on FAQ endpoints but only 5-10% on open-ended support queries.
Timestamp-based invalidation is simple but blunt. If your docs update weekly, set cache TTL to 6 days and force refresh on every Monday. For more granular control, tie invalidation directly to your document pipeline: when you publish new docs, bump the version hash and rotate the cache key. This requires coordinating between your RAG ingestion and your LLM client, but the operational safety is worth it.
Bottom line: Cache your RAG context and retrieval logic via native API caching, not final answers. Tie cache invalidation to your document update schedule, not just time. Verify hit rates in production. For teams with rapidly changing knowledge bases, the safest approach is caching the embedding retrieval step and passing fresh context to every inference, even if it costs slightly more.
Question via Hacker News