Office Hours — How are people safely reusing cached LLM answers in production RAG systems?

How are people safely reusing cached LLM answers in production RAG systems?

Most teams doing this are leaning on prompt caching at the API level rather than building custom caching layers. OpenAI’s prompt caching and Anthropic’s prompt caching both let you cache the retrieval context (your RAG documents) so you’re only paying for new queries against cached embeddings. That’s the safest approach because the model itself validates freshness.

The trickier part is knowing when your cache is stale. If you’re caching answers for FAQ-style queries where the underlying documents rarely change, you’re fine. But if your RAG corpus updates weekly, you need explicit invalidation logic tied to your document update pipeline. Some teams use a simple version hash on their document set, others tag cache entries with timestamps and require explicit refresh after a cutoff.

One gotcha people hit: they cache at the application layer without accounting for context drift. A cached answer to “what’s our refund policy” from three months ago might be outdated even if the LLM itself hasn’t changed. The safest pattern is caching the retrieval step (keeping the context fresh) rather than caching the final answer.

If you’re using Gemini or Claude Opus 4.6, both have solid caching, but verify your actual hit rates. Sometimes the cache key variations break hits more than you expect.

Bottom line: Cache your RAG context and retrieval logic via native API caching, not final answers. Tie cache invalidation to your document update schedule, not just time.

Question via Hacker News