Office Hours — Are we pretending RAG is ready, when it's barely out of demo phase?

Are we pretending RAG is ready, when it’s barely out of demo phase?

No, but the gap between “demo” and “production” is real and most teams are collapsing it badly.

RAG works. The Daily Signal from April 11 shows that advanced retrieval with cross-encoders and reranking dramatically improves relevance without rebuilding your pipeline. That’s not demo talk. Same piece covers graph-RAG systems with deterministic multi-tiered retrieval that solve actual hard problems in complex knowledge domains. These are production patterns.

What’s not ready: the retrieval stage itself. Most RAG systems fail silently because they retrieve wrong documents first, then hand garbage to the LLM. Vector search alone is table stakes, not enough. You need reranking. You probably need hybrid search mixing keyword and semantic retrieval. You need to instrument what you’re actually retrieving and measure it.

The Retrieval Problem Is Concrete

Here’s what this looks like in practice. A naive RAG pipeline embeds documents with an embedding model, stores them in a vector database, and on query does a single pass of cosine similarity. Sounds straightforward. In reality, you’ll get:

High recall, low precision. The top 10 results include 7 documents that mention your terms but don’t actually answer the query.
Silent failures. The LLM will confidently hallucinate around bad context because it has no way to flag “none of these documents are relevant.”

The fix: add a reranking stage. Query your vector database for top-k (usually 50-100), then run those through a cross-encoder model like Cohere’s rerank API or a local model like ms-marco-MiniLM-L-12-v2. A cross-encoder scores relevance pairwise, not in isolation. Cost is higher per query (maybe 2-3x for the reranking call), but your top-5 retrieved documents are now actually relevant. Combine this with keyword search (BM25) on the same corpus and merge the ranked lists. This hybrid approach catches both semantic matches and exact-term matches that vector search alone would miss.

Here’s a concrete setup: embed documents once with a model like Nomic Embed (fast, cheap). Store embeddings in Postgres with pgvector or in Qdrant. On query, run parallel BM25 (using Postgres full-text search or Elasticsearch) and vector search, retrieve top-50 from each, deduplicate, then pass the merged list through a cross-encoder. The cross-encoder reranks to top-10. Total latency: 150-300ms depending on corpus size. Cost per query: roughly $0.0001-0.0005 if you use Cohere’s API, negligible if you run a local cross-encoder on a GPU.

# Pseudo-pipeline
1. BM25 query -> top-50 by keyword
2. Vector search -> top-50 by embedding
3. Merge and deduplicate (union, keep first occurrence)
4. Cross-encoder rerank -> top-10 by relevance score
5. Return to LLM context window

This costs more compute than naive vector search, but the difference in output quality is not marginal. Teams running this pattern report 20-40% improvement in downstream LLM accuracy on retrieval-dependent tasks.

The Multi-Source Problem

The other gap: agentic RAG across multiple heterogeneous sources. That’s still messy. Single-source RAG over a cleaned document set? Ship it. Multi-source with conflicting or outdated information? You’re debugging forever. When you’re pulling from a customer database, a knowledge base, and three different APIs, you need explicit conflict resolution and freshness checks. That’s not a RAG problem anymore, it’s an integration problem, and most teams don’t have patterns for it yet.

If you do need multi-source, add metadata to every chunk: source name, last-updated timestamp, confidence level (manual annotation or inferred from update frequency). At retrieval time, filter by freshness and source credibility before reranking. This prevents serving stale data or conflicting information without explicit flagging.

Measurement Matters More Than Model Choice

Production RAG failures aren’t usually about model quality. They’re about not measuring retrieval at all. Build a small evaluation set (100-200 queries with ground-truth relevant documents). After each retrieval, log:

Whether the relevant document appeared in top-k (recall@k).
The rank position of the first relevant document.
Whether reranking improved the position.
End-to-end correctness of the LLM response given the retrieved context.

Use these metrics to tune your retrieval pipeline before you ever touch the LLM. A frontier model with bad retrieval context will hallucinate less confidently, but it’ll still be wrong. A mediocre model with tight retrieval will be correct more often. If your retrieval quality is low, upgrading from Claude Sonnet 4.6 to Claude Opus 4.7 won’t save you. Fixing retrieval will.

The honest take: RAG is ready for specific, bounded problems where your retrieval corpus is clean and homogeneous. It’s not ready as a generic “throw any data at it” solution. And yes, a lot of teams are shipping RAG demos into production and calling it done.

Bottom line: RAG works at production scale if you treat retrieval as a first-class problem, add reranking, instrument what you’re returning to the LLM, and measure retrieval quality metrics continuously. The model is not your bottleneck. Retrieval precision is. Don’t ship without proving your retrieval pipeline works on a representative test set.

Question via Hacker News