Office Hours — How are you extracting the best performance out of your RAG pipeline?

How are you extracting the best performance out of your RAG pipeline?

The honest answer is most teams aren’t, and it’s not because they’re lazy. RAG has three failure modes that stack on top of each other, and fixing one doesn’t fix the others.

First, retrieval itself is fragile. You can nail document ranking and still feed the model irrelevant context because semantic search doesn’t understand task intent. A query about “when was this policy last updated” might retrieve the most similar passage, which is actually the policy’s original date, not the amendment. The model confidently cites the wrong date.

Second, even with perfect retrieval, models hallucinate freely when reasoning over retrieved documents. This is the gap most people miss. Your RAG system can fetch the right paragraph, and the LLM will still invent details or misread it. Testing shows this happens more often than you’d expect.

Third, you need to actually verify outputs against your sources. This means structured extraction, post-retrieval filtering, and honest acknowledgment when the model can’t answer from what you gave it. Most teams skip this because it’s operational friction.

What actually works in production

Hybrid search combining keyword and semantic matching catches cases semantic-only search misses. Reranking retrieved documents with a smaller model like Claude Sonnet 4.6 or Mistral Large 3 before sending to your main LLM dramatically improves relevance without the latency hit of reranking everything through GPT-5.5. Few-shot examples showing the model how to cite sources correctly reduce invented citations by roughly 40-60% depending on your domain. Hard constraints that force the model to admit when documents don’t contain the answer instead of hallucinating close the remaining gap.

A concrete setup: retrieve top 20 documents via hybrid search, rerank to top 5 with Claude Sonnet 4.6, pass those with source IDs to GPT-5.4, then validate that every claim in the output appears verbatim or near-verbatim in the source spans. For a policy Q&A system handling 1000 queries daily, this costs roughly 15-18% more than naive retrieval-then-generation, but reduces wrong-answer rates from 12-18% down to 2-4%. The validation layer catches hallucinations before users see them.

Here’s the cost math. Claude Sonnet 4.6 reranking 20 documents per query runs about 800 tokens per request (titles, snippets, query context). At roughly $3 per million input tokens, that’s $0.0024 per query, or $2.40 daily for 1000 queries. Validation with GPT-4.1 Nano (the budget tier for low-stakes post-hoc checking) adds another $1.50-2.00 per day. Total pipeline overhead: $4-5 daily. A single production error in legal or financial domains costs orders of magnitude more in support, remediation, and liability.

What the prompt actually looks like

Most write-ups skip the generation prompt, which is where constraint enforcement lives or dies. A minimal version that works:

You are a policy assistant. Answer only using the provided source documents.
For each claim, cite the source ID in brackets, e.g. [doc_3].
If the documents do not contain enough information to answer, respond with:
"The provided documents do not contain a clear answer to this question."
Do not infer, extrapolate, or use prior knowledge.

Sources:
[doc_1] <title: "Leave Policy v4.2"> <text: ...>
[doc_2] <title: "Leave Policy Amendment 2024-11"> <text: ...>
...

Question: {user_query}

The critical parts: explicit prohibition on prior knowledge, required citation format per claim, and a scripted refusal string. That last one matters because it gives your validation layer something deterministic to match against. If the output doesn’t contain a citation for every factual statement and doesn’t match the refusal pattern when appropriate, flag it for review. This is not complicated logic. It’s a regex pass and a citation coverage check.

The validation layer is where RAG projects die

You need to measure retrieval quality separately from generation quality. That means tracking: retrieval precision (is the right document in the top 5?), reranking efficacy (does the reranker move the right doc up?), and hallucination rate (does the output only cite retrieved sources?). If you’re not instrumented for this, you’re flying blind. Most teams only look at end-to-end accuracy and wonder why swapping retrievers doesn’t help.

One edge case: reranking introduces latency. A Claude Sonnet 4.6 rerank of 20 docs adds 800-1200ms per query. For interactive systems, consider parallel reranking batches or caching rerank scores across similar queries. For batch systems or where latency isn’t critical, the accuracy gain is worth it.

Another edge case: your validation layer needs a ground truth signal. For Q&A, that’s manual annotation of 50-100 representative queries. For classification tasks, it’s easier. If your task is open-ended (“summarize this document set”), validation becomes opinion-dependent and the whole pipeline degrades.

The setup tax varies. If your documents are already chunked and embedded, integration takes 2-3 weeks. If you’re starting from PDFs or unstructured HTML, add 4-6 weeks for preprocessing and chunk optimization. The Proxy-Pointer RAG technique is worth testing if you’re hitting reliability walls, though implementation complexity scales with your data structure heterogeneity.

But don’t add reranking if retrieval is already solving your problem. Instrument first. Run 100 queries through your current pipeline, manually check if the right document lands in the top 5, and see what percentage of hallucinations come from retrieved context versus model invention. If top-5 precision is already above 95% and hallucination rate is below 3%, reranking is waste.

Bottom line: Stop optimizing retrieval alone and build a three-stage pipeline with reranking, constraint-based generation, and post-hoc validation. Most performance gains come from preventing hallucinations after retrieval and measuring what’s actually failing, not from better retrieval itself. Get the prompt constraints and citation format right before touching the retrieval stack. Measure before you add complexity.

Question via Hacker News