Building Reliable RAG Systems
Practical guide to building reliable RAG systems: chunking strategies, retrieval methods, reranking, hybrid search, failure modes, and production architecture patterns.
Building Reliable RAG Systems
Retrieval-Augmented Generation is the most common architecture for grounding LLM outputs in private data, and also the most common source of subtle production failures. The gap between a RAG demo and a RAG system that reliably answers questions across thousands of documents is filled with chunking edge cases, retrieval misses, ranking failures, and hallucinations that pass every vibe check until they don’t.
This post covers the full stack: how documents become retrievable chunks, how retrieval actually works at query time, how reranking and hybrid search recover from embedding blind spots, and — critically — the failure modes that break production deployments and how to defend against them.
Table of Contents
- The RAG Pipeline End to End
- Document Ingestion and Preprocessing
- Chunking Strategies
- Embedding and Indexing
- Retrieval Methods
- Hybrid Search
- Reranking
- Context Assembly and Prompt Construction
- Production Failure Modes
- Evaluation
- Architecture Patterns
- Summary
- Further Reading
The RAG Pipeline End to End
Before dissecting each component, the full pipeline at a glance:
The canonical RAG pipeline: ingestion writes to indices, query time reads from them through retrieval, reranking, and generation.
Two separate paths exist: the ingestion path (offline or near-real-time, transforms documents into indexed chunks) and the query path (online, sub-second latency target, retrieves and generates). Most reliability problems live at the boundary between these two paths — the chunk that was stored doesn’t match the way the question will be asked.
Document Ingestion and Preprocessing
Raw documents are PDFs, HTML pages, Markdown files, Slack threads, database rows, and support tickets. Each format has extraction pitfalls.
PDF extraction remains the hardest. Tables, multi-column layouts, headers/footers, and embedded images all break naive text extraction. Tools like pymupdf4llm (which converts PDF to Markdown preserving structure), unstructured.io, and docling produce meaningfully different output for the same PDF. For table-heavy documents, a vision model pass (Claude Sonnet 4.6 or Gemini 3.1 Pro with vision) to extract structured data often outperforms any text-based parser.
HTML needs boilerplate removal. Libraries like trafilatura or readability-lxml strip navigation, ads, and chrome. Without this, chunks end up filled with “Cookie Policy | Terms of Service | Contact Us” noise that pollutes retrieval.
Metadata extraction happens here too. Capture: source URL, document title, section headers, creation/modification dates, author, and document type. This metadata enables filtered retrieval later — one of the most effective and underused reliability levers.
# Minimal document preprocessing pipeline
from dataclasses import dataclass
from datetime import datetime
@dataclass
class ProcessedChunk:
text: str
doc_id: str
chunk_index: int
source_url: str
title: str
section_header: str
created_at: datetime
token_count: int
def to_metadata(self) -> dict:
return {
"doc_id": self.doc_id,
"chunk_index": self.chunk_index,
"source_url": self.source_url,
"title": self.title,
"section_header": self.section_header,
"created_at": self.created_at.isoformat(),
"token_count": self.token_count,
}
Chunking Strategies
Chunking determines what the retrieval system can find. Bad chunking is the #1 cause of RAG failures, ahead of embedding quality and model capability.
Fixed-Size Chunking
Split text every N tokens with M tokens of overlap. Simple, predictable, language-agnostic.
| Parameter | Typical Range | Notes |
|---|---|---|
| Chunk size | 256–1024 tokens | 512 is a common default |
| Overlap | 50–200 tokens | 10-20% of chunk size |
| Splitting unit | Tokens or characters | Token-based preferred for consistency |
The overlap prevents information from being split across a chunk boundary. A 512-token chunk with 100-token overlap means consecutive chunks share ~100 tokens, so a sentence that falls on a boundary still appears in full in at least one chunk.
Tradeoff: smaller chunks (256 tokens) give more precise retrieval but lose surrounding context. Larger chunks (1024 tokens) preserve context but dilute the embedding — a 1024-token chunk about three different topics will be a mediocre match for any one of them.
Recursive / Structure-Aware Chunking
Split on document structure first (headings, paragraphs, list items), then fall back to fixed-size splitting for sections that exceed the target size. LangChain’s RecursiveCharacterTextSplitter popularized this approach, splitting on \n\n, then \n, then sentence boundaries, then characters.
# Recursive splitting priority for Markdown documents
SEPARATORS = [
"\n## ", # H2 headers first
"\n### ", # H3 headers
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentence boundaries
" ", # Words (last resort)
]
This preserves semantic coherence better than fixed-size: a section about “Authentication” stays together rather than being split mid-paragraph with a chunk that starts talking about “Rate Limiting.”
Semantic Chunking
Embed each sentence, then find breakpoints where the cosine similarity between consecutive sentences drops below a threshold. The idea: split where the topic shifts rather than at arbitrary positions.
In practice, semantic chunking produces inconsistent chunk sizes (some 50 tokens, some 2000), is slow during ingestion (requires embedding every sentence), and the improvement over recursive chunking is marginal for most document types. It works best for long, unstructured documents like meeting transcripts or research papers without clear headings.
Parent-Child / Hierarchical Chunking
Store small chunks for retrieval precision but return the parent (larger) chunk for generation context. A 512-token section gets split into 128-token child chunks. Retrieval matches on the children; the prompt includes the full parent.
Parent-child chunking: retrieve on small chunks for precision, return the larger parent for generation context.
This pattern is one of the highest-leverage improvements for RAG quality. It decouples retrieval granularity from generation context size.
Proposition-Based Chunking
Decompose each chunk into self-contained propositions (“The Eiffel Tower is 330 meters tall. It was completed in 1889.”) and index those. Originally proposed in the “Dense X Retrieval” paper. Each proposition embeds cleanly because it’s a single, complete fact.
The downside is cost: generating propositions requires an LLM pass over every chunk during ingestion. For large corpora (100k+ documents), this adds meaningful latency and token spend to the pipeline. Best reserved for high-value, relatively static knowledge bases.
Chunking Decision Matrix
| Document Type | Recommended Strategy | Chunk Size |
|---|---|---|
| Technical docs with headers | Recursive / structure-aware | 400–600 tokens |
| Legal contracts | Structure-aware + parent-child | 300–500 tokens (children) |
| Chat/support transcripts | Semantic or fixed-size | 200–400 tokens |
| API references | One chunk per endpoint/method | Variable |
| Tabular data | Row-level or section-level | One row = one chunk |
| Research papers | Section-based with abstract as standalone chunk | 400–800 tokens |
Embedding and Indexing
Once chunks exist, they need vector representations. The choice of embedding model determines the ceiling for retrieval quality — no amount of reranking fixes embeddings that put the right chunk far from the query in vector space.
The Embeddings in Practice deep dive covers model selection in detail. The short version for RAG specifically:
| Model | Dimensions | MTEB Retrieval Avg | Sweet Spot |
|---|---|---|---|
OpenAI text-embedding-3-large | 3072 (or truncated) | ~59 | General purpose, easy integration |
Cohere embed-v4 | 1024 | ~60 | Multilingual, native int8 support |
Voyage voyage-3-large | 1024 | ~61 | Code and technical content |
| BGE-en-ICL | 4096 | ~60 | Self-hosted, strong zero-shot |
Jina jina-embeddings-v3 | 1024 | ~59 | Long-context (8192 tokens) |
Practical note: MTEB scores measure broad retrieval quality across many datasets. Performance on your data depends on domain match. A quick evaluation on 50–100 hand-labeled query-document pairs from your actual corpus is worth more than any benchmark table.
Indexing
The vector index determines retrieval speed and recall. For most production RAG systems, the choice is between:
- HNSW (Hierarchical Navigable Small World): High recall (>95%), moderate memory usage, used by Qdrant, Weaviate, pgvector (with
hnswindex). Best default choice. - IVF (Inverted File Index): Lower memory, faster at scale, but lower recall unless
nprobeis tuned. Used as a fallback for very large collections. - Flat/brute-force: Perfect recall, O(n) search. Fine for <100k chunks.
The How Vector Databases Actually Work deep dive covers these in depth. For RAG, the key parameter is recall@k: what fraction of the true top-k results does the approximate index actually return? HNSW with default parameters (ef_construction=128, M=16) typically achieves 95%+ recall@10. That remaining 5% miss rate compounds across queries.
Retrieval Methods
The query path starts here. A user question arrives, gets transformed into a retrieval query (sometimes multiple), and the system fetches candidate chunks.
Single-Vector Semantic Search
The baseline: embed the query with the same model used for chunks, find the nearest neighbors by cosine similarity (or dot product, or L2 distance — cosine similarity is standard for normalized embeddings).
Basic semantic retrieval: single embedding, single ANN search.
Failure mode: semantic search misses lexical matches. The query “error code 4012” may not embed close to a chunk containing “Error 4012: Connection timeout exceeded” if the embedding model doesn’t handle alphanumeric codes well. This is the primary motivation for hybrid search.
Multi-Query Retrieval
Generate multiple reformulations of the original query, run each through retrieval, and union the results. An LLM generates 3–5 query variants that capture different phrasings or aspects of the original question.
MULTI_QUERY_PROMPT = """Generate 3 different search queries that would help
answer this question. Each query should approach the topic from a different
angle or use different terminology.
Question: {question}
Queries (one per line):"""
This adds one LLM call (~200-500ms) but meaningfully improves recall for ambiguous or multi-faceted questions. The query “How do I handle auth?” might generate: “authentication implementation guide,” “OAuth token management,” “API key security best practices.”
HyDE (Hypothetical Document Embeddings)
Instead of embedding the query directly, generate a hypothetical answer, then embed that answer and use it as the search vector. The intuition: the hypothetical answer is closer in embedding space to the real answer chunk than the question is.
In practice, HyDE helps for short, vague queries (“pricing?”) but can hurt for specific factual queries (“What’s the rate limit for the /users endpoint?”) because the hypothetical answer may contain fabricated specifics that pull the embedding in the wrong direction. Use it selectively, not as a default.
Filtered Retrieval
Combine vector search with metadata filters. Instead of searching the entire index, restrict to chunks matching source_type = "api_docs" or created_after = "2025-01-01".
This is underappreciated. For a support chatbot over product documentation, knowing the user’s product tier or the feature area they’re asking about can eliminate 90% of irrelevant results before vector search even runs. Most vector databases support pre-filtering (filter before ANN search) and post-filtering (ANN search then filter). Pre-filtering is preferred — post-filtering can return fewer than k results if many ANN matches fail the filter.
Hybrid Search
Combine semantic (vector) search with lexical (keyword) search. This is the single most effective retrieval improvement for production RAG systems.
Why Hybrid
Semantic search captures meaning but misses exact terms. Keyword search (BM25) captures exact terms but misses semantic equivalence. The intersection of their failure modes is much smaller than either alone.
Examples of queries where each method wins:
| Query | Semantic Wins | BM25 Wins |
|---|---|---|
| ”How to handle user login” | ✓ (matches “authentication flow”) | |
| “ERR_CONNECTION_REFUSED” | ✓ (exact error string) | |
| “cancel subscription” | ✓ (matches “terminate membership”) | |
| “config.yaml maxRetries” | ✓ (exact parameter name) | |
| “slow API responses” | ✓ (matches “latency optimization”) |
Implementation
Most vector databases now offer hybrid search natively. Qdrant, Weaviate, and Elasticsearch/OpenSearch support it. For pgvector, combine with tsvector full-text search in the same query.
Hybrid search runs two parallel retrieval paths and fuses the results.
Score Fusion
The two search methods return scores on different scales (cosine similarity: 0–1; BM25: 0–∞). Two standard fusion approaches:
Reciprocal Rank Fusion (RRF): Score each result by 1 / (k + rank) where k is a constant (typically 60), then sum across methods. Simple, parameter-light, hard to beat.
def reciprocal_rank_fusion(
result_lists: list[list[str]],
k: int = 60
) -> dict[str, float]:
scores = {}
for results in result_lists:
for rank, doc_id in enumerate(results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return dict(sorted(scores.items(), key=lambda x: -x[1]))
Weighted linear combination: Normalize both score distributions to [0, 1], then combine with weights: α * semantic_score + (1 - α) * bm25_score. Typical α values range from 0.5 to 0.7 (slightly favoring semantic). Requires tuning α on evaluation data.
RRF is the better default — it’s rank-based, so it’s invariant to score distribution differences, and it doesn’t need tuning.
Reranking
Retrieval returns candidates. Reranking reorders them with a more expensive, more accurate model. The retrieval system prioritizes recall (get the right chunk into the candidate set); the reranker prioritizes precision (push the right chunk to the top).
Cross-Encoder Rerankers
A cross-encoder takes (query, document) as a pair and outputs a relevance score. Unlike bi-encoders (which embed query and document independently), cross-encoders can attend across both inputs simultaneously, making them substantially more accurate.
| Reranker | Notes |
|---|---|
Cohere rerank-v3.5 | Strong general-purpose, API-based, 4096 token input |
Voyage rerank-2 | Good for code and technical content |
bge-reranker-v2-m3 | Open-source, self-hostable, competitive quality |
BAAI/bge-reranker-v2.5-gemma2-lightweight | Lighter open model, good latency/quality tradeoff |
Jina jina-reranker-v2 | 8192 token context window |
Where Reranking Fits
Reranking narrows 50 retrieval candidates to the 5 most relevant before generation.
Retrieve wide, rerank narrow. A typical pattern: retrieve 30–50 candidates (high recall), rerank all of them, pass the top 3–5 to the LLM. The reranking step adds 100–300ms for API-based rerankers or 50–150ms for self-hosted models on a GPU.
The Impact
Reranking typically improves answer quality by 5–15% on retrieval benchmarks (measured by MRR@10 or NDCG@10). The improvement is largest when the initial retrieval has good recall but noisy precision — exactly the situation with hybrid search returning 50 candidates.
ColBERT-Style Late Interaction
ColBERT represents a middle ground between bi-encoders and cross-encoders. It encodes query and document independently (like a bi-encoder) but uses token-level interaction at scoring time rather than a single vector comparison. This enables precomputing document representations while still capturing fine-grained relevance.
RAGatouille provides a practical ColBERT implementation. It works well as a reranker or even as the primary retrieval method for smaller corpora (<1M chunks), but the index size is larger than standard embeddings (one vector per token rather than one per chunk).
Context Assembly and Prompt Construction
The reranker has selected the top-n chunks. Assembling them into a prompt is where many systems make preventable mistakes.
Chunk Ordering
LLMs exhibit a “lost in the middle” effect — information at the beginning and end of the context window gets more attention than information in the middle. Place the most relevant chunk first (by reranker score), then the rest in decreasing relevance. Some implementations put the most relevant chunk both first and last, though this wastes tokens.
Citation and Grounding
Include chunk metadata in the prompt so the model can cite sources:
[Source 1: API Authentication Guide, updated 2025-11-15]
To authenticate API requests, include a Bearer token in the Authorization
header. Tokens expire after 24 hours and must be refreshed using the
/auth/refresh endpoint...
[Source 2: Error Codes Reference, updated 2026-01-03]
Error 401: Invalid or expired authentication token. Ensure the token
has not expired and was issued for the correct environment (staging vs
production)...
Then instruct the model: “Answer based only on the provided sources. Cite sources by number. If the sources don’t contain the answer, say so.”
Token Budget
With current context windows (Claude Sonnet 4.6: 200k tokens, GPT-5.4: 128k tokens, Gemini 3.1 Pro: 1M tokens), the limiting factor isn’t fitting chunks — it’s cost and latency. Every input token costs money. A prompt with 20 chunks of 500 tokens each = 10,000 context tokens per query. At even modest per-token rates, this adds up at scale.
A practical budget: 3–5 chunks × 400–600 tokens = 1,500–3,000 context tokens. Enough to answer most questions; cheap enough for production traffic.
Production Failure Modes
This section covers the ways RAG systems fail that don’t show up in demos.
Failure Mode 1: Chunk Boundary Splits
The answer spans two chunks. Neither chunk alone contains sufficient information. The retriever might find one or both, but the model sees partial information and either hallucinates the rest or gives an incomplete answer.
Mitigation: Overlap during chunking (100–200 tokens). Parent-child retrieval. Retrieve adjacent chunks when confidence is low.
Failure Mode 2: Embedding Blindness
The query and the relevant chunk are semantically related but embed differently. Common for: acronyms (“RBAC” vs “role-based access control”), numerical identifiers (“order #48291”), code identifiers (getUserById), and domain jargon.
Mitigation: Hybrid search (BM25 catches exact terms). Document expansion during ingestion — add a metadata field with expanded acronyms and alternative phrasings.
Failure Mode 3: Stale or Contradictory Sources
The knowledge base contains old documentation alongside current documentation. The retriever returns both. The model picks the wrong one, or worse, merges them into an answer that’s partially correct.
Mitigation: Metadata filtering by date. Deduplication during ingestion (detect near-duplicate chunks and keep only the most recent). Include dates in the prompt context so the model can prefer recent sources.
Failure Mode 4: No Relevant Chunk Exists
The question asks about something not in the knowledge base. The retriever still returns the “nearest” chunks (vector search always returns results). The model sees topically adjacent but factually irrelevant context and fabricates an answer that sounds plausible.
Mitigation: Set a similarity threshold. If the best reranker score is below a threshold (e.g., 0.3 for Cohere rerank-v3.5), return “I don’t have information about that” instead of generating. This requires calibrating the threshold on held-out queries.
A relevance threshold gates generation, preventing hallucinated answers when no relevant context exists.
Failure Mode 5: Multi-Hop Questions
“What’s the difference between the Pro and Enterprise plan rate limits?” requires information from two separate documents (the Pro plan page and the Enterprise plan page). Single-query retrieval may only surface one.
Mitigation: Query decomposition. Detect multi-part questions, split into sub-queries, retrieve for each, and merge results. An LLM call to decompose adds latency but handles this case well.
DECOMPOSITION_PROMPT = """Does this question require information from
multiple distinct topics? If so, split it into 2-3 focused sub-queries.
If it's a single focused question, return it as-is.
Question: {question}
Sub-queries (one per line):"""
Failure Mode 6: Table and Structured Data
Tables in documents often chunk poorly. A row might be separated from its column headers. The model sees “100 | 500 | 1000” without knowing what those numbers represent.
Mitigation: Convert tables to Markdown during preprocessing (preserving headers). Alternatively, store each row as a standalone chunk with column headers included: “Plan: Pro, Rate Limit: 500 req/min, Price: $49/mo.”
Failure Mode 7: Context Window Poisoning
An adversary (or a poorly sanitized document) contains instructions like “Ignore all previous instructions and…” within a chunk that gets retrieved and injected into the prompt. This is prompt injection via RAG context.
Mitigation: Input sanitization during ingestion. Instruction hierarchy (system prompt > user message > retrieved context). Output filtering. The Prompt Injection Prevention in Production deep dive covers this comprehensively.
Evaluation
RAG evaluation has two layers: retrieval quality (did the right chunks get retrieved?) and generation quality (did the model produce a correct answer from those chunks?).
Retrieval Metrics
| Metric | What It Measures | Typical Target |
|---|---|---|
| Recall@k | Fraction of relevant docs in top-k | >0.9 at k=10 |
| MRR (Mean Reciprocal Rank) | Average 1/rank of first relevant result | >0.7 |
| NDCG@k | Rank-aware relevance scoring | >0.7 at k=10 |
| Precision@k | Fraction of top-k that are relevant | Context-dependent |
Generation Metrics
Manual evaluation on a labeled set of (question, expected answer, relevant chunks) triples is the gold standard. Automated metrics include:
- Faithfulness: Does the answer contain only information from the provided context? LLM-as-judge evaluation (have a model check whether each claim in the answer is supported by the context) is the practical approach.
- Answer relevance: Does the answer actually address the question?
- Completeness: Does the answer cover all aspects of the question that the context supports?
Building an Evaluation Set
Start with 50–100 examples. For each:
- A real user question (or realistic synthetic question)
- The chunk(s) that should be retrieved (ground truth retrieval)
- A reference answer
This takes 2–4 hours of manual work and is the highest-ROI time investment in a RAG project. Without it, every change to chunking, retrieval, or prompts is guesswork.
# Minimal eval structure
eval_set = [
{
"question": "What's the rate limit for the /chat/completions endpoint?",
"relevant_chunk_ids": ["api-ref-chunk-42", "api-ref-chunk-43"],
"reference_answer": "The /chat/completions endpoint allows 500 requests per minute on the Pro plan and 3,500 on the Enterprise plan.",
"metadata": {"category": "api_limits", "difficulty": "easy"}
},
# ... 99 more
]
Evaluation Frameworks
RAGAS provides automated metrics (faithfulness, answer relevance, context relevance) using LLM-as-judge. DeepEval offers similar metrics plus a testing framework. Braintrust and Langfuse provide evaluation dashboards with human annotation support. All are useful for continuous monitoring; none replace the initial manual evaluation set.
Architecture Patterns
Pattern 1: Simple RAG
Single retrieval step, no reranking, chunks directly into the prompt. Works for: internal tools, small knowledge bases (<1000 documents), low query volume.
Simple RAG: suitable for prototypes and small knowledge bases.
Pattern 2: Production RAG
Hybrid search, reranking, metadata filtering, relevance threshold. The recommended starting architecture for any production deployment.
Production RAG: multi-query, hybrid search, reranking, and relevance gating.
Pattern 3: Agentic RAG
An LLM agent decides how to retrieve — which index to query, whether to decompose the question, whether to do a follow-up retrieval based on initial results. The agent has retrieval as a tool rather than retrieval being a fixed pipeline step.
This pattern handles complex questions (multi-hop, comparative, analytical) better than a fixed pipeline but adds latency (2–4 LLM calls per query) and non-determinism. Use it when query complexity justifies the cost.
Agentic RAG: the LLM decides retrieval strategy, can do multiple rounds, and self-checks before answering.
Pattern 4: RAG + Long Context
Current frontier models accept 128k–1M tokens of context. An alternative to traditional RAG: stuff many chunks (or entire documents) into the context and let the model find the relevant parts.
This works surprisingly well for small-to-medium knowledge bases (under 100k tokens total). The “many-shot” approach — put 50+ chunks in context — outperforms top-5 retrieval in some benchmarks because the model can cross-reference across chunks.
The math limits this approach: 100k input tokens per query at frontier model pricing gets expensive quickly at scale. Prompt caching (supported by Anthropic and Google) helps — if the document set is stable, the cached prefix amortizes over many queries.
Practical heuristic: If the total knowledge base fits in 100k tokens and query volume is <1000/day, long context without retrieval is simpler and often better. Beyond that, traditional RAG with retrieval is more cost-effective.
Pattern 5: Graph RAG
Construct a knowledge graph from documents (entities and relationships), then traverse the graph during retrieval to find connected information. Microsoft’s GraphRAG implementation uses LLMs to extract entities and relationships during ingestion, builds community summaries, and uses graph traversal + vector search at query time.
Graph RAG excels at: questions about relationships (“Which teams depend on the auth service?”), global summarization (“What are the main themes across all customer feedback?”), and multi-hop reasoning. It’s expensive to build (many LLM calls during ingestion) and complex to maintain. Best for high-value, relatively stable corpora where relationship-aware retrieval justifies the overhead.
Putting It Together: A Reference Implementation
A condensed reference implementation showing the core production pattern:
import numpy as np
from dataclasses import dataclass
@dataclass
class RetrievedChunk:
chunk_id: str
text: str
score: float
metadata: dict
class RAGPipeline:
def __init__(self, vector_store, bm25_index, embedder, reranker, llm):
self.vector_store = vector_store
self.bm25_index = bm25_index
self.embedder = embedder
self.reranker = reranker
self.llm = llm
self.relevance_threshold = 0.3
def query(self, question: str, filters: dict = None) -> str:
# 1. Hybrid retrieval
query_embedding = self.embedder.embed(question)
semantic_results = self.vector_store.search(
query_embedding, top_k=30, filters=filters
)
bm25_results = self.bm25_index.search(question, top_k=30)
# 2. RRF fusion
fused = self._rrf_fusion(
[semantic_results, bm25_results], k=60
)
candidates = fused[:30] # top 30 after fusion
# 3. Rerank
reranked = self.reranker.rerank(
query=question,
documents=[c.text for c in candidates],
top_n=5
)
# 4. Relevance gate
if reranked[0].score < self.relevance_threshold:
return "I don't have enough information to answer that question."
# 5. Build prompt and generate
context = self._build_context(reranked)
return self.llm.generate(
system="Answer based only on the provided sources. "
"Cite sources by number. If the sources don't "
"contain the answer, say so.",
user=f"Sources:\n{context}\n\nQuestion: {question}"
)
def _rrf_fusion(self, result_lists, k=60):
scores = {}
chunk_map = {}
for results in result_lists:
for rank, chunk in enumerate(results):
scores[chunk.chunk_id] = scores.get(chunk.chunk_id, 0) \
+ 1 / (k + rank + 1)
chunk_map[chunk.chunk_id] = chunk
ranked_ids = sorted(scores, key=scores.get, reverse=True)
return [chunk_map[cid] for cid in ranked_ids]
def _build_context(self, chunks):
parts = []
for i, chunk in enumerate(chunks):
source_info = chunk.metadata.get("title", "Unknown")
date = chunk.metadata.get("created_at", "Unknown date")
parts.append(
f"[Source {i+1}: {source_info}, {date}]\n{chunk.text}"
)
return "\n\n".join(parts)
Summary
The reliable RAG stack in 2026:
- Chunking: Structure-aware recursive splitting at 400–600 tokens with parent-child retrieval for context. Preserve metadata aggressively.
- Hybrid search: Always combine semantic and keyword retrieval. RRF fusion is the robust default.
- Reranking: A cross-encoder reranker on 30–50 candidates, selecting the top 3–5. This is the highest-ROI component after hybrid search.
- Relevance gating: Threshold on reranker scores to prevent hallucinated answers when no relevant context exists.
- Evaluation: Build a 50–100 example labeled set. Measure retrieval recall and generation faithfulness. Without evaluation data, every pipeline change is guesswork.
- Failure modes: Chunk boundary splits, embedding blindness for codes/acronyms, stale sources, no-context hallucinations, multi-hop misses, and table breakage are the recurring problems. Each has specific mitigations.
The pattern that handles most production use cases: multi-query expansion → hybrid search (vector + BM25) → RRF fusion → cross-encoder reranking → relevance threshold → LLM generation with citations. Start here, measure, then add complexity (agentic retrieval, graph RAG, query decomposition) only where evaluation shows specific gaps.
Further Reading
- RAGAS Documentation — Framework for automated RAG evaluation metrics (faithfulness, relevance, context recall)
- LangChain RAG Tutorial — Step-by-step RAG implementation with LangChain, covering chunking through generation
- Qdrant Hybrid Search Documentation — Native hybrid search implementation with sparse and dense vectors
- Cohere Rerank API — Production reranking API documentation with model details and integration examples
- RAGatouille — ColBERT-based retrieval library for easy late-interaction search
- Microsoft GraphRAG — Knowledge graph construction and community summarization for RAG over large document collections
- Unstructured.io — Document parsing library supporting PDFs, HTML, images, and other formats for RAG ingestion pipelines
- Dense X Retrieval: What Retrieval Granularity Should We Use? — Research paper on proposition-based chunking and retrieval granularity tradeoffs
- Lost in the Middle: How Language Models Use Long Contexts — Stanford research on positional bias in long-context retrieval, informing chunk ordering strategies