Office Hours — What patterns work best for aggregating results when retrieval-augmented generation approaches fail in agentic systems?

What patterns work best for aggregating results when retrieval-augmented generation approaches fail in agentic systems?

RAG failures in agent contexts are qualitatively different from RAG failures in static QA systems. When an agent hits a retrieval dead end, it doesn’t just produce a worse answer—it can misinterpret silence as permission to hallucinate, backtrack into loops, or waste token budget on futile retries. You need aggregation patterns that degrade gracefully and signal failure explicitly rather than masking it.

The Core Problem: RAG Failure Modes in Agents

Standard RAG assumes a retrieval-then-generation pipeline. Agents assume iterative refinement. When retrieval fails silently in an agent loop, the agent doesn’t know it’s working from nothing and can confidently generate plausible falsehoods. A static RAG system returns a bad answer once. An agent returns bad answers repeatedly, refining them based on hallucinated feedback.

The issue compounds because agentic systems often chain retrievals across heterogeneous sources—internal docs, APIs, databases, external search. Any one source failing doesn’t halt the agent; it just reduces signal and increases noise in the aggregate.

Pattern 1: Explicit Retrieval Confidence Scoring

Before aggregating results, require the retrieval layer to emit a confidence score for each source. When confidence falls below a threshold, mark that source as unreliable and either exclude it or flag it for human review.

def retrieve_with_confidence(query: str, sources: List[str]) -> List[dict]:
    results = []
    for source in sources:
        docs = source.retrieve(query)
        # Compute confidence: BM25 score + semantic similarity + source freshness
        confidence = compute_confidence(
            bm25_score=docs[0].score if docs else 0,
            semantic_similarity=embed_similarity(query, docs),
            source_staleness=time.time() - docs[0].metadata['updated'],
            source_reliability=source.historical_accuracy()
        )
        if confidence > CONFIDENCE_THRESHOLD:
            results.append({
                'docs': docs,
                'source': source.name,
                'confidence': confidence
            })
        else:
            results.append({
                'docs': [],
                'source': source.name,
                'confidence': confidence,
                'status': 'BELOW_THRESHOLD'
            })
    return results

When the agent sees that a source failed confidence checks, it can decide to retry with different keywords, escalate to a human, or skip that source entirely. The key is making failure observable.

Pattern 2: Consensus-Based Aggregation

If you’re pulling from multiple sources (docs, APIs, databases), aggregate by requiring agreement across sources before the agent acts on a result. If source A says “the API rate limit is 100 req/s” but sources B and C say “1000 req/s”, the agent shouldn’t confidently pick one.

def aggregate_with_consensus(results: List[dict], query: str) -> dict:
    # Group results by semantic equivalence (embedding clustering)
    clusters = cluster_results_semantically(results)
    
    # Find the consensus cluster (most sources agree)
    consensus = max(clusters, key=lambda c: len(c['sources']))
    
    # Return only if consensus is strong enough
    if len(consensus['sources']) / len(results) >= CONSENSUS_THRESHOLD:
        return {
            'answer': consensus['answer'],
            'confidence': len(consensus['sources']) / len(results),
            'sources': consensus['sources'],
            'dissenting': [c for c in clusters if c != consensus]
        }
    else:
        return {
            'answer': None,
            'confidence': 0,
            'status': 'NO_CONSENSUS',
            'clusters': clusters  # Let agent see the split opinion
        }

This forces the agent to see when sources contradict rather than averaging conflicting information into plausibility.

Pattern 3: Staged Fallback with Explicit Degradation

RAG doesn’t have to be all-or-nothing. Design retrieval as a cascade where each stage has explicit failure modes and the next stage only triggers if the previous one failed below a threshold.

def staged_retrieval(query: str, max_retries: int = 3) -> dict:
    stage_results = {}
    
    # Stage 1: Exact match from internal KB
    exact = retrieve_exact_match(query)
    stage_results['exact'] = {
        'docs': exact,
        'count': len(exact),
        'success': len(exact) > 0
    }
    
    if stage_results['exact']['success']:
        return {'source': 'exact', 'result': exact, 'degraded': False}
    
    # Stage 2: Semantic search on internal KB
    semantic = retrieve_semantic(query, top_k=10)
    stage_results['semantic'] = {
        'docs': semantic,
        'count': len(semantic),
        'min_similarity': min([d.score for d in semantic]) if semantic else 0
    }
    
    if stage_results['semantic']['count'] > 5 and stage_results['semantic']['min_similarity'] > 0.75:
        return {'source': 'semantic', 'result': semantic, 'degraded': False}
    
    # Stage 3: API calls (slower, more expensive)
    api_results = retrieve_from_apis(query, timeout=5)
    stage_results['api'] = {
        'docs': api_results,
        'count': len(api_results),
        'source': 'external_api'
    }
    
    if api_results:
        return {'source': 'api', 'result': api_results, 'degraded': False}
    
    # Stage 4: Degraded mode—return what we have with explicit warning
    all_results = exact + semantic + api_results
    return {
        'source': 'degraded_aggregation',
        'result': all_results,
        'degraded': True,
        'stages_attempted': stage_results,
        'warning': 'Retrieval confidence below threshold. Results may be incomplete or stale.'
    }

The agent gets metadata about which stage succeeded and can adjust its confidence and tone accordingly. If retrieval came from an API (degraded mode), the agent might say “I found this, but it comes from an external source I can’t verify” instead of claiming certainty.

Pattern 4: Ensemble Scoring with Weighted Voting

When you have multiple models or retrieval strategies, score each independently and aggregate via weighted voting rather than averaging.

def ensemble_retrieval(query: str) -> dict:
    retrievers = [
        ('bm25', bm25_retriever, weight=0.3),
        ('semantic', semantic_retriever, weight=0.4),
        ('hybrid', hybrid_retriever, weight=0.3),
    ]
    
    all_results = {}
    for name, retriever, weight in retrievers:
        docs = retriever.retrieve(query)
        for doc in docs:
            doc_id = doc.id
            if doc_id not in all_results:
                all_results[doc_id] = {
                    'doc': doc,
                    'votes': 0,
                    'voters': []
                }
            all_results[doc_id]['votes'] += weight
            all_results[doc_id]['voters'].append(name)
    
    # Rank by weighted vote count
    ranked = sorted(all_results.items(), key=lambda x: x[1]['votes'], reverse=True)
    
    return {
        'results': [item[1]['doc'] for item in ranked],
        'consensus_metadata': [
            {
                'doc_id': item[0],
                'votes': item[1]['votes'],
                'retrieved_by': item[1]['voters']
            }
            for item in ranked
        ]
    }

This shows the agent which retrievers agreed on a result, making weak signals visible.

Pattern 5: Explicit Uncertainty Budgeting

Agents can burn token budget fast. Cap the number of retrieval attempts and force degradation when you hit the limit.

class BudgetedAgent:
    def __init__(self, max_retrieval_

*Question via [Hacker News](https://news.ycombinator.com/item?id=47922550)*