Office Hours — Should you use agentic search or RAG for retrieval, and what's the tradeoff in production?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
Should you use agentic search or RAG for retrieval, and what’s the tradeoff in production?
The honest answer: RAG is still your default unless you have a specific reason to go agentic. But the reasons are shifting fast, and most teams are misunderstanding what “agentic retrieval” actually buys you.
What Each One Actually Does
RAG (retrieval-augmented generation) is straightforward. You embed a query, find the top K similar documents, pass them to an LLM, done. The flow is: query → retrieve → generate. It’s deterministic and predictable.
Agentic search means the model decides what to search for, when to search, which results to trust, and whether to search again. The model reasons about the retrieval problem instead of you solving it upfront. This sounds great until you run it in production.
The Real Tradeoffs
RAG wins on:
Latency. A RAG call is one embedding, one vector DB lookup, one LLM call. Agentic search requires the model to decide to search, which means reasoning time before you even hit your vector DB. With Claude Opus 4.7 or GPT-5.5, that reasoning can add 500ms to 3 seconds per decision point depending on query complexity.
Cost predictability. RAG’s cost is linear: you know exactly how many tokens you’re spending per query (embedding + context + response). Agentic search with tool calling introduces variable token usage. If the model decides to search three times, you’re paying three times. At production scale with high volume, this compounds fast.
Observability. When RAG fails, you can see exactly what documents got retrieved and why the answer was wrong. With agentic search, you have to trace through the model’s reasoning about why it decided to search, what search it decided to do, whether it trusted the results. Debugging a failing agent is slower than debugging RAG.
Agentic search wins on:
Complex information needs. If your user question requires multiple searches across different contexts, agentic search can chain them. Example: “What’s the revenue impact of the Q3 price change, and how does it compare to last year’s Q3?” A pure RAG system will retrieve everything about price changes, but might miss the comparison component. An agent reasons, “I need current Q3 revenue data and historical Q3 revenue” and fetches both intelligently.
Handling uncertainty. RAG always returns something. Agentic search can recognize “I don’t have enough information to answer this” and search again before giving up. This is valuable in high-stakes domains like legal or healthcare, where a confident-sounding wrong answer is worse than admitting uncertainty.
Dynamic query refinement. If the first search returns garbage, an agent can try a different search strategy. RAG will just use the bad results. This matters for messy corpora where a naive embedding match fails.
Production Reality Check
The Daily Signal from May 9 covered exactly this problem: “RAG Systems Can’t Tell Time—And That’s Costing Real Users.” A production AI tutor was giving outdated answers because RAG pulled the most semantically similar document, not the most recent one. The fix wasn’t to go agentic; it was to add a temporal ranking layer to RAG.
This matters because most of what people think requires agentic search can actually be solved with better RAG engineering: hybrid search (semantic + keyword), reranking, time-aware retrieval, explicit metadata filtering.
A Concrete Example
Say you’re building a customer support bot over a knowledge base. Your corpus has PDFs updated monthly.
RAG approach:
def answer_support_query(question: str):
embedding = embed(question)
docs = vector_db.search(embedding, top_k=5)
return llm.generate(prompt=f"Context: {docs}\n\nQuestion: {question}")
Cost: ~50 tokens for embedding + context + response. Latency: ~800ms.
Agentic approach:
def answer_support_query(question: str):
response = agent.run(
question,
tools=[search_kb, search_faqs, search_recent_updates],
model=claude_opus_4_7
)
return response
The agent might think: “This is a billing question, I should search recent billing updates first, then general FAQs.” Cost: ~2,000 tokens (reasoning + tool calls + context + response). Latency: ~3-5 seconds.
The agentic version is smarter about which tool to use. But the RAG version with explicit metadata filtering (“only docs modified in last 30 days”) + a reranker solves 85% of the same problem at 1/40th the cost.
When to Actually Go Agentic
You have heterogeneous data sources and the query doesn’t fit neatly into one retrieval pattern. Not “I have a lot of documents” (that’s just big RAG), but “I have databases, APIs, document stores, and the right answer requires reasoning about which to hit.”
You need provable correctness. If you’re in legal or regulatory work, agentic search with explicit reasoning chains gives you an audit trail of why the model chose to search for what. Pure RAG obscures the decision.
You’re dealing with adversarial queries. If users are trying to trick your system, an agent that can recognize a trick and search differently is harder to fool than RAG that always uses the same embedding strategy.
Otherwise, invest in RAG infrastructure: better chunking, metadata filtering, reranking (use Claude or GPT-5.4 as a cheap reranker), hybrid search. These are cheaper, faster, and easier to debug than agent loops.
Bottom line: Use RAG as your default. Go agentic only when your retrieval problem genuinely requires reasoning about which sources to hit, not just better ranking of fixed sources. Most teams haven’t maxed out their RAG setup yet.
Question via Hacker News