Office Hours — Is RAG an antipattern for AI agents, or is it still the right default approach?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
Is RAG an antipattern for AI agents, or is it still the right default approach?
RAG is fine for retrieval tasks. It’s an antipattern when you treat it as a substitute for actual reasoning or decision-making.
The problem is that agents often need to act on information, not just read it back. If your agent pulls a document from your knowledge base and then immediately returns it to the user, that’s search, not agency. But if your agent needs to synthesize facts, choose between conflicting data sources, or decide what action to take based on retrieved context, RAG becomes a bottleneck because retrieval quality directly limits what the agent can decide.
RAG as a context bottleneck
The real issue surfaces when you trace through a planning agent’s execution. Consider an autonomous coding agent (Claude Opus 4.7 or GPT-5.4 with computer use) that needs to refactor a codebase while respecting architectural constraints stored in scattered documentation. Pre-retrieving all “relevant” docs wastes context tokens on noise. The agent can’t reason about what matters until it’s already spent budget reading low-confidence matches. Worse, if your retriever misses a critical constraint document, the agent proceeds confidently on incomplete information.
Better approach: let the agent decide when to look something up. Instead of dumping 20 retrieved documents into the system prompt, give the agent access to a search tool it can call mid-reasoning. This scales better because the agent’s token budget stays allocated to actual planning, not passive document review.
Tool-calling over pre-retrieval
Concrete pattern: use tool definitions for fact-grounding. An agent handling customer support tickets should have tools like fetch_account_history() and check_policy_docs() rather than RAG vectors of all policies baked into context. The agent calls these tools as needed, sees the output, and decides the next step. This is especially valuable for agents handling real-time data (inventory, customer records, API responses) where pre-retrieval is stale by the time execution happens.
The tradeoff is latency per tool call versus context efficiency. A tool call adds roughly 500-1000ms overhead depending on your backend. Pre-retrieval is faster but ties your agent’s reasoning to whatever your retriever happened to fetch. For agents running loops over minutes or hours (long-running tasks, multi-step workflows), the latency difference is negligible relative to total execution time. For single-turn response agents (chatbots, instant lookups), pre-retrieval RAG still makes sense.
Concrete example: coding agent with architectural constraints
Suppose you’re building an agent that refactors a monolith into microservices. Your constraints live in three places: a service boundary doc (400 tokens), API contract specs (2000 tokens), and past decisions in a decision log (1500 tokens). That’s 3900 tokens of useful context.
Naive RAG approach: embed all three docs, retrieve top-k matches for the refactoring task, load them all into the system prompt upfront. Result: the agent sees all 3900 tokens immediately, uses 3000 tokens of context budget before writing a single line of code, and has already decided which constraints matter based on embedding similarity. If the retriever ranks the decision log last and the agent needs it halfway through planning, it’s already committed to a plan.
Tool-calling approach: give the agent three callable tools. It starts planning with a blank slate. When it proposes moving a database layer into a separate service, it calls fetch_related_decisions("database separation"). The decision log returns the three relevant past decisions (400 tokens). The agent synthesizes those, then calls check_service_boundaries("database") to confirm the proposal doesn’t violate contracts. Now it’s used 1200 tokens of context for actual reasoning instead of 3900 for pre-loaded noise.
The second approach costs 1-2 extra API calls but preserves 2700 tokens for the agent’s actual work. On a 200K context window, this seems trivial. But on a smaller budget (GPT-5.4 Thinking, which caps reasoning at specific depths) or a very long-running agent (Opus 4.7 holding state across 12+ hour workflows), that budget discipline pays off.
When RAG remains the right default
RAG still works well for grounding specific, unchanging facts: customer account data pulled at query time, reference documentation that rarely changes, regulatory docs. The key is scope. Narrow your RAG to high-confidence retrieval targets (specific doc types, exact field lookups) rather than general question-answering over your entire corpus. Use embeddings to find the right document type, then parse and extract specific fields rather than asking the model to summarize retrieved content.
Frontier models like Claude Opus 4.7 and GPT-5.5 have large context windows, so you can afford to retrieve more without truncating. But larger context doesn’t fix poor retrieval quality. A 200K token window filled with mediocre matches is still worse than a 10K window with only the facts that matter.
The edge case: agents over heterogeneous data
One hard case RAG struggles with is agents needing to reason across multiple incompatible data sources simultaneously. Example: an agent planning a customer escalation that needs to cross-reference support tickets (unstructured text), SLA definitions (structured), account status (real-time API), and past resolutions (time-series). Each source has different retrieval semantics. Pre-retrieving from all of them at once pollutes context. Calling separate tools for each source works better but requires the agent to orchestrate the calls in the right order.
Current autonomous agents handle this by keeping tool calls simple and letting the model make multiple round-trips. It’s slower but more reliable than trying to predict which data the agent will need upfront. This is where agents like Claude Code with Opus 4.7 shine, since they can tolerate higher latency in exchange for clarity.
Bottom line: RAG is still the right default for fact-grounding, but stop using it as the entire backbone of your agent. Separate retrieval from reasoning. Give agents tool access to fetch data on-demand rather than pre-populating context. This keeps reasoning clearer, reduces hallucination risk, and lets the agent decide what actually matters.
Question via Hacker News