Office Hours — How are you preventing runaway LLM workflows and token costs in production?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
How are you preventing runaway LLM workflows and token costs in production?
Token budgets feel like a solved problem until they’re not. You set a rate limit, maybe a monthly cap, and assume it’s fine. Then an agent retries a flaky API 200 times in a loop, or a RAG system retrieves the same 50-page document ten times per user session, or a feedback loop starts querying the model to evaluate its own outputs, and suddenly your $500/month bill is $5000. The runaway happens silently because LLM costs scale with invisible token counts, not with visible request volume.
The Invisible Cost Multiplier
The hardest part is that token waste doesn’t register as a bug. A request succeeds. The model returns output. Nobody knows it burned 50,000 tokens on a 200-token response because of bloated context or redundant retrieval. A practitioner in the Daily Signal caught their Claude .md file consuming 8,000 tokens per request—essentially a hidden tax on every interaction they didn’t know existed.
Token consumption scales with:
- Context length. Every extra KB of system prompt, retrieved documents, or conversation history multiplies cost linearly. A 4KB system prompt burned on every request adds up fast across thousands of users.
- Retry logic. Exponential backoff on timeouts or rate limits can spiral if you’re not careful about circuit breakers.
- Nested LLM calls. An agent calling the model to evaluate its own output, then calling again to refine, doubles costs silently.
- Redundant retrieval. RAG systems that fetch the same documents multiple times per workflow because they lack deduplication.
Concrete Guardrails That Work
The standard approach is multi-layered, not single-lever.
Per-request token budgets with hard stops. Most providers let you set max_completion_tokens and max_prompt_tokens at the API level. Use them. A coding agent generating a 50KB file shouldn’t be your problem. Set a ceiling that makes sense for your use case, then let it fail when it hits it rather than silently charging you.
response = client.messages.create(
model="claude-opus-4.7",
max_tokens=2000, # Hard stop, not a target
system=system_prompt,
messages=messages,
)
This is boring but effective. The agent fails fast instead of spinning. You get an error log you can actually act on.
Semantic caching to skip redundant calls entirely. If your agent retrieves the same document twice in one workflow, or a user asks the same question twice in a week, don’t call the model again. Semantic caching—matching similar prompts at the embedding level, not token level—can cut inference costs by an order of magnitude.
OpenAI’s prompt caching works at the API level for static context (like a long document). Anthropic doesn’t have built-in semantic caching, but you can layer it yourself with a vector DB: before calling the model, embed the prompt, check for a cached response above a similarity threshold (say 0.95), and skip the API call if you find one. The latency win is even better than the cost win.
Strict retrieval budgets for RAG. Don’t retrieve “all relevant chunks.” Retrieve the top 3 or top 5 by relevance score, then stop. It’s tempting to throw more context at the problem, but it almost never improves output quality past a certain point, and it always increases cost. Set a hard limit on the number of documents or total tokens retrieved per query.
retrieved_chunks = vector_store.search(query, limit=5, score_threshold=0.8)
context = "\n".join([chunk.text for chunk in retrieved_chunks])
# Don't keep adding more chunks hoping for better answers
A production RAG system at Daily Signal (via the May 9 piece on temporal awareness) was pulling “most similar” documents without temporal filtering. They weren’t retrieving more documents, just the wrong ones. Same token cost, worse answers. The fix was adding time-awareness to the retrieval ranking, not throwing more tokens at it.
Monitoring at the workflow level, not just the request level. Track cost per completed task, not per API call. An agent that makes 10 calls to solve a problem is fine if the task costs $0.02 overall. An agent making 2 calls that each cost $0.05 is burning money silently. Instrument your workflows to emit cost + outcome pairs so you can see which paths are efficient and which are drains.
task_start = time.time()
cost_start = get_api_usage()
# Run agent workflow
result = agent.run(task)
cost_end = get_api_usage()
cost_per_task = cost_end - cost_start
duration = time.time() - task_start
logger.info(f"task_type={task.type} cost=${cost_per_task:.4f} duration={duration}s")
This gives you real data on which tasks are cheap and which are expensive, so you can prioritize fixing the expensive ones.
Rate limiting per user, per workflow. Don’t just set a global monthly budget. Set per-request limits and per-user daily limits. If one user or one feature suddenly starts consuming 10x normal traffic, you catch it before it wipes your budget. Use token-bucket rate limiting at the application level, not just at the API key level.
The One Thing Everyone Misses
Runaway costs almost always come from loops that shouldn’t exist. An agent retrying indefinitely because its retry logic has no exit condition. A feedback loop where the model evaluates its own output, which triggers another evaluation, which triggers another. A user-facing endpoint that multiplexes across three different models to “pick the best answer,” tripling tokens per request.
The fix isn’t more monitoring. It’s asking why the loop exists. Does the agent really need to retry failed API calls 20 times, or should it fail after 3 and let a human decide? Does the model really need to evaluate itself, or can you use a cheaper heuristic? Can you serve one model and accept its answer, or do you genuinely need consensus?
Most runaway costs come from defensive engineering that makes sense in isolation but creates geometric cost explosion at scale.
Bottom line: Set hard per-request token limits, add semantic caching to skip redundant calls, limit retrieval volume in RAG, and monitor cost per task (not per request) so you see expensive patterns early. The most expensive workflows are the ones designed to be safe—catch those first.
Question via Hacker News