Office Hours — How do teams prevent duplicate LLM API calls and token waste?

How do teams prevent duplicate LLM API calls and token waste?

This is a higher-order problem than it looks. You can’t just cache everything and call it solved—you need to understand where duplicates come from, decide what’s actually worth caching, and build observability so you catch waste before it gets billed.

Where the duplicates actually happen

Duplicate calls fall into a few patterns. First, there’s the obvious one: the same user or request hits your system twice because of a retry, a race condition, or a stale UI state triggering two parallel calls. Second is the “different wrapper, same question” problem—one user asks “summarize this document,” another asks “what’s the main point of this document,” and both hit the LLM independently even though they’re essentially the same request. Third is architectural: you have multiple services or jobs independently calling the same model on the same input because nobody coordinated. Fourth is the insidious one: you’re caching aggressively but your cache key is too strict (includes timestamps, user IDs, or request metadata that varies) so you miss hits that should have landed.

Request deduplication patterns

The simplest approach is deterministic request fingerprinting. Hash the actual content that matters—the system prompt, the user query, the relevant context—into a cache key. Ignore request metadata like timestamps, user IDs, or session tokens unless they genuinely affect the LLM’s output.

import hashlib
import json

def make_cache_key(system_prompt, user_query, context_docs=None):
    """
    Fingerprint only the semantically meaningful parts of the request.
    Ignores timestamps, request IDs, user context that doesn't affect the LLM output.
    """
    key_data = {
        "system_prompt": system_prompt,
        "query": user_query,
        "context": sorted(context_docs) if context_docs else None
    }
    key_str = json.dumps(key_data, sort_keys=True)
    return hashlib.sha256(key_str.encode()).hexdigest()

# Before calling the LLM, check the cache
cache_key = make_cache_key(system_prompt, user_query, context)
if cache_key in redis_cache:
    return redis_cache[cache_key]  # Hit—no API call

The catch: you need to decide what context matters. If two requests have the same question but different retrieved documents from your RAG pipeline, should they hit the same cache entry? Usually no—the LLM will give different answers. But if they have the same question and the same documents (just retrieved in different order or from different search branches), probably yes. Be explicit about what goes into the key.

Request coalescing for concurrent calls

If the same request comes in while you’re already waiting for a response, don’t spawn a second API call. Coalesce the callers and share the result.

import asyncio

pending_requests = {}  # cache_key -> Future

async def call_llm_with_coalesce(cache_key, llm_call):
    """
    If this request is already in flight, await its result.
    If it's new, issue the call and let others coalesce onto it.
    """
    if cache_key in pending_requests:
        # Another request for this is already in flight. Wait for it.
        return await pending_requests[cache_key]
    
    # This is the first request. Create a future and let others attach to it.
    future = asyncio.Future()
    pending_requests[cache_key] = future
    
    try:
        result = await llm_call()  # Actually call the LLM
        future.set_result(result)
        return result
    except Exception as e:
        future.set_exception(e)
        raise
    finally:
        del pending_requests[cache_key]

This is critical during request spikes. If your UI sends 10 identical queries in parallel (a user mashed refresh, or a bulk job started), you’ll issue 10 API calls by default. Coalescing cuts that to 1.

Cache invalidation and staleness

The hard part isn’t storing results, it’s knowing when they’re stale. For summarization or classification on static documents, caching indefinitely is probably fine. For time-sensitive queries (news analysis, real-time data), you need TTLs. For personalized outputs, don’t cache at all—or cache only the intermediate retrieval, not the final response.

Set explicit TTLs based on your use case:

CACHE_TTL = {
    "document_summary": 86400 * 7,      # 7 days—documents don't change often
    "code_review": 3600,                # 1 hour—code evolves quickly
    "financial_analysis": 300,          # 5 minutes—data freshness matters
    "user_personalized": 0,             # Never cache across users
}

cache_key = make_cache_key(...)
ttl = CACHE_TTL.get(task_type, 3600)  # Default 1 hour
redis_cache.setex(cache_key, ttl, result)

Observability: catching waste you don’t see

Set up counters for cache hits, misses, and redundant calls. Without visibility, waste becomes invisible.

metrics = {
    "llm_calls_made": 0,
    "cache_hits": 0,
    "cache_misses": 0,
    "duplicate_requests": 0,  # Calls coalesced
    "tokens_saved": 0,
}

# Track it
if found_in_cache:
    metrics["cache_hits"] += 1
    metrics["tokens_saved"] += estimated_tokens
elif coalesced_to_existing_request:
    metrics["duplicate_requests"] += 1
else:
    metrics["llm_calls_made"] += 1

# Emit to your observability stack (Datadog, Prometheus, etc.)
log_metrics(metrics)

Log the cache key, the request timestamp, and whether it was a hit or miss. Over time, you’ll see patterns: “Every morning at 9am we process the same batch file twice.” That’s actionable.

Cost calculation example

Say you process a monthly batch of 10,000 documents through an LLM at ~1000 tokens per document.

Without caching or deduplication:

10,000 API calls × 1000 tokens = 10M tokens
Cost at typical rates: ~$50/M input tokens = $500/month

With fingerprint caching (80% hit rate):

2,000 new calls × 1000 tokens = 2M tokens
Cost: ~$100/month
Savings: $400/month

With coalescing during concurrent spikes:

Reduces redundant in-flight calls by another 10-20% depending on your concurrency pattern
Additional savings: $10-40/month

For large teams running agents or batch jobs, this compounds.

The trap: caching confidence vs. correctness

One subtle failure mode: you cache aggressively to save tokens, but the LLM’s output contains context-dependent information you didn’t anticipate. Example: you cache “here are the top 3 risks in this document,” but the next request asking the same document with a different user or in a different context gets the same stale answer. Be conservative about what you cache until you’ve validated it’s actually safe.

Bottom line: Start with request fingerprinting and coalescing to catch the low-hanging fruit (identical concurrent requests, repeated batch jobs). Add strict cache keys and TTLs based on your actual use case. Then instrument everything so you can see what you’re actually saving. Most teams waste 20-40% of their token budget on duplicates before they measure it.

Question via Hacker News