Office Hours — How do you prevent retry cascades in LLM systems? A daily developer question about AI/LLMs, answered with a direct, opinionated take. 2026-05-15T12:00:00.000Z Office Hours Office Hours office-hoursq-and-apractical-ai

Office Hours — How do you prevent retry cascades in LLM systems?

A daily developer question about AI/LLMs, answered with a direct, opinionated take.

Daily One question from the trenches, one opinionated answer.

How do you prevent retry cascades in LLM systems?

Retry cascades are the sneaky way your LLM system turns a single timeout into an infrastructure fire. One request fails, gets retried, hits the same bottleneck, queues up behind itself, and suddenly your token budget is burning in a loop while users see 10-second latencies. The problem is especially sharp with agentic systems, where a failed step in a multi-turn workflow can trigger exponential retries across dependent calls.

The Core Problem

Your retry logic doesn’t know the difference between “this will work if I try again in 100ms” and “this service is actually melting and will be melting for the next 5 minutes.” Standard exponential backoff treats both as “keep pounding.” With LLM APIs, you’re also competing for shared quota with other tenants, so your retries are literally fighting other customers’ retries for capacity.

Agentic systems amplify this. An agent makes call A, which fails. The agent framework catches it and retries. Meanwhile, call B downstream was waiting for A’s result, so it also times out and retries. Now you have three cascading retry loops fighting each other, each one backing off independently and creating a thundering herd on recovery.

Pattern 1: Circuit Breaker on Token Quota, Not Just Endpoints

Standard circuit breakers flip when an endpoint is down. That’s useful, but it doesn’t catch quota exhaustion. You need to measure token consumption velocity and break the circuit before you hit the hard limit.

Track rolling window usage (e.g., tokens consumed per minute over the last 5 minutes). When you approach 80% of your per-minute quota, stop accepting new requests for 30 seconds instead of letting them queue and retry. This sounds aggressive, but it’s less painful than burning through your daily limit on failed retries.

class TokenBudgetCircuitBreaker:
    def __init__(self, max_tokens_per_minute, window_seconds=60):
        self.max_tpm = max_tokens_per_minute
        self.window = window_seconds
        self.token_history = deque()
        self.broken = False
        self.broken_until = 0
    
    def record_tokens(self, count):
        now = time.time()
        self.token_history.append((now, count))
        
        # Expire old entries
        cutoff = now - self.window
        while self.token_history and self.token_history[0][0] < cutoff:
            self.token_history.popleft()
        
        total_recent = sum(tokens for _, tokens in self.token_history)
        current_rate = total_recent / min(now - self.token_history[0][0], self.window)
        
        if current_rate > self.max_tpm * 0.8:
            self.broken = True
            self.broken_until = now + 30
    
    def can_call(self):
        if time.time() < self.broken_until:
            return False
        self.broken = False
        return True

This prevents retries from accelerating token burn. Once the circuit trips, new requests fail fast instead of queuing.

Pattern 2: Distinguish Retryable From Non-Retryable Failures

Not every error deserves a retry. An invalid API key, malformed request, or 400-level error will fail the same way on retry. A 429 (quota), 500 (server error), or timeout might work next time.

Implement a simple classifier:

def is_retryable(error):
    """Only retry on server errors, timeouts, and rate limits."""
    if isinstance(error, (Timeout, ConnectionError)):
        return True
    if hasattr(error, 'status_code'):
        return error.status_code >= 500 or error.status_code == 429
    return False

Then skip retries for 4xx errors entirely. This saves token budget and latency.

Pattern 3: Use Jitter + Bounded Backoff, Not Exponential

Exponential backoff (2^n) grows fast. After 5 retries, you’re waiting 32 seconds. For agentic systems calling multiple endpoints, this compounds.

Instead, use bounded exponential backoff with jitter:

def backoff_delay(attempt, max_delay=10):
    """Exponential backoff capped at max_delay, with jitter."""
    base = min(2 ** attempt, max_delay)
    jitter = random.uniform(0, base * 0.1)
    return base + jitter

This prevents thundering herds (jitter spreads retry times) while capping how long you wait. After hitting the max, additional retries use the same delay instead of growing.

Pattern 4: Retry Budget Per Request, Not Global

Give each request a fixed “retry budget” (e.g., max 2 retries), and once it’s spent, fail fast. This prevents a single slow request from retrying indefinitely while blocking others.

In agentic systems, propagate this budget downstream. If an agent’s initial call gets 1 retry before failing, its dependent calls should get 0 retries. This forces the agent to handle partial failures gracefully instead of cascading retries through the whole workflow.

async def call_with_budget(client, prompt, retries_left=2):
    for attempt in range(retries_left + 1):
        try:
            return await client.complete(prompt)
        except Exception as e:
            if not is_retryable(e) or attempt >= retries_left:
                raise
            await asyncio.sleep(backoff_delay(attempt))
    # Budget exhausted
    raise TimeoutError(f"Failed after {retries_left} retries")

Pattern 5: Observe Retry Rates as a Health Signal

If your retry rate jumps from 2% to 15%, that’s a signal to stop accepting new requests before the cascade spreads. Wire retry metrics to your circuit breaker.

Track separately: retries due to timeouts vs. rate limits vs. server errors. If timeouts spike, the upstream service is slow; back off immediately. If rate limits spike, you’re hitting quota; fail fast.

def should_reject_new_requests(retry_metrics):
    """Reject new work if retry rate is unhealthy."""
    if retry_metrics.rate_limit_retries > 0.1:  # > 10%
        return True  # Quota pressure; stop accepting
    if retry_metrics.timeout_retries > 0.2:  # > 20%
        return True  # Upstream is struggling
    return False

Pattern 6: Timeouts Shorter Than Your SLA

Set your request timeout shorter than your end-to-end SLA. If your SLA is 10 seconds, set timeout to 5 seconds so you have 5 seconds for retries before the user sees a failure.

This prevents cascades by failing fast instead of letting requests hang. When requests timeout quickly, your circuit breaker trips sooner, and you stop queuing work.

Real-World Impact

A fintech team deploying an agentic workflow for trade settlement hit a 429 rate limit cascade. Their original retry logic: unlimited exponential backoff per request, with agents retrying dependent calls. Result: a single timeout cascaded into 500+ retries across 30 seconds, burning a day’s quota in minutes.

After adding token budget tracking (Pattern 1), jitter + bounded backoff (Pattern 3), and retry budgets per request (Pattern 4), the same scenario triggered circuit breaker, failed gracefully, and burned 3% of quota instead of 100%.

Bottom line: Treat retry cascades as a quota exhaustion problem, not a resilience problem. Use circuit breakers on token consumption velocity, classify errors before retrying, and cap retry budgets per request to prevent exponential amplification in agentic workflows.

Question via Hacker News