Cost Optimization for LLM Applications 2026-05-26T09:00:00.000Z Deep Dives Deep Dives deep-divereferencearchitecture

Cost Optimization for LLM Applications

The post you bookmark. One topic, covered end to end.

Every major LLM provider handles cost optimization differently. This guide covers prompt caching, batching, model routing, token budgeting, and the math behind choosing the right model.

Cost Optimization for LLM Applications

Running LLM-powered features in production gets expensive fast. A single chatbot handling 10,000 conversations per day can cost anywhere from $15 to $15,000 depending on model choice, prompt design, and architectural decisions. The difference between a well-optimized LLM application and a naively built one is often 10–50x in token spend — without any measurable quality loss.

Most cost overruns stem from three mistakes: using a frontier model for tasks a smaller model handles equally well, sending redundant tokens on every request, and failing to cache or batch when the workload allows it. This guide covers the specific techniques, the math behind them, and the decision frameworks for knowing which optimizations matter for a given workload.

Table of Contents

The Token Cost Model

LLM pricing has two axes: input tokens and output tokens. Output tokens are typically 2–5x more expensive than input tokens because each output token requires a full forward pass through the model, while input tokens can be processed in parallel during the prefill phase.

ModelInput (per 1M tokens)Output (per 1M tokens)Context Window
GPT-5.5~$10–15~$30–60200K
GPT-4.1 Nano~$0.10~$0.40128K
Claude Opus 4.7~$15~$75200K
Claude Haiku 4.5~$0.80~$4.00200K
Gemini 3.1 Flash-Lite~$0.075~$0.301M
Command A+ (self-hosted)Hardware cost onlyHardware cost only128K

Pricing is approximate and changes frequently. Check provider pricing pages for current figures.

The ratio between the cheapest and most expensive options spans roughly two orders of magnitude. A task that costs $1.00 with Claude Opus 4.7 might cost $0.01 with Gemini 3.1 Flash-Lite. The core question is always: does the cheaper model produce acceptable output for this specific task?

Diagram

Cost accrues at two points: input tokenization and output generation. Output tokens dominate cost for long-form generation.

A useful mental model: think in “cost per task” rather than “cost per token.” If a summarization task uses 2,000 input tokens and generates 200 output tokens, the cost per summary on Claude Opus 4.7 is roughly $0.000045 input + $0.000015 output = $0.00006 per summary. At 100,000 summaries/day, that’s $6/day. On Claude Haiku 4.5, the same task costs roughly $0.40/day. Whether the Opus quality justifies the 15x premium depends entirely on downstream impact.

Model Selection and Routing

The single highest-leverage cost optimization is routing requests to the cheapest model that meets quality requirements. This sounds obvious but requires systematic evaluation rather than gut feel.

The Routing Decision Framework

For any task, classify it along two axes:

  1. Difficulty: Does the task require multi-step reasoning, nuanced judgment, or expert-level knowledge? Or is it pattern matching, extraction, or reformatting?
  2. Risk tolerance: What’s the cost of a wrong answer? A customer-facing medical summary has different stakes than an internal log classifier.
Diagram

Route by task characteristics, not by a blanket model choice. Most applications have a mix of easy and hard tasks.

Implementing a Model Router

A practical router evaluates incoming requests and dispatches to the appropriate model. The simplest version uses task type as the routing key:

MODEL_ROUTES = {
    "classification": "gpt-4.1-nano",
    "extraction": "gemini-3.1-flash-lite",
    "summarization": "claude-haiku-4.5",
    "analysis": "claude-sonnet-4.6",
    "complex_reasoning": "claude-opus-4.7",
}

def route_request(task_type: str, complexity_score: float = 0.0) -> str:
    """Route to cheapest adequate model. Escalate if complexity is high."""
    base_model = MODEL_ROUTES.get(task_type, "claude-haiku-4.5")
    if complexity_score > 0.8:
        return escalate(base_model)
    return base_model

def escalate(model: str) -> str:
    escalation = {
        "gpt-4.1-nano": "claude-haiku-4.5",
        "gemini-3.1-flash-lite": "claude-haiku-4.5",
        "claude-haiku-4.5": "claude-sonnet-4.6",
        "claude-sonnet-4.6": "claude-opus-4.7",
    }
    return escalation.get(model, model)

More sophisticated routers use a lightweight classifier (often itself a small LLM or a fine-tuned BERT model) to estimate task difficulty from the prompt. The classifier cost is negligible — a few hundred tokens through a nano-tier model — and can save 5–20x on requests that would otherwise hit a frontier model unnecessarily.

Cascade Pattern

The cascade pattern sends a request to the cheapest model first, evaluates the response, and escalates only if quality is insufficient:

Diagram

Cascade routing: try cheap first, escalate on failure. Works well when 70%+ of requests are handleable by the cheap model.

The key metric is the escalation rate. If 80% of requests resolve at the cheap tier, the blended cost is:

blended_cost = 0.80 × cheap_cost + 0.20 × (cheap_cost + expensive_cost)

For a workload where cheap = $0.001/request and expensive = $0.05/request:

blended = 0.80 × $0.001 + 0.20 × ($0.001 + $0.05)
        = $0.0008 + $0.0102
        = $0.011/request

Compared to $0.05/request if everything hits the expensive model, that’s a 4.5x cost reduction. The cascade adds latency for escalated requests (they pay for two model calls), so it works best for async workloads or when the cheap model is fast enough that the double-call latency is acceptable.

Evaluating Model Quality Per Task

Running a systematic eval is non-negotiable before committing to a routing strategy. The process:

  1. Collect 200–500 representative inputs for each task type
  2. Run all candidate models on the same inputs
  3. Score outputs using a rubric (automated with an LLM judge, or human-labeled for high-stakes tasks)
  4. Compute the quality gap between cheap and expensive models

If the cheapest model scores within 5% of the most expensive on a given task, the routing decision is straightforward. Gaps of 5–15% require judgment about whether the cost savings justify the quality loss. Gaps above 15% usually mean the cheap model is inadequate for that task.

Prompt Caching

Prompt caching is the mechanism by which providers avoid re-processing the same input prefix across multiple requests. When a cached prefix is detected, the provider skips the prefill computation for those tokens and charges a reduced rate — often 75–90% less than standard input pricing.

How It Works

Most frontier model APIs support some form of prefix caching. The mechanism varies:

  • OpenAI: Automatic caching for prompts sharing a common prefix. Cached input tokens are billed at 50% of the standard input rate. The cache persists for 5–10 minutes of inactivity.
  • Anthropic: Explicit cache control via cache_control breakpoints in the message structure. Cache write costs a premium over standard input, but cache reads are roughly 10% of standard input cost. Cache TTL is 5 minutes, extended on each hit.
  • Google: Implicit context caching through their Caching API. Create a named cached content object, then reference it in subsequent requests.
Diagram

Prompt caching stores the computed KV cache for shared prefixes. Subsequent requests skip expensive prefill computation.

When Caching Pays Off

The math depends on three variables: cache write cost, cache read cost, and how many requests share the same prefix.

For Anthropic-style explicit caching with a 4,000-token system prompt sent 100 times per cache window:

Without caching:  100 requests × 4,000 tokens × standard_input_rate
With caching:     1 write × 4,000 tokens × write_rate
                + 99 reads × 4,000 tokens × read_rate (≈0.1× standard)

The break-even point is typically 2–4 requests per cache window. Any prompt that’s sent more than a handful of times within the TTL benefits from caching.

Maximizing Cache Hit Rate

Structure prompts so the stable portions come first:

# GOOD: Cacheable prefix, variable suffix
messages = [
    {
        "role": "system",
        "content": LONG_SYSTEM_PROMPT,  # 3,000 tokens, stable
        "cache_control": {"type": "ephemeral"}
    },
    {
        "role": "user",
        "content": REFERENCE_DOCUMENTS,  # 10,000 tokens, stable per session
        "cache_control": {"type": "ephemeral"}
    },
    {
        "role": "user",
        "content": user_query  # Variable per request
    }
]

# BAD: Variable content early breaks cache
messages = [
    {
        "role": "system",
        "content": f"Current time: {datetime.now()}\n{SYSTEM_PROMPT}"
        # Timestamp changes every request, cache never hits
    }
]

Common cache-breaking patterns to avoid:

  • Timestamps or request IDs in the system prompt
  • Randomized few-shot example ordering
  • User-specific personalization tokens embedded in the prefix
  • Dynamic tool definitions that change between requests

Multi-Turn Conversation Caching

In multi-turn conversations, each message builds on the previous prefix. With proper caching, turn N only pays full input price for the new user message — all prior turns are cached. This makes long conversations dramatically cheaper than re-sending the full history on each turn.

A 20-turn conversation with 500 tokens per turn:

Without caching: Σ(n=1 to 20) of n × 500 = 105,000 input tokens billed
With caching:    20 × 500 = 10,000 new tokens billed at full rate
                 + 95,000 cached tokens at ~10% rate
Effective cost:  ~19,500 token-equivalents vs 105,000

That’s roughly a 5x reduction for a 20-turn conversation. The savings compound with conversation length.

Batching Strategies

Batching reduces cost in two ways: providers offer discounted rates for batch API calls (typically 50% off), and batching amortizes fixed overhead across multiple requests.

Provider Batch APIs

OpenAI’s Batch API accepts JSONL files of requests, processes them within a 24-hour window, and returns results at 50% of standard pricing. Anthropic offers a similar Message Batches API. The tradeoff is latency — batch requests aren’t real-time.

# OpenAI Batch API example
import json

# Prepare batch file
requests = []
for i, item in enumerate(items_to_process):
    requests.append({
        "custom_id": f"item-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4.1-nano",
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": item["text"]}
            ],
            "max_tokens": 200
        }
    })

# Write JSONL
with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload and submit
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h")

When to Batch

Batching is appropriate for:

  • Nightly processing jobs (document classification, embedding generation, content moderation)
  • Eval pipelines (running test suites against model outputs)
  • Bulk data enrichment (adding metadata to database records)
  • Report generation (end-of-day summaries, analytics)

Batching is inappropriate for:

  • User-facing chat (latency matters)
  • Real-time agents (need immediate responses)
  • Streaming responses (batch APIs return complete responses)
Diagram

Separate real-time and batch paths. Anything that doesn’t need immediate response should route through the batch API.

Self-Managed Batching

For self-hosted models (via vLLM, llama.cpp, or similar), batching is about GPU utilization. Collecting requests and sending them as a batch increases throughput by allowing the inference engine to process multiple sequences in parallel, sharing the attention computation across the batch.

vLLM handles this automatically with continuous batching — requests are added to the running batch as GPU capacity becomes available. The optimization on the application side is ensuring a steady stream of requests rather than bursty traffic that leaves the GPU idle between peaks.

Token Budgeting

Token budgeting is the practice of setting explicit limits on token consumption per request, per user, per feature, and per billing period. Without budgets, a single runaway prompt or a traffic spike can blow through monthly spend in hours.

Per-Request Budgets

Always set max_tokens (or equivalent) on every API call. The default in most SDKs is either unlimited or the full context window, which means a malformed prompt could generate a 100,000-token response.

# Set explicit output limits per task type
TOKEN_BUDGETS = {
    "classification": {"max_tokens": 50},
    "extraction": {"max_tokens": 500},
    "summarization": {"max_tokens": 1000},
    "analysis": {"max_tokens": 2000},
    "chat_response": {"max_tokens": 4000},
}

def call_llm(task_type: str, messages: list, model: str) -> dict:
    budget = TOKEN_BUDGETS.get(task_type, {"max_tokens": 1000})
    return client.chat.completions.create(
        model=model,
        messages=messages,
        **budget
    )

Per-User and Per-Feature Budgets

Track cumulative token usage per user and per feature. Implement circuit breakers that degrade gracefully:

from datetime import datetime, timedelta
from collections import defaultdict

class TokenBudgetManager:
    def __init__(self):
        self.usage = defaultdict(lambda: {"tokens": 0, "reset_at": datetime.now() + timedelta(hours=1)})
    
    # Per-user hourly budget
    USER_HOURLY_LIMIT = 500_000  # tokens
    
    def check_budget(self, user_id: str, estimated_tokens: int) -> bool:
        entry = self.usage[user_id]
        if datetime.now() > entry["reset_at"]:
            entry["tokens"] = 0
            entry["reset_at"] = datetime.now() + timedelta(hours=1)
        return (entry["tokens"] + estimated_tokens) <= self.USER_HOURLY_LIMIT
    
    def record_usage(self, user_id: str, actual_tokens: int):
        self.usage[user_id]["tokens"] += actual_tokens

Estimating Token Counts Before Sending

Pre-flight token estimation avoids surprises. Use tiktoken (for OpenAI models) or the provider’s token counting endpoint to measure input tokens before sending:

import tiktoken

def estimate_cost(messages: list, model: str = "gpt-4.1-nano", max_output: int = 500) -> float:
    enc = tiktoken.encoding_for_model(model)
    input_tokens = sum(len(enc.encode(m["content"])) for m in messages)
    
    # Rates per token (approximate)
    INPUT_RATE = 0.10 / 1_000_000   # $0.10 per 1M input tokens
    OUTPUT_RATE = 0.40 / 1_000_000  # $0.40 per 1M output tokens
    
    estimated_cost = (input_tokens * INPUT_RATE) + (max_output * OUTPUT_RATE)
    return estimated_cost

Monthly Budget Guardrails

Set hard spending limits at the provider level (OpenAI and Anthropic both support this in their dashboards) and at the application level:

Diagram

Budget checks before every LLM call. Degradation is better than a surprise invoice.

Prompt Engineering for Cost

Token count is directly proportional to cost. Reducing prompt length without sacrificing output quality is pure margin improvement.

System Prompt Compression

System prompts accumulate cruft over time. A prompt that started at 500 tokens often grows to 3,000 tokens through incremental additions. Periodically audit and compress:

BEFORE (847 tokens):
"You are a helpful assistant that specializes in analyzing customer 
support tickets. Your job is to read the ticket carefully and determine 
the category it belongs to. The categories are: billing, technical, 
account, shipping, and other. You should respond with just the category 
name. Please be accurate and consistent in your categorizations. 
If you're not sure, pick the closest category. Do not explain your 
reasoning unless asked..."
[continues for 600 more tokens with examples and edge cases]

AFTER (203 tokens):
"Classify support tickets into exactly one category: billing, technical, 
account, shipping, other. Output only the category name.

Examples:
'I was charged twice' → billing
'App crashes on login' → technical  
'Change my email' → account
'Package not delivered' → shipping"

The compressed version costs 75% less per request in input tokens while conveying the same behavioral constraints. Over millions of requests, this adds up.

Few-Shot Example Selection

Few-shot examples improve quality but cost tokens. The optimal number is task-dependent, but generally:

  • 0 examples: Works for simple, well-understood tasks with strong instruction-following models
  • 1–3 examples: Sufficient for most classification and extraction tasks
  • 5–10 examples: Needed for complex formatting or domain-specific patterns

Each example might cost 100–300 tokens. Five examples at 200 tokens each adds 1,000 tokens per request. At $10/1M input tokens, that’s $0.01 per 1,000 requests — trivial for low-volume use cases, but $10/day at 1M requests/day.

Strategy: select examples dynamically based on input similarity rather than including a fixed set. This improves quality and can reduce example count:

def select_examples(query: str, example_bank: list, k: int = 3) -> list:
    """Select most relevant examples using embedding similarity."""
    query_embedding = embed(query)
    scored = [
        (example, cosine_similarity(query_embedding, example["embedding"]))
        for example in example_bank
    ]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [ex for ex, _ in scored[:k]]

Output Format Constraints

Requesting structured output (JSON, specific formats) can reduce output tokens compared to free-form prose:

# Verbose output (~150 tokens):
"Based on my analysis, this ticket appears to be related to a billing 
issue. The customer mentions being charged twice for their subscription, 
which falls under the billing category. Confidence: high."

# Constrained output (~10 tokens):
{"category": "billing", "confidence": 0.95}

That’s a 15x reduction in output tokens, which at the higher output token rates translates directly to cost savings.

Semantic Caching

Semantic caching stores LLM responses and returns cached results for semantically similar (not just identical) future queries. This avoids paying for the same answer twice, even when the exact phrasing differs.

How It Works

  1. Embed the incoming query
  2. Search the cache for embeddings within a similarity threshold
  3. If a match is found, return the cached response (cost: one embedding call)
  4. If no match, call the LLM, cache the response, return it
Diagram

Semantic cache flow. The embedding call is ~1000x cheaper than a frontier model call, making even modest hit rates profitable.

Threshold Tuning

The similarity threshold controls the precision/recall tradeoff:

ThresholdBehaviorRisk
0.99+Near-exact match onlyLow hit rate, very safe
0.95–0.98Paraphrases matchGood hit rate, occasionally wrong cache hit
0.90–0.95Loosely similar queries matchHigh hit rate, noticeable false positives
<0.90Aggressive cachingDangerous — returns wrong answers

For factual queries (“What is the capital of France?”), a 0.95 threshold works well because rephrased versions should return the same answer. For creative or context-dependent queries, caching is less useful — the “right” answer varies.

Implementation Considerations

  • Cache invalidation: Time-based TTL (e.g., 24 hours) works for most cases. For data that changes frequently, tag cache entries with version identifiers and invalidate when the underlying data changes.
  • Scope: Cache per-user (personalized responses) or globally (shared factual responses). Global caches have higher hit rates but risk leaking personalized context.
  • Storage: A vector database (Qdrant, pgvector) for the embeddings, paired with a key-value store (Redis, DynamoDB) for the response payloads.

Cost Math

If a frontier model call costs $0.02 per request and the embedding + vector search costs $0.00005 per request, a 40% cache hit rate saves:

1,000 requests without cache: 1,000 × $0.02 = $20.00
1,000 requests with 40% hit rate:
  - 400 cache hits:  400 × $0.00005 = $0.02
  - 600 cache misses: 600 × ($0.02 + $0.00005) = $12.003
  Total: $12.023
  Savings: ~40%

The savings scale linearly with the cache hit rate. Applications with repetitive query patterns (FAQ bots, documentation search, customer support) often achieve 50–70% hit rates.

Architectural Patterns

Pattern 1: Tiered Generation

Split a complex task into a cheap planning step and an expensive execution step:

Diagram

Tiered generation: a cheap model plans, then a gate decides which model executes. The planning step itself often reveals whether the expensive model is needed.

A concrete example: a coding assistant receives a request. A Haiku-class model analyzes the request and determines it’s a simple string formatting question. The answer is generated by the same cheap model. A complex architectural question gets routed to Opus. The planning step costs ~$0.0005 and saves $0.05 on every request that doesn’t need the expensive model.

Pattern 2: Summarize-Then-Process

For tasks that consume large documents, summarize first with a cheap model (or a model with very cheap input rates like Gemini Flash-Lite), then process the summary with a more capable model:

async def analyze_document(document: str) -> dict:
    # Step 1: Cheap summarization (handles 100K+ token documents)
    summary = await call_llm(
        model="gemini-3.1-flash-lite",
        messages=[{"role": "user", "content": f"Summarize this document in 500 words:\n{document}"}],
        max_tokens=700
    )
    
    # Step 2: Expensive analysis on the summary (~700 tokens vs 100K)
    analysis = await call_llm(
        model="claude-opus-4.7",
        messages=[{"role": "user", "content": f"Analyze this summary for legal risks:\n{summary}"}],
        max_tokens=2000
    )
    return analysis

Processing 100,000 tokens through Opus costs roughly $1.50 in input alone. Processing a 700-token summary through Opus costs roughly $0.01. Even adding the Flash-Lite summarization step (~$0.008 for 100K input tokens), the total is ~$0.018 vs $1.50 — an 80x reduction. The tradeoff is information loss in summarization, which is acceptable for many analytical tasks but not for tasks requiring fine-grained detail.

Pattern 3: Pre-Computed Responses

For predictable, high-volume queries, pre-compute responses offline and serve them from a database:

Diagram

Pre-computed responses eliminate runtime LLM calls for predictable queries. The batch job runs at batch API rates (50% discount).

An e-commerce site with 10,000 products can pre-generate product descriptions, FAQ answers, and comparison blurbs nightly using the batch API. Runtime requests that match a known product serve the pre-computed response instantly, at zero marginal LLM cost.

Pattern 4: Progressive Detail

Start with a short, cheap response. Expand only if the user asks for more:

async def progressive_response(query: str, depth: str = "brief") -> str:
    budgets = {
        "brief": {"model": "claude-haiku-4.5", "max_tokens": 200},
        "detailed": {"model": "claude-sonnet-4.6", "max_tokens": 1000},
        "comprehensive": {"model": "claude-opus-4.7", "max_tokens": 4000},
    }
    config = budgets[depth]
    instruction = {
        "brief": "Answer in 2-3 sentences.",
        "detailed": "Provide a detailed answer with examples.",
        "comprehensive": "Provide a comprehensive analysis.",
    }[depth]
    
    return await call_llm(
        model=config["model"],
        messages=[
            {"role": "system", "content": instruction},
            {"role": "user", "content": query}
        ],
        max_tokens=config["max_tokens"]
    )

Usage data typically shows that 60–80% of users are satisfied with the brief response. Only the minority who click “show more” trigger the expensive call.

Monitoring and Alerting

Cost optimization without monitoring is guesswork. Track these metrics:

Essential Metrics

MetricWhyAlert Threshold
Cost per request (by model, task)Detect model routing issues>2x baseline
Daily/weekly spendBudget tracking>80% of budget
Input tokens per requestPrompt bloat detection>2x baseline
Output tokens per requestGeneration budget adherence>expected max_tokens
Cache hit rateCache effectiveness<20% (if caching enabled)
Escalation rate (cascade)Routing effectiveness>50%
Cost per user/sessionPer-customer unit economics>revenue threshold

Implementation

Most LLM gateway tools (Helicone, Langfuse, Braintrust, LiteLLM) provide cost tracking out of the box. A minimal custom implementation:

import time
from dataclasses import dataclass

@dataclass
class LLMCallMetrics:
    model: str
    task_type: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    latency_ms: float
    cache_hit: bool
    user_id: str

def track_llm_call(func):
    """Decorator to track LLM call metrics."""
    async def wrapper(*args, **kwargs):
        start = time.monotonic()
        response = await func(*args, **kwargs)
        latency = (time.monotonic() - start) * 1000
        
        metrics = LLMCallMetrics(
            model=kwargs.get("model", "unknown"),
            task_type=kwargs.get("task_type", "unknown"),
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            cost_usd=compute_cost(response),
            latency_ms=latency,
            cache_hit=getattr(response, "cache_hit", False),
            user_id=kwargs.get("user_id", "anonymous"),
        )
        emit_metrics(metrics)  # Send to your metrics system
        return response
    return wrapper

Cost Anomaly Detection

Set up alerts for:

  • Spike detection: Daily spend exceeds 2x the 7-day moving average
  • Per-request anomalies: Any single request costing more than $1 (probably a bug sending the entire context window)
  • Model drift: The proportion of requests hitting expensive models increases without a corresponding product change
  • Token inflation: Average input tokens per request trending upward (system prompt growth, context accumulation)
Diagram

Cost monitoring pipeline. Aggregate first, then detect anomalies. Real-time dashboards for trends, alerts for spikes.

Cost Modeling Worksheet

Before building, model the costs. This worksheet framework works for any LLM application:

Step 1: Enumerate All LLM Calls

List every place the application calls an LLM:

1. Chat response generation — user-facing, 1:1 with user messages
2. RAG retrieval query reformulation — 1 per user message
3. Document summarization — 1 per uploaded document
4. Response quality scoring — 1 per chat response (internal)
5. Content moderation — 1 per user message

Step 2: Estimate Volume

Daily active users: 5,000
Messages per user per day: 8
Documents uploaded per day: 200

Call volumes:
1. Chat responses: 40,000/day
2. Query reformulation: 40,000/day
3. Document summarization: 200/day
4. Quality scoring: 40,000/day
5. Content moderation: 40,000/day

Step 3: Estimate Tokens Per Call

1. Chat: 2,000 input (system + history + RAG context) + 500 output
2. Reformulation: 500 input + 100 output
3. Summarization: 20,000 input + 1,000 output
4. Scoring: 1,000 input + 50 output
5. Moderation: 500 input + 20 output

Step 4: Assign Models and Compute Cost

1. Chat: Claude Sonnet 4.6 — need quality, not max capability
2. Reformulation: Claude Haiku 4.5 — simple task
3. Summarization: Gemini 3.1 Flash-Lite — cheap input tokens
4. Scoring: GPT-4.1 Nano — classification task
5. Moderation: GPT-4.1 Nano — classification task

Calculate daily cost using each provider’s rates, then multiply by 30 for monthly. Add 20% buffer for traffic variance. This exercise often reveals that one or two call types dominate total spend — those are where optimization effort should focus.

Step 5: Apply Optimizations

Model the impact of each optimization:

OptimizationAffected CallsExpected Savings
Prompt caching (system prompt)Chat, Reformulation30–50% input cost reduction
Semantic cachingChat, Reformulation20–40% of calls avoided
Batch API for scoringQuality scoring50% rate discount
Compressed system promptAll10–20% input token reduction
Progressive detailChat30% output token reduction

Stack the optimizations multiplicatively to estimate final cost. A common result: 3–5x total cost reduction from the unoptimized baseline.

Example: Final Cost Comparison

Unoptimized (all Opus, no caching, no routing):     $4,200/month
With model routing only:                             $1,100/month
With routing + prompt caching:                       $680/month
With routing + caching + semantic cache:             $420/month
With all optimizations:                              $350/month

Total reduction: ~12x

These numbers are illustrative but representative. The exact ratio depends on workload characteristics, but 5–15x reductions are common when moving from a naive implementation to a fully optimized one.

Summary

The highest-impact optimizations, in order of implementation effort vs. payoff:

  1. Model routing: Use the cheapest model that meets quality requirements for each task type. This alone typically saves 3–5x. Requires systematic evals.
  2. Prompt caching: Structure prompts with stable prefixes. Near-zero implementation effort, 30–50% input cost reduction for conversational or repeated-context workloads.
  3. Token budgeting: Set max_tokens on every call, track usage per user and feature, alert on anomalies. Prevents runaway costs.
  4. Prompt compression: Audit and compress system prompts. Eliminate redundant few-shot examples. 10–30% token reduction.
  5. Batch APIs: Route non-real-time work through batch endpoints for 50% rate discounts.
  6. Semantic caching: Cache LLM responses keyed by embedding similarity. 20–40% call elimination for repetitive workloads.
  7. Architectural patterns: Tiered generation, summarize-then-process, progressive detail. Higher implementation effort but multiplicative with other optimizations.

Cost optimization is not a one-time exercise. Token costs change as providers adjust pricing, new models emerge that shift the quality-cost frontier, and application usage patterns evolve. Monthly cost reviews — examining per-call-type spend, cache hit rates, and escalation rates — catch drift before it becomes a budget problem.

Further Reading

  • OpenAI Prompt Caching Guide — Official documentation on automatic prefix caching behavior and pricing
  • Anthropic Prompt Caching Documentation — Explicit cache control breakpoints and TTL behavior
  • tiktoken — OpenAI’s fast BPE tokenizer for pre-flight token counting
  • LiteLLM — Unified API proxy supporting 100+ LLM providers with built-in cost tracking and model fallback
  • Helicone — Open-source LLM observability platform with per-request cost attribution and dashboards
  • Langfuse — Open-source LLM engineering platform with tracing, cost tracking, and eval pipelines
  • GPTCache — Semantic caching library for LLM responses with multiple embedding and storage backends
  • OpenAI Batch API Documentation — 50% discount batch processing with 24-hour completion window
  • Anthropic Message Batches — Anthropic’s batch processing API for high-volume workloads
  • Martian Model Router — Open-source intelligent model routing based on prompt characteristics