LLM Token Costs and Efficiency: A Practitioner's Guide (April 2026)
Beyond the pricing page. How to actually think about LLM costs: per-token pricing across 15+ providers, hidden multipliers, caching mechanics, batch discounts, model routing architectures, and what 'cost per useful output' means in production.
Per-token pricing is the number on the pricing page. It is not your actual cost. The gap between the two is where most teams either save or waste thousands of dollars a month. This post covers both: the raw numbers across every major provider, and the strategies that determine what you actually pay.
Companion post: API Rate Limits Compared (April 2026) covers RPM, TPM, and throughput limits across the same providers.
Table of Contents
- Per-token pricing across providers
- The hidden cost multipliers
- Prompt caching: how it works, what it saves
- Batch APIs: the easiest 50% discount
- Model routing: the architecture that pays for itself
- Token efficiency: not all tokens are equal
- Structured outputs: fewer tokens, same information
- Self-hosting vs. API: when the math flips
- Cost modeling a real workload
All prices verified against official provider documentation as of April 25, 2026. Prices change frequently. Confirm on provider pricing pages before budgeting.
Per-token pricing across providers
Prices in USD per million tokens. Sorted cheapest to most expensive by output cost within each category.
Proprietary flagship models
| Model | Input/M | Output/M | Cached Input/M | Context | Notes |
|---|---|---|---|---|---|
| DeepSeek V4-Flash | $0.14 | $0.28 | $0.028 | 1M | 75% launch discount until May 5 |
| Gemini 2.5 Flash | $0.15 | $0.60 | $0.0375 | 1M | Thinking tokens: $0.60/M in, $2.50/M out |
| Mistral Nemo 12B | $0.02 | $0.06 | — | 128K | |
| Gemini 2.5 Flash-Lite | $0.025 | $0.10 | — | 1M | No thinking mode |
| GPT-5.4 nano | $0.20 | $1.25 | $0.10 | 128K | |
| GPT-5.4 mini | $0.75 | $4.50 | $0.375 | 128K | |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | — | 1M | Preview |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | 200K | Cache write: $1.25/M |
| DeepSeek V4-Pro | $0.435 | $0.87 | $0.036 | 1M | 75% launch discount until May 5 |
| Mistral Large 3 | $2.00 | $6.00 | — | 128K | |
| GPT-5.4 | $2.50 | $10.00 | $1.25 | 1M | |
| Gemini 2.5 Pro | $1.25 | $10.00 | $0.3125 | 1M | Thinking tokens: $1.25/M in, $10/M out |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | 1M | Cache write: $3.75/M |
| GPT-5.5 | $5.00 | $30.00 | $2.50 | 1M | ~40% more token-efficient per task |
| Claude Opus 4.7 | $5.00 | $25.00 | $0.50 | 1M | Cache write: $6.25/M |
| Claude Opus 4.6 | $5.00 | $25.00 | $0.50 | 1M | Cache write: $6.25/M |
| Gemini 3.1 Pro | $2–4 | $12–18 | — | 1M | $2/$12 under 200K; $4/$18 above |
| GPT-5.5 Pro | Contact sales | — | — | — | Enterprise only |
Reasoning/thinking models
| Model | Input/M | Output/M | Context | Notes |
|---|---|---|---|---|
| o4-mini | $1.10 | $4.40 | 200K | Thinking tokens: $1.10/M |
| o3 | $2.00 | $8.00 | 200K | Thinking tokens: $2.00/M |
| DeepSeek R1 | $0.55 | $2.19 | 128K | Cache hits: $0.14/M |
Thinking/reasoning tokens deserve attention. When a model uses extended thinking, you pay for the thinking tokens too. A complex math problem that triggers 10,000 thinking tokens before generating a 500-token answer costs you for 10,500 tokens of output, not 500. Anthropic, OpenAI, and Google all bill thinking tokens at the same rate as standard output tokens for their respective models.
Open-weight models (hosted pricing)
Self-hosted = free. These are representative hosted prices from providers like Groq, Together AI, Fireworks, and Alibaba Cloud.
| Model | Hosted Input/M | Hosted Output/M | Context | License |
|---|---|---|---|---|
| Llama 4 Scout (109B/17B active) | ~$0.10 | ~$0.40 | 10M | Meta |
| Qwen3-32B | ~$0.15 | ~$0.60 | 128K | Apache 2.0 |
| Llama 4 Maverick | ~$0.15 | ~$0.60 | 1M | Meta |
| Gemma 3 27B | ~$0.20 | ~$0.20 | 128K | |
| Mistral 7B | ~$0.25 | ~$0.75 | 128K | Apache 2.0 |
| DeepSeek R1 (hosted) | $0.55 | $2.19 | 128K | MIT |
| GPT-OSS 120B | ~$0.90 | ~$0.90 | 128K | Apache 2.0 |
| Devstral 2 (123B) | ~$1.00 | ~$1.00 | 256K | Apache 2.0 |
Embeddings and other modalities
| Model | Price | Unit | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02/M tokens | tokens | |
| OpenAI text-embedding-3-large | $0.13/M tokens | tokens | |
| Gemini Embedding 2 | Free (AI Studio) | tokens | Multimodal (text, image, video, audio) |
| Cohere Embed v3 | $0.10/M tokens | tokens | Multilingual |
| Whisper (OpenAI) | $0.006/min | audio minutes | |
| Whisper (Groq, free tier) | $0.00 | audio minutes | 8 hrs/day limit |
| OpenAI TTS | $15.00/M chars | characters | |
| OpenAI TTS HD | $30.00/M chars | characters | |
| GPT Image 2 | $0.02–$0.17 | per image | Varies by quality/size |
The hidden cost multipliers
The pricing table above is the starting point, not the answer. Several factors multiply your actual spend beyond the per-token rate.
1. Output tokens cost 2–6x more than input tokens
Every major provider charges more for output than input. The ratio varies:
| Provider | Typical Output:Input Ratio |
|---|---|
| DeepSeek V4-Flash | 2:1 |
| OpenAI GPT-5.4 | 4:1 |
| Claude Sonnet 4.6 | 5:1 |
| GPT-5.5 | 6:1 |
| Claude Opus 4.7 | 5:1 |
This means prompt engineering that reduces output length has an outsized cost impact. A system prompt that gets the model to return 200 tokens instead of 500 tokens saves more than a system prompt that reduces input by the same 300 tokens.
2. Context length pricing tiers
Google charges different rates based on how much of the context window you use. Gemini 3.1 Pro costs $2/$12 per M under 200K tokens, but $4/$18 above 200K. That is a 50% price increase for long-context workloads. Other providers charge a flat rate regardless of context usage, but this could change.
3. Thinking token overhead
Extended thinking is not free. A request to Claude Opus 4.7 that triggers 20,000 thinking tokens at $25/M output adds $0.50 to that single request before the actual response tokens. For complex reasoning tasks, thinking tokens can exceed response tokens by 10–50x.
Anthropic’s budget_tokens parameter lets you cap thinking tokens per request. OpenAI’s reasoning_effort parameter (low/medium/high) on o3/o4-mini controls thinking depth. Use these to bound costs on reasoning-heavy workloads.
4. Cache write costs
Anthropic charges a 25% premium on tokens written to cache vs. standard input. Claude Opus 4.7: $6.25/M for cache writes vs. $5.00/M for standard input. The write premium pays for itself after the second cache read ($0.50/M), so any cached prefix used 3+ times is net positive. Single-use prompts should skip caching.
5. Image and multimodal token costs
Images are converted to tokens before processing. A single high-resolution image can consume 1,000–5,000+ tokens depending on the model and resolution settings. For vision-heavy workloads (document OCR, screenshot analysis, diagram interpretation), image token costs can dominate your bill. Resize images to the minimum resolution needed before sending them.
Prompt caching: how it works, what it saves
Prompt caching stores the processed representation of input tokens on the provider side so subsequent requests with the same prefix skip re-processing. The savings are substantial.
Provider comparison
| Provider | Cache Read Discount | Cache Write Premium | Min Cacheable | TTL | Rate Limit Benefit |
|---|---|---|---|---|---|
| Anthropic | 90% off input | 25% over input | 1,024 tokens | 5 min (auto-extends) | Cached reads don’t count toward ITPM |
| OpenAI | 50% off input | None (automatic) | Varies | Automatic | No rate limit benefit |
| 75% off input | None (usage-based) | 4,096 tokens | Configurable | No rate limit benefit | |
| DeepSeek | 80–90% off input | None | Automatic | Varies | No rate limit benefit |
Anthropic’s caching has the most generous rate limit interaction: cached read tokens are excluded from ITPM limits entirely. With 80% cache hits, a 2M ITPM limit handles 10M total input tokens per minute. This is a throughput multiplier, not just a cost reduction.
When caching pays off
The break-even math for Anthropic: cache write costs 25% more than standard input. Cache reads cost 90% less. After 2 cache reads, the write premium is recovered. Anything beyond that is pure savings. For a system prompt reused across thousands of requests, the savings approach 90% on that portion of every request.
Practical caching patterns
RAG with stable context: Cache the system prompt + retrieved documents as a prefix. Each follow-up query appends to the cached prefix. Works well when the same document set is queried multiple times within the TTL window.
Multi-turn conversations: Cache the conversation history prefix. Each new turn extends the cache. Anthropic’s 5-minute auto-extending TTL means active conversations stay cached indefinitely.
Batch evaluation: Cache the evaluation rubric + few-shot examples. Run hundreds of evaluations against the cached prefix. The rubric tokens (often 2,000–5,000) are paid once at write price, then read at 90% discount for every subsequent evaluation.
Batch APIs: the easiest 50% discount
Both OpenAI and Anthropic offer batch processing with lower pricing and higher throughput limits.
| Provider | Batch Discount | Turnaround | Queue Limits |
|---|---|---|---|
| OpenAI | 50% off standard pricing | Up to 24 hours | 3x–100x higher than real-time TPM |
| Anthropic | 50% off standard pricing | Up to 24 hours | Higher than real-time |
| 50% off standard pricing | Up to 24 hours | Available for Gemini 2.5+ models |
OpenAI’s batch queue limits are notably generous: GPT-5.5 gets 1.5M batch tokens at Tier 1 vs. 500K real-time TPM. At Tier 5, the batch queue reaches 15 billion tokens.
Workloads that fit batch processing: content generation pipelines, bulk classification, evaluation suites, data extraction from document archives, nightly report generation, synthetic data creation. Any workload where the output is consumed minutes or hours after generation rather than in real time.
Model routing: the architecture that pays for itself
The most impactful cost optimization is not using one model for everything. A routing layer that sends requests to different models based on complexity can cut spend 60–85%.
Cost comparison: routed vs. single-model
Assume 100,000 requests/day, average 500 output tokens per request.
Single model (Claude Sonnet 4.6 for everything):
- 100K requests × 500 tokens = 50M output tokens/day
- 50M × $15/M = $750/day
Three-tier routing (70/25/5 split):
- 70K requests × 500 tokens × $5/M (Haiku 4.5) = $175
- 25K requests × 500 tokens × $15/M (Sonnet 4.6) = $187.50
- 5K requests × 500 tokens × $25/M (Opus 4.7) = $62.50
- Total: $425/day (43% savings)
The classifier itself costs almost nothing. Running Haiku 4.5 or GPT-5.4 nano to classify intent on a 100-token input is ~$0.10/M input. For 100K daily classifications: $0.01/day.
The savings compound with caching. If the routing layer also implements prompt caching on the medium and complex tiers, effective savings can reach 70–80%.
Token efficiency: not all tokens are equal
The same task produces different token counts on different models. This is the most underappreciated cost variable.
GPT-5.5’s efficiency claim
OpenAI reports GPT-5.5 is ~40% more token-efficient than GPT-5.4 on coding tasks. Per-token price doubled ($5/$30 vs. $2.50/$15), but if the model uses 40% fewer tokens to produce equivalent output, the effective per-task cost is ~20% higher, not 100% higher.
This claim is plausible for code generation (where a more capable model produces tighter code in fewer iterations) but unlikely to hold for simple summarization or classification tasks where output length is fixed. Evaluate token efficiency on your specific workload before assuming the headline number applies.
Tokenizer differences matter
Different providers use different tokenizers. The same English sentence might be 15 tokens on OpenAI’s tiktoken (cl100k_base), 18 tokens on Anthropic’s tokenizer, and 12 tokens on a model using a larger vocabulary. For pricing comparisons, convert to a common unit (characters or words) rather than comparing raw per-token prices across providers with different tokenizers.
A rough heuristic: 1 token ≈ 4 characters ≈ 0.75 words in English. But this varies by language (CJK characters often consume 1–2 tokens each) and by content type (code is tokenized differently than prose).
Prompt engineering for token efficiency
Small changes to prompts can produce large token savings at scale:
- “Respond in JSON” vs. natural language: structured responses are 30–60% shorter.
- “Be concise” in the system prompt: models tend to over-explain. An explicit brevity instruction reduces output tokens by 20–40% on average.
- Few-shot examples: Adding 2–3 examples of desired output length calibrates the model’s verbosity. This costs a few hundred extra input tokens but saves thousands of output tokens across the batch.
- Max tokens parameter: Set a hard cap. Prevents runaway generation on edge cases where the model would otherwise produce 10x the expected output.
Structured outputs: fewer tokens, same information
JSON mode and structured outputs (now GA on OpenAI, Anthropic, and Google) constrain the model to return data in a predefined schema. This is both a reliability and a cost optimization.
A sentiment classification task:
Unstructured response (typical): “The sentiment of this review is positive. The customer expresses satisfaction with the product quality and delivery speed, though they note minor concerns about packaging.” (~35 tokens)
Structured response: {"sentiment": "positive", "confidence": 0.92} (~12 tokens)
That is a 65% reduction in output tokens per request. At $15/M output (Sonnet 4.6), across 1M daily classifications, the savings are: (35-12) × 1M / 1M × $15 = $345/day.
For any pipeline where the downstream consumer is code (not a human reading prose), structured outputs should be the default.
Self-hosting vs. API: when the math flips
Open-weight models (Llama 4, Qwen3, Mistral, DeepSeek, GPT-OSS) are free to run. The cost is infrastructure.
Break-even estimation
A single H100 (80GB) rents for roughly $2–3/hour on cloud providers. It can serve a 70B model (quantized to 4-bit) at approximately 30–50 tokens/second.
Monthly H100 cost: ~$2,000 Monthly tokens at 40 tok/s sustained: ~100B tokens
Equivalent API cost for 100B output tokens:
- Claude Haiku 4.5: $500,000
- GPT-5.4 nano: $125,000
- DeepSeek V4-Flash: $28,000
Self-hosting a 70B open model at $2,000/month breaks even against DeepSeek V4-Flash at roughly 7B output tokens/month. Against Haiku 4.5, it breaks even at under 500M tokens/month.
When APIs still win
- Low volume: Under ~5B tokens/month, API costs are lower than maintaining infrastructure.
- Burst traffic: APIs absorb spikes. Self-hosted GPUs sit idle during troughs and can’t handle peaks without over-provisioning.
- Frontier quality: No open model matches Opus 4.7 or GPT-5.5 on complex reasoning. If you need the best, APIs are the only option.
- Compliance and support: Enterprise API tiers include SLAs, BAAs (HIPAA), data residency controls, and zero-retention options that would require additional investment to replicate on self-hosted infrastructure.
When self-hosting wins
- High sustained volume: Above 10B+ tokens/month, the economics favor owned or reserved GPU capacity.
- Data sovereignty: Air-gapped or on-premise requirements.
- Fine-tuning: Self-hosted models can be fine-tuned on proprietary data. API-based fine-tuning is available from some providers but with less control.
- Latency control: No network round-trip. Co-located inference can achieve sub-10ms TTFT.
Cost modeling a real workload
Pricing tables are abstractions. Here is a concrete example of modeling costs for a production RAG pipeline.
Scenario: customer support agent
- 50,000 queries/day
- Each query: 200 tokens input (user question), 2,000 tokens retrieved context, 300 tokens system prompt, 400 tokens output
- System prompt + retrieval context is 80% cacheable (same FAQ docs across queries)
Option A: Claude Sonnet 4.6, no optimization
| Component | Tokens/request | Daily tokens | Cost/M | Daily cost |
|---|---|---|---|---|
| Input (system + context + query) | 2,500 | 125M | $3.00 | $375.00 |
| Output | 400 | 20M | $15.00 | $300.00 |
| Total | $675.00/day |
Monthly: ~$20,250.
Option B: Claude Sonnet 4.6, with prompt caching
| Component | Tokens/request | Daily tokens | Cost/M | Daily cost |
|---|---|---|---|---|
| Cache write (system + context, first request/5min) | 2,300 | ~0.7M | $3.75 | $2.63 |
| Cache read (system + context, subsequent) | 2,300 | 114.3M | $0.30 | $34.29 |
| Uncached input (user query) | 200 | 10M | $3.00 | $30.00 |
| Output | 400 | 20M | $15.00 | $300.00 |
| Total | $366.92/day |
Monthly: ~$11,008. 46% savings from caching alone.
Option C: Routed, with caching
Route 70% of queries (simple FAQ-type) to Haiku 4.5, 30% (complex) to Sonnet 4.6. Both with caching.
| Component | Daily cost |
|---|---|
| Haiku tier (35K queries, cached) | ~$52.50 |
| Sonnet tier (15K queries, cached) | ~$117.00 |
| Classifier (50K queries, Haiku) | ~$0.50 |
| Total | ~$170.00/day |
Monthly: ~$5,100. 75% savings vs. Option A.
Option D: Add batch processing for non-real-time queries
If 20% of queries (overnight analytics, report generation) can tolerate 24-hour turnaround:
| Component | Daily cost |
|---|---|
| Real-time routed (40K queries) | ~$136.00 |
| Batch (10K queries, 50% discount) | ~$17.00 |
| Total | ~$153.00/day |
Monthly: ~$4,590. 77% savings vs. Option A.
The progression from $20,250/month to $4,590/month uses no different models, no degradation in output quality, and no self-hosting. It is entirely a function of caching, routing, and batching. These are infrastructure decisions, not model decisions.
Quick reference: cost per 1M output tokens
For fast comparison. Sorted cheapest to most expensive. Input costs excluded (they vary by caching and prompt design). This is what you pay for the tokens your users actually see.
| Model | Output $/M | Type |
|---|---|---|
| Mistral Nemo 12B | $0.06 | Proprietary |
| Gemini 2.5 Flash-Lite | $0.10 | Proprietary |
| DeepSeek V4-Flash | $0.28 | Proprietary |
| Llama 4 Scout (hosted) | ~$0.40 | Open-weight |
| GPT-5.4 nano | $1.25 | Proprietary |
| Gemini 3.1 Flash-Lite | $1.50 | Proprietary |
| DeepSeek R1 | $2.19 | Open-weight |
| GPT-5.4 mini | $4.50 | Proprietary |
| o4-mini | $4.40 | Proprietary |
| Claude Haiku 4.5 | $5.00 | Proprietary |
| Mistral Large 3 | $6.00 | Proprietary |
| o3 | $8.00 | Proprietary |
| GPT-5.4 | $10.00 | Proprietary |
| Gemini 2.5 Pro | $10.00 | Proprietary |
| Gemini 3.1 Pro | $12–18 | Proprietary |
| Claude Sonnet 4.6 | $15.00 | Proprietary |
| Claude Opus 4.7 | $25.00 | Proprietary |
| GPT-5.5 | $30.00 | Proprietary |
Prices verified April 25, 2026. The LLM pricing landscape shifts every 2–4 weeks. Use pricepertoken.com for live tracking across 300+ models. Always confirm against official provider documentation before production budgeting.