LLM Token Costs and Efficiency: A Practitioner's Guide (April 2026) Beyond the pricing page. How to actually think about LLM costs: per-token pricing across 15+ providers, hidden multipliers, caching mechanics, batch discounts, model routing architectures, and what 'cost per useful output' means in production. 2026-04-25T12:00:00.000Z Deep Dives Deep Dives apipricingcost-optimizationtokensreference

LLM Token Costs and Efficiency: A Practitioner's Guide (April 2026)

Beyond the pricing page. How to actually think about LLM costs: per-token pricing across 15+ providers, hidden multipliers, caching mechanics, batch discounts, model routing architectures, and what 'cost per useful output' means in production.

The post you bookmark. One topic, covered end to end.

Per-token pricing is the number on the pricing page. It is not your actual cost. The gap between the two is where most teams either save or waste thousands of dollars a month. This post covers both: the raw numbers across every major provider, and the strategies that determine what you actually pay.

Companion post: API Rate Limits Compared (April 2026) covers RPM, TPM, and throughput limits across the same providers.

Table of Contents

All prices verified against official provider documentation as of April 25, 2026. Prices change frequently. Confirm on provider pricing pages before budgeting.


Per-token pricing across providers

Prices in USD per million tokens. Sorted cheapest to most expensive by output cost within each category.

Proprietary flagship models

ModelInput/MOutput/MCached Input/MContextNotes
DeepSeek V4-Flash$0.14$0.28$0.0281M75% launch discount until May 5
Gemini 2.5 Flash$0.15$0.60$0.03751MThinking tokens: $0.60/M in, $2.50/M out
Mistral Nemo 12B$0.02$0.06128K
Gemini 2.5 Flash-Lite$0.025$0.101MNo thinking mode
GPT-5.4 nano$0.20$1.25$0.10128K
GPT-5.4 mini$0.75$4.50$0.375128K
Gemini 3.1 Flash-Lite$0.25$1.501MPreview
Claude Haiku 4.5$1.00$5.00$0.10200KCache write: $1.25/M
DeepSeek V4-Pro$0.435$0.87$0.0361M75% launch discount until May 5
Mistral Large 3$2.00$6.00128K
GPT-5.4$2.50$10.00$1.251M
Gemini 2.5 Pro$1.25$10.00$0.31251MThinking tokens: $1.25/M in, $10/M out
Claude Sonnet 4.6$3.00$15.00$0.301MCache write: $3.75/M
GPT-5.5$5.00$30.00$2.501M~40% more token-efficient per task
Claude Opus 4.7$5.00$25.00$0.501MCache write: $6.25/M
Claude Opus 4.6$5.00$25.00$0.501MCache write: $6.25/M
Gemini 3.1 Pro$2–4$12–181M$2/$12 under 200K; $4/$18 above
GPT-5.5 ProContact salesEnterprise only

Reasoning/thinking models

ModelInput/MOutput/MContextNotes
o4-mini$1.10$4.40200KThinking tokens: $1.10/M
o3$2.00$8.00200KThinking tokens: $2.00/M
DeepSeek R1$0.55$2.19128KCache hits: $0.14/M

Thinking/reasoning tokens deserve attention. When a model uses extended thinking, you pay for the thinking tokens too. A complex math problem that triggers 10,000 thinking tokens before generating a 500-token answer costs you for 10,500 tokens of output, not 500. Anthropic, OpenAI, and Google all bill thinking tokens at the same rate as standard output tokens for their respective models.

Open-weight models (hosted pricing)

Self-hosted = free. These are representative hosted prices from providers like Groq, Together AI, Fireworks, and Alibaba Cloud.

ModelHosted Input/MHosted Output/MContextLicense
Llama 4 Scout (109B/17B active)~$0.10~$0.4010MMeta
Qwen3-32B~$0.15~$0.60128KApache 2.0
Llama 4 Maverick~$0.15~$0.601MMeta
Gemma 3 27B~$0.20~$0.20128KGoogle
Mistral 7B~$0.25~$0.75128KApache 2.0
DeepSeek R1 (hosted)$0.55$2.19128KMIT
GPT-OSS 120B~$0.90~$0.90128KApache 2.0
Devstral 2 (123B)~$1.00~$1.00256KApache 2.0

Embeddings and other modalities

ModelPriceUnitNotes
OpenAI text-embedding-3-small$0.02/M tokenstokens
OpenAI text-embedding-3-large$0.13/M tokenstokens
Gemini Embedding 2Free (AI Studio)tokensMultimodal (text, image, video, audio)
Cohere Embed v3$0.10/M tokenstokensMultilingual
Whisper (OpenAI)$0.006/minaudio minutes
Whisper (Groq, free tier)$0.00audio minutes8 hrs/day limit
OpenAI TTS$15.00/M charscharacters
OpenAI TTS HD$30.00/M charscharacters
GPT Image 2$0.02–$0.17per imageVaries by quality/size

The hidden cost multipliers

The pricing table above is the starting point, not the answer. Several factors multiply your actual spend beyond the per-token rate.

1. Output tokens cost 2–6x more than input tokens

Every major provider charges more for output than input. The ratio varies:

ProviderTypical Output:Input Ratio
DeepSeek V4-Flash2:1
OpenAI GPT-5.44:1
Claude Sonnet 4.65:1
GPT-5.56:1
Claude Opus 4.75:1

This means prompt engineering that reduces output length has an outsized cost impact. A system prompt that gets the model to return 200 tokens instead of 500 tokens saves more than a system prompt that reduces input by the same 300 tokens.

2. Context length pricing tiers

Google charges different rates based on how much of the context window you use. Gemini 3.1 Pro costs $2/$12 per M under 200K tokens, but $4/$18 above 200K. That is a 50% price increase for long-context workloads. Other providers charge a flat rate regardless of context usage, but this could change.

3. Thinking token overhead

Extended thinking is not free. A request to Claude Opus 4.7 that triggers 20,000 thinking tokens at $25/M output adds $0.50 to that single request before the actual response tokens. For complex reasoning tasks, thinking tokens can exceed response tokens by 10–50x.

Anthropic’s budget_tokens parameter lets you cap thinking tokens per request. OpenAI’s reasoning_effort parameter (low/medium/high) on o3/o4-mini controls thinking depth. Use these to bound costs on reasoning-heavy workloads.

4. Cache write costs

Anthropic charges a 25% premium on tokens written to cache vs. standard input. Claude Opus 4.7: $6.25/M for cache writes vs. $5.00/M for standard input. The write premium pays for itself after the second cache read ($0.50/M), so any cached prefix used 3+ times is net positive. Single-use prompts should skip caching.

5. Image and multimodal token costs

Images are converted to tokens before processing. A single high-resolution image can consume 1,000–5,000+ tokens depending on the model and resolution settings. For vision-heavy workloads (document OCR, screenshot analysis, diagram interpretation), image token costs can dominate your bill. Resize images to the minimum resolution needed before sending them.


Prompt caching: how it works, what it saves

Prompt caching stores the processed representation of input tokens on the provider side so subsequent requests with the same prefix skip re-processing. The savings are substantial.

Provider comparison

ProviderCache Read DiscountCache Write PremiumMin CacheableTTLRate Limit Benefit
Anthropic90% off input25% over input1,024 tokens5 min (auto-extends)Cached reads don’t count toward ITPM
OpenAI50% off inputNone (automatic)VariesAutomaticNo rate limit benefit
Google75% off inputNone (usage-based)4,096 tokensConfigurableNo rate limit benefit
DeepSeek80–90% off inputNoneAutomaticVariesNo rate limit benefit

Anthropic’s caching has the most generous rate limit interaction: cached read tokens are excluded from ITPM limits entirely. With 80% cache hits, a 2M ITPM limit handles 10M total input tokens per minute. This is a throughput multiplier, not just a cost reduction.

When caching pays off

> 1024 tokens?Reused 3+times?Cache itSkip yesnoyesno

The break-even math for Anthropic: cache write costs 25% more than standard input. Cache reads cost 90% less. After 2 cache reads, the write premium is recovered. Anything beyond that is pure savings. For a system prompt reused across thousands of requests, the savings approach 90% on that portion of every request.

Practical caching patterns

RAG with stable context: Cache the system prompt + retrieved documents as a prefix. Each follow-up query appends to the cached prefix. Works well when the same document set is queried multiple times within the TTL window.

Multi-turn conversations: Cache the conversation history prefix. Each new turn extends the cache. Anthropic’s 5-minute auto-extending TTL means active conversations stay cached indefinitely.

Batch evaluation: Cache the evaluation rubric + few-shot examples. Run hundreds of evaluations against the cached prefix. The rubric tokens (often 2,000–5,000) are paid once at write price, then read at 90% discount for every subsequent evaluation.


Batch APIs: the easiest 50% discount

Both OpenAI and Anthropic offer batch processing with lower pricing and higher throughput limits.

ProviderBatch DiscountTurnaroundQueue Limits
OpenAI50% off standard pricingUp to 24 hours3x–100x higher than real-time TPM
Anthropic50% off standard pricingUp to 24 hoursHigher than real-time
Google50% off standard pricingUp to 24 hoursAvailable for Gemini 2.5+ models

OpenAI’s batch queue limits are notably generous: GPT-5.5 gets 1.5M batch tokens at Tier 1 vs. 500K real-time TPM. At Tier 5, the batch queue reaches 15 billion tokens.

Workloads that fit batch processing: content generation pipelines, bulk classification, evaluation suites, data extraction from document archives, nightly report generation, synthetic data creation. Any workload where the output is consumed minutes or hours after generation rather than in real time.


Model routing: the architecture that pays for itself

The most impactful cost optimization is not using one model for everything. A routing layer that sends requests to different models based on complexity can cut spend 60–85%.

RequestClassifier70% simple25% medium5% complex HaikuSonnetOpus

Cost comparison: routed vs. single-model

Assume 100,000 requests/day, average 500 output tokens per request.

Single model (Claude Sonnet 4.6 for everything):

  • 100K requests × 500 tokens = 50M output tokens/day
  • 50M × $15/M = $750/day

Three-tier routing (70/25/5 split):

  • 70K requests × 500 tokens × $5/M (Haiku 4.5) = $175
  • 25K requests × 500 tokens × $15/M (Sonnet 4.6) = $187.50
  • 5K requests × 500 tokens × $25/M (Opus 4.7) = $62.50
  • Total: $425/day (43% savings)

The classifier itself costs almost nothing. Running Haiku 4.5 or GPT-5.4 nano to classify intent on a 100-token input is ~$0.10/M input. For 100K daily classifications: $0.01/day.

The savings compound with caching. If the routing layer also implements prompt caching on the medium and complex tiers, effective savings can reach 70–80%.


Token efficiency: not all tokens are equal

The same task produces different token counts on different models. This is the most underappreciated cost variable.

GPT-5.5’s efficiency claim

OpenAI reports GPT-5.5 is ~40% more token-efficient than GPT-5.4 on coding tasks. Per-token price doubled ($5/$30 vs. $2.50/$15), but if the model uses 40% fewer tokens to produce equivalent output, the effective per-task cost is ~20% higher, not 100% higher.

This claim is plausible for code generation (where a more capable model produces tighter code in fewer iterations) but unlikely to hold for simple summarization or classification tasks where output length is fixed. Evaluate token efficiency on your specific workload before assuming the headline number applies.

Tokenizer differences matter

Different providers use different tokenizers. The same English sentence might be 15 tokens on OpenAI’s tiktoken (cl100k_base), 18 tokens on Anthropic’s tokenizer, and 12 tokens on a model using a larger vocabulary. For pricing comparisons, convert to a common unit (characters or words) rather than comparing raw per-token prices across providers with different tokenizers.

A rough heuristic: 1 token ≈ 4 characters ≈ 0.75 words in English. But this varies by language (CJK characters often consume 1–2 tokens each) and by content type (code is tokenized differently than prose).

Prompt engineering for token efficiency

Small changes to prompts can produce large token savings at scale:

  • “Respond in JSON” vs. natural language: structured responses are 30–60% shorter.
  • “Be concise” in the system prompt: models tend to over-explain. An explicit brevity instruction reduces output tokens by 20–40% on average.
  • Few-shot examples: Adding 2–3 examples of desired output length calibrates the model’s verbosity. This costs a few hundred extra input tokens but saves thousands of output tokens across the batch.
  • Max tokens parameter: Set a hard cap. Prevents runaway generation on edge cases where the model would otherwise produce 10x the expected output.

Structured outputs: fewer tokens, same information

JSON mode and structured outputs (now GA on OpenAI, Anthropic, and Google) constrain the model to return data in a predefined schema. This is both a reliability and a cost optimization.

A sentiment classification task:

Unstructured response (typical): “The sentiment of this review is positive. The customer expresses satisfaction with the product quality and delivery speed, though they note minor concerns about packaging.” (~35 tokens)

Structured response: {"sentiment": "positive", "confidence": 0.92} (~12 tokens)

That is a 65% reduction in output tokens per request. At $15/M output (Sonnet 4.6), across 1M daily classifications, the savings are: (35-12) × 1M / 1M × $15 = $345/day.

For any pipeline where the downstream consumer is code (not a human reading prose), structured outputs should be the default.


Self-hosting vs. API: when the math flips

Open-weight models (Llama 4, Qwen3, Mistral, DeepSeek, GPT-OSS) are free to run. The cost is infrastructure.

Break-even estimation

A single H100 (80GB) rents for roughly $2–3/hour on cloud providers. It can serve a 70B model (quantized to 4-bit) at approximately 30–50 tokens/second.

Monthly H100 cost: ~$2,000 Monthly tokens at 40 tok/s sustained: ~100B tokens

Equivalent API cost for 100B output tokens:

  • Claude Haiku 4.5: $500,000
  • GPT-5.4 nano: $125,000
  • DeepSeek V4-Flash: $28,000

Self-hosting a 70B open model at $2,000/month breaks even against DeepSeek V4-Flash at roughly 7B output tokens/month. Against Haiku 4.5, it breaks even at under 500M tokens/month.

When APIs still win

  • Low volume: Under ~5B tokens/month, API costs are lower than maintaining infrastructure.
  • Burst traffic: APIs absorb spikes. Self-hosted GPUs sit idle during troughs and can’t handle peaks without over-provisioning.
  • Frontier quality: No open model matches Opus 4.7 or GPT-5.5 on complex reasoning. If you need the best, APIs are the only option.
  • Compliance and support: Enterprise API tiers include SLAs, BAAs (HIPAA), data residency controls, and zero-retention options that would require additional investment to replicate on self-hosted infrastructure.

When self-hosting wins

  • High sustained volume: Above 10B+ tokens/month, the economics favor owned or reserved GPU capacity.
  • Data sovereignty: Air-gapped or on-premise requirements.
  • Fine-tuning: Self-hosted models can be fine-tuned on proprietary data. API-based fine-tuning is available from some providers but with less control.
  • Latency control: No network round-trip. Co-located inference can achieve sub-10ms TTFT.

Cost modeling a real workload

Pricing tables are abstractions. Here is a concrete example of modeling costs for a production RAG pipeline.

Scenario: customer support agent

  • 50,000 queries/day
  • Each query: 200 tokens input (user question), 2,000 tokens retrieved context, 300 tokens system prompt, 400 tokens output
  • System prompt + retrieval context is 80% cacheable (same FAQ docs across queries)

Option A: Claude Sonnet 4.6, no optimization

ComponentTokens/requestDaily tokensCost/MDaily cost
Input (system + context + query)2,500125M$3.00$375.00
Output40020M$15.00$300.00
Total$675.00/day

Monthly: ~$20,250.

Option B: Claude Sonnet 4.6, with prompt caching

ComponentTokens/requestDaily tokensCost/MDaily cost
Cache write (system + context, first request/5min)2,300~0.7M$3.75$2.63
Cache read (system + context, subsequent)2,300114.3M$0.30$34.29
Uncached input (user query)20010M$3.00$30.00
Output40020M$15.00$300.00
Total$366.92/day

Monthly: ~$11,008. 46% savings from caching alone.

Option C: Routed, with caching

Route 70% of queries (simple FAQ-type) to Haiku 4.5, 30% (complex) to Sonnet 4.6. Both with caching.

ComponentDaily cost
Haiku tier (35K queries, cached)~$52.50
Sonnet tier (15K queries, cached)~$117.00
Classifier (50K queries, Haiku)~$0.50
Total~$170.00/day

Monthly: ~$5,100. 75% savings vs. Option A.

Option D: Add batch processing for non-real-time queries

If 20% of queries (overnight analytics, report generation) can tolerate 24-hour turnaround:

ComponentDaily cost
Real-time routed (40K queries)~$136.00
Batch (10K queries, 50% discount)~$17.00
Total~$153.00/day

Monthly: ~$4,590. 77% savings vs. Option A.

The progression from $20,250/month to $4,590/month uses no different models, no degradation in output quality, and no self-hosting. It is entirely a function of caching, routing, and batching. These are infrastructure decisions, not model decisions.


Quick reference: cost per 1M output tokens

For fast comparison. Sorted cheapest to most expensive. Input costs excluded (they vary by caching and prompt design). This is what you pay for the tokens your users actually see.

ModelOutput $/MType
Mistral Nemo 12B$0.06Proprietary
Gemini 2.5 Flash-Lite$0.10Proprietary
DeepSeek V4-Flash$0.28Proprietary
Llama 4 Scout (hosted)~$0.40Open-weight
GPT-5.4 nano$1.25Proprietary
Gemini 3.1 Flash-Lite$1.50Proprietary
DeepSeek R1$2.19Open-weight
GPT-5.4 mini$4.50Proprietary
o4-mini$4.40Proprietary
Claude Haiku 4.5$5.00Proprietary
Mistral Large 3$6.00Proprietary
o3$8.00Proprietary
GPT-5.4$10.00Proprietary
Gemini 2.5 Pro$10.00Proprietary
Gemini 3.1 Pro$12–18Proprietary
Claude Sonnet 4.6$15.00Proprietary
Claude Opus 4.7$25.00Proprietary
GPT-5.5$30.00Proprietary

Prices verified April 25, 2026. The LLM pricing landscape shifts every 2–4 weeks. Use pricepertoken.com for live tracking across 300+ models. Always confirm against official provider documentation before production budgeting.