LLM Token Costs and Efficiency: A Practitioner's Guide (April 2026)

Per-token pricing is the number on the pricing page. It is not your actual cost. The gap between the two is where most teams either save or waste thousands of dollars a month. This post covers both: the raw numbers across every major provider, and the strategies that determine what you actually pay.

Companion post: API Rate Limits Compared (April 2026) covers RPM, TPM, and throughput limits across the same providers.

Per-token pricing across providers
The hidden cost multipliers
Prompt caching: how it works, what it saves
Batch APIs: the easiest 50% discount
Model routing: the architecture that pays for itself
Token efficiency: not all tokens are equal
Structured outputs: fewer tokens, same information
Self-hosting vs. API: when the math flips
Cost modeling a real workload

All prices verified against official provider documentation as of April 25, 2026. Prices change frequently. Confirm on provider pricing pages before budgeting.

Per-token pricing across providers

Prices in USD per million tokens. Sorted cheapest to most expensive by output cost within each category.

Proprietary flagship models

Model	Input/M	Output/M	Cached Input/M	Context	Notes
DeepSeek V4-Flash	$0.14	$0.28	$0.028	1M	75% launch discount until May 5
Gemini 2.5 Flash	$0.15	$0.60	$0.0375	1M	Thinking tokens: $0.60/M in, $2.50/M out
Mistral Nemo 12B	$0.02	$0.06	—	128K
Gemini 2.5 Flash-Lite	$0.025	$0.10	—	1M	No thinking mode
GPT-5.4 nano	$0.20	$1.25	$0.10	128K
GPT-5.4 mini	$0.75	$4.50	$0.375	128K
Gemini 3.1 Flash-Lite	$0.25	$1.50	—	1M	Preview
Claude Haiku 4.5	$1.00	$5.00	$0.10	200K	Cache write: $1.25/M
DeepSeek V4-Pro	$0.435	$0.87	$0.036	1M	75% launch discount until May 5
Mistral Large 3	$2.00	$6.00	—	128K
GPT-5.4	$2.50	$10.00	$1.25	1M
Gemini 2.5 Pro	$1.25	$10.00	$0.3125	1M	Thinking tokens: $1.25/M in, $10/M out
Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M	Cache write: $3.75/M
GPT-5.5	$5.00	$30.00	$2.50	1M	~40% more token-efficient per task
Claude Opus 4.7	$5.00	$25.00	$0.50	1M	Cache write: $6.25/M
Claude Opus 4.6	$5.00	$25.00	$0.50	1M	Cache write: $6.25/M
Gemini 3.1 Pro	$2–4	$12–18	—	1M	$2/$12 under 200K; $4/$18 above
GPT-5.5 Pro	Contact sales	—	—	—	Enterprise only

Reasoning/thinking models

Model	Input/M	Output/M	Context	Notes
o4-mini	$1.10	$4.40	200K	Thinking tokens: $1.10/M
o3	$2.00	$8.00	200K	Thinking tokens: $2.00/M
DeepSeek R1	$0.55	$2.19	128K	Cache hits: $0.14/M

Thinking/reasoning tokens deserve attention. When a model uses extended thinking, you pay for the thinking tokens too. A complex math problem that triggers 10,000 thinking tokens before generating a 500-token answer costs you for 10,500 tokens of output, not 500. Anthropic, OpenAI, and Google all bill thinking tokens at the same rate as standard output tokens for their respective models.

Open-weight models (hosted pricing)

Self-hosted = free. These are representative hosted prices from providers like Groq, Together AI, Fireworks, and Alibaba Cloud.

Model	Hosted Input/M	Hosted Output/M	Context	License
Llama 4 Scout (109B/17B active)	~$0.10	~$0.40	10M	Meta
Qwen3-32B	~$0.15	~$0.60	128K	Apache 2.0
Llama 4 Maverick	~$0.15	~$0.60	1M	Meta
Gemma 3 27B	~$0.20	~$0.20	128K	Google
Mistral 7B	~$0.25	~$0.75	128K	Apache 2.0
DeepSeek R1 (hosted)	$0.55	$2.19	128K	MIT
GPT-OSS 120B	~$0.90	~$0.90	128K	Apache 2.0
Devstral 2 (123B)	~$1.00	~$1.00	256K	Apache 2.0

Embeddings and other modalities

Model	Price	Unit	Notes
OpenAI text-embedding-3-small	$0.02/M tokens	tokens
OpenAI text-embedding-3-large	$0.13/M tokens	tokens
Gemini Embedding 2	Free (AI Studio)	tokens	Multimodal (text, image, video, audio)
Cohere Embed v3	$0.10/M tokens	tokens	Multilingual
Whisper (OpenAI)	$0.006/min	audio minutes
Whisper (Groq, free tier)	$0.00	audio minutes	8 hrs/day limit
OpenAI TTS	$15.00/M chars	characters
OpenAI TTS HD	$30.00/M chars	characters
GPT Image 2	$0.02–$0.17	per image	Varies by quality/size

The hidden cost multipliers

The pricing table above is the starting point, not the answer. Several factors multiply your actual spend beyond the per-token rate.

1. Output tokens cost 2–6x more than input tokens

Every major provider charges more for output than input. The ratio varies:

Provider	Typical Output:Input Ratio
DeepSeek V4-Flash	2:1
OpenAI GPT-5.4	4:1
Claude Sonnet 4.6	5:1
GPT-5.5	6:1
Claude Opus 4.7	5:1

This means prompt engineering that reduces output length has an outsized cost impact. A system prompt that gets the model to return 200 tokens instead of 500 tokens saves more than a system prompt that reduces input by the same 300 tokens.

2. Context length pricing tiers

Google charges different rates based on how much of the context window you use. Gemini 3.1 Pro costs $2/$12 per M under 200K tokens, but $4/$18 above 200K. That is a 50% price increase for long-context workloads. Other providers charge a flat rate regardless of context usage, but this could change.

3. Thinking token overhead

Extended thinking is not free. A request to Claude Opus 4.7 that triggers 20,000 thinking tokens at $25/M output adds $0.50 to that single request before the actual response tokens. For complex reasoning tasks, thinking tokens can exceed response tokens by 10–50x.

Anthropic’s budget_tokens parameter lets you cap thinking tokens per request. OpenAI’s reasoning_effort parameter (low/medium/high) on o3/o4-mini controls thinking depth. Use these to bound costs on reasoning-heavy workloads.

4. Cache write costs

Anthropic charges a 25% premium on tokens written to cache vs. standard input. Claude Opus 4.7: $6.25/M for cache writes vs. $5.00/M for standard input. The write premium pays for itself after the second cache read ($0.50/M), so any cached prefix used 3+ times is net positive. Single-use prompts should skip caching.

5. Image and multimodal token costs

Images are converted to tokens before processing. A single high-resolution image can consume 1,000–5,000+ tokens depending on the model and resolution settings. For vision-heavy workloads (document OCR, screenshot analysis, diagram interpretation), image token costs can dominate your bill. Resize images to the minimum resolution needed before sending them.

Prompt caching: how it works, what it saves

Prompt caching stores the processed representation of input tokens on the provider side so subsequent requests with the same prefix skip re-processing. The savings are substantial.

Provider comparison

Provider	Cache Read Discount	Cache Write Premium	Min Cacheable	TTL	Rate Limit Benefit
Anthropic	90% off input	25% over input	1,024 tokens	5 min (auto-extends)	Cached reads don’t count toward ITPM
OpenAI	50% off input	None (automatic)	Varies	Automatic	No rate limit benefit
Google	75% off input	None (usage-based)	4,096 tokens	Configurable	No rate limit benefit
DeepSeek	80–90% off input	None	Automatic	Varies	No rate limit benefit

Anthropic’s caching has the most generous rate limit interaction: cached read tokens are excluded from ITPM limits entirely. With 80% cache hits, a 2M ITPM limit handles 10M total input tokens per minute. This is a throughput multiplier, not just a cost reduction.

When caching pays off

The break-even math for Anthropic: cache write costs 25% more than standard input. Cache reads cost 90% less. After 2 cache reads, the write premium is recovered. Anything beyond that is pure savings. For a system prompt reused across thousands of requests, the savings approach 90% on that portion of every request.

Practical caching patterns

RAG with stable context: Cache the system prompt + retrieved documents as a prefix. Each follow-up query appends to the cached prefix. Works well when the same document set is queried multiple times within the TTL window.

Multi-turn conversations: Cache the conversation history prefix. Each new turn extends the cache. Anthropic’s 5-minute auto-extending TTL means active conversations stay cached indefinitely.

Batch evaluation: Cache the evaluation rubric + few-shot examples. Run hundreds of evaluations against the cached prefix. The rubric tokens (often 2,000–5,000) are paid once at write price, then read at 90% discount for every subsequent evaluation.

Batch APIs: the easiest 50% discount

Both OpenAI and Anthropic offer batch processing with lower pricing and higher throughput limits.

Provider	Batch Discount	Turnaround	Queue Limits
OpenAI	50% off standard pricing	Up to 24 hours	3x–100x higher than real-time TPM
Anthropic	50% off standard pricing	Up to 24 hours	Higher than real-time
Google	50% off standard pricing	Up to 24 hours	Available for Gemini 2.5+ models

OpenAI’s batch queue limits are notably generous: GPT-5.5 gets 1.5M batch tokens at Tier 1 vs. 500K real-time TPM. At Tier 5, the batch queue reaches 15 billion tokens.

Workloads that fit batch processing: content generation pipelines, bulk classification, evaluation suites, data extraction from document archives, nightly report generation, synthetic data creation. Any workload where the output is consumed minutes or hours after generation rather than in real time.

Model routing: the architecture that pays for itself

The most impactful cost optimization is not using one model for everything. A routing layer that sends requests to different models based on complexity can cut spend 60–85%.

Cost comparison: routed vs. single-model

Assume 100,000 requests/day, average 500 output tokens per request.

Single model (Claude Sonnet 4.6 for everything):

100K requests × 500 tokens = 50M output tokens/day
50M × $15/M = $750/day

Three-tier routing (70/25/5 split):

70K requests × 500 tokens × $5/M (Haiku 4.5) = $175
25K requests × 500 tokens × $15/M (Sonnet 4.6) = $187.50
5K requests × 500 tokens × $25/M (Opus 4.7) = $62.50
Total: $425/day (43% savings)

The classifier itself costs almost nothing. Running Haiku 4.5 or GPT-5.4 nano to classify intent on a 100-token input is ~$0.10/M input. For 100K daily classifications: $0.01/day.

The savings compound with caching. If the routing layer also implements prompt caching on the medium and complex tiers, effective savings can reach 70–80%.

Token efficiency: not all tokens are equal

The same task produces different token counts on different models. This is the most underappreciated cost variable.

GPT-5.5’s efficiency claim

OpenAI reports GPT-5.5 is ~40% more token-efficient than GPT-5.4 on coding tasks. Per-token price doubled ($5/$30 vs. $2.50/$15), but if the model uses 40% fewer tokens to produce equivalent output, the effective per-task cost is ~20% higher, not 100% higher.

This claim is plausible for code generation (where a more capable model produces tighter code in fewer iterations) but unlikely to hold for simple summarization or classification tasks where output length is fixed. Evaluate token efficiency on your specific workload before assuming the headline number applies.

Tokenizer differences matter

Different providers use different tokenizers. The same English sentence might be 15 tokens on OpenAI’s tiktoken (cl100k_base), 18 tokens on Anthropic’s tokenizer, and 12 tokens on a model using a larger vocabulary. For pricing comparisons, convert to a common unit (characters or words) rather than comparing raw per-token prices across providers with different tokenizers.

A rough heuristic: 1 token ≈ 4 characters ≈ 0.75 words in English. But this varies by language (CJK characters often consume 1–2 tokens each) and by content type (code is tokenized differently than prose).

Prompt engineering for token efficiency

Small changes to prompts can produce large token savings at scale:

“Respond in JSON” vs. natural language: structured responses are 30–60% shorter.
“Be concise” in the system prompt: models tend to over-explain. An explicit brevity instruction reduces output tokens by 20–40% on average.
Few-shot examples: Adding 2–3 examples of desired output length calibrates the model’s verbosity. This costs a few hundred extra input tokens but saves thousands of output tokens across the batch.
Max tokens parameter: Set a hard cap. Prevents runaway generation on edge cases where the model would otherwise produce 10x the expected output.

Structured outputs: fewer tokens, same information

JSON mode and structured outputs (now GA on OpenAI, Anthropic, and Google) constrain the model to return data in a predefined schema. This is both a reliability and a cost optimization.

A sentiment classification task:

Unstructured response (typical): “The sentiment of this review is positive. The customer expresses satisfaction with the product quality and delivery speed, though they note minor concerns about packaging.” (~35 tokens)

Structured response: {"sentiment": "positive", "confidence": 0.92} (~12 tokens)

That is a 65% reduction in output tokens per request. At $15/M output (Sonnet 4.6), across 1M daily classifications, the savings are: (35-12) × 1M / 1M × $15 = $345/day.

For any pipeline where the downstream consumer is code (not a human reading prose), structured outputs should be the default.

Self-hosting vs. API: when the math flips

Open-weight models (Llama 4, Qwen3, Mistral, DeepSeek, GPT-OSS) are free to run. The cost is infrastructure.

Break-even estimation

A single H100 (80GB) rents for roughly $2–3/hour on cloud providers. It can serve a 70B model (quantized to 4-bit) at approximately 30–50 tokens/second.

Monthly H100 cost: ~$2,000 Monthly tokens at 40 tok/s sustained: ~100B tokens

Equivalent API cost for 100B output tokens:

Claude Haiku 4.5: $500,000
GPT-5.4 nano: $125,000
DeepSeek V4-Flash: $28,000

Self-hosting a 70B open model at $2,000/month breaks even against DeepSeek V4-Flash at roughly 7B output tokens/month. Against Haiku 4.5, it breaks even at under 500M tokens/month.

When APIs still win

Low volume: Under ~5B tokens/month, API costs are lower than maintaining infrastructure.
Burst traffic: APIs absorb spikes. Self-hosted GPUs sit idle during troughs and can’t handle peaks without over-provisioning.
Frontier quality: No open model matches Opus 4.7 or GPT-5.5 on complex reasoning. If you need the best, APIs are the only option.
Compliance and support: Enterprise API tiers include SLAs, BAAs (HIPAA), data residency controls, and zero-retention options that would require additional investment to replicate on self-hosted infrastructure.

When self-hosting wins

High sustained volume: Above 10B+ tokens/month, the economics favor owned or reserved GPU capacity.
Data sovereignty: Air-gapped or on-premise requirements.
Fine-tuning: Self-hosted models can be fine-tuned on proprietary data. API-based fine-tuning is available from some providers but with less control.
Latency control: No network round-trip. Co-located inference can achieve sub-10ms TTFT.

Cost modeling a real workload

Pricing tables are abstractions. Here is a concrete example of modeling costs for a production RAG pipeline.

Scenario: customer support agent

50,000 queries/day
Each query: 200 tokens input (user question), 2,000 tokens retrieved context, 300 tokens system prompt, 400 tokens output
System prompt + retrieval context is 80% cacheable (same FAQ docs across queries)

Option A: Claude Sonnet 4.6, no optimization

Component	Tokens/request	Daily tokens	Cost/M	Daily cost
Input (system + context + query)	2,500	125M	$3.00	$375.00
Output	400	20M	$15.00	$300.00
Total				$675.00/day

Monthly: ~$20,250.

Option B: Claude Sonnet 4.6, with prompt caching

Component	Tokens/request	Daily tokens	Cost/M	Daily cost
Cache write (system + context, first request/5min)	2,300	~0.7M	$3.75	$2.63
Cache read (system + context, subsequent)	2,300	114.3M	$0.30	$34.29
Uncached input (user query)	200	10M	$3.00	$30.00
Output	400	20M	$15.00	$300.00
Total				$366.92/day

Monthly: ~$11,008. 46% savings from caching alone.

Option C: Routed, with caching

Route 70% of queries (simple FAQ-type) to Haiku 4.5, 30% (complex) to Sonnet 4.6. Both with caching.

Component	Daily cost
Haiku tier (35K queries, cached)	~$52.50
Sonnet tier (15K queries, cached)	~$117.00
Classifier (50K queries, Haiku)	~$0.50
Total	~$170.00/day

Monthly: ~$5,100. 75% savings vs. Option A.

Option D: Add batch processing for non-real-time queries

If 20% of queries (overnight analytics, report generation) can tolerate 24-hour turnaround:

Component	Daily cost
Real-time routed (40K queries)	~$136.00
Batch (10K queries, 50% discount)	~$17.00
Total	~$153.00/day

Monthly: ~$4,590. 77% savings vs. Option A.

The progression from $20,250/month to $4,590/month uses no different models, no degradation in output quality, and no self-hosting. It is entirely a function of caching, routing, and batching. These are infrastructure decisions, not model decisions.

Quick reference: cost per 1M output tokens

For fast comparison. Sorted cheapest to most expensive. Input costs excluded (they vary by caching and prompt design). This is what you pay for the tokens your users actually see.

Model	Output $/M	Type
Mistral Nemo 12B	$0.06	Proprietary
Gemini 2.5 Flash-Lite	$0.10	Proprietary
DeepSeek V4-Flash	$0.28	Proprietary
Llama 4 Scout (hosted)	~$0.40	Open-weight
GPT-5.4 nano	$1.25	Proprietary
Gemini 3.1 Flash-Lite	$1.50	Proprietary
DeepSeek R1	$2.19	Open-weight
GPT-5.4 mini	$4.50	Proprietary
o4-mini	$4.40	Proprietary
Claude Haiku 4.5	$5.00	Proprietary
Mistral Large 3	$6.00	Proprietary
o3	$8.00	Proprietary
GPT-5.4	$10.00	Proprietary
Gemini 2.5 Pro	$10.00	Proprietary
Gemini 3.1 Pro	$12–18	Proprietary
Claude Sonnet 4.6	$15.00	Proprietary
Claude Opus 4.7	$25.00	Proprietary
GPT-5.5	$30.00	Proprietary

Prices verified April 25, 2026. The LLM pricing landscape shifts every 2–4 weeks. Use pricepertoken.com for live tracking across 300+ models. Always confirm against official provider documentation before production budgeting.