LLM Token Costs and Efficiency: A Practitioner's Guide (May 2026)

Per-token pricing is the number on the pricing page. It is not your actual cost. The gap between the two is where most teams either save or waste thousands of dollars a month. This post covers both: the raw numbers across every major provider, and the strategies that determine what you actually pay.

Companion post: API Rate Limits Compared (May 2026) covers RPM, TPM, and throughput limits across the same providers.

Per-token pricing across providers
The hidden cost multipliers
Prompt caching: how it works, what it saves
Batch APIs: the easiest 50% discount
Model routing: the architecture that pays for itself
Token efficiency: not all tokens are equal
Structured outputs: fewer tokens, same information
Self-hosting vs. API: when the math flips
Cost modeling a real workload
Further reading

All prices verified against official provider documentation as of May 23, 2026. Prices change frequently. Confirm on provider pricing pages before budgeting.

Per-token pricing across providers

Prices in USD per million tokens. Sorted cheapest to most expensive by output cost within each category.

Proprietary flagship models

Model	Input/M	Output/M	Cached Input/M	Context	Notes
Mistral Nemo 12B	$0.02	$0.06	—	128K
Gemini 2.5 Flash-Lite	$0.025	$0.10	—	1M	No thinking mode
DeepSeek V4-Flash	$0.14	$0.28	$0.028	1M
Gemini 2.5 Flash	$0.15	$0.60	$0.0375	1M	Thinking tokens: $0.60/M in, $2.50/M out
GPT-4.1 Nano	$0.10	$0.40	$0.025	1M	Budget/high-throughput tier
GPT-5.5 Instant	$0.20	$1.00	—	128K	Free-tier ChatGPT default since May 5
Gemini 3.1 Flash-Lite	$0.25	$1.50	—	1M
Claude Haiku 4.5	$1.00	$5.00	$0.10	200K	Cache write: $1.25/M
DeepSeek V4-Pro	$0.435	$0.87	$0.036	1M
Mistral Large 3	$2.00	$6.00	—	128K
GPT-5.4	$2.50	$10.00	$1.25	1M
Gemini 2.5 Pro	$1.25	$10.00	$0.3125	1M	Thinking tokens: $1.25/M in, $10/M out
Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M	Cache write: $3.75/M
Gemini 3.1 Pro	$2–4	$12–18	—	1M	$2/$12 under 200K; $4/$18 above
GPT-5.5	$5.00	$30.00	$2.50	1M	~40% more token-efficient per task than GPT-5.4
Claude Opus 4.6	$5.00	$25.00	$0.50	1M	Cache write: $6.25/M
Claude Opus 4.7	$5.00	$25.00	$0.50	1M	Cache write: $6.25/M; higher-res vision
Qwen3.7-Max	~$4.00	~$16.00	—	1M	Proprietary; 35-hr autonomous operation demonstrated
GPT-5.5 Pro	Contact sales	—	—	—	Enterprise only

Two notes on this table. GPT-5.5 Instant replaced GPT-5.3 Instant as the free-tier ChatGPT default on May 5, 2026; API pricing above is estimated from OpenAI’s current tier structure — confirm before budgeting. Qwen3.7-Max launched May 20, 2026, and hosted API pricing had not been formally published at time of writing; the numbers above are provisional estimates based on Alibaba Cloud’s prior pricing patterns.

Reasoning/thinking models

Model	Input/M	Output/M	Context	Notes
o4-mini	$1.10	$4.40	200K	Thinking tokens: $1.10/M
o3	$2.00	$8.00	200K	Thinking tokens: $2.00/M
DeepSeek R1	$0.55	$2.19	128K	Cache hits: $0.14/M
GPT-5.4 Thinking	$3.00	$12.00	1M	Thinking tokens billed at output rate
GPT-5.5 Thinking	$6.00	$36.00	1M	Thinking tokens billed at output rate

Thinking/reasoning tokens deserve attention. When a model uses extended thinking, you pay for the thinking tokens too. A complex math problem that triggers 10,000 thinking tokens before generating a 500-token answer costs you for 10,500 tokens of output, not 500. Anthropic, OpenAI, and Google all bill thinking tokens at the same rate as standard output tokens for their respective models.

GPT-5.4 Thinking and GPT-5.5 Thinking pricing is estimated based on the pattern established with GPT-5.4 standard; confirm against OpenAI’s pricing page before using these in budget models.

Open-weight models (hosted pricing)

Self-hosted = free. These are representative hosted prices from providers like Groq, Together AI, Fireworks, and Alibaba Cloud.

Model	Hosted Input/M	Hosted Output/M	Context	License
Llama 4 Scout (109B/17B active)	~$0.10	~$0.40	10M	Meta
Qwen3-32B	~$0.15	~$0.60	128K	Apache 2.0
Llama 4 Maverick	~$0.15	~$0.60	1M	Meta
Gemma 3 27B	~$0.20	~$0.20	128K	Google
Mistral 7B	~$0.25	~$0.75	128K	Apache 2.0
DeepSeek R1 (hosted)	$0.55	$2.19	128K	MIT
GPT-OSS 120B	~$0.90	~$0.90	128K	Apache 2.0
Devstral 2 (72B)	~$1.00	~$1.00	256K	Apache 2.0
Command A+ (218B MoE, 25B active)	~$0.50	~$1.50	256K	Apache 2.0

Command A+ launched May 20, 2026, replacing Command R+ as Cohere’s strongest model. Its 25B active parameter count despite 218B total parameters makes it notably cheaper to host than its benchmark performance might suggest — two H100s are sufficient for inference. Hosted pricing above is provisional; confirm with Cohere’s API pricing page.

Embeddings and other modalities

Model	Price	Unit	Notes
OpenAI text-embedding-3-small	$0.02/M tokens	tokens
OpenAI text-embedding-3-large	$0.13/M tokens	tokens
Gemini Embedding 2	Free (AI Studio)	tokens	Multimodal (text, image, video, audio)
Cohere Embed v3	$0.10/M tokens	tokens	Multilingual
Whisper (OpenAI)	$0.006/min	audio minutes
Whisper (Groq, free tier)	$0.00	audio minutes	8 hrs/day limit
Gemini 3.1 Flash TTS	Contact Google	characters	New; dedicated TTS model
OpenAI TTS	$15.00/M chars	characters
OpenAI TTS HD	$30.00/M chars	characters
GPT Image 2	$0.02–$0.17	per image	Varies by quality/size

The hidden cost multipliers

The pricing table above is the starting point, not the answer. Several factors multiply your actual spend beyond the per-token rate.

1. Output tokens cost 2–6x more than input tokens

Every major provider charges more for output than input. The ratio varies:

Provider/Model	Typical Output:Input Ratio
DeepSeek V4-Flash	2:1
GPT-5.4	4:1
Claude Sonnet 4.6	5:1
GPT-5.5	6:1
Claude Opus 4.7	5:1

Prompt engineering that reduces output length has an outsized cost impact. A system prompt that gets the model to return 200 tokens instead of 500 tokens saves more than a system prompt that reduces input by the same 300 tokens.

2. Context length pricing tiers

Google charges different rates based on how much of the context window you use. Gemini 3.1 Pro costs $2/$12 per M under 200K tokens, but $4/$18 above 200K — a 50% price increase for long-context workloads. Other providers charge a flat rate regardless of context usage, but this could change as 1M+ context windows become standard.

3. Thinking token overhead

Extended thinking is not free. A request to Claude Opus 4.7 that triggers 20,000 thinking tokens at $25/M output adds $0.50 to that single request before the actual response tokens. For complex reasoning tasks, thinking tokens can exceed response tokens by 10–50x.

Anthropic’s budget_tokens parameter lets you cap thinking tokens per request. OpenAI’s reasoning_effort parameter (low/medium/high) on o3/o4-mini controls thinking depth. Use these to bound costs on reasoning-heavy workloads.

4. Cache write costs

Anthropic charges a 25% premium on tokens written to cache vs. standard input. Claude Opus 4.7: $6.25/M for cache writes vs. $5.00/M for standard input. The write premium pays for itself after the second cache read ($0.50/M), so any cached prefix used 3+ times is net positive. Single-use prompts should skip caching.

5. Image and multimodal token costs

Images are converted to tokens before processing. A single high-resolution image can consume 1,000–5,000+ tokens depending on the model and resolution settings. For vision-heavy workloads — document OCR, screenshot analysis, diagram interpretation — image token costs can dominate the bill. Claude Opus 4.7’s higher-resolution vision capability increases this exposure relative to Opus 4.6, since the model processes finer image detail at greater token cost. Resize images to the minimum resolution needed before sending them.

6. Autonomous operation and agentic loops

Qwen3.7-Max’s 35-hour autonomous operation demonstration and Anthropic’s continued expansion of agentic Claude use cases introduce a cost pattern that per-request pricing doesn’t capture well: a single high-level task can spawn dozens or hundreds of model calls, each accumulating input context from prior steps. An agent that reads its own output history accumulates input tokens quadratically over a long session. Budget token limits at the task level, not just the request level, for any agentic workload.

Prompt caching: how it works, what it saves

Prompt caching stores the processed representation of input tokens on the provider side so subsequent requests with the same prefix skip re-processing. The savings are substantial.

Provider comparison

Provider	Cache Read Discount	Cache Write Premium	Min Cacheable	TTL	Rate Limit Benefit
Anthropic	90% off input	25% over input	1,024 tokens	5 min (auto-extends)	Cached reads don’t count toward ITPM
OpenAI	50% off input	None (automatic)	Varies	Automatic	No rate limit benefit
Google	75% off input	None (usage-based)	4,096 tokens	Configurable	No rate limit benefit
DeepSeek	80–90% off input	None	Automatic	Varies	No rate limit benefit

Anthropic’s caching has the most generous rate limit interaction: cached read tokens are excluded from ITPM limits entirely. With 80% cache hits, a 2M ITPM limit handles 10M total input tokens per minute. This is a throughput multiplier, not just a cost reduction.

When caching pays off

The break-even math for Anthropic: cache write costs 25% more than standard input. Cache reads cost 90% less. After 2 cache reads, the write premium is recovered. Anything beyond that is pure savings. For a system prompt reused across thousands of requests, the savings approach 90% on that portion of every request.

Practical caching patterns

RAG with stable context: Cache the system prompt and retrieved documents as a prefix. Each follow-up query appends to the cached prefix. Works well when the same document set is queried multiple times within the TTL window.

Multi-turn conversations: Cache the conversation history prefix. Each new turn extends the cache. Anthropic’s 5-minute auto-extending TTL means active conversations stay cached indefinitely.

Batch evaluation: Cache the evaluation rubric and few-shot examples. Run hundreds of evaluations against the cached prefix. The rubric tokens (often 2,000–5,000) are paid once at write price, then read at 90% discount for every subsequent evaluation.

Agentic task context: For long-running autonomous tasks, cache the task description, tool definitions, and accumulated working memory up to the stable prefix boundary. This is particularly relevant for Qwen3.7-Max and Claude Opus 4.7 workloads where session context can grow into the hundreds of thousands of tokens.

Batch APIs: the easiest 50% discount

OpenAI, Anthropic, and Google all offer batch processing with lower pricing and higher throughput limits.

Provider	Batch Discount	Turnaround	Queue Limits
OpenAI	50% off standard pricing	Up to 24 hours	3x–100x higher than real-time TPM
Anthropic	50% off standard pricing	Up to 24 hours	Higher than real-time
Google	50% off standard pricing	Up to 24 hours	Available for Gemini 2.5+ models

OpenAI’s batch queue limits are notably generous: GPT-5.5 gets 1.5M batch tokens at Tier 1 vs. 500K real-time TPM. At Tier 5, the batch queue reaches 15 billion tokens.

Workloads that fit batch processing: content generation pipelines, bulk classification, evaluation suites, data extraction from document archives, nightly report generation, synthetic data creation. Any workload where the output is consumed minutes or hours after generation rather than in real time.

Model routing: the architecture that pays for itself

The most impactful cost optimization is not using one model for everything. A routing layer that sends requests to different models based on complexity can cut spend 60–85%.

Cost comparison: routed vs. single-model

Assume 100,000 requests/day, average 500 output tokens per request.

Single model (Claude Sonnet 4.6 for everything):

100K requests × 500 tokens = 50M output tokens/day
50M × $15/M = $750/day

Three-tier routing (70/25/5 split):

70K requests × 500 tokens × $5/M (Haiku 4.5) = $175.00
25K requests × 500 tokens × $15/M (Sonnet 4.6) = $187.50
5K requests × 500 tokens × $25/M (Opus 4.7) = $62.50
Total: $425/day (43% savings)

The classifier itself costs almost nothing. Running Haiku 4.5 or GPT-4.1 Nano to classify intent on a 100-token input is ~$0.10/M input. For 100K daily classifications: $0.01/day.

The savings compound with caching. If the routing layer also implements prompt caching on the medium and complex tiers, effective savings can reach 70–80%.

Open-weight routing for cost floors

For teams with GPU infrastructure, routing the simple tier to a self-hosted model (Llama 4 Scout, Qwen3-32B, Command A+) eliminates the per-token cost entirely on the majority of requests. Command A+‘s Apache 2.0 license and 2×H100 inference footprint make it a practical target for the medium tier as well, at near-zero marginal cost above infrastructure.

Token efficiency: not all tokens are equal

The same task produces different token counts on different models. This is the most underappreciated cost variable.

GPT-5.5’s efficiency claim

OpenAI reports GPT-5.5 is ~40% more token-efficient than GPT-5.4 on coding tasks. Per-token price doubled ($5/$30 vs. $2.50/$15 for standard), but if the model uses 40% fewer tokens to produce equivalent output, the effective per-task cost is ~20% higher, not 100% higher.

This claim is plausible for code generation (where a more capable model produces tighter code in fewer iterations) but unlikely to hold for simple summarization or classification tasks where output length is fixed. Evaluate token efficiency on your specific workload before assuming the headline number applies.

Tokenizer differences matter

Different providers use different tokenizers. The same English sentence might be 15 tokens on OpenAI’s tiktoken (cl100k_base), 18 tokens on Anthropic’s tokenizer, and 12 tokens on a model using a larger vocabulary. For pricing comparisons, convert to a common unit (characters or words) rather than comparing raw per-token prices across providers with different tokenizers.

A rough heuristic: 1 token ≈ 4 characters ≈ 0.75 words in English. This varies by language (CJK characters often consume 1–2 tokens each) and by content type (code is tokenized differently than prose).

Prompt engineering for token efficiency

Small changes to prompts can produce large token savings at scale:

“Respond in JSON” vs. natural language: structured responses are 30–60% shorter.
“Be concise” in the system prompt: models tend to over-explain. An explicit brevity instruction reduces output tokens by 20–40% on average.
Few-shot examples: Adding 2–3 examples of desired output length calibrates the model’s verbosity. This costs a few hundred extra input tokens but saves thousands of output tokens across the batch.
Max tokens parameter: Set a hard cap. Prevents runaway generation on edge cases where the model would otherwise produce 10x the expected output.

Structured outputs: fewer tokens, same information

JSON mode and structured outputs (GA on OpenAI, Anthropic, and Google) constrain the model to return data in a predefined schema. This is both a reliability and a cost optimization.

A sentiment classification task:

Unstructured response (typical): “The sentiment of this review is positive. The customer expresses satisfaction with the product quality and delivery speed, though they note minor concerns about packaging.” (~35 tokens)

Structured response: {"sentiment": "positive", "confidence": 0.92} (~12 tokens)

That is a 65% reduction in output tokens per request. At $15/M output (Sonnet 4.6), across 1M daily classifications, the savings are: (35−12) × 1M / 1M × $15 = $345/day.

For any pipeline where the downstream consumer is code rather than a human reading prose, structured outputs should be the default.

Self-hosting vs. API: when the math flips

Open-weight models (Llama 4, Qwen3, Mistral, DeepSeek, Command A+, GPT-OSS) are free to run. The cost is infrastructure.

Break-even estimation

A single H100 (80GB) rents for roughly $2–3/hour on cloud providers. It can serve a 70B model (quantized to 4-bit) at approximately 30–50 tokens/second.

Monthly H100 cost: ~$2,000 Monthly tokens at 40 tok/s sustained: ~100B tokens

Equivalent API cost for 100B output tokens:

Claude Haiku 4.5: $500,000
GPT-4.1 Nano: $40,000
DeepSeek V4-Flash: $28,000

Self-hosting a 70B open model at $2,000/month breaks even against DeepSeek V4-Flash at roughly 7B output tokens/month. Against Haiku 4.5, it breaks even at under 500M tokens/month.

Command A+ changes the calculus somewhat. At 218B total parameters but only 25B active, it runs on 2×H100 (~$4,000/month) while delivering quality closer to mid-tier proprietary models than typical 70B open-weights. For teams that can keep those GPUs near-saturated, the economics are compelling.

When APIs still win

Low volume: Under ~5B tokens/month, API costs are lower than maintaining infrastructure.
Burst traffic: APIs absorb spikes. Self-hosted GPUs sit idle during troughs and can’t handle peaks without over-provisioning.
Frontier quality: No open model matches Opus 4.7 or GPT-5.5 on complex reasoning. If the task requires the best available performance, APIs are the only option.
Compliance and support: Enterprise API tiers include SLAs, BAAs (HIPAA), data residency controls, and zero-retention options that would require additional investment to replicate on self-hosted infrastructure.

When self-hosting wins

High sustained volume: Above 10B+ tokens/month, owned or reserved GPU capacity typically beats API pricing.
Data sovereignty: Air-gapped or on-premise requirements.
Fine-tuning: Self-hosted models can be fine-tuned on proprietary data. API-based fine-tuning is available from some providers but with less control.
Latency control: No network round-trip. Co-located inference can achieve sub-10ms TTFT.

Cost modeling a real workload

Pricing tables are abstractions. Here is a concrete example of modeling costs for a production RAG pipeline.

Scenario: customer support agent

50,000 queries/day
Each query: 200 tokens input (user question), 2,000 tokens retrieved context, 300 tokens system prompt, 400 tokens output
System prompt + retrieval context is 80% cacheable (same FAQ docs across queries)

Option A: Claude Sonnet 4.6, no optimization

Component	Tokens/request	Daily tokens	Cost/M	Daily cost
Input (system + context + query)	2,500	125M	$3.00	$375.00
Output	400	20M	$15.00	$300.00
Total				$675.00/day

Monthly: ~$20,250.

Option B: Claude Sonnet 4.6, with prompt caching

Component	Tokens/request	Daily tokens	Cost/M	Daily cost
Cache write (system + context, first request/5min)	2,300	~0.7M	$3.75	$2.63
Cache read (system + context, subsequent)	2,300	114.3M	$0.30	$34.29
Uncached input (user query)	200	10M	$3.00	$30.00
Output	400	20M	$15.00	$300.00
Total				$366.92/day

Monthly: ~$11,008. 46% savings from caching alone.

Option C: Routed, with caching

Route 70% of queries (simple FAQ-type) to Haiku 4.5, 30% (complex) to Sonnet 4.6. Both with caching.

Component	Daily cost
Haiku tier (35K queries, cached)	~$52.50
Sonnet tier (15K queries, cached)	~$117.00
Classifier (50K queries, Haiku)	~$0.50
Total	~$170.00/day

Monthly: ~$5,100. 75% savings vs. Option A.

Option D: Add batch processing for non-real-time queries

If 20% of queries (overnight analytics, report generation) can tolerate 24-hour turnaround:

Component	Daily cost
Real-time routed (40K queries)	~$136.00
Batch (10K queries, 50% discount)	~$17.00
Total	~$153.00/day

Monthly: ~$4,590. 77% savings vs. Option A.

The progression from $20,250/month to $4,590/month uses no different models, no degradation in output quality, and no self-hosting. It is entirely a function of caching, routing, and batching — infrastructure decisions, not model decisions.

Quick reference: cost per 1M output tokens

Sorted cheapest to most expensive. Input costs excluded. This is what you pay for the tokens your users actually see.

Model	Output $/M	Type
Mistral Nemo 12B	$0.06	Proprietary
Gemini 2.5 Flash-Lite	$0.10	Proprietary
DeepSeek V4-Flash	$0.28	Proprietary
GPT-4.1 Nano	$0.40	Proprietary
Llama 4 Scout (hosted)	~$0.40	Open-weight
DeepSeek R1	$2.19	Open-weight
GPT-5.5 Instant	~$1.00	Proprietary
GPT-5.4	$10.00	Proprietary
Gemini 2.5 Pro	$10.00	Proprietary
Claude Haiku 4.5	$5.00	Proprietary
Mistral Large 3	$6.00	Proprietary
o3	$8.00	Proprietary
Gemini 3.1 Flash-Lite	$1.50	Proprietary
Gemini 3.1 Pro	$12–18	Proprietary
Qwen3.7-Max	~$16.00	Proprietary
Claude Sonnet 4.6	$15.00	Proprietary
Claude Opus 4.6	$25.00	Proprietary
Claude Opus 4.7	$25.00	Proprietary
GPT-5.5	$30.00	Proprietary