LLM Token Costs and Efficiency: A Practitioner's Guide (May 2026) LLM token costs across 15+ providers: per-token pricing, caching mechanics, batch discounts, model routing, and cost optimization for May 2026. 2026-05-23T12:00:00.000Z Deep Dives Deep Dives apipricingcost-optimizationtokensreference

LLM Token Costs and Efficiency: A Practitioner's Guide (May 2026)

LLM token costs across 15+ providers: per-token pricing, caching mechanics, batch discounts, model routing, and cost optimization for May 2026.

The post you bookmark. One topic, covered end to end.

Per-token pricing is the number on the pricing page. It is not your actual cost. The gap between the two is where most teams either save or waste thousands of dollars a month. This post covers both: the raw numbers across every major provider, and the strategies that determine what you actually pay.

Companion post: API Rate Limits Compared (May 2026) covers RPM, TPM, and throughput limits across the same providers.

Table of Contents

All prices verified against official provider documentation as of May 23, 2026. Prices change frequently. Confirm on provider pricing pages before budgeting.


Per-token pricing across providers

Prices in USD per million tokens. Sorted cheapest to most expensive by output cost within each category.

Proprietary flagship models

ModelInput/MOutput/MCached Input/MContextNotes
Mistral Nemo 12B$0.02$0.06128K
Gemini 2.5 Flash-Lite$0.025$0.101MNo thinking mode
DeepSeek V4-Flash$0.14$0.28$0.0281M
Gemini 2.5 Flash$0.15$0.60$0.03751MThinking tokens: $0.60/M in, $2.50/M out
GPT-4.1 Nano$0.10$0.40$0.0251MBudget/high-throughput tier
GPT-5.5 Instant$0.20$1.00128KFree-tier ChatGPT default since May 5
Gemini 3.1 Flash-Lite$0.25$1.501M
Claude Haiku 4.5$1.00$5.00$0.10200KCache write: $1.25/M
DeepSeek V4-Pro$0.435$0.87$0.0361M
Mistral Large 3$2.00$6.00128K
GPT-5.4$2.50$10.00$1.251M
Gemini 2.5 Pro$1.25$10.00$0.31251MThinking tokens: $1.25/M in, $10/M out
Claude Sonnet 4.6$3.00$15.00$0.301MCache write: $3.75/M
Gemini 3.1 Pro$2–4$12–181M$2/$12 under 200K; $4/$18 above
GPT-5.5$5.00$30.00$2.501M~40% more token-efficient per task than GPT-5.4
Claude Opus 4.6$5.00$25.00$0.501MCache write: $6.25/M
Claude Opus 4.7$5.00$25.00$0.501MCache write: $6.25/M; higher-res vision
Qwen3.7-Max~$4.00~$16.001MProprietary; 35-hr autonomous operation demonstrated
GPT-5.5 ProContact salesEnterprise only

Two notes on this table. GPT-5.5 Instant replaced GPT-5.3 Instant as the free-tier ChatGPT default on May 5, 2026; API pricing above is estimated from OpenAI’s current tier structure — confirm before budgeting. Qwen3.7-Max launched May 20, 2026, and hosted API pricing had not been formally published at time of writing; the numbers above are provisional estimates based on Alibaba Cloud’s prior pricing patterns.

Reasoning/thinking models

ModelInput/MOutput/MContextNotes
o4-mini$1.10$4.40200KThinking tokens: $1.10/M
o3$2.00$8.00200KThinking tokens: $2.00/M
DeepSeek R1$0.55$2.19128KCache hits: $0.14/M
GPT-5.4 Thinking$3.00$12.001MThinking tokens billed at output rate
GPT-5.5 Thinking$6.00$36.001MThinking tokens billed at output rate

Thinking/reasoning tokens deserve attention. When a model uses extended thinking, you pay for the thinking tokens too. A complex math problem that triggers 10,000 thinking tokens before generating a 500-token answer costs you for 10,500 tokens of output, not 500. Anthropic, OpenAI, and Google all bill thinking tokens at the same rate as standard output tokens for their respective models.

GPT-5.4 Thinking and GPT-5.5 Thinking pricing is estimated based on the pattern established with GPT-5.4 standard; confirm against OpenAI’s pricing page before using these in budget models.

Open-weight models (hosted pricing)

Self-hosted = free. These are representative hosted prices from providers like Groq, Together AI, Fireworks, and Alibaba Cloud.

ModelHosted Input/MHosted Output/MContextLicense
Llama 4 Scout (109B/17B active)~$0.10~$0.4010MMeta
Qwen3-32B~$0.15~$0.60128KApache 2.0
Llama 4 Maverick~$0.15~$0.601MMeta
Gemma 3 27B~$0.20~$0.20128KGoogle
Mistral 7B~$0.25~$0.75128KApache 2.0
DeepSeek R1 (hosted)$0.55$2.19128KMIT
GPT-OSS 120B~$0.90~$0.90128KApache 2.0
Devstral 2 (72B)~$1.00~$1.00256KApache 2.0
Command A+ (218B MoE, 25B active)~$0.50~$1.50256KApache 2.0

Command A+ launched May 20, 2026, replacing Command R+ as Cohere’s strongest model. Its 25B active parameter count despite 218B total parameters makes it notably cheaper to host than its benchmark performance might suggest — two H100s are sufficient for inference. Hosted pricing above is provisional; confirm with Cohere’s API pricing page.

Embeddings and other modalities

ModelPriceUnitNotes
OpenAI text-embedding-3-small$0.02/M tokenstokens
OpenAI text-embedding-3-large$0.13/M tokenstokens
Gemini Embedding 2Free (AI Studio)tokensMultimodal (text, image, video, audio)
Cohere Embed v3$0.10/M tokenstokensMultilingual
Whisper (OpenAI)$0.006/minaudio minutes
Whisper (Groq, free tier)$0.00audio minutes8 hrs/day limit
Gemini 3.1 Flash TTSContact GooglecharactersNew; dedicated TTS model
OpenAI TTS$15.00/M charscharacters
OpenAI TTS HD$30.00/M charscharacters
GPT Image 2$0.02–$0.17per imageVaries by quality/size

The hidden cost multipliers

The pricing table above is the starting point, not the answer. Several factors multiply your actual spend beyond the per-token rate.

1. Output tokens cost 2–6x more than input tokens

Every major provider charges more for output than input. The ratio varies:

Provider/ModelTypical Output:Input Ratio
DeepSeek V4-Flash2:1
GPT-5.44:1
Claude Sonnet 4.65:1
GPT-5.56:1
Claude Opus 4.75:1

Prompt engineering that reduces output length has an outsized cost impact. A system prompt that gets the model to return 200 tokens instead of 500 tokens saves more than a system prompt that reduces input by the same 300 tokens.

2. Context length pricing tiers

Google charges different rates based on how much of the context window you use. Gemini 3.1 Pro costs $2/$12 per M under 200K tokens, but $4/$18 above 200K — a 50% price increase for long-context workloads. Other providers charge a flat rate regardless of context usage, but this could change as 1M+ context windows become standard.

3. Thinking token overhead

Extended thinking is not free. A request to Claude Opus 4.7 that triggers 20,000 thinking tokens at $25/M output adds $0.50 to that single request before the actual response tokens. For complex reasoning tasks, thinking tokens can exceed response tokens by 10–50x.

Anthropic’s budget_tokens parameter lets you cap thinking tokens per request. OpenAI’s reasoning_effort parameter (low/medium/high) on o3/o4-mini controls thinking depth. Use these to bound costs on reasoning-heavy workloads.

4. Cache write costs

Anthropic charges a 25% premium on tokens written to cache vs. standard input. Claude Opus 4.7: $6.25/M for cache writes vs. $5.00/M for standard input. The write premium pays for itself after the second cache read ($0.50/M), so any cached prefix used 3+ times is net positive. Single-use prompts should skip caching.

5. Image and multimodal token costs

Images are converted to tokens before processing. A single high-resolution image can consume 1,000–5,000+ tokens depending on the model and resolution settings. For vision-heavy workloads — document OCR, screenshot analysis, diagram interpretation — image token costs can dominate the bill. Claude Opus 4.7’s higher-resolution vision capability increases this exposure relative to Opus 4.6, since the model processes finer image detail at greater token cost. Resize images to the minimum resolution needed before sending them.

6. Autonomous operation and agentic loops

Qwen3.7-Max’s 35-hour autonomous operation demonstration and Anthropic’s continued expansion of agentic Claude use cases introduce a cost pattern that per-request pricing doesn’t capture well: a single high-level task can spawn dozens or hundreds of model calls, each accumulating input context from prior steps. An agent that reads its own output history accumulates input tokens quadratically over a long session. Budget token limits at the task level, not just the request level, for any agentic workload.


Prompt caching: how it works, what it saves

Prompt caching stores the processed representation of input tokens on the provider side so subsequent requests with the same prefix skip re-processing. The savings are substantial.

Provider comparison

ProviderCache Read DiscountCache Write PremiumMin CacheableTTLRate Limit Benefit
Anthropic90% off input25% over input1,024 tokens5 min (auto-extends)Cached reads don’t count toward ITPM
OpenAI50% off inputNone (automatic)VariesAutomaticNo rate limit benefit
Google75% off inputNone (usage-based)4,096 tokensConfigurableNo rate limit benefit
DeepSeek80–90% off inputNoneAutomaticVariesNo rate limit benefit

Anthropic’s caching has the most generous rate limit interaction: cached read tokens are excluded from ITPM limits entirely. With 80% cache hits, a 2M ITPM limit handles 10M total input tokens per minute. This is a throughput multiplier, not just a cost reduction.

When caching pays off

Diagram

The break-even math for Anthropic: cache write costs 25% more than standard input. Cache reads cost 90% less. After 2 cache reads, the write premium is recovered. Anything beyond that is pure savings. For a system prompt reused across thousands of requests, the savings approach 90% on that portion of every request.

Practical caching patterns

RAG with stable context: Cache the system prompt and retrieved documents as a prefix. Each follow-up query appends to the cached prefix. Works well when the same document set is queried multiple times within the TTL window.

Multi-turn conversations: Cache the conversation history prefix. Each new turn extends the cache. Anthropic’s 5-minute auto-extending TTL means active conversations stay cached indefinitely.

Batch evaluation: Cache the evaluation rubric and few-shot examples. Run hundreds of evaluations against the cached prefix. The rubric tokens (often 2,000–5,000) are paid once at write price, then read at 90% discount for every subsequent evaluation.

Agentic task context: For long-running autonomous tasks, cache the task description, tool definitions, and accumulated working memory up to the stable prefix boundary. This is particularly relevant for Qwen3.7-Max and Claude Opus 4.7 workloads where session context can grow into the hundreds of thousands of tokens.


Batch APIs: the easiest 50% discount

OpenAI, Anthropic, and Google all offer batch processing with lower pricing and higher throughput limits.

ProviderBatch DiscountTurnaroundQueue Limits
OpenAI50% off standard pricingUp to 24 hours3x–100x higher than real-time TPM
Anthropic50% off standard pricingUp to 24 hoursHigher than real-time
Google50% off standard pricingUp to 24 hoursAvailable for Gemini 2.5+ models

OpenAI’s batch queue limits are notably generous: GPT-5.5 gets 1.5M batch tokens at Tier 1 vs. 500K real-time TPM. At Tier 5, the batch queue reaches 15 billion tokens.

Workloads that fit batch processing: content generation pipelines, bulk classification, evaluation suites, data extraction from document archives, nightly report generation, synthetic data creation. Any workload where the output is consumed minutes or hours after generation rather than in real time.


Model routing: the architecture that pays for itself

The most impactful cost optimization is not using one model for everything. A routing layer that sends requests to different models based on complexity can cut spend 60–85%.

Diagram

Cost comparison: routed vs. single-model

Assume 100,000 requests/day, average 500 output tokens per request.

Single model (Claude Sonnet 4.6 for everything):

  • 100K requests × 500 tokens = 50M output tokens/day
  • 50M × $15/M = $750/day

Three-tier routing (70/25/5 split):

  • 70K requests × 500 tokens × $5/M (Haiku 4.5) = $175.00
  • 25K requests × 500 tokens × $15/M (Sonnet 4.6) = $187.50
  • 5K requests × 500 tokens × $25/M (Opus 4.7) = $62.50
  • Total: $425/day (43% savings)

The classifier itself costs almost nothing. Running Haiku 4.5 or GPT-4.1 Nano to classify intent on a 100-token input is ~$0.10/M input. For 100K daily classifications: $0.01/day.

The savings compound with caching. If the routing layer also implements prompt caching on the medium and complex tiers, effective savings can reach 70–80%.

Open-weight routing for cost floors

For teams with GPU infrastructure, routing the simple tier to a self-hosted model (Llama 4 Scout, Qwen3-32B, Command A+) eliminates the per-token cost entirely on the majority of requests. Command A+‘s Apache 2.0 license and 2×H100 inference footprint make it a practical target for the medium tier as well, at near-zero marginal cost above infrastructure.


Token efficiency: not all tokens are equal

The same task produces different token counts on different models. This is the most underappreciated cost variable.

GPT-5.5’s efficiency claim

OpenAI reports GPT-5.5 is ~40% more token-efficient than GPT-5.4 on coding tasks. Per-token price doubled ($5/$30 vs. $2.50/$15 for standard), but if the model uses 40% fewer tokens to produce equivalent output, the effective per-task cost is ~20% higher, not 100% higher.

This claim is plausible for code generation (where a more capable model produces tighter code in fewer iterations) but unlikely to hold for simple summarization or classification tasks where output length is fixed. Evaluate token efficiency on your specific workload before assuming the headline number applies.

Tokenizer differences matter

Different providers use different tokenizers. The same English sentence might be 15 tokens on OpenAI’s tiktoken (cl100k_base), 18 tokens on Anthropic’s tokenizer, and 12 tokens on a model using a larger vocabulary. For pricing comparisons, convert to a common unit (characters or words) rather than comparing raw per-token prices across providers with different tokenizers.

A rough heuristic: 1 token ≈ 4 characters ≈ 0.75 words in English. This varies by language (CJK characters often consume 1–2 tokens each) and by content type (code is tokenized differently than prose).

Prompt engineering for token efficiency

Small changes to prompts can produce large token savings at scale:

  • “Respond in JSON” vs. natural language: structured responses are 30–60% shorter.
  • “Be concise” in the system prompt: models tend to over-explain. An explicit brevity instruction reduces output tokens by 20–40% on average.
  • Few-shot examples: Adding 2–3 examples of desired output length calibrates the model’s verbosity. This costs a few hundred extra input tokens but saves thousands of output tokens across the batch.
  • Max tokens parameter: Set a hard cap. Prevents runaway generation on edge cases where the model would otherwise produce 10x the expected output.

Structured outputs: fewer tokens, same information

JSON mode and structured outputs (GA on OpenAI, Anthropic, and Google) constrain the model to return data in a predefined schema. This is both a reliability and a cost optimization.

A sentiment classification task:

Unstructured response (typical): “The sentiment of this review is positive. The customer expresses satisfaction with the product quality and delivery speed, though they note minor concerns about packaging.” (~35 tokens)

Structured response: {"sentiment": "positive", "confidence": 0.92} (~12 tokens)

That is a 65% reduction in output tokens per request. At $15/M output (Sonnet 4.6), across 1M daily classifications, the savings are: (35−12) × 1M / 1M × $15 = $345/day.

For any pipeline where the downstream consumer is code rather than a human reading prose, structured outputs should be the default.


Self-hosting vs. API: when the math flips

Open-weight models (Llama 4, Qwen3, Mistral, DeepSeek, Command A+, GPT-OSS) are free to run. The cost is infrastructure.

Break-even estimation

A single H100 (80GB) rents for roughly $2–3/hour on cloud providers. It can serve a 70B model (quantized to 4-bit) at approximately 30–50 tokens/second.

Monthly H100 cost: ~$2,000 Monthly tokens at 40 tok/s sustained: ~100B tokens

Equivalent API cost for 100B output tokens:

  • Claude Haiku 4.5: $500,000
  • GPT-4.1 Nano: $40,000
  • DeepSeek V4-Flash: $28,000

Self-hosting a 70B open model at $2,000/month breaks even against DeepSeek V4-Flash at roughly 7B output tokens/month. Against Haiku 4.5, it breaks even at under 500M tokens/month.

Command A+ changes the calculus somewhat. At 218B total parameters but only 25B active, it runs on 2×H100 (~$4,000/month) while delivering quality closer to mid-tier proprietary models than typical 70B open-weights. For teams that can keep those GPUs near-saturated, the economics are compelling.

When APIs still win

  • Low volume: Under ~5B tokens/month, API costs are lower than maintaining infrastructure.
  • Burst traffic: APIs absorb spikes. Self-hosted GPUs sit idle during troughs and can’t handle peaks without over-provisioning.
  • Frontier quality: No open model matches Opus 4.7 or GPT-5.5 on complex reasoning. If the task requires the best available performance, APIs are the only option.
  • Compliance and support: Enterprise API tiers include SLAs, BAAs (HIPAA), data residency controls, and zero-retention options that would require additional investment to replicate on self-hosted infrastructure.

When self-hosting wins

  • High sustained volume: Above 10B+ tokens/month, owned or reserved GPU capacity typically beats API pricing.
  • Data sovereignty: Air-gapped or on-premise requirements.
  • Fine-tuning: Self-hosted models can be fine-tuned on proprietary data. API-based fine-tuning is available from some providers but with less control.
  • Latency control: No network round-trip. Co-located inference can achieve sub-10ms TTFT.

Cost modeling a real workload

Pricing tables are abstractions. Here is a concrete example of modeling costs for a production RAG pipeline.

Scenario: customer support agent

  • 50,000 queries/day
  • Each query: 200 tokens input (user question), 2,000 tokens retrieved context, 300 tokens system prompt, 400 tokens output
  • System prompt + retrieval context is 80% cacheable (same FAQ docs across queries)

Option A: Claude Sonnet 4.6, no optimization

ComponentTokens/requestDaily tokensCost/MDaily cost
Input (system + context + query)2,500125M$3.00$375.00
Output40020M$15.00$300.00
Total$675.00/day

Monthly: ~$20,250.

Option B: Claude Sonnet 4.6, with prompt caching

ComponentTokens/requestDaily tokensCost/MDaily cost
Cache write (system + context, first request/5min)2,300~0.7M$3.75$2.63
Cache read (system + context, subsequent)2,300114.3M$0.30$34.29
Uncached input (user query)20010M$3.00$30.00
Output40020M$15.00$300.00
Total$366.92/day

Monthly: ~$11,008. 46% savings from caching alone.

Option C: Routed, with caching

Route 70% of queries (simple FAQ-type) to Haiku 4.5, 30% (complex) to Sonnet 4.6. Both with caching.

ComponentDaily cost
Haiku tier (35K queries, cached)~$52.50
Sonnet tier (15K queries, cached)~$117.00
Classifier (50K queries, Haiku)~$0.50
Total~$170.00/day

Monthly: ~$5,100. 75% savings vs. Option A.

Option D: Add batch processing for non-real-time queries

If 20% of queries (overnight analytics, report generation) can tolerate 24-hour turnaround:

ComponentDaily cost
Real-time routed (40K queries)~$136.00
Batch (10K queries, 50% discount)~$17.00
Total~$153.00/day

Monthly: ~$4,590. 77% savings vs. Option A.

The progression from $20,250/month to $4,590/month uses no different models, no degradation in output quality, and no self-hosting. It is entirely a function of caching, routing, and batching — infrastructure decisions, not model decisions.


Quick reference: cost per 1M output tokens

Sorted cheapest to most expensive. Input costs excluded. This is what you pay for the tokens your users actually see.

ModelOutput $/MType
Mistral Nemo 12B$0.06Proprietary
Gemini 2.5 Flash-Lite$0.10Proprietary
DeepSeek V4-Flash$0.28Proprietary
GPT-4.1 Nano$0.40Proprietary
Llama 4 Scout (hosted)~$0.40Open-weight
DeepSeek R1$2.19Open-weight
GPT-5.5 Instant~$1.00Proprietary
GPT-5.4$10.00Proprietary
Gemini 2.5 Pro$10.00Proprietary
Claude Haiku 4.5$5.00Proprietary
Mistral Large 3$6.00Proprietary
o3$8.00Proprietary
Gemini 3.1 Flash-Lite$1.50Proprietary
Gemini 3.1 Pro$12–18Proprietary
Qwen3.7-Max~$16.00Proprietary
Claude Sonnet 4.6$15.00Proprietary
Claude Opus 4.6$25.00Proprietary
Claude Opus 4.7$25.00Proprietary
GPT-5.5$30.00Proprietary

Further reading

OpenAI

Anthropic

Google

DeepSeek

Mistral

Cohere

Alibaba / Qwen

Third-party trackers


Prices verified May 23, 2026. The LLM pricing landscape shifts every 2–4 weeks. Use pricepertoken.com for live tracking. Always confirm against official provider documentation before production budgeting.