API Rate Limits Compared: Every Major LLM Provider (June 2026)

Seventh edition. The sixth (June 6, 2026) covered 17 providers. This update reflects the one-week window since June 6. Claude Opus 4.8 is now the Anthropic flagship, replacing Opus 4.7 as the entry in scaled-tier tables. Google released Gemini 3.5 Flash and Gemini Omni at Google I/O 2026 — neither has published rate limit figures yet. The DeepSeek V3.2 alias deprecation on July 24 is now six weeks out. Qwen3.7-Max and Command A+ rate limit tables remain unpublished.

What changed since June 6
Free Tier — RPM & RPD
Free Tier — TPM & TPD
Entry Paid Tier — RPM & RPD
Entry Paid Tier — TPM
Scaled Tier — RPM & TPM
Cerebras & SambaNova
Audio Models — ASH & ASD
Cloud Aggregators — Azure AI & AWS Bedrock
More Providers — Perplexity, Alibaba, Moonshot
Provider Notes
Tips for managing rate limits
Further Reading

Last updated: June 13, 2026.

What changed since June 6

Claude Opus 4.8 is now Anthropic’s flagship. Released in May 2026, Opus 4.8 brings incremental but confirmed improvements in reasoning and code over Opus 4.7. It slots into the same rate limit pool structure as prior Opus 4.x models — Opus 4.8, 4.7, and 4.6 share one combined bucket. Scaled-tier tables in this edition reference Opus 4.8 as the current flagship; the pool note remains relevant.

Gemini 3.5 Flash and Gemini Omni released at Google I/O 2026. No rate limit figures published for either model yet. Both are omitted from tables accordingly. Gemini 3.1 Pro, 2.5 Pro, and 2.5 Flash remain in tables with their last confirmed figures.

No new model releases across OpenAI, xAI, Groq, DeepSeek, or Mistral in the June 6–13 window. Rate limit structures are unchanged at those providers.

Gemini 2.0 shutdown: Remains complete as of June 1. References to those endpoints have been removed from all tables.

Still pending formal documentation:

Alibaba: Qwen3.7-Max (May 20) rate limit tables still not published. Qwen3 Max limits continue as a provisional floor.
Cohere: Command A+ (May 20) rate limit figures still not published. Command R+ figures remain as a reference baseline.
Google: Gemini 3.5 Flash and Gemini Omni rate limit tables not yet published.

Deprecation calendar:

DeepSeek V3.2 aliases deprecate July 24, 2026 — six weeks out. V4-Flash and V4-Pro are the current endpoints.
OpenAI Assistants API deprecates August 2026; replaced by the Responses API.

What are rate limits?

Rate limits cap how much you can use an API within a given time window. Providers enforce them to manage capacity and ensure fair access. Exceeding a limit returns a 429 Too Many Requests error.

Most providers use the token bucket algorithm: capacity refills continuously up to your maximum, rather than resetting at fixed intervals. A 60 RPM limit means roughly 1 request per second with steady refill, not 60 requests then a hard stop.

The metrics

RPM (Requests per minute): API calls per minute, regardless of size. A 10-token request and a 100K-token request both count as one.
RPD (Requests per day): Daily cap on total API calls. Some providers use this instead of (or alongside) RPM, especially on free tiers.
TPM (Tokens per minute): Total tokens (input + output) processed per minute. Usually the binding constraint in production.
TPD (Tokens per day): Daily token cap. More common on free tiers.
ITPM/OTPM (Input/Output tokens per minute): Anthropic separates input and output token limits, giving finer control. Cached input tokens don’t count toward ITPM on current Claude 4.x models.
ASH/ASD (Audio seconds per hour/day): For speech models like Whisper.

RPM limits set your concurrency ceiling. TPM limits set your throughput. For batch workloads, RPD and TPD matter more. For real-time apps, RPM and TPM hit first.

Free Tier — Requests per Minute (RPM) & Requests per Day (RPD)

Provider / Model	RPM	RPD
Google — 2.5 Flash	10	250–1,500
Google — 2.5 Flash-Lite	15	1,000
Groq — Llama 4 Scout	30	1,000
Groq — Qwen3 32B	60	1,000
Cerebras — Llama 3.3 70B	30	—
SambaNova — Llama 3.1 405B	10–30	—
Fireworks AI — OS models	10	—
Cohere — Command R+*	20	~33†

*Cohere’s free tier figures reference Command R+. Command A+ free tier limits not yet published. †Cohere trial keys get 1,000 calls/month across all endpoints, roughly ~33/day.

Note: Gemini 2.5 Pro free tier (5 RPM, 50–100 RPD) remains available but is not listed here — the conservative limits make it unsuitable as a development baseline. Check AI Studio for your project’s actual caps. OpenAI, Anthropic, xAI, DeepSeek, and Mistral have no free API tier for current flagship models. OpenAI’s free tier does not support any GPT-5.x model via the API (GPT-5.5 Instant is a ChatGPT product, not an API tier). Anthropic’s lowest entry is $5 (Tier 1). xAI and Mistral gate numerical limits behind their respective consoles.

Migration note: Gemini 2.0 Flash and Flash-Lite are shut down as of June 1, 2026. Those endpoints no longer accept requests.

Free Tier — Tokens per Minute (TPM) & Tokens per Day (TPD)

Provider / Model	TPM	TPD
Google — 2.5 Flash	250,000	—
Google — 2.5 Flash-Lite	250,000	—
Groq — Llama 4 Scout	30,000	500,000
Groq — Qwen3 32B	6,000	500,000
Cerebras — Llama 3.3 70B	60,000	1,000,000
SambaNova — Llama 3.1 405B	—	—

Standout: Google leads with 250K TPM on the free tier, but the RPM/RPD caps (10–15 RPM) mean it’s best for fewer, larger requests. Cerebras offers 60K TPM with 1M TPD and ~2,100 tokens/second inference speed on Llama 3.3 70B, making it the fastest free option by throughput. SambaNova publishes RPM but not TPM/TPD specifics.

Entry Paid Tier — Requests per Minute (RPM) & Requests per Day (RPD)

The tier most developers start with. OpenAI Tier 1 ($5), Anthropic Tier 1 ($5), Google Tier 1 (pay-as-you-go), and others at their lowest paid level.

Provider / Model	RPM	RPD
OpenAI — GPT-5.5	500	—
OpenAI — GPT-4.1 Nano	500	—
Anthropic — Sonnet 4.6	50	—
Anthropic — Haiku 4.5	50	—
Google — 2.5 Pro	150	1,000
Google — 2.5 Flash	300	1,500
xAI — Grok 4.20	Console†	—
DeepSeek — V4 Flash	Dynamic‡	—
Fireworks AI — OS models	≤6,000	—
Cohere — Command A+*	500	—

*Command A+ entry paid tier limits not yet published. 500 RPM is the Command R+ figure, carried forward as a reference. †xAI publishes tier thresholds ($0/$50/$250/$1K/$5K) but numerical RPM/TPM are only visible in the xAI Console after login. ‡DeepSeek uses fully dynamic concurrency limits based on server load. No fixed RPM/TPM published.

Gemini 3.1 Pro note: Gemini 3.1 Pro is in paid preview. Published limits not confirmed — check AI Studio for your project’s actual limits. Gemini 3.5 Flash and Gemini Omni, released at Google I/O 2026, have no published rate limit figures yet. All three are omitted from this table.

Mistral note: Mistral moved to console-only limits. Actual numbers require login at admin.mistral.ai/plateforme/limits. Mistral is excluded from RPM columns where no public figure exists.

Standout: Fireworks can spike to 6,000 RPM but it’s a dynamic ceiling, not guaranteed (soft limit starts at ~1 RPS and doubles hourly). Anthropic’s 50 RPM at Tier 1 is the lowest here, but jumps 20x to 1,000 RPM at Tier 2 ($40).

Entry Paid Tier — Tokens per Minute (TPM)

Provider / Model	TPM
OpenAI — GPT-5.5	500K
OpenAI — GPT-4.1 Nano	200K
Anthropic — Sonnet 4.6	30K in / 8K out
Anthropic — Haiku 4.5	50K in / 10K out
Google — 2.5 Pro	1M
Google — 2.5 Flash	2M
DeepSeek — V4 Flash	Dynamic

Standout: Google 2.5 Flash leads the confirmed figures at 2M TPM. GPT-5.5 gets 500K TPM at Tier 1. Anthropic looks low on paper (30K ITPM for Sonnet), but cached input tokens don’t count toward the limit — with an 80% cache hit rate, effective throughput is 5x higher (150K+ effective ITPM).

Scaled Tier — RPM & TPM at Higher Spend

For teams past entry-level. OpenAI Tier 3 ($100+), Anthropic Tier 3 ($200+), Google Tier 2 ($250+).

Provider (Tier) / Model	RPM	TPM
OpenAI (Tier 3) — GPT-5.5	5,000	2M
OpenAI (Tier 3) — GPT-4.1 Nano	5,000	4M
Anthropic (Tier 3) — Opus 4.8	2,000	800K in / 160K out
Anthropic (Tier 3) — Haiku 4.5	2,000	1M in / 200K out
Google (Tier 2) — 2.5 Pro	1,000	2M
Google (Tier 2) — 2.5 Flash	2,000	4M

At the highest standard tiers (OpenAI Tier 5 at $1,000+, Anthropic Tier 4 at $400+):

Provider (Tier) / Model	RPM	TPM
OpenAI (Tier 5) — GPT-5.5	15,000	40M
OpenAI (Tier 5) — GPT-4.1 Nano	30,000	180M
Anthropic (Tier 4) — Opus 4.8	4,000	2M in / 400K out
Anthropic (Tier 4) — Haiku 4.5	4,000	4M in / 800K out

Standout: OpenAI’s Tier 5 numbers remain the highest of any provider with published figures. GPT-4.1 Nano at 180M TPM and 30K RPM is built for high-volume classification and routing. GPT-5.5 gets 15K RPM and 40M TPM. Anthropic’s Tier 4 caps at 4K RPM, but cached tokens don’t count toward ITPM — effective throughput can be 5x+ higher with good cache hit rates.

Opus pool note: Opus 4.8, 4.7, and 4.6 share one rate limit bucket. Sending traffic to multiple Opus model versions doesn’t multiply your effective limits; all draws come from the same pool.

Cerebras & SambaNova

Both specialize in custom silicon for inference speed.

Cerebras

Hardware: Wafer-Scale Engine (WSE-3). Fastest published inference: ~2,100 tokens/second on Llama 3.3 70B.

Tier	RPM	TPM	TPD
Free	30	60,000	1,000,000
Paid	Higher (contact sales)	Higher	Higher

Hosts Llama 3.3 70B, Llama 3.1 8B, and other open models. The speed advantage is real: tasks that take 30 seconds on GPU-based providers finish in under 5 seconds on the WSE-3. Paid tier limits are not publicly documented.

SambaNova

Hardware: Custom RDU (Reconfigurable Dataflow Unit). Best time-to-first-token (TTFT): ~0.2 seconds.

Tier	RPM	Notes
Free	10–30 (varies by model)	Hosts up to Llama 3.1 405B for free

SambaNova offers Llama 3.1 405B on the free tier — most other free-tier providers cap out at 70B-class models. TPM and TPD limits are not publicly documented.

Audio Models — ASH & ASD

Groq and Fireworks both publish audio-specific limits.

Provider / Model	RPM	RPD	ASH	ASD	Audio min/min
Groq (Free) — Whisper Large v3	20	2,000	7,200	28,800	—
Groq (Free) — Whisper Large v3 Turbo	20	2,000	7,200	28,800	—
Fireworks AI — Whisper v3-large	—	—	—	—	200
Fireworks AI — Whisper v3-turbo	—	—	—	—	400

Groq: 7,200 ASH = 2 hours of audio per hour of wall time. 28,800 ASD = 8 hours per day. Adequate for a podcast transcription pipeline on the free tier.

Fireworks: 200 min/min for Whisper v3-large, 400 min/min for v3-turbo. Concurrent streaming capped at 10 connections.

Cloud Aggregators — Azure AI & AWS Bedrock

Both use configurable per-deployment limits, not fixed tiers. The shared ratio is 6 RPM per 1,000 TPM.

Azure AI (Microsoft)

Deployment Type	How Limits Work
Pay-as-you-go (Standard)	TPM quota per model per region, auto-scales with usage
Provisioned (PTU)	Reserved throughput units, no per-request limits
Global/Data Zone	Higher default quotas, multi-region routing

Multi-model: admins can select GPT-5.5, GPT-5.4, Claude Opus 4.8/Sonnet 4.6, or Gemini 3.1 Pro (verify availability per region — not all models are deployed globally). Azure AI Quotas & Limits

AWS Bedrock

Hosts Claude (Opus 4.8, Sonnet 4.6, Haiku 4.5), Llama 4, Mistral, and Nova models with per-model, per-region quotas. Default quotas vary by model and can be increased via AWS Service Quotas console. Provisioned Throughput deployments remove rate limits entirely. AWS Bedrock Quotas

More Providers — Perplexity, Alibaba (Qwen), Moonshot (Kimi)

Provider / Model	RPM (T0)	RPM (T1)	RPM (T3+)	TPM	TPD (T0)
Perplexity — Sonar Pro	50	150	1,000	—	—
Perplexity — Sonar	50	150	1,000	—	—
Perplexity — Deep Research	5	10	40	—	—
Alibaba (Qwen) — Qwen3.7-Max*	—	600*	600*	1M*	—
Alibaba (Qwen) — Qwen3.5 Plus	—	15,000	30,000	5M	—
Alibaba (Qwen) — Qwen3.5 Flash	—	15,000	30,000	10M	—
Moonshot — Kimi K2.5	3	200	5,000	—	1.5M

*Qwen3.7-Max (May 20) limits not yet formally published. Figures shown are Qwen3 Max limits carried forward as a provisional floor.

Perplexity tiers: T0 = new account, T1 = $50+, T3 = $500+. Alibaba limits shown for international (Singapore) deployment; Beijing deployment is higher (Qwen3 Max at 30K RPM — Qwen3.7-Max Beijing limits not yet confirmed). Moonshot T1 = $10+ cumulative recharge; T0 has 1.5M TPD cap, T1+ is unlimited.

Standout: Alibaba’s Qwen3.5 Flash at 30K RPM / 10M TPM (Beijing) remains the highest confirmed throughput of any provider in this post. Moonshot’s T5 tier (10K RPM, 1,000 concurrent connections, $3,000+ recharge) is competitive at the high end.

Provider Notes

OpenAI: GPT-5.5 and GPT-5.4 share identical rate limit profiles across all tiers. GPT-4.1 Nano at Tier 5 (180M TPM, 30K RPM) is the throughput ceiling for any published provider. GPT-5.5 Instant is a ChatGPT product tier — not accessible via the API under that name; API access goes through the GPT-5.5 model identifier. GPT-5.4-Cyber is available only to vetted security teams; rate limits not disclosed. GPT-5.2 is being phased out. GPT-5.3 Codex remains available for coding workloads. Batch API offers 50% discount with higher queue limits. Assistants API deprecates August 2026; replaced by Responses API. Docs
Anthropic: Opus 4.8 is the current flagship (May 2026). Opus 4.8, 4.7, and 4.6 share a combined rate limit pool. You cannot independently max out multiple Opus 4.x versions simultaneously. Cached input tokens don’t count toward ITPM on current 4.x models. No free tier; lowest entry is $5. Fast mode on Opus 4.6 draws from a separate dedicated pool. Claude 3 Haiku retired April 2026. Docs
Google: Static rate limit tables removed from public docs in Q1 2026. Actual limits only visible in AI Studio dashboard (aistudio.google.com/rate-limit). Gemini 3.1 Pro and 3.1 Flash-Lite are in preview with conservative limits. Gemini 3.5 Flash and Gemini Omni were released at Google I/O 2026 — no published rate limit figures for either yet. Gemini 2.0 Flash/Flash-Lite shut down June 1, 2026 — those endpoints no longer accept requests. Gemini 2.5 Pro/Flash remain available as the previous generation. Docs
Groq: Free tier publishes exact numbers (30 RPM, 6K–30K TPM depending on model). Daily token budgets (TPD) alongside TPM. Cached tokens don’t count. Hosts Llama 4 Scout and Qwen3-32B on free tier. Runs on custom LPU hardware. Docs
xAI: 5-tier structure ($0/$50/$250/$1K/$5K thresholds) but numerical RPM/TPM only visible in xAI Console. Docs
Mistral: European data residency (GDPR). 5-tier structure (Free through Tier 4 at $500+). Limits enforced per RPS (not RPM), TPM, and tokens/month. Actual numbers require login at admin.mistral.ai. Docs
DeepSeek: Fully dynamic concurrency limits based on server load. No published RPM/TPM. V4-Flash and V4-Pro are the current models. V3.2 aliases deprecate July 24, 2026 — update any hardcoded model references now. V4-Flash pricing: $0.14/M input (cache miss), $0.028/M (cache hit), $0.28/M output. Docs
Cerebras: Fastest inference (~2,100 tok/s on Llama 3.3 70B). Free tier: 30 RPM, 60K TPM, 1M TPD. Paid tier limits not publicly documented. Custom WSE-3 silicon. Docs
SambaNova: Best TTFT (~0.2s). Free tier: 10–30 RPM depending on model size. Hosts up to Llama 3.1 405B for free. TPM and TPD limits not publicly documented. Custom RDU hardware. Docs
Together AI: Dynamic rate limits since January 2026. No fixed tiers or published numbers. Limits grow with sustained usage and are returned in API response headers. Docs
Fireworks AI: Dynamic ceiling up to 6,000 RPM (soft limit starts ~1 RPS, doubles hourly). On-demand GPU deployments remove limits. Spending tier caps: $50–$50K/month by tier. Docs
Cohere: Command A+ (May 20, Apache 2.0, 218B MoE / 25B active) is now the flagship. Runs on 2x H100; self-hosting is viable and removes API rate limits entirely for teams with the hardware. API rate limit figures for Command A+ not yet published — 500 RPM (production chat) shown in tables is the Command R+ figure. Trial keys: 1,000 calls/month. Docs
Perplexity: 6 tiers (T0–T5) based on cumulative spend. Leaky bucket algorithm. Deep Research model has very low limits (5–100 RPM). Agent API: 50–2,000 RPM by tier. Docs
Alibaba (Qwen): Qwen3.7-Max (May 20) is the current flagship; published rate limits not yet confirmed. Region-specific limits apply — Beijing deployment is more generous than Singapore/Global. Qwen3.5 Flash at 30K RPM / 10M TPM (Beijing) is the highest confirmed throughput of any provider listed. Docs
Moonshot (Kimi): 6 tiers (T0–T5). T0 ($1 recharge): 3 RPM, 1 concurrent, 1.5M TPD. T5 ($3,000): 10K RPM, 1,000 concurrent, 5M TPM, unlimited TPD. Automatic 75% caching discount applied without opt-in. Docs
AWS Bedrock: Hosts Claude (Opus 4.8, Sonnet 4.6, Haiku 4.5), Llama 4, Mistral, Nova with per-model per-region quotas. 6 RPM per 1K TPM ratio. Provisioned Throughput removes limits. Docs
Azure AI: Hosts GPT-5.5, GPT-5.4, Claude Opus 4.8/Sonnet 4.6, and Gemini 3.1 Pro across regions. PTU deployments remove per-request limits. Verify model availability per region before planning deployments. Docs
NVIDIA NIM: Hosted API is for prototyping (40 RPM). Self-hosted NIM containers have no limits. Docs

Tips for managing rate limits

For a deep dive on token pricing, cost optimization strategies, caching mechanics, model routing architectures, and production cost modeling, see LLM Token Costs and Efficiency. This section covers rate limit management specifically.

Handling 429s

Use exponential backoff when you hit 429s. The Anthropic SDK, OpenAI SDK, and most third-party clients handle this automatically. Don’t build your own retry loop unless you need custom jitter or circuit-breaking behavior.
Check response headers before you hit the wall. Together AI and Fireworks return your current rate limit state in every API response. DeepSeek adjusts concurrency limits dynamically based on server load. Reading these headers in production enables proactive throttling rather than reactive retries.

Caching for throughput (not just cost)

Prompt caching has a rate limit benefit separate from its cost benefit. The cost savings are covered in the token costs post. The rate limit implications:

Anthropic: Cached input tokens don’t count toward ITPM limits at all on Claude 4.x models. With 80% cache hit rate, a 2M ITPM limit effectively handles 10M total input tokens per minute. This is a throughput multiplier, not just a cost reduction.
Groq: Cached tokens don’t count toward TPM/TPD limits on the free tier, effectively expanding your daily token budget.
Moonshot (Kimi): Automatic 75% caching discount applied without opt-in. No cache management required.

Batch APIs: higher queue limits

OpenAI, Anthropic, and Google all offer batch endpoints with queue limits that far exceed real-time TPM. OpenAI’s GPT-5.5 gets 1.5M batch queue tokens at Tier 1 vs. 500K real-time TPM. For workloads that don’t need sub-second responses, batch endpoints let you move more tokens through tighter rate limits. See batch API cost savings for pricing details.

Model routing for rate limit distribution

Routing requests to different models by complexity distributes rate limit pressure across multiple buckets instead of concentrating it on your most constrained tier. A three-tier architecture (budget for triage, mid-tier for generation, flagship for reasoning) also cuts costs 60–85%. Full routing architecture with cost modeling is in the token costs post.

Custom silicon for fewer concurrent connections

Cerebras (~2,100 tok/s) and SambaNova (~0.2s TTFT) are faster than GPU-based providers by a large margin. A task that requires 10 parallel GPU-based API calls to meet a latency SLA might need 2–3 calls on Cerebras. Fewer concurrent calls means less rate limit pressure per unit of output.

Self-hosting as an escape hatch

Cohere’s Command A+ (Apache 2.0, 218B MoE / 25B active) runs on 2x H100. For teams already operating H100 capacity, self-hosting removes API rate limits entirely. The economics only work at scale — hardware cost needs to be weighed against API spend — but it’s now a realistic option for a broader tier of teams than it was six months ago. The same applies to Llama 4, Devstral 2, Mistral Large 3, and other open-weight models.

Watch for shared pools and hidden constraints

Anthropic model pools: Opus 4.8, 4.7, and 4.6 share one rate limit bucket. Sonnet 4.6 and 4.5 share another. Sending traffic to multiple model versions within the same pool doesn’t multiply your effective limits; it draws from the same bucket.
Google’s invisible limits: With static rate limit tables removed from public docs, you must check AI Studio (aistudio.google.com/rate-limit) to see your actual per-project limits. Third-party figures — including those in this post for 2.5 Pro/Flash — are the last confirmed published numbers and may be stale. Gemini 3.5 Flash and Omni have no published figures at all yet.
DeepSeek’s dynamic ceiling: No published RPM/TPM means no guaranteed minimum. Under heavy load, effective concurrency can drop without warning. Build fallback routing to a second provider if DeepSeek availability is critical to your pipeline. The July 24 V3.2 alias deprecation is an additional forcing function to audit your model endpoint references now.
New model lag: Qwen3.7-Max and Command A+ both launched May 20, 2026. Gemini 3.5 Flash and Gemini Omni launched at Google I/O 2026. None of these models have formal rate limit tables yet. Figures in this post for Qwen3.7-Max and Command A+ are provisional floors based on predecessor models.