API Rate Limits Compared: Every Major LLM Provider (April 2026)
Side-by-side rate limit comparison across 17 LLM API providers — OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Cerebras, SambaNova, Perplexity, Alibaba, Moonshot, and more — as of April 2026.
Second edition. The first (March 2026) covered 15 providers. This update adds Cerebras and SambaNova, covers new flagship models (GPT-5.5, Claude Opus 4.7, DeepSeek V4, Gemini 3.1 Pro), and reflects structural changes at several providers since March.
Table of Contents
- What changed since March
- Free Tier — RPM & RPD
- Free Tier — TPM & TPD
- Entry Paid Tier — RPM & RPD
- Entry Paid Tier — TPM
- Scaled Tier — RPM & TPM
- New Providers — Cerebras & SambaNova
- Audio Models — ASH & ASD
- Cloud Aggregators — Azure AI & AWS Bedrock
- More Providers — Perplexity, Alibaba, Moonshot
- Provider Notes
- Tips for managing rate limits
Last updated: April 25, 2026.
What changed since March
New models in rate limit tables:
- OpenAI: GPT-5.5 (Apr 24) and GPT-5.4 mini/nano (Mar 17) joined the lineup. GPT-5.5 shares rate limits with GPT-5.4. Free tier still excluded from all GPT-5.x models.
- Anthropic: Claude Opus 4.7 (Apr 16) shares the Opus 4.x rate limit pool with Opus 4.6. No separate bucket.
- DeepSeek: V4-Flash and V4-Pro (Apr 24) replaced V3.2 as the current models. Rate limits remain fully dynamic (no published caps).
- Google: Gemini 3.1 Pro and 3.1 Flash-Lite are in preview with conservative limits. Gemini 2.0 Flash and Flash-Lite deprecated; shutting down June 1, 2026.
Structural changes:
- Google removed static rate limit tables from public docs. Limits are now only visible in AI Studio (
aistudio.google.com/rate-limit). The numbers below for 2.5-series models are the last confirmed published figures. - Together AI switched to fully dynamic rate limits in January 2026. No fixed tiers; limits grow with sustained usage.
- Groq now hosts GPT-OSS (20B/120B) and Qwen3-32B on its free tier.
New providers added: Cerebras (published free tier, fastest inference) and SambaNova (published free tier, best TTFT).
What are rate limits?
Rate limits cap how much you can use an API within a given time window. Providers enforce them to manage capacity and ensure fair access. Exceeding a limit returns a 429 Too Many Requests error.
Most providers use the token bucket algorithm: capacity refills continuously up to your maximum, rather than resetting at fixed intervals. A 60 RPM limit means roughly 1 request per second with steady refill, not 60 requests then a hard stop.
The metrics
- RPM (Requests per minute): API calls per minute, regardless of size. A 10-token request and a 100K-token request both count as one.
- RPD (Requests per day): Daily cap on total API calls. Some providers use this instead of (or alongside) RPM, especially on free tiers.
- TPM (Tokens per minute): Total tokens (input + output) processed per minute. Usually the binding constraint in production.
- TPD (Tokens per day): Daily token cap. More common on free tiers.
- ITPM/OTPM (Input/Output tokens per minute): Anthropic separates input and output token limits, giving finer control. Cached input tokens don’t count toward ITPM on current Claude 4.x models.
- ASH/ASD (Audio seconds per hour/day): For speech models like Whisper.
RPM limits set your concurrency ceiling. TPM limits set your throughput. For batch workloads, RPD and TPD matter more. For real-time apps, RPM and TPM hit first.
Free Tier — Requests per Minute (RPM) & Requests per Day (RPD)
| Metric | Groq | Cerebras | SambaNova | Fireworks AI | Cohere | ||||
|---|---|---|---|---|---|---|---|---|---|
| 2.5 Pro | 2.5 Flash | 2.5 Flash-Lite | Llama 4 Scout | Qwen3 32B | Llama 3.3 70B | Llama 3.1 405B | OS models | Command R+ | |
| RPM | 5 | 10 | 15 | 30 | 60 | 30 | 10–30 | 10 | 20 |
| RPD | 50–100 | 250–1,500 | 1,000 | 1,000 | 1,000 | — | — | — | ~33* |
*Cohere trial keys get 1,000 calls/month across all endpoints, roughly ~33/day.
Note: OpenAI, Anthropic, xAI, DeepSeek, and Mistral have no free API tier for current flagship models. OpenAI’s free tier does not support any GPT-5.x model. Anthropic’s lowest entry is $5 (Tier 1). xAI and Mistral gate numerical limits behind their respective consoles.
Change from March: Google’s free tier RPD dropped 50–80% in late 2025. Gemini 2.5 Pro went from ~100–200 RPD to 50–100. Groq added GPT-OSS (20B/120B) and Qwen3-32B to its free tier.
Free Tier — Tokens per Minute (TPM) & Tokens per Day (TPD)
| Metric | Groq | Cerebras | SambaNova | ||||
|---|---|---|---|---|---|---|---|
| 2.5 Pro | 2.5 Flash | 2.5 Flash-Lite | Llama 4 Scout | Qwen3 32B | Llama 3.3 70B | Llama 3.1 405B | |
| TPM | 250,000 | 250,000 | 250,000 | 30,000 | 6,000 | 60,000 | — |
| TPD | — | — | — | 500,000 | 500,000 | 1,000,000 | — |
Standout: Google still leads with 250K TPM on the free tier, but the RPM/RPD caps (5–15 RPM) mean it’s best for fewer, larger requests. Cerebras offers 60K TPM with 1M TPD and ~2,100 tokens/second inference speed on Llama 3.3 70B, making it the fastest free option by throughput. SambaNova publishes RPM but not TPM/TPD specifics.
Entry Paid Tier — Requests per Minute (RPM) & Requests per Day (RPD)
The tier most developers start with. OpenAI Tier 1 ($5), Anthropic Tier 1 ($5), Google Tier 1 (pay-as-you-go), and others at their lowest paid level.
| Metric | OpenAI | Anthropic | xAI | DeepSeek | Fireworks AI | Cohere | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | GPT-5.4 nano | Sonnet 4.x | Haiku 4.5 | 2.5 Pro | 2.5 Flash | 2.5 Flash-Lite | Grok 4.20 | V4 Flash | OS models | Command R+ | |
| RPM | 500 | 500 | 50 | 50 | 150 | 300 | 300 | Console† | Dynamic‡ | ≤6,000 | 500 |
| RPD | — | — | — | — | 1,000 | 1,500 | 1,500 | — | — | — | — |
†xAI publishes tier thresholds ($0/$50/$250/$1K/$5K) but numerical RPM/TPM are only visible in the xAI Console after login. ‡DeepSeek uses fully dynamic concurrency limits based on server load. No fixed RPM/TPM published.
Change from March: Mistral also moved to console-only limits. Actual numbers require login at admin.mistral.ai/plateforme/limits. Both xAI and Mistral are excluded from RPM columns where no public figure exists.
Standout: Fireworks can spike to 6,000 RPM but it’s a dynamic ceiling, not guaranteed (soft limit starts at ~1 RPS and doubles hourly). Anthropic’s 50 RPM at Tier 1 is the lowest here, but jumps 20x to 1,000 RPM at Tier 2 ($40).
Entry Paid Tier — Tokens per Minute (TPM)
| Metric | OpenAI | Anthropic | DeepSeek | |||||
|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | GPT-5.4 nano | Sonnet 4.x | Haiku 4.5 | 2.5 Pro | 2.5 Flash | 2.5 Flash-Lite | V4 Flash | |
| TPM | 500K | 200K | 30K in / 8K out | 50K in / 10K out | 1M | 2M | 2M | Dynamic |
Standout: Google 2.5 Flash and Flash-Lite lead at 2M TPM. GPT-5.5 gets 500K TPM at Tier 1 (same as GPT-5.4). Anthropic looks low on paper (30K ITPM for Sonnet), but cached input tokens don’t count toward the limit. With 80% cache hit rate, effective throughput is 5x higher (150K+ effective ITPM).
Scaled Tier — RPM & TPM at Higher Spend
For teams past entry-level. OpenAI Tier 3 ($100+), Anthropic Tier 3 ($200+), Google Tier 2 ($250+).
| Metric | OpenAI (Tier 3) | Anthropic (Tier 3) | Google (Tier 2) | |||
|---|---|---|---|---|---|---|
| GPT-5.5 | GPT-5.4 nano | Opus 4.x | Haiku 4.5 | 2.5 Pro | 2.5 Flash | |
| RPM | 5,000 | 5,000 | 2,000 | 2,000 | 1,000 | 2,000 |
| TPM | 2M | 4M | 800K in / 160K out | 1M in / 200K out | 2M | 4M |
At the highest standard tiers (OpenAI Tier 5 at $1,000+, Anthropic Tier 4 at $400+):
| Metric | OpenAI (Tier 5) | Anthropic (Tier 4) | ||
|---|---|---|---|---|
| GPT-5.5 | GPT-5.4 mini | Opus 4.x | Haiku 4.5 | |
| RPM | 15,000 | 30,000 | 4,000 | 4,000 |
| TPM | 40M | 180M | 2M in / 400K out | 4M in / 800K out |
Standout: OpenAI’s Tier 5 numbers are staggering. GPT-5.4 mini at 180M TPM and 30K RPM is built for high-volume classification and routing. GPT-5.5 gets 15K RPM and 40M TPM. Anthropic’s Tier 4 caps at 4K RPM but cached tokens are free, so effective throughput can be 5x+ higher with good cache hit rates.
Change from March: GPT-5.5 now matches GPT-5.4 at every tier (identical rate limit profiles). The old GPT-4.1 nano Tier 5 figure of 150M TPM has been overtaken by GPT-5.4 mini at 180M TPM.
New Providers — Cerebras & SambaNova
Both specialize in custom silicon for inference speed. Neither existed in the March edition.
Cerebras
Hardware: Wafer-Scale Engine (WSE-3). Fastest published inference: ~2,100 tokens/second on Llama 3.3 70B.
| Tier | RPM | TPM | TPD |
|---|---|---|---|
| Free | 30 | 60,000 | 1,000,000 |
| Paid | Higher (contact sales) | Higher | Higher |
Hosts Llama 3.3 70B, Llama 3.1 8B, and other open models. The speed advantage is real: tasks that take 30 seconds on GPU-based providers finish in under 5 seconds. Paid tier limits are not publicly documented.
SambaNova
Hardware: Custom RDU (Reconfigurable Dataflow Unit). Best time-to-first-token (TTFT): ~0.2 seconds.
| Tier | RPM | Notes |
|---|---|---|
| Free | 10–30 (varies by model) | Hosts up to Llama 3.1 405B for free |
SambaNova is notable for offering Llama 3.1 405B on the free tier. Most other free-tier providers cap out at 70B-class models. TPM and TPD limits are not publicly documented.
Audio Models — ASH & ASD
Groq and Fireworks both publish audio-specific limits.
| Metric | Groq (Free Tier) | Fireworks AI | ||
|---|---|---|---|---|
| Whisper Large v3 | Whisper Large v3 Turbo | Whisper v3-large | Whisper v3-turbo | |
| RPM | 20 | 20 | — | — |
| RPD | 2,000 | 2,000 | — | — |
| ASH | 7,200 | 7,200 | — | — |
| ASD | 28,800 | 28,800 | — | — |
| Audio min/min | — | — | 200 | 400 |
Groq: 7,200 ASH = 2 hours of audio per hour of wall time. 28,800 ASD = 8 hours per day. Plenty for a podcast transcription pipeline on the free tier.
Fireworks: 200 min/min for Whisper v3-large, 400 min/min for v3-turbo. Concurrent streaming capped at 10 connections.
Cloud Aggregators — Azure AI & AWS Bedrock
Both use configurable per-deployment limits, not fixed tiers. The shared ratio is 6 RPM per 1,000 TPM.
Azure AI (Microsoft)
| Deployment Type | How Limits Work |
|---|---|
| Pay-as-you-go (Standard) | TPM quota per model per region, auto-scales with usage |
| Provisioned (PTU) | Reserved throughput units, no per-request limits |
| Global/Data Zone | Higher default quotas, multi-region routing |
Multi-model: admins can select GPT-5.5, GPT-5.4, Claude Opus/Sonnet 4.6, or Gemini 3.1 Pro. Azure AI Quotas & Limits
AWS Bedrock
Hosts Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5), Llama 4, Mistral, and Nova models with per-model, per-region quotas. Default quotas vary by model and can be increased via AWS Service Quotas console. Provisioned Throughput deployments remove rate limits entirely. AWS Bedrock Quotas
More Providers — Perplexity, Alibaba (Qwen), Moonshot (Kimi)
| Metric | Perplexity | Alibaba (Qwen) | Moonshot | ||||
|---|---|---|---|---|---|---|---|
| Sonar Pro | Sonar | Deep Research | Qwen3 Max | Qwen3.5 Plus | Qwen3.5 Flash | Kimi K2.5 | |
| RPM (T0) | 50 | 50 | 5 | — | 3 | ||
| RPM (T1) | 150 | 150 | 10 | 600 | 15,000 | 15,000 | 200 |
| RPM (T3+) | 1,000 | 1,000 | 40 | 600 | 30,000 | 30,000 | 5,000 |
| TPM | — | — | — | 1M | 5M | 10M | — |
| TPD (T0) | — | — | — | — | — | — | 1.5M |
Perplexity tiers: T0 = new account, T1 = $50+, T3 = $500+. Alibaba limits shown for international (Singapore) deployment; Beijing deployment is higher (Qwen3 Max jumps to 30K RPM). Moonshot T1 = $10+ cumulative recharge; T0 has 1.5M TPD cap, T1+ is unlimited.
Change from March: Perplexity added Sonar Deep Research (very low limits: 5–100 RPM depending on tier) and an Agent API (50–2,000 RPM across tiers). Moonshot expanded from 2 tiers (T0–T1) to 6 tiers (T0–T5), with T5 reaching 10K RPM and 5M TPM at $3,000+ cumulative recharge.
Standout: Alibaba’s Qwen3.5 Flash at 15K RPM and 10M TPM remains the highest throughput in this entire post. Moonshot’s new tier structure is competitive at the high end (T5: 10K RPM, 1,000 concurrent connections).
Provider Notes
- OpenAI: GPT-5.5 and GPT-5.4 share identical rate limit profiles across all tiers. GPT-5.4 mini at Tier 5 (180M TPM, 30K RPM) is the throughput king. Free tier excluded from all GPT-5.x models. Batch API offers 50% discount with higher queue limits. Assistants API deprecated August 2026; replaced by Responses API. Docs
- Anthropic: Opus 4.7 shares a combined rate limit pool with all Opus 4.x variants. You cannot independently max out Opus 4.7 and 4.6 simultaneously. Cached input tokens don’t count toward ITPM on current 4.x/4.5 models. No free tier; lowest entry is $5. Fast mode on Opus 4.6 draws from a separate dedicated pool. Docs
- Google: Removed static rate limit tables from public docs in Q1 2026. Actual limits now only visible in AI Studio dashboard. Gemini 3.1 Pro and 3.1 Flash-Lite are in preview with conservative limits. Gemini 2.0 Flash/Flash-Lite shut down June 1, 2026. Docs
- Groq: Free tier publishes exact numbers (30 RPM, 6K–30K TPM depending on model). Daily token budgets (TPD) alongside TPM. Cached tokens don’t count. Now hosts GPT-OSS and Qwen3-32B. Runs on custom LPU hardware. Docs
- xAI: 5-tier structure ($0/$50/$250/$1K/$5K thresholds) but numerical RPM/TPM only visible in xAI Console. Docs
- Mistral: European data residency (GDPR). 5-tier structure (Free through Tier 4 at $500+). Limits enforced per RPS (not RPM), TPM, and tokens/month. Actual numbers require login at admin.mistral.ai. New model: Mistral Small 4 (v26.03). Docs
- DeepSeek: Fully dynamic concurrency limits based on server load. No published RPM/TPM. V4-Flash and V4-Pro are the current models (V3.2 aliases deprecated July 24, 2026). V4-Flash pricing: $0.14/M input (cache miss), $0.028/M (cache hit), $0.28/M output. Docs
- Cerebras: Fastest inference (~2,100 tok/s). Free tier: 30 RPM, 60K TPM, 1M TPD. Paid limits not publicly documented. Custom WSE-3 silicon.
- SambaNova: Best TTFT (~0.2s). Free tier: 10–30 RPM depending on model size. Hosts up to Llama 3.1 405B for free. Custom RDU hardware.
- Together AI: Dynamic rate limits since January 2026. No fixed tiers or published numbers. Limits grow with sustained usage and are returned in API response headers. Docs
- Fireworks AI: Dynamic ceiling up to 6,000 RPM (soft limit starts ~1 RPS, doubles hourly). On-demand GPU deployments remove limits. Spending tier caps: $50–$50K/month by tier. Docs
- Cohere: Chat endpoint: 20 RPM (trial), 500 RPM (production). Rerank: 1,000 RPM production. Embed: 2,000 inputs/min. Trial keys capped at 1,000 calls/month. No TPM limits published. Docs
- Perplexity: 6 tiers (T0–T5) based on cumulative spend. Leaky bucket algorithm. Deep Research model has very low limits (5–100 RPM). New Agent API: 50–2,000 RPM by tier. Docs
- Alibaba (Qwen): Region-specific limits. Beijing deployment is more generous than Singapore/Global. Qwen3.5 Flash at 30K RPM / 10M TPM (Beijing) is the highest throughput of any provider listed. Docs
- Moonshot (Kimi): Expanded to 6 tiers (T0–T5). T0 ($1 recharge): 3 RPM, 1 concurrent, 1.5M TPD. T5 ($3,000): 10K RPM, 1,000 concurrent, 5M TPM, unlimited TPD. Automatic 75% caching discount. Docs
- AWS Bedrock: Hosts Claude, Llama, Mistral, Nova with per-model per-region quotas. 6 RPM per 1K TPM ratio. Provisioned Throughput removes limits. Docs
- NVIDIA NIM: Hosted API is for prototyping (40 RPM). Self-hosted NIM containers have no limits. Docs
Tips for managing rate limits
For a deep dive on token pricing, cost optimization strategies, caching mechanics, model routing architectures, and production cost modeling, see LLM Token Costs and Efficiency. This section focuses on rate limit management specifically.
Handling 429s
- Use exponential backoff when you hit 429s. The Anthropic SDK, OpenAI SDK, and most third-party clients handle this automatically. Don’t build your own retry loop unless you need custom jitter or circuit-breaking behavior.
- Check response headers before you hit the wall. Together AI and Fireworks return your current rate limit state in every API response. DeepSeek adjusts concurrency limits dynamically based on server load. Read these headers in production to implement proactive throttling rather than reactive retries.
Caching for throughput (not just cost)
Prompt caching has a rate limit benefit that is separate from its cost benefit. The cost savings are covered in the token costs post. The rate limit implications:
- Anthropic: Cached input tokens don’t count toward ITPM limits at all on Claude 4.x/4.5 models. With 80% cache hit rate, a 2M ITPM limit effectively handles 10M total input tokens per minute. This is a throughput multiplier, not just a cost reduction.
- Groq: Cached tokens don’t count toward TPM/TPD limits on the free tier, effectively expanding your daily token budget.
- Moonshot (Kimi): Automatic 75% caching discount applied without opt-in. No cache management required.
- Groq: Cached tokens don’t count toward TPM/TPD limits on the free tier, effectively expanding your daily budget.
Batch APIs: higher queue limits
OpenAI, Anthropic, and Google all offer batch endpoints with queue limits that far exceed real-time TPM. OpenAI’s GPT-5.5 gets 1.5M batch queue tokens at Tier 1 vs. 500K real-time TPM. For workloads that don’t need sub-second responses, batch endpoints let you move more tokens through tighter rate limits. See batch API cost savings for pricing details.
Model routing for rate limit distribution
Routing requests to different models by complexity distributes rate limit pressure across multiple buckets instead of concentrating it on your most constrained tier. A three-tier architecture (budget for triage, mid-tier for generation, flagship for reasoning) also cuts costs 60–85%. Full routing architecture with cost modeling is in the token costs post.
Custom silicon for fewer concurrent connections
Cerebras (~2,100 tok/s) and SambaNova (~0.2s TTFT) are faster than GPU-based providers by a large margin. A task that requires 10 parallel GPU-based API calls to meet a latency SLA might need 2–3 calls on Cerebras. Fewer calls means less rate limit pressure.
Watch for shared pools and hidden constraints
- Anthropic model pools: Opus 4.7 and 4.6 share one rate limit bucket. Sonnet 4.6 and 4.5 share another. Sending traffic to both model versions doesn’t double your effective limits; it draws from the same pool. Plan your model selection accordingly.
- Google’s invisible limits: With static rate limit tables removed from public docs, you must check AI Studio (
aistudio.google.com/rate-limit) to see your actual per-project limits. Don’t assume the numbers from third-party blog posts are current. - DeepSeek’s dynamic ceiling: No published RPM/TPM means no guaranteed minimum. Under heavy load, your effective concurrency can drop without warning. Build fallback routing to a second provider if DeepSeek availability is critical to your pipeline.
Further Reading
- LLM Token Costs and Efficiency — per-token pricing, caching mechanics, model routing, batch discounts, production cost modeling
- OpenAI rate limits — tiers, usage tracking, batch API
- Anthropic rate limits — build tier system, prompt caching behavior
- Google Gemini rate limits — free vs paid, AI Studio dashboard
- Groq rate limits — real-time dashboard, daily token caps
- xAI Grok API — rate limits and pricing
- DeepSeek API — dynamic limits, V4 pricing
- Mistral rate limits — tier structure, RPS enforcement
- Cerebras API — WSE-3 inference, free tier
- SambaNova API — RDU inference, free tier
- Perplexity rate limits — search-augmented API tiers
- Alibaba Qwen rate limits — region-specific limits
- Moonshot Kimi rate limits — tier structure, caching
- AWS Bedrock quotas — per-model per-region
- Azure AI quotas — PTU capacity, global deployment