API Rate Limits Compared: Every Major LLM Provider in One Place
A side-by-side comparison of rate limits across 15 LLM API providers — OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Perplexity, Alibaba, Moonshot, and more — as of March 2026.
If you’re building on LLM APIs, rate limits will bite you eventually. Every provider enforces them, they all measure them slightly differently, and the documentation is scattered across a dozen different dashboards. This post puts it all in one place.
Table of Contents
- What are rate limits?
- Free Tier — RPM & RPD
- Free Tier — TPM & TPD
- Entry Paid Tier — RPM & RPD
- Entry Paid Tier — TPM
- Scaled Tier — RPM & TPM
- Audio Models — ASH & ASD
- Cloud Aggregators — Azure AI & AWS Bedrock
- More Providers — Perplexity, Alibaba, Moonshot
- Provider Notes
- Tips for managing rate limits
Last updated: March 22, 2026.
What are rate limits?
Rate limits cap how much you can use an API within a given time window. Providers enforce them to prevent abuse, manage capacity, and ensure fair access across customers. When you exceed a limit, you get a 429 Too Many Requests error and have to wait before retrying.
Most providers use the token bucket algorithm: your capacity refills continuously up to your maximum, rather than resetting at fixed intervals. So a 60 RPM limit isn’t “60 requests then wait a minute.” It’s closer to 1 request per second, refilling steadily.
The token bucket refills steadily — capacity isn’t a hard reset at the top of each minute.
The metrics
- RPM (Requests per minute): How many API calls you can make per minute, regardless of size. A 10-token request and a 100K-token request both count as one request.
- RPD (Requests per day): A daily cap on total API calls. Some providers use this instead of (or alongside) RPM, especially on free tiers.
- TPM (Tokens per minute): The total number of tokens (input + output) you can process per minute. This is usually the limit that matters most in production, since it directly determines throughput.
- TPD (Tokens per day): A daily token cap. More common on free tiers and providers like Groq.
- ASH (Audio seconds per hour): For speech-to-text models like Whisper. Limits how many seconds of audio you can transcribe per hour.
- ASD (Audio seconds per day): Daily cap on audio transcription.
Some providers split tokens further. Anthropic separates ITPM (input tokens per minute) and OTPM (output tokens per minute), which gives you more granular control. OpenAI uses a combined TPM for most models.
Why they matter for production
RPM limits determine your concurrency ceiling: how many parallel requests you can sustain. TPM limits determine your throughput: how much data you can move through the API. For batch workloads, RPD and TPD matter more. For real-time apps, RPM and TPM are the constraints you’ll hit first.
Free Tier — Requests per Minute (RPM) & Requests per Day (RPD)
| Metric | OpenAI | Groq | xAI | Mistral | NVIDIA | Together AI | Fireworks AI | Cohere | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5 | 3.1 Pro ⍺ | 2.5 Pro | 2.5 Flash | Llama 4 Scout | Llama 3.1 8B | Grok 3 | Large 3 | NIM | OS models | OS models | Command R+ | |
| RPM | 3 | 10 | 5 | 10 | 30 | 30 | 60 | ~60 | 40 | 60 | 10 | 20 |
| RPD | 200 | 100 | 100 | 250 | 1,000 | 14,400 | — | — | — | — | — | ~33* |
*Cohere trial keys get 1,000 calls/month across all endpoints, roughly ~33/day. ⍺ = preview model, limits may change.
Note: Anthropic and DeepSeek have no free API tier. Anthropic’s lowest entry is $5 (Tier 1).
Free Tier — Tokens per Minute (TPM) & Tokens per Day (TPD)
| Metric | OpenAI | Groq | xAI | Mistral | ||||
|---|---|---|---|---|---|---|---|---|
| GPT-5 | 3.1 Pro ⍺ | 2.5 Pro | 2.5 Flash | Llama 4 Scout | Llama 3.1 8B | Grok 3 | Large 3 | |
| TPM | 40,000 | 250,000 | 250,000 | 250,000 | 30,000 | 6,000 | 100,000 | 500,000 |
| TPD | — | — | — | — | 500,000 | 500,000 | — | — |
Standout: Google’s free tier gives you 250K TPM across all Gemini models, which beats most providers’ paid Tier 1 for throughput. The catch is the tight RPM/RPD caps (5-10 RPM, 100 RPD), so it’s best for fewer, larger requests. Groq is the only provider with TPD limits, meaning daily token budgets can run out fast under sustained load.
Entry Paid Tier — Requests per Minute (RPM) & Requests per Day (RPD)
The tier most developers start with. OpenAI Tier 1 ($5), Anthropic Tier 1 ($5), Google Tier 1 (pay-as-you-go), and others at their lowest paid level.
| Metric | OpenAI | Anthropic | Mistral | xAI | DeepSeek | Together AI | Fireworks AI | Cohere | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5 | GPT-4.1 Nano | Sonnet 4.x | Haiku 4.5 | 3.1 Pro ⍺ | 2.5 Pro | 2.5 Flash | Large 3 | Grok 3 | V3.2 | OS models | OS models | Command R+ | |
| RPM | 500 | 1,000 | 50 | 50 | 150 | 150 | 300 | 300 | 1,200 | 60 | 60+ | ≤6,000 | 500 |
| RPD | — | — | — | — | 1,000 | 1,000 | 1,500 | — | — | — | — | — | — |
Standout: xAI leads at 1,200 RPM guaranteed. Fireworks can spike to 6,000 RPM but it’s a dynamic ceiling, not guaranteed. Anthropic’s 50 RPM at Tier 1 is the lowest here, but jumps 20x to 1,000 RPM at Tier 2 ($40). Google is the only paid provider still enforcing RPD.
Entry Paid Tier — Tokens per Minute (TPM)
| Metric | OpenAI | Anthropic | Mistral | xAI | DeepSeek | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5 | GPT-4.1 Nano | Sonnet 4.x | Haiku 4.5 | 3.1 Pro ⍺ | 2.5 Pro | 2.5 Flash | Large 3 | Grok 3 | V3.2 | |
| TPM | 200K | 4M | 30K in / 8K out | 50K in / 10K out | 1M | 1M | 2M | 2M | 600K | 1M |
Standout: Mistral and Google 2.5 Flash lead at 2M TPM. GPT-4.1 Nano is OpenAI’s throughput beast at 4M TPM. Anthropic looks low on paper (30K ITPM for Sonnet), but cached input tokens don’t count toward the limit. With 80% cache hit rate, effective throughput is 5x higher.
Scaled Tier — RPM & TPM at Higher Spend
For teams that have scaled past entry-level. OpenAI Tier 3 ($100+), Anthropic Tier 3 ($200+), Google Tier 2 ($250+).
| Metric | OpenAI (Tier 3) | Anthropic (Tier 3) | Google (Tier 2) | |||
|---|---|---|---|---|---|---|
| GPT-5 | GPT-4.1 Nano | Sonnet 4.x | Haiku 4.5 | 2.5 Pro | 2.5 Flash | |
| RPM | 5,000 | 5,000 | 2,000 | 2,000 | 1,000 | 2,000 |
| TPM | 4M | 20M | 800K in / 160K out | 1M in / 200K out | 2M | 4M |
And at the highest standard tiers — OpenAI Tier 5 ($1,000+), Anthropic Tier 4 ($400+):
| Metric | OpenAI (Tier 5) | Anthropic (Tier 4) | ||
|---|---|---|---|---|
| GPT-5 | GPT-4.1 Nano | Sonnet 4.x | Haiku 4.5 | |
| RPM | 10,000 | 30,000 | 4,000 | 4,000 |
| TPM | 30M | 150M | 2M in / 400K out | 4M in / 800K out |
Standout: OpenAI’s Tier 5 limits are staggering. GPT-4.1 Nano at 150M TPM and 30K RPM is designed for high-volume classification and routing workloads. Anthropic’s Tier 4 caps at 4K RPM and 4M ITPM for Haiku, but remember cached tokens are free, so effective throughput can be dramatically higher.
Audio Models — ASH & ASD
Only Groq currently publishes audio-specific rate limits for Whisper models.
| Metric | Groq (Free Tier) | |
|---|---|---|
| Whisper Large v3 | Whisper Large v3 Turbo | |
| RPM | 20 | 20 |
| RPD | 2,000 | 2,000 |
| ASH | 7,200 | 7,200 |
| ASD | 28,800 | 28,800 |
7,200 ASH = 2 hours of audio per hour of wall time. 28,800 ASD = 8 hours of audio per day. For a podcast transcription pipeline or meeting summarizer, that’s plenty on the free tier.
Cloud Aggregators — Azure AI & AWS Bedrock
Azure and Bedrock don’t fit neatly into fixed-tier tables because limits are configurable per deployment, not fixed tiers. Both use a 6 RPM per 1,000 TPM ratio.
Azure AI (Microsoft)
| Deployment Type | How Limits Work |
|---|---|
| Pay-as-you-go (Standard) | TPM quota per model per region, auto-scales with usage |
| Provisioned (PTU) | Reserved throughput units, no per-request limits |
| Global/Data Zone | Higher default quotas, multi-region routing |
If you’re already on Azure, this is the path of least resistance for OpenAI models with enterprise compliance (SOC 2, HIPAA BAA). Azure AI Quotas & Limits
AWS Bedrock
Hosts Claude, Llama, Mistral, and other models with per-model, per-region quotas. Default quotas vary by model and can be increased via AWS Service Quotas console. Provisioned Throughput deployments remove rate limits entirely. AWS Bedrock Quotas
More Providers — Perplexity, Alibaba (Qwen), Moonshot (Kimi)
| Metric | Perplexity | Alibaba (Qwen) | Moonshot | |||
|---|---|---|---|---|---|---|
| Sonar Pro | Sonar | Qwen3 Max | Qwen3.5 Plus | Qwen3.5 Flash | Kimi K2.5 | |
| RPM (T0) | 50 | 50 | — | — | ||
| RPM (T1) | 150 | 150 | 600 | 15,000 | 15,000 | 200 |
| RPM (T3+) | 1,000 | 1,000 | 600 | 30,000 | 30,000 | — |
| TPM | — | — | 1M | 5M | 10M | — |
| TPD (T0) | — | — | — | — | — | 1.5M |
Perplexity tiers: T0 = new account, T1 = $50+, T3 = $500+. Alibaba limits shown for international (Singapore) deployment. Moonshot T1 = $10+ cumulative recharge; T0 has 1.5M TPD cap, T1+ is unlimited.
Standout: Alibaba’s Qwen3.5 Flash at 15K RPM and 10M TPM is the highest throughput in this entire post, by a wide margin. Perplexity is unique as a search-augmented LLM, so its value is in grounded answers, not raw throughput. Moonshot’s Kimi K2.5 has competitive pricing with automatic 75% caching discounts.
Provider Notes
- Anthropic: Cached input tokens don’t count toward ITPM limits for most models. With effective prompt caching, a 2M ITPM limit can handle 10M+ total input tokens/minute. No free tier; lowest entry is $5. Docs
- OpenAI: RPD limits only on the Free tier. Paid tiers drop daily caps entirely. Batch API offers 50% discount with higher limits. Docs
- Google: Gemini 3.1 Pro is the current flagship; Gemini 3.1 Flash Lite is the budget tier. Limits enforced per GCP project, not per API key. Most generous free-tier TPM. Docs
- Groq: Only provider using TPD (daily token budgets) alongside TPM. Cached tokens don’t count. Runs on custom LPU hardware. Docs
- Mistral: European data residency (GDPR). Simple 2-tier structure. Docs
- xAI: Highest guaranteed RPM at entry level (1,200). Docs
- DeepSeek: No tiers, flat limits. Competitive pricing, low RPM. Docs
- NVIDIA NIM: Hosted API is for prototyping (40 RPM). Self-hosted NIM containers have no limits. Docs
- Together AI: Dynamic rate limits since Jan 2026. Limits grow with sustained usage. Docs
- Fireworks AI: 6,000 RPM spike arrest ceiling, not guaranteed. On-demand GPU deployments remove limits entirely. Docs
- Cohere: Endpoint-specific limits. Best for Embed (2K RPM) and Rerank (1K RPM) in RAG pipelines. Docs
- Perplexity: Search-augmented LLM API. 6 tiers (T0-T5) based on cumulative spend. Uses leaky bucket algorithm. Sonar Deep Research model has much lower limits (5-100 RPM). Docs
- Alibaba (Qwen): Massive model catalog with region-specific limits (Singapore, US, EU, Beijing). Qwen3.5 Flash at 15K RPM / 10M TPM is the highest throughput of any provider listed here. International deployment via Singapore. Docs
- Moonshot (Kimi): Tiered by cumulative recharge ($1 minimum). T0 has 1.5M TPD daily cap; T1+ ($10) unlocks unlimited daily tokens and 200 RPM. Automatic 75% caching discount. Docs
- AWS Bedrock: Hosts Claude, Llama, Mistral with per-model per-region quotas. Same 6 RPM per 1K TPM ratio as Azure. Provisioned Throughput removes limits. Docs
Tips for managing rate limits
- Use exponential backoff when you hit 429s. Most SDKs handle this automatically.
- Batch where possible. OpenAI and Anthropic both offer batch APIs with higher limits and lower prices (50% discount on OpenAI).
- Cache aggressively. Anthropic’s prompt caching effectively multiplies your ITPM limit. Groq’s cached tokens also don’t count.
- Route by model. Don’t send simple classification tasks to your most expensive model. Use a cheap model (Haiku, Flash-Lite, GPT-4.1 Nano) for simple work and reserve the flagship for complex tasks.
- Monitor before you scale. Every provider has a dashboard showing your current usage against limits. Check it before architecting around limits you haven’t actually hit yet.
Further Reading
- OpenAI rate limits documentation — tiers, usage tracking, and batch API details
- Anthropic rate limits documentation — build tier system and prompt caching behavior
- Google AI rate limits — free vs paid tier limits for Gemini
- Groq rate limits — real-time limits dashboard and daily token caps
- xAI Grok API documentation — rate limits and pricing for Grok models
- DeepSeek API docs — pricing and concurrent request limits
- Mistral rate limits — workspace-level limits and tier upgrades
- AWS Bedrock quotas — per-model per-region limits and provisioned throughput
- Azure AI model quotas — PTU-based capacity and global deployment limits
- Alibaba Qwen rate limits — region-specific limits for Qwen models