API Rate Limits Compared: Every Major LLM Provider in One Place

If you’re building on LLM APIs, rate limits will bite you eventually. Every provider enforces them, they all measure them slightly differently, and the documentation is scattered across a dozen different dashboards. This post puts it all in one place.

What are rate limits?
Free Tier — RPM & RPD
Free Tier — TPM & TPD
Entry Paid Tier — RPM & RPD
Entry Paid Tier — TPM
Scaled Tier — RPM & TPM
Audio Models — ASH & ASD
Cloud Aggregators — Azure AI & AWS Bedrock
More Providers — Perplexity, Alibaba, Moonshot
Provider Notes
Tips for managing rate limits

Last updated: March 22, 2026.

What are rate limits?

Rate limits cap how much you can use an API within a given time window. Providers enforce them to prevent abuse, manage capacity, and ensure fair access across customers. When you exceed a limit, you get a 429 Too Many Requests error and have to wait before retrying.

Most providers use the token bucket algorithm: your capacity refills continuously up to your maximum, rather than resetting at fixed intervals. So a 60 RPM limit isn’t “60 requests then wait a minute.” It’s closer to 1 request per second, refilling steadily.

The token bucket refills steadily — capacity isn’t a hard reset at the top of each minute.

The metrics

RPM (Requests per minute): How many API calls you can make per minute, regardless of size. A 10-token request and a 100K-token request both count as one request.
RPD (Requests per day): A daily cap on total API calls. Some providers use this instead of (or alongside) RPM, especially on free tiers.
TPM (Tokens per minute): The total number of tokens (input + output) you can process per minute. This is usually the limit that matters most in production, since it directly determines throughput.
TPD (Tokens per day): A daily token cap. More common on free tiers and providers like Groq.
ASH (Audio seconds per hour): For speech-to-text models like Whisper. Limits how many seconds of audio you can transcribe per hour.
ASD (Audio seconds per day): Daily cap on audio transcription.

Some providers split tokens further. Anthropic separates ITPM (input tokens per minute) and OTPM (output tokens per minute), which gives you more granular control. OpenAI uses a combined TPM for most models.

Why they matter for production

RPM limits determine your concurrency ceiling: how many parallel requests you can sustain. TPM limits determine your throughput: how much data you can move through the API. For batch workloads, RPD and TPD matter more. For real-time apps, RPM and TPM are the constraints you’ll hit first.

Free Tier — Requests per Minute (RPM) & Requests per Day (RPD)

Metric	OpenAI	Google			Groq		xAI	Mistral	NVIDIA	Together AI	Fireworks AI	Cohere
Metric	GPT-5	3.1 Pro ⍺	2.5 Pro	2.5 Flash	Llama 4 Scout	Llama 3.1 8B	Grok 3	Large 3	NIM	OS models	OS models	Command R+
RPM	3	10	5	10	30	30	60	~60	40	60	10	20
RPD	200	100	100	250	1,000	14,400	—	—	—	—	—	~33*

*Cohere trial keys get 1,000 calls/month across all endpoints, roughly ~33/day. ⍺ = preview model, limits may change.

Note: Anthropic and DeepSeek have no free API tier. Anthropic’s lowest entry is $5 (Tier 1).

Free Tier — Tokens per Minute (TPM) & Tokens per Day (TPD)

Metric	OpenAI	Google			Groq		xAI	Mistral
Metric	GPT-5	3.1 Pro ⍺	2.5 Pro	2.5 Flash	Llama 4 Scout	Llama 3.1 8B	Grok 3	Large 3
TPM	40,000	250,000	250,000	250,000	30,000	6,000	100,000	500,000
TPD	—	—	—	—	500,000	500,000	—	—

Standout: Google’s free tier gives you 250K TPM across all Gemini models, which beats most providers’ paid Tier 1 for throughput. The catch is the tight RPM/RPD caps (5-10 RPM, 100 RPD), so it’s best for fewer, larger requests. Groq is the only provider with TPD limits, meaning daily token budgets can run out fast under sustained load.

Entry Paid Tier — Requests per Minute (RPM) & Requests per Day (RPD)

The tier most developers start with. OpenAI Tier 1 ($5), Anthropic Tier 1 ($5), Google Tier 1 (pay-as-you-go), and others at their lowest paid level.

Metric	OpenAI		Anthropic		Google			Mistral	xAI	DeepSeek	Together AI	Fireworks AI	Cohere
Metric	GPT-5	GPT-4.1 Nano	Sonnet 4.x	Haiku 4.5	3.1 Pro ⍺	2.5 Pro	2.5 Flash	Large 3	Grok 3	V3.2	OS models	OS models	Command R+
RPM	500	1,000	50	50	150	150	300	300	1,200	60	60+	≤6,000	500
RPD	—	—	—	—	1,000	1,000	1,500	—	—	—	—	—	—

Standout: xAI leads at 1,200 RPM guaranteed. Fireworks can spike to 6,000 RPM but it’s a dynamic ceiling, not guaranteed. Anthropic’s 50 RPM at Tier 1 is the lowest here, but jumps 20x to 1,000 RPM at Tier 2 ($40). Google is the only paid provider still enforcing RPD.

Entry Paid Tier — Tokens per Minute (TPM)

Metric	OpenAI		Anthropic		Google			Mistral	xAI	DeepSeek
Metric	GPT-5	GPT-4.1 Nano	Sonnet 4.x	Haiku 4.5	3.1 Pro ⍺	2.5 Pro	2.5 Flash	Large 3	Grok 3	V3.2
TPM	200K	4M	30K in / 8K out	50K in / 10K out	1M	1M	2M	2M	600K	1M

Standout: Mistral and Google 2.5 Flash lead at 2M TPM. GPT-4.1 Nano is OpenAI’s throughput beast at 4M TPM. Anthropic looks low on paper (30K ITPM for Sonnet), but cached input tokens don’t count toward the limit. With 80% cache hit rate, effective throughput is 5x higher.

Scaled Tier — RPM & TPM at Higher Spend

For teams that have scaled past entry-level. OpenAI Tier 3 ($100+), Anthropic Tier 3 ($200+), Google Tier 2 ($250+).

Metric	OpenAI (Tier 3)		Anthropic (Tier 3)		Google (Tier 2)
Metric	GPT-5	GPT-4.1 Nano	Sonnet 4.x	Haiku 4.5	2.5 Pro	2.5 Flash
RPM	5,000	5,000	2,000	2,000	1,000	2,000
TPM	4M	20M	800K in / 160K out	1M in / 200K out	2M	4M

And at the highest standard tiers — OpenAI Tier 5 ($1,000+), Anthropic Tier 4 ($400+):

Metric	OpenAI (Tier 5)		Anthropic (Tier 4)
Metric	GPT-5	GPT-4.1 Nano	Sonnet 4.x	Haiku 4.5
RPM	10,000	30,000	4,000	4,000
TPM	30M	150M	2M in / 400K out	4M in / 800K out

Standout: OpenAI’s Tier 5 limits are staggering. GPT-4.1 Nano at 150M TPM and 30K RPM is designed for high-volume classification and routing workloads. Anthropic’s Tier 4 caps at 4K RPM and 4M ITPM for Haiku, but remember cached tokens are free, so effective throughput can be dramatically higher.

Audio Models — ASH & ASD

Only Groq currently publishes audio-specific rate limits for Whisper models.

Metric	Groq (Free Tier)
Metric	Whisper Large v3	Whisper Large v3 Turbo
RPM	20	20
RPD	2,000	2,000
ASH	7,200	7,200
ASD	28,800	28,800

7,200 ASH = 2 hours of audio per hour of wall time. 28,800 ASD = 8 hours of audio per day. For a podcast transcription pipeline or meeting summarizer, that’s plenty on the free tier.

Cloud Aggregators — Azure AI & AWS Bedrock

Azure and Bedrock don’t fit neatly into fixed-tier tables because limits are configurable per deployment, not fixed tiers. Both use a 6 RPM per 1,000 TPM ratio.

Azure AI (Microsoft)

Deployment Type	How Limits Work
Pay-as-you-go (Standard)	TPM quota per model per region, auto-scales with usage
Provisioned (PTU)	Reserved throughput units, no per-request limits
Global/Data Zone	Higher default quotas, multi-region routing

If you’re already on Azure, this is the path of least resistance for OpenAI models with enterprise compliance (SOC 2, HIPAA BAA). Azure AI Quotas & Limits

AWS Bedrock

Hosts Claude, Llama, Mistral, and other models with per-model, per-region quotas. Default quotas vary by model and can be increased via AWS Service Quotas console. Provisioned Throughput deployments remove rate limits entirely. AWS Bedrock Quotas

More Providers — Perplexity, Alibaba (Qwen), Moonshot (Kimi)

Metric	Perplexity		Alibaba (Qwen)			Moonshot
Metric	Sonar Pro	Sonar	Qwen3 Max	Qwen3.5 Plus	Qwen3.5 Flash	Kimi K2.5
RPM (T0)	50	50	—			—
RPM (T1)	150	150	600	15,000	15,000	200
RPM (T3+)	1,000	1,000	600	30,000	30,000	—
TPM	—	—	1M	5M	10M	—
TPD (T0)	—	—	—	—	—	1.5M

Perplexity tiers: T0 = new account, T1 = $50+, T3 = $500+. Alibaba limits shown for international (Singapore) deployment. Moonshot T1 = $10+ cumulative recharge; T0 has 1.5M TPD cap, T1+ is unlimited.

Standout: Alibaba’s Qwen3.5 Flash at 15K RPM and 10M TPM is the highest throughput in this entire post, by a wide margin. Perplexity is unique as a search-augmented LLM, so its value is in grounded answers, not raw throughput. Moonshot’s Kimi K2.5 has competitive pricing with automatic 75% caching discounts.

Provider Notes

Anthropic: Cached input tokens don’t count toward ITPM limits for most models. With effective prompt caching, a 2M ITPM limit can handle 10M+ total input tokens/minute. No free tier; lowest entry is $5. Docs
OpenAI: RPD limits only on the Free tier. Paid tiers drop daily caps entirely. Batch API offers 50% discount with higher limits. Docs
Google: Gemini 3.1 Pro is the current flagship; Gemini 3.1 Flash Lite is the budget tier. Limits enforced per GCP project, not per API key. Most generous free-tier TPM. Docs
Groq: Only provider using TPD (daily token budgets) alongside TPM. Cached tokens don’t count. Runs on custom LPU hardware. Docs
Mistral: European data residency (GDPR). Simple 2-tier structure. Docs
xAI: Highest guaranteed RPM at entry level (1,200). Docs
DeepSeek: No tiers, flat limits. Competitive pricing, low RPM. Docs
NVIDIA NIM: Hosted API is for prototyping (40 RPM). Self-hosted NIM containers have no limits. Docs
Together AI: Dynamic rate limits since Jan 2026. Limits grow with sustained usage. Docs
Fireworks AI: 6,000 RPM spike arrest ceiling, not guaranteed. On-demand GPU deployments remove limits entirely. Docs
Cohere: Endpoint-specific limits. Best for Embed (2K RPM) and Rerank (1K RPM) in RAG pipelines. Docs
Perplexity: Search-augmented LLM API. 6 tiers (T0-T5) based on cumulative spend. Uses leaky bucket algorithm. Sonar Deep Research model has much lower limits (5-100 RPM). Docs
Alibaba (Qwen): Massive model catalog with region-specific limits (Singapore, US, EU, Beijing). Qwen3.5 Flash at 15K RPM / 10M TPM is the highest throughput of any provider listed here. International deployment via Singapore. Docs
Moonshot (Kimi): Tiered by cumulative recharge ($1 minimum). T0 has 1.5M TPD daily cap; T1+ ($10) unlocks unlimited daily tokens and 200 RPM. Automatic 75% caching discount. Docs
AWS Bedrock: Hosts Claude, Llama, Mistral with per-model per-region quotas. Same 6 RPM per 1K TPM ratio as Azure. Provisioned Throughput removes limits. Docs

Tips for managing rate limits

Use exponential backoff when you hit 429s. Most SDKs handle this automatically.
Batch where possible. OpenAI and Anthropic both offer batch APIs with higher limits and lower prices (50% discount on OpenAI).
Cache aggressively. Anthropic’s prompt caching effectively multiplies your ITPM limit. Groq’s cached tokens also don’t count.
Route by model. Don’t send simple classification tasks to your most expensive model. Use a cheap model (Haiku, Flash-Lite, GPT-4.1 Nano) for simple work and reserve the flagship for complex tasks.
Monitor before you scale. Every provider has a dashboard showing your current usage against limits. Check it before architecting around limits you haven’t actually hit yet.