API Rate Limits Compared: Every Major LLM Provider in One Place

A side-by-side comparison of rate limits across 15 LLM API providers — OpenAI, Anthropic, Google, Groq, xAI, DeepSeek, Mistral, Perplexity, Alibaba, Moonshot, and more — as of March 2026.

The post you bookmark. One topic, covered end to end.

If you’re building on LLM APIs, rate limits will bite you eventually. Every provider enforces them, they all measure them slightly differently, and the documentation is scattered across a dozen different dashboards. This post puts it all in one place.

Table of Contents

Last updated: March 22, 2026.


What are rate limits?

Rate limits cap how much you can use an API within a given time window. Providers enforce them to prevent abuse, manage capacity, and ensure fair access across customers. When you exceed a limit, you get a 429 Too Many Requests error and have to wait before retrying.

Most providers use the token bucket algorithm: your capacity refills continuously up to your maximum, rather than resetting at fixed intervals. So a 60 RPM limit isn’t “60 requests then wait a minute.” It’s closer to 1 request per second, refilling steadily.

API RequestToken Bucket(refills continuously)Tokens Available?Process Request429 Too Many Requests yesno, retry later

The token bucket refills steadily — capacity isn’t a hard reset at the top of each minute.

The metrics

  • RPM (Requests per minute): How many API calls you can make per minute, regardless of size. A 10-token request and a 100K-token request both count as one request.
  • RPD (Requests per day): A daily cap on total API calls. Some providers use this instead of (or alongside) RPM, especially on free tiers.
  • TPM (Tokens per minute): The total number of tokens (input + output) you can process per minute. This is usually the limit that matters most in production, since it directly determines throughput.
  • TPD (Tokens per day): A daily token cap. More common on free tiers and providers like Groq.
  • ASH (Audio seconds per hour): For speech-to-text models like Whisper. Limits how many seconds of audio you can transcribe per hour.
  • ASD (Audio seconds per day): Daily cap on audio transcription.

Some providers split tokens further. Anthropic separates ITPM (input tokens per minute) and OTPM (output tokens per minute), which gives you more granular control. OpenAI uses a combined TPM for most models.

Why they matter for production

RPM limits determine your concurrency ceiling: how many parallel requests you can sustain. TPM limits determine your throughput: how much data you can move through the API. For batch workloads, RPD and TPD matter more. For real-time apps, RPM and TPM are the constraints you’ll hit first.


Free Tier — Requests per Minute (RPM) & Requests per Day (RPD)

Metric OpenAI Google Groq xAI Mistral NVIDIA Together AI Fireworks AI Cohere
GPT-5 3.1 Pro ⍺ 2.5 Pro 2.5 Flash Llama 4 Scout Llama 3.1 8B Grok 3 Large 3 NIM OS models OS models Command R+
RPM 3 10 5 10 30 30 60 ~60 40 60 10 20
RPD 200 100 100 250 1,000 14,400 ~33*

*Cohere trial keys get 1,000 calls/month across all endpoints, roughly ~33/day. ⍺ = preview model, limits may change.

Note: Anthropic and DeepSeek have no free API tier. Anthropic’s lowest entry is $5 (Tier 1).

Free Tier — Tokens per Minute (TPM) & Tokens per Day (TPD)

Metric OpenAI Google Groq xAI Mistral
GPT-5 3.1 Pro ⍺ 2.5 Pro 2.5 Flash Llama 4 Scout Llama 3.1 8B Grok 3 Large 3
TPM 40,000 250,000 250,000 250,000 30,000 6,000 100,000 500,000
TPD 500,000 500,000

Standout: Google’s free tier gives you 250K TPM across all Gemini models, which beats most providers’ paid Tier 1 for throughput. The catch is the tight RPM/RPD caps (5-10 RPM, 100 RPD), so it’s best for fewer, larger requests. Groq is the only provider with TPD limits, meaning daily token budgets can run out fast under sustained load.


Entry Paid Tier — Requests per Minute (RPM) & Requests per Day (RPD)

The tier most developers start with. OpenAI Tier 1 ($5), Anthropic Tier 1 ($5), Google Tier 1 (pay-as-you-go), and others at their lowest paid level.

Metric OpenAI Anthropic Google Mistral xAI DeepSeek Together AI Fireworks AI Cohere
GPT-5 GPT-4.1 Nano Sonnet 4.x Haiku 4.5 3.1 Pro ⍺ 2.5 Pro 2.5 Flash Large 3 Grok 3 V3.2 OS models OS models Command R+
RPM 500 1,000 50 50 150 150 300 300 1,200 60 60+ ≤6,000 500
RPD 1,000 1,000 1,500

Standout: xAI leads at 1,200 RPM guaranteed. Fireworks can spike to 6,000 RPM but it’s a dynamic ceiling, not guaranteed. Anthropic’s 50 RPM at Tier 1 is the lowest here, but jumps 20x to 1,000 RPM at Tier 2 ($40). Google is the only paid provider still enforcing RPD.

Entry Paid Tier — Tokens per Minute (TPM)

Metric OpenAI Anthropic Google Mistral xAI DeepSeek
GPT-5 GPT-4.1 Nano Sonnet 4.x Haiku 4.5 3.1 Pro ⍺ 2.5 Pro 2.5 Flash Large 3 Grok 3 V3.2
TPM 200K 4M 30K in / 8K out 50K in / 10K out 1M 1M 2M 2M 600K 1M

Standout: Mistral and Google 2.5 Flash lead at 2M TPM. GPT-4.1 Nano is OpenAI’s throughput beast at 4M TPM. Anthropic looks low on paper (30K ITPM for Sonnet), but cached input tokens don’t count toward the limit. With 80% cache hit rate, effective throughput is 5x higher.


Scaled Tier — RPM & TPM at Higher Spend

For teams that have scaled past entry-level. OpenAI Tier 3 ($100+), Anthropic Tier 3 ($200+), Google Tier 2 ($250+).

Metric OpenAI (Tier 3) Anthropic (Tier 3) Google (Tier 2)
GPT-5 GPT-4.1 Nano Sonnet 4.x Haiku 4.5 2.5 Pro 2.5 Flash
RPM 5,000 5,000 2,000 2,000 1,000 2,000
TPM 4M 20M 800K in / 160K out 1M in / 200K out 2M 4M

And at the highest standard tiers — OpenAI Tier 5 ($1,000+), Anthropic Tier 4 ($400+):

Metric OpenAI (Tier 5) Anthropic (Tier 4)
GPT-5 GPT-4.1 Nano Sonnet 4.x Haiku 4.5
RPM 10,000 30,000 4,000 4,000
TPM 30M 150M 2M in / 400K out 4M in / 800K out

Standout: OpenAI’s Tier 5 limits are staggering. GPT-4.1 Nano at 150M TPM and 30K RPM is designed for high-volume classification and routing workloads. Anthropic’s Tier 4 caps at 4K RPM and 4M ITPM for Haiku, but remember cached tokens are free, so effective throughput can be dramatically higher.


Audio Models — ASH & ASD

Only Groq currently publishes audio-specific rate limits for Whisper models.

Metric Groq (Free Tier)
Whisper Large v3 Whisper Large v3 Turbo
RPM 20 20
RPD 2,000 2,000
ASH 7,200 7,200
ASD 28,800 28,800

7,200 ASH = 2 hours of audio per hour of wall time. 28,800 ASD = 8 hours of audio per day. For a podcast transcription pipeline or meeting summarizer, that’s plenty on the free tier.


Cloud Aggregators — Azure AI & AWS Bedrock

Azure and Bedrock don’t fit neatly into fixed-tier tables because limits are configurable per deployment, not fixed tiers. Both use a 6 RPM per 1,000 TPM ratio.

Azure AI (Microsoft)

Deployment TypeHow Limits Work
Pay-as-you-go (Standard)TPM quota per model per region, auto-scales with usage
Provisioned (PTU)Reserved throughput units, no per-request limits
Global/Data ZoneHigher default quotas, multi-region routing

If you’re already on Azure, this is the path of least resistance for OpenAI models with enterprise compliance (SOC 2, HIPAA BAA). Azure AI Quotas & Limits

AWS Bedrock

Hosts Claude, Llama, Mistral, and other models with per-model, per-region quotas. Default quotas vary by model and can be increased via AWS Service Quotas console. Provisioned Throughput deployments remove rate limits entirely. AWS Bedrock Quotas


More Providers — Perplexity, Alibaba (Qwen), Moonshot (Kimi)

Metric Perplexity Alibaba (Qwen) Moonshot
Sonar Pro Sonar Qwen3 Max Qwen3.5 Plus Qwen3.5 Flash Kimi K2.5
RPM (T0) 50 50
RPM (T1) 150 150 600 15,000 15,000 200
RPM (T3+) 1,000 1,000 600 30,000 30,000
TPM 1M 5M 10M
TPD (T0) 1.5M

Perplexity tiers: T0 = new account, T1 = $50+, T3 = $500+. Alibaba limits shown for international (Singapore) deployment. Moonshot T1 = $10+ cumulative recharge; T0 has 1.5M TPD cap, T1+ is unlimited.

Standout: Alibaba’s Qwen3.5 Flash at 15K RPM and 10M TPM is the highest throughput in this entire post, by a wide margin. Perplexity is unique as a search-augmented LLM, so its value is in grounded answers, not raw throughput. Moonshot’s Kimi K2.5 has competitive pricing with automatic 75% caching discounts.


Provider Notes

  • Anthropic: Cached input tokens don’t count toward ITPM limits for most models. With effective prompt caching, a 2M ITPM limit can handle 10M+ total input tokens/minute. No free tier; lowest entry is $5. Docs
  • OpenAI: RPD limits only on the Free tier. Paid tiers drop daily caps entirely. Batch API offers 50% discount with higher limits. Docs
  • Google: Gemini 3.1 Pro is the current flagship; Gemini 3.1 Flash Lite is the budget tier. Limits enforced per GCP project, not per API key. Most generous free-tier TPM. Docs
  • Groq: Only provider using TPD (daily token budgets) alongside TPM. Cached tokens don’t count. Runs on custom LPU hardware. Docs
  • Mistral: European data residency (GDPR). Simple 2-tier structure. Docs
  • xAI: Highest guaranteed RPM at entry level (1,200). Docs
  • DeepSeek: No tiers, flat limits. Competitive pricing, low RPM. Docs
  • NVIDIA NIM: Hosted API is for prototyping (40 RPM). Self-hosted NIM containers have no limits. Docs
  • Together AI: Dynamic rate limits since Jan 2026. Limits grow with sustained usage. Docs
  • Fireworks AI: 6,000 RPM spike arrest ceiling, not guaranteed. On-demand GPU deployments remove limits entirely. Docs
  • Cohere: Endpoint-specific limits. Best for Embed (2K RPM) and Rerank (1K RPM) in RAG pipelines. Docs
  • Perplexity: Search-augmented LLM API. 6 tiers (T0-T5) based on cumulative spend. Uses leaky bucket algorithm. Sonar Deep Research model has much lower limits (5-100 RPM). Docs
  • Alibaba (Qwen): Massive model catalog with region-specific limits (Singapore, US, EU, Beijing). Qwen3.5 Flash at 15K RPM / 10M TPM is the highest throughput of any provider listed here. International deployment via Singapore. Docs
  • Moonshot (Kimi): Tiered by cumulative recharge ($1 minimum). T0 has 1.5M TPD daily cap; T1+ ($10) unlocks unlimited daily tokens and 200 RPM. Automatic 75% caching discount. Docs
  • AWS Bedrock: Hosts Claude, Llama, Mistral with per-model per-region quotas. Same 6 RPM per 1K TPM ratio as Azure. Provisioned Throughput removes limits. Docs

Tips for managing rate limits

  • Use exponential backoff when you hit 429s. Most SDKs handle this automatically.
  • Batch where possible. OpenAI and Anthropic both offer batch APIs with higher limits and lower prices (50% discount on OpenAI).
  • Cache aggressively. Anthropic’s prompt caching effectively multiplies your ITPM limit. Groq’s cached tokens also don’t count.
  • Route by model. Don’t send simple classification tasks to your most expensive model. Use a cheap model (Haiku, Flash-Lite, GPT-4.1 Nano) for simple work and reserve the flagship for complex tasks.
  • Monitor before you scale. Every provider has a dashboard showing your current usage against limits. Check it before architecting around limits you haven’t actually hit yet.

Further Reading