The Inference Stack Top to Bottom
What happens between your API call and a streamed token — routing, batching, KV cache, quantization, and speculative decoding explained.
The Inference Stack Top to Bottom
Every API call to a hosted LLM triggers a cascade of systems — load balancers, routers, batch schedulers, model parallelism coordinators, KV cache managers, quantized matrix multiplications, and sampling logic — before a single token reaches the client. Understanding this stack explains why latency varies between providers, why streaming feels faster than it is, and why the difference between a $0.50/M and $15/M token price reflects genuine engineering tradeoffs rather than arbitrary markup.
Table of Contents
- The Request Path Overview
- API Gateway and Authentication
- Routing and Model Selection
- Prompt Preprocessing and Tokenization
- The Scheduling Layer: Continuous Batching
- Model Parallelism: Tensor, Pipeline, and Expert
- The KV Cache: Why Memory Is the Real Bottleneck
- Quantization at Inference Time
- Attention Kernels and FlashAttention
- Speculative Decoding
- Sampling: Temperature, Top-p, and Beyond
- Streaming and Token Delivery
- Prefix Caching and Prompt Caching
- Putting It All Together: Latency Budget Breakdown
- Summary
- Further Reading
The Request Path Overview
A mental model of the full path from HTTP request to streamed token:
End-to-end inference path from API call to streamed token delivery.
The entire round-trip for a first token (time-to-first-token, TTFT) on a flagship model like GPT-5.4 or Claude Opus 4.6 typically ranges from 200ms to 2s depending on prompt length, load, and provider. Subsequent tokens arrive at 30–120 tokens/second for most hosted endpoints. Each layer in this stack contributes latency, and each represents a place where engineering teams make cost/quality/speed tradeoffs.
API Gateway and Authentication
The request hits an API gateway before anything model-related happens. This layer handles:
- TLS termination — typically at an edge proxy (Envoy, nginx, or a cloud load balancer)
- Authentication — API key validation, OAuth token verification, or in enterprise deployments, workload identity via mTLS
- Rate limiting — token-aware rate limiting (not just request counting) that inspects the
max_tokensparameter and estimated input token count - Request validation — schema validation of the JSON body, parameter bounds checking, content policy pre-screening
- Region routing — directing requests to the nearest inference cluster with available capacity
At providers like Anthropic and OpenAI, this layer also handles organization-level quota tracking, billing metering, and abuse detection. The gateway logs the estimated input token count before tokenization happens — providers use a fast approximation (roughly len(text) / 4 for English) for rate limiting decisions, then compute exact counts downstream.
Gateway latency: 5–20ms for most providers, occasionally higher during quota lookup for enterprise accounts with complex billing hierarchies.
Routing and Model Selection
After authentication, the router decides which physical GPU cluster serves the request. This involves several decisions:
Model version resolution. A request for claude-sonnet-4.6 maps to a specific model checkpoint, weight version, and system prompt configuration. Providers maintain multiple versions simultaneously during rollouts — a model alias like gpt-5.4 might resolve to gpt-5.4-0312 or gpt-5.4-0319 depending on rollout state.
Capacity-aware routing. The router checks available capacity across multiple GPU clusters, often spanning regions. If the primary cluster is saturated, requests overflow to secondary clusters. This is why latency can vary by 2–5x between requests to the same endpoint — different clusters have different utilization and hardware configurations.
Priority queuing. Most providers implement tiered priority queues. Batch API requests (OpenAI’s batch endpoint, Anthropic’s message batches) go into low-priority queues that fill idle GPU capacity. Real-time requests get higher priority. Enterprise customers with reserved capacity get dedicated queues.
Model-specific routing. Different model sizes need different hardware configurations. Claude Haiku 4.5 can run on fewer GPUs with less memory than Claude Opus 4.6. The router directs requests to appropriately provisioned clusters.
Routing layer distributes requests across clusters by priority and capacity.
Router latency: 2–10ms, but queue wait time during peak load can add 100ms–several seconds, which is the primary source of TTFT variance.
Prompt Preprocessing and Tokenization
The raw text prompt gets converted to token IDs through the model’s tokenizer. This is CPU-bound work and fast, but the details matter.
Tokenizer specifics by model family:
| Model Family | Tokenizer | Vocab Size | Avg Chars/Token (English) |
|---|---|---|---|
| GPT-5.4 | cl200k_base (extended) | ~210,000 | ~3.8 |
| Claude Opus/Sonnet 4.6 | Custom SentencePiece | ~160,000 | ~3.6 |
| Gemini 2.5/3.x | SentencePiece (v2) | ~256,000 | ~4.0 |
| Llama 4 | SentencePiece BPE | ~128,000 | ~3.5 |
Larger vocabularies generally mean fewer tokens per prompt (reducing compute and KV cache usage) but increase the size of the embedding and output projection matrices. Gemini’s 256K vocabulary is notably large and contributes to its relatively efficient tokenization of multilingual text.
System prompt injection happens at this stage. The provider prepends its system prompt (safety instructions, behavioral guidelines, tool-use formatting) to the user’s messages. These system tokens count toward the context window but are typically prefix-cached (covered later), so they don’t add meaningful latency after the first request.
Chat template formatting converts the structured messages array into the model’s expected token format — special tokens like <|im_start|>, [INST], or <human> delimit roles. Getting this wrong during self-hosted inference is a common source of degraded quality.
Tokenization latency: <1ms for most prompts, up to 5–10ms for very long contexts (100K+ tokens).
The Scheduling Layer: Continuous Batching
This is where modern inference engines diverge most from naive implementations. A naive approach processes one request at a time: prefill all input tokens, then decode output tokens one at a time until done. Continuous batching (also called iteration-level batching or in-flight batching) changed everything.
The problem with static batching: Requests have wildly different input and output lengths. If you batch 8 requests together, 7 might finish while 1 is still generating. Static batching forces those 7 to wait, wasting GPU cycles.
Continuous batching allows the scheduler to:
- Insert new requests into the batch at every decode iteration
- Remove completed requests immediately
- Preempt low-priority requests when high-priority ones arrive
- Maintain different requests at different stages (some prefilling, some decoding)
vLLM popularized this approach, and it’s now standard in TensorRT-LLM, TGI (Text Generation Inference), and every major provider’s custom stack.
Prefill vs. decode — the two phases:
The prefill phase processes all input tokens in parallel. This is compute-bound — the GPU performs matrix multiplications across all input tokens simultaneously. For a 4,000-token prompt on an H100, prefill takes ~50–200ms depending on model size.
The decode phase generates tokens one at a time (autoregressively). Each decode step processes only the new token plus KV cache lookups. This is memory-bandwidth-bound — the GPU spends most of its time loading model weights from HBM rather than doing math.
Prefill processes input in parallel; decode generates tokens one at a time against the cached keys/values.
This prefill/decode asymmetry creates a scheduling challenge: prefill operations are bursty and compute-heavy, while decode operations are steady and memory-bound. Sophisticated schedulers separate these phases, sometimes running prefill on dedicated GPU partitions (a technique called disaggregated prefill) or chunking long prefills into smaller pieces to avoid stalling concurrent decode operations.
# Simplified continuous batching loop (conceptual)
class ContinuousBatcher:
def __init__(self, model, max_batch_size=256):
self.model = model
self.waiting_queue = []
self.running_batch = {} # request_id -> state
self.max_batch = max_batch_size
def step(self):
# Admit new requests if capacity available
while self.waiting_queue and len(self.running_batch) < self.max_batch:
req = self.waiting_queue.pop(0)
# Run prefill for this request (chunked if long)
kv_cache = self.model.prefill(req.input_ids)
self.running_batch[req.id] = {
"kv_cache": kv_cache,
"generated": [],
"req": req,
}
# Single decode step for all running requests
if self.running_batch:
batch_logits = self.model.decode_step(
[s["kv_cache"] for s in self.running_batch.values()],
[s["generated"][-1] if s["generated"] else None
for s in self.running_batch.values()],
)
# Sample, check stopping, evict finished
for req_id, logits in zip(self.running_batch.keys(), batch_logits):
token = sample(logits, self.running_batch[req_id]["req"].params)
self.running_batch[req_id]["generated"].append(token)
if token == EOS or len(self.running_batch[req_id]["generated"]) >= max_tokens:
self.finish(req_id)
The actual implementations are far more complex — they handle memory allocation for KV caches, preemption policies, chunked prefill scheduling, and multi-node coordination — but the core loop is: admit new requests, run one decode step for all active requests, evict finished ones.
Model Parallelism: Tensor, Pipeline, and Expert
Frontier models don’t fit on a single GPU. An H100 has 80GB of HBM3. A dense 70B-parameter model in FP16 requires ~140GB just for weights. Flagship models (GPT-5.4, Claude Opus 4.6) are substantially larger. Parallelism strategies distribute the model across multiple GPUs.
Tensor Parallelism (TP) splits individual matrix multiplications across GPUs. A single attention head’s QKV projection gets partitioned column-wise, each GPU computes its portion, and results are combined via all-reduce. TP requires high-bandwidth interconnects (NVLink at 900 GB/s between H100s) because GPUs communicate every layer.
Pipeline Parallelism (PP) assigns different layers to different GPUs. GPU 0 runs layers 0–15, GPU 1 runs layers 16–31, etc. Communication happens only between adjacent stages (one GPU sends activations to the next). PP introduces pipeline bubbles — idle time when some stages wait for input — but uses less interconnect bandwidth than TP.
Expert Parallelism (EP) applies specifically to Mixture-of-Experts (MoE) models. Llama 4’s MoE variant, DeepSeek-V3, and reportedly GPT-5.4’s architecture use MoE. Each GPU holds a subset of experts. A routing network selects which experts to activate per token, and tokens get dispatched to the GPUs holding those experts. EP reduces compute per token (only ~2 of 16 experts fire, for example) but introduces load-balancing challenges when certain experts are popular.
Three parallelism strategies, often combined. TP needs high-bandwidth NVLink; PP and EP tolerate lower bandwidth.
Typical configurations for hosted inference:
| Model | Architecture | Likely Parallelism | GPUs per Instance |
|---|---|---|---|
| Claude Haiku 4.5 | Dense (smaller) | TP=2 or TP=4 | 2–4 H100s |
| Claude Sonnet 4.6 | Dense | TP=4 to TP=8 | 4–8 H100s |
| Claude Opus 4.6 | Dense (largest) | TP=8, possibly PP=2 | 8–16 H100s |
| GPT-5.4 | Likely MoE | TP + EP | 8–16+ H100s |
| DeepSeek-V3 | MoE (671B total, ~37B active) | TP + EP | 8 H100s |
| Llama 4 70B (dense) | Dense | TP=4 | 4 H100s (FP8) |
These are informed estimates — providers don’t publish exact configurations — but they align with known model sizes and reported inference characteristics.
The parallelism strategy directly affects latency and throughput. More TP means lower latency (each GPU does less work per layer) but worse throughput per GPU dollar. Providers tune these parameters based on the target latency SLA for each model tier.
The KV Cache: Why Memory Is the Real Bottleneck
The KV (key-value) cache is the single most important concept for understanding inference economics. During autoregressive generation, each new token attends to all previous tokens. Without caching, generating token N would require recomputing attention over all N-1 previous tokens — O(N²) total compute for N tokens.
The KV cache stores the key and value projections from every previous token at every layer, so each decode step only computes attention for the new token against cached K/V pairs. This makes decode O(N) total but introduces a massive memory problem.
KV cache memory per token:
bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element
For a model with 80 layers, 8 KV heads (GQA), 128 head dimension, in FP16:
2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 320 KB per token
For a 128K context window: 128,000 × 320 KB = ~40 GB of KV cache per request. On an 8-GPU H100 cluster with 640GB total HBM, the KV cache for just 16 concurrent 128K-context requests would consume the entire memory.
This is why context length directly determines how many concurrent requests a cluster can serve, and why long-context requests cost more per token at some providers.
PagedAttention (from vLLM) addresses KV cache memory fragmentation. Rather than pre-allocating contiguous memory for the maximum possible sequence length, PagedAttention allocates KV cache in small pages (blocks) and maps them through a page table, similar to OS virtual memory. This eliminates internal fragmentation and improves memory utilization from ~50-60% to ~95%+.
PagedAttention maps virtual KV cache blocks to physical GPU memory pages, eliminating fragmentation.
Grouped-Query Attention (GQA) reduces KV cache size by sharing key/value heads across multiple query heads. Instead of 64 KV heads matching 64 query heads (MHA), GQA might use 8 KV heads shared across 64 query heads — an 8x reduction in KV cache size with minimal quality loss. Nearly all models deployed after 2024 use GQA or its extreme variant, Multi-Query Attention (MQA, 1 KV head).
Multi-head Latent Attention (MLA), used by DeepSeek-V3, goes further by projecting keys and values into a low-rank latent space before caching, reducing per-token KV cache to ~70 bytes — roughly 4–5x smaller than GQA at comparable model sizes.
Quantization at Inference Time
Quantization reduces the precision of model weights (and sometimes activations and KV cache) from FP16/BF16 (16 bits) to lower bitwidths. The tradeoff is straightforward: less precision means less memory, faster memory bandwidth utilization, and cheaper inference, but potentially degraded output quality.
Common quantization formats for inference:
| Format | Bits | Memory Savings | Quality Impact | Hardware Support |
|---|---|---|---|---|
| FP16/BF16 | 16 | Baseline | None | Universal |
| FP8 (E4M3/E5M2) | 8 | 2x | Minimal for most models | H100, L40S, MI300X, B200 |
| INT8 (W8A8) | 8 | 2x | Minimal with calibration | H100, A100 |
| INT4 (W4A16) | 4 (weights only) | ~3.5x | Noticeable on small models | CUDA via kernels |
| GPTQ / AWQ | 4 | ~3.5x | Moderate, model-dependent | CUDA via kernels |
| GGUF Q4_K_M | ~4.5 effective | ~3x | Good for chat models | CPU + GPU (llama.cpp) |
FP8 is the current production sweet spot. H100s have native FP8 tensor cores that deliver ~2x the FLOPS of FP16. Most major providers run FP8 inference for their flagship models, with careful calibration to preserve quality. The quality difference between FP8 and FP16 inference is generally within noise on standard benchmarks for models above 30B parameters.
Weight-only quantization (W4A16, W8A16) quantizes stored weights but dequantizes to FP16 for computation. This saves memory and bandwidth but doesn’t improve compute throughput. It’s popular for self-hosted deployments where GPU memory is the constraint.
W4A4 quantization is emerging for smaller models — both weights and activations in 4-bit. NVIDIA’s B200/B100 GPUs have native FP4 tensor cores for this. Quality degradation is nontrivial and requires quantization-aware training or very careful post-training quantization.
For self-hosted inference, the practical guidance:
# FP8 inference with vLLM (H100/L40S)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-70B \
--quantization fp8 \
--tensor-parallel-size 4 \
--max-model-len 131072
# 4-bit inference with llama.cpp (consumer hardware)
./llama-server \
-m llama-4-70b-Q4_K_M.gguf \
-ngl 99 \
--ctx-size 8192 \
--host 0.0.0.0 --port 8080
Attention Kernels and FlashAttention
Standard attention computes softmax(QK^T / √d) × V, which naively requires materializing the full N×N attention matrix in GPU memory. For N=128K tokens, that’s a 128K × 128K matrix of floats — 64GB for a single attention head in FP32.
FlashAttention (now at v3 for Hopper GPUs) restructures this computation using tiling and kernel fusion to:
- Never materialize the full attention matrix — it computes attention in blocks that fit in GPU SRAM (shared memory / L2)
- Fuse the softmax, scaling, and V multiplication into a single kernel — reducing HBM reads/writes
- Support variable sequence lengths efficiently via jagged/ragged tensors
FlashAttention tiles the computation to fit in fast SRAM, avoiding the N×N memory bottleneck.
FlashAttention-3 on H100s exploits the asynchronous TMA (Tensor Memory Accelerator) and warp-group level parallelism specific to the Hopper architecture. Benchmarks show ~1.5–2x speedup over FlashAttention-2 on H100s.
Ring Attention extends FlashAttention across multiple GPUs for extremely long contexts. Each GPU holds a block of the KV cache and attention is computed in a ring pattern — GPU i computes attention against its local KV block, then passes its Q block to GPU i+1. This enables context lengths beyond what fits in single-GPU memory without the full materialization cost.
FlashDecoding optimizes the decode phase specifically. During decode, the query is a single token but keys/values span the entire sequence. FlashDecoding parallelizes across the KV sequence dimension rather than the batch dimension, which is critical for long-context decode performance.
These kernel-level optimizations are invisible at the API layer but directly determine the tokens/second and TTFT characteristics of every provider.
Speculative Decoding
Standard autoregressive decoding generates one token per forward pass through the full model. For a 70B-parameter model, each forward pass costs ~140 TFLOPS. Speculative decoding generates multiple tokens per step by using a smaller “draft” model.
The algorithm:
- A small draft model (e.g., 1-7B parameters) generates K candidate tokens autoregressively (fast, because the model is small)
- The full target model verifies all K tokens in a single forward pass (parallel, like prefill)
- Tokens are accepted left-to-right using a rejection sampling scheme that guarantees the output distribution exactly matches the target model
- If the first M ≤ K tokens are accepted, one additional token is sampled from an adjusted distribution at position M+1
The key insight: verification is parallel (one forward pass checks all K tokens), while draft generation is sequential but cheap. If the draft model matches the target model’s distribution well (high acceptance rate), speculative decoding provides ~2-3x speedup for decode-bound workloads.
Acceptance rates depend on how well the draft model approximates the target model. For well-matched pairs (e.g., a distilled 1B draft for a 70B target), acceptance rates of 70-85% are typical for natural language, lower for code or reasoning.
Medusa and EAGLE are alternative approaches that attach extra prediction heads to the target model itself, avoiding the need for a separate draft model. EAGLE-2, in particular, achieves ~3-4x speedup on some benchmarks by predicting multiple future tokens from the target model’s hidden states.
Providers probably use speculative decoding for their mid-tier models (Haiku, Flash) where decode speed is the key selling point. For the largest models, the draft model needs to be proportionally capable, and the engineering complexity increases.
Speculative decoding uses a small draft model to propose tokens verified by the full model in parallel.
Sampling: Temperature, Top-p, and Beyond
After the model produces logits (one score per vocabulary token), the sampling strategy converts these into the next token. This is the last step before a token enters the output stream.
Temperature scales logits before softmax: p(token) = softmax(logits / temperature). Temperature 0 (greedy) always picks the highest-probability token. Temperature 1.0 samples from the model’s natural distribution. Temperature >1 flattens the distribution (more random). Most API defaults are 1.0 with top_p=1.0.
Top-p (nucleus sampling) sorts tokens by probability, then samples from the smallest set whose cumulative probability exceeds p. Top-p=0.95 means: find the top tokens that account for 95% of probability mass, sample from those. This dynamically adjusts the candidate set size based on the model’s confidence.
Top-k is simpler: only consider the top K tokens by probability. Top-k=50 means sample from the 50 most likely tokens. Less adaptive than top-p.
Min-p (supported by llama.cpp, vLLM, and some providers) sets a minimum probability threshold relative to the top token. If the top token has probability 0.9 and min_p=0.1, only tokens with probability ≥ 0.09 are considered. This adapts naturally to the model’s confidence level.
Repetition penalty and frequency/presence penalties modify logits based on tokens that have already appeared in the output. These reduce repetitive text but can degrade quality if set too high.
The sampling order matters and varies between implementations:
Sampling pipeline — order varies between implementations, producing subtly different outputs.
Some implementations apply penalties before temperature; others after. This produces subtly different output distributions with the same parameters. It’s one reason why self-hosted models with identical weights can produce different outputs than API endpoints.
Practical guidance on sampling parameters:
| Use Case | Temperature | Top-p | Notes |
|---|---|---|---|
| Code generation | 0 – 0.2 | 1.0 | Determinism matters |
| Factual Q&A | 0 – 0.3 | 0.95 | Low creativity |
| Creative writing | 0.7 – 1.0 | 0.95 | Higher variety |
| Brainstorming | 1.0 – 1.3 | 1.0 | Maximally diverse |
| Structured output (JSON) | 0 | 1.0 | Greedy avoids format errors |
Streaming and Token Delivery
Most inference APIs stream tokens via Server-Sent Events (SSE) over HTTP. Each token (or small group of tokens) is sent as it’s generated, rather than waiting for the complete response.
The streaming response contains a sequence of events:
data: {"id":"msg_01X","type":"content_block_delta","delta":{"type":"text_delta","text":"The"}}
data: {"id":"msg_01X","type":"content_block_delta","delta":{"type":"text_delta","text":" inference"}}
data: {"id":"msg_01X","type":"content_block_delta","delta":{"type":"text_delta","text":" stack"}}
Detokenization subtleties. The model generates token IDs, not text. Token ID 1234 might map to " inference" (with a leading space) or "inf" (a partial word). The detokenizer must handle:
- Token boundaries that don’t align with character boundaries (especially in multibyte UTF-8)
- Tokens that are prefixes of other tokens — the server sometimes buffers 1-2 tokens to ensure the detokenized text is valid UTF-8
- Special tokens that should be filtered from the output
Token grouping. Some providers batch multiple tokens into a single SSE event for efficiency, especially at high generation speeds. A server generating at 100 tokens/s might send events every 30-50ms containing 3-5 tokens each rather than one event per token.
Backpressure. If the client reads slowly (or has high network latency), the server’s output buffer fills. Well-implemented inference servers apply backpressure to the decode loop — they won’t generate tokens faster than the client can consume them. This prevents unbounded memory growth but means slow clients get lower throughput.
Prefix Caching and Prompt Caching
Prefix caching stores the KV cache for common prompt prefixes across requests. If 1,000 requests share the same system prompt (which they do for every API call to a given model), the KV cache for that system prompt can be computed once and reused.
Automatic prefix caching (APC), implemented in vLLM and TensorRT-LLM, hashes prompt token sequences and caches their KV states. When a new request arrives with a matching prefix, the prefill phase skips the cached portion.
Provider-level prompt caching is more explicit:
| Provider | Feature | Cache Duration | Discount |
|---|---|---|---|
| Anthropic | Prompt caching (beta) | 5 min TTL, extended on hit | 90% off cached tokens |
| OpenAI | Automatic caching | Session-based | 50% off cached tokens |
| Google (Gemini) | Context caching | Configurable TTL | ~75% off cached tokens |
Prefix caching skips redundant prefill for shared prompt prefixes across requests.
The economics are substantial. A 10K-token system prompt + few-shot examples that appear in every request costs ~$0.15/M with GPT-5.4 at standard input pricing. With caching, those tokens are $0.075/M or less after the initial cache fill. For high-volume applications, this can reduce input token costs by 60-80%.
Implementation detail: Cached KV states must be stored in GPU HBM (or at minimum in CPU RAM with fast copy-back). This creates a tension between cache size and batch capacity — memory used for prefix caches is memory not available for active request KV caches. Providers balance this using LRU eviction and usage-frequency-weighted caching policies.
For self-hosted deployments:
# vLLM automatic prefix caching
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-4-70B",
enable_prefix_caching=True, # Enable APC
max_model_len=32768,
)
# First request: full prefill
result1 = llm.generate(
["<system>You are a helpful assistant...</system>\n<user>What is 2+2?</user>"],
SamplingParams(temperature=0),
)
# Second request with same prefix: prefill skips cached portion
result2 = llm.generate(
["<system>You are a helpful assistant...</system>\n<user>What is 3+3?</user>"],
SamplingParams(temperature=0),
)
Putting It All Together: Latency Budget Breakdown
A typical API request to a frontier model, broken down by where time is spent:
Scenario: 2,000 input tokens, 500 output tokens, Claude Sonnet 4.6, streaming
| Stage | Latency | Notes |
|---|---|---|
| Network round-trip (client ↔ edge) | 10–50ms | Geography dependent |
| API Gateway (auth, validation) | 5–15ms | |
| Routing + queue wait | 5–500ms | Highly variable under load |
| Tokenization + preprocessing | 1–3ms | |
| Prefill (2K tokens) | 50–150ms | Compute-bound, depends on batch state |
| First decode step | 10–30ms | Produces first output token |
| Time to first token (TTFT) | ~100–750ms | Sum of above |
| Decode (499 remaining tokens) | 5–10s | At ~50-100 tok/s |
| Total request time | ~5–11s |
The variance is dominated by queue wait time and prefill duration. During off-peak hours, TTFT for a short prompt on Sonnet is ~200ms. During peak, it can exceed 1s.
For long-context requests (100K+ input tokens):
Prefill dominates. 100K tokens at ~100ms per 2K tokens (rough estimate) means 5+ seconds of prefill alone. This is why long-context TTFT is noticeably slower and why providers like Anthropic charge more for >32K context tokens.
For batch API requests:
Queue wait time can be minutes to hours. But per-token cost is 50% less (OpenAI, Anthropic) because providers fill GPU idle capacity with batch work. If latency isn’t a constraint, batch API pricing is the most cost-efficient option.
Decode speed by provider (approximate, March 2026):
| Provider + Model | Decode Speed (tok/s) | TTFT (short prompt) |
|---|---|---|
| GPT-5.4 | 60–90 | 200–600ms |
| GPT-4.1 Nano | 150–250 | 80–200ms |
| Claude Opus 4.6 | 40–60 | 400–1000ms |
| Claude Sonnet 4.6 | 80–120 | 150–400ms |
| Claude Haiku 4.5 | 150–200 | 50–200ms |
| Gemini 3.1 Flash Lite | 180–300 | 80–250ms |
| Gemini 3.1 Pro | 60–100 | 200–500ms |
These are output tokens per second as observed from the client side. Actual GPU-side generation is faster; network and SSE framing add overhead.
Summary
The inference stack is a series of engineering tradeoffs stacked vertically:
- Routing trades latency variance for throughput via multi-cluster distribution and priority queuing
- Continuous batching maximizes GPU utilization by interleaving requests at the iteration level rather than waiting for static batches to complete
- Model parallelism (TP, PP, EP) distributes models that don’t fit on one GPU, with each strategy trading off communication overhead against pipeline efficiency
- KV cache is the binding constraint on concurrent request capacity — PagedAttention and GQA/MLA are the primary mitigations
- Quantization (FP8 as the current production default) trades minimal quality loss for ~2x memory and throughput improvement
- FlashAttention eliminates the N² memory bottleneck in attention, making long-context inference practical
- Speculative decoding breaks the one-token-per-forward-pass bottleneck using draft models, with ~2–3x speedup at matched quality
- Sampling parameters control the quality/diversity tradeoff, and implementation ordering varies across providers
- Prefix caching amortizes repeated prompt computation across requests, reducing costs by 50–90% for shared prefixes
The cost difference between GPT-4.1 Nano and Claude Opus 4.6 (roughly 30–50x per token) maps directly to differences in these layers: smaller model → fewer GPUs → more concurrent requests → faster decode → less KV cache per request → dramatically lower cost per token. The quality difference comes from model capacity. Every pricing tier reflects a specific configuration of this stack.
Further Reading
- vLLM Documentation — Reference implementation for PagedAttention, continuous batching, and automatic prefix caching
- FlashAttention GitHub — Source code and benchmarks for FlashAttention-2 and FlashAttention-3 (Hopper)
- Efficient Memory Management for Large Language Model Serving with PagedAttention — The original vLLM/PagedAttention paper by Kwon et al.
- TensorRT-LLM — NVIDIA’s optimized inference library with FP8 quantization, speculative decoding, and in-flight batching
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — EAGLE and EAGLE-2 speculative decoding implementations
- llama.cpp — C/C++ inference engine for quantized models on consumer hardware
- DeepSeek-V3 Technical Report — Details on MLA (Multi-head Latent Attention) and the MoE architecture behind DeepSeek-V3
- Anthropic Prompt Caching Documentation — Implementation details and pricing for Anthropic’s prefix caching feature
- Medusa: Simple LLM Inference Acceleration Framework — Multi-head speculative decoding without a separate draft model
- SGLang — Inference engine with RadixAttention for automatic prefix sharing and advanced scheduling