The Inference Stack Top to Bottom

What happens between your API call and a streamed token — routing, batching, KV cache, quantization, and speculative decoding explained.

The post you bookmark. One topic, covered end to end.

The Inference Stack Top to Bottom

Every API call to a hosted LLM triggers a cascade of systems — load balancers, routers, batch schedulers, model parallelism coordinators, KV cache managers, quantized matrix multiplications, and sampling logic — before a single token reaches the client. Understanding this stack explains why latency varies between providers, why streaming feels faster than it is, and why the difference between a $0.50/M and $15/M token price reflects genuine engineering tradeoffs rather than arbitrary markup.


Table of Contents


The Request Path Overview

A mental model of the full path from HTTP request to streamed token:

IngressComputeOutputClientAPI GatewayRouterTokenizerContinuous BatcherGPU ClusterSamplerSSE Stream

End-to-end inference path from API call to streamed token delivery.

The entire round-trip for a first token (time-to-first-token, TTFT) on a flagship model like GPT-5.4 or Claude Opus 4.6 typically ranges from 200ms to 2s depending on prompt length, load, and provider. Subsequent tokens arrive at 30–120 tokens/second for most hosted endpoints. Each layer in this stack contributes latency, and each represents a place where engineering teams make cost/quality/speed tradeoffs.


API Gateway and Authentication

The request hits an API gateway before anything model-related happens. This layer handles:

  • TLS termination — typically at an edge proxy (Envoy, nginx, or a cloud load balancer)
  • Authentication — API key validation, OAuth token verification, or in enterprise deployments, workload identity via mTLS
  • Rate limiting — token-aware rate limiting (not just request counting) that inspects the max_tokens parameter and estimated input token count
  • Request validation — schema validation of the JSON body, parameter bounds checking, content policy pre-screening
  • Region routing — directing requests to the nearest inference cluster with available capacity

At providers like Anthropic and OpenAI, this layer also handles organization-level quota tracking, billing metering, and abuse detection. The gateway logs the estimated input token count before tokenization happens — providers use a fast approximation (roughly len(text) / 4 for English) for rate limiting decisions, then compute exact counts downstream.

Gateway latency: 5–20ms for most providers, occasionally higher during quota lookup for enterprise accounts with complex billing hierarchies.


Routing and Model Selection

After authentication, the router decides which physical GPU cluster serves the request. This involves several decisions:

Model version resolution. A request for claude-sonnet-4.6 maps to a specific model checkpoint, weight version, and system prompt configuration. Providers maintain multiple versions simultaneously during rollouts — a model alias like gpt-5.4 might resolve to gpt-5.4-0312 or gpt-5.4-0319 depending on rollout state.

Capacity-aware routing. The router checks available capacity across multiple GPU clusters, often spanning regions. If the primary cluster is saturated, requests overflow to secondary clusters. This is why latency can vary by 2–5x between requests to the same endpoint — different clusters have different utilization and hardware configurations.

Priority queuing. Most providers implement tiered priority queues. Batch API requests (OpenAI’s batch endpoint, Anthropic’s message batches) go into low-priority queues that fill idle GPU capacity. Real-time requests get higher priority. Enterprise customers with reserved capacity get dedicated queues.

Model-specific routing. Different model sizes need different hardware configurations. Claude Haiku 4.5 can run on fewer GPUs with less memory than Claude Opus 4.6. The router directs requests to appropriately provisioned clusters.

RequestLoad BalancerCluster A (us-east)Cluster B (us-west)Batch Queue Resolved modelReal-timeOverflowBatch APIIdleIdle

Routing layer distributes requests across clusters by priority and capacity.

Router latency: 2–10ms, but queue wait time during peak load can add 100ms–several seconds, which is the primary source of TTFT variance.


Prompt Preprocessing and Tokenization

The raw text prompt gets converted to token IDs through the model’s tokenizer. This is CPU-bound work and fast, but the details matter.

Tokenizer specifics by model family:

Model FamilyTokenizerVocab SizeAvg Chars/Token (English)
GPT-5.4cl200k_base (extended)~210,000~3.8
Claude Opus/Sonnet 4.6Custom SentencePiece~160,000~3.6
Gemini 2.5/3.xSentencePiece (v2)~256,000~4.0
Llama 4SentencePiece BPE~128,000~3.5

Larger vocabularies generally mean fewer tokens per prompt (reducing compute and KV cache usage) but increase the size of the embedding and output projection matrices. Gemini’s 256K vocabulary is notably large and contributes to its relatively efficient tokenization of multilingual text.

System prompt injection happens at this stage. The provider prepends its system prompt (safety instructions, behavioral guidelines, tool-use formatting) to the user’s messages. These system tokens count toward the context window but are typically prefix-cached (covered later), so they don’t add meaningful latency after the first request.

Chat template formatting converts the structured messages array into the model’s expected token format — special tokens like <|im_start|>, [INST], or <human> delimit roles. Getting this wrong during self-hosted inference is a common source of degraded quality.

Tokenization latency: <1ms for most prompts, up to 5–10ms for very long contexts (100K+ tokens).


The Scheduling Layer: Continuous Batching

This is where modern inference engines diverge most from naive implementations. A naive approach processes one request at a time: prefill all input tokens, then decode output tokens one at a time until done. Continuous batching (also called iteration-level batching or in-flight batching) changed everything.

The problem with static batching: Requests have wildly different input and output lengths. If you batch 8 requests together, 7 might finish while 1 is still generating. Static batching forces those 7 to wait, wasting GPU cycles.

Continuous batching allows the scheduler to:

  1. Insert new requests into the batch at every decode iteration
  2. Remove completed requests immediately
  3. Preempt low-priority requests when high-priority ones arrive
  4. Maintain different requests at different stages (some prefilling, some decoding)

vLLM popularized this approach, and it’s now standard in TensorRT-LLM, TGI (Text Generation Inference), and every major provider’s custom stack.

Prefill vs. decode — the two phases:

The prefill phase processes all input tokens in parallel. This is compute-bound — the GPU performs matrix multiplications across all input tokens simultaneously. For a 4,000-token prompt on an H100, prefill takes ~50–200ms depending on model size.

The decode phase generates tokens one at a time (autoregressively). Each decode step processes only the new token plus KV cache lookups. This is memory-bandwidth-bound — the GPU spends most of its time loading model weights from HBM rather than doing math.

Input Prompt(all tokens)Prefill(parallel, compute-bound)KV Cache(stored per layer)Decode(sequential, memory-bound)Output Tokens append each token

Prefill processes input in parallel; decode generates tokens one at a time against the cached keys/values.

This prefill/decode asymmetry creates a scheduling challenge: prefill operations are bursty and compute-heavy, while decode operations are steady and memory-bound. Sophisticated schedulers separate these phases, sometimes running prefill on dedicated GPU partitions (a technique called disaggregated prefill) or chunking long prefills into smaller pieces to avoid stalling concurrent decode operations.

# Simplified continuous batching loop (conceptual)
class ContinuousBatcher:
    def __init__(self, model, max_batch_size=256):
        self.model = model
        self.waiting_queue = []
        self.running_batch = {}  # request_id -> state
        self.max_batch = max_batch_size

    def step(self):
        # Admit new requests if capacity available
        while self.waiting_queue and len(self.running_batch) < self.max_batch:
            req = self.waiting_queue.pop(0)
            # Run prefill for this request (chunked if long)
            kv_cache = self.model.prefill(req.input_ids)
            self.running_batch[req.id] = {
                "kv_cache": kv_cache,
                "generated": [],
                "req": req,
            }

        # Single decode step for all running requests
        if self.running_batch:
            batch_logits = self.model.decode_step(
                [s["kv_cache"] for s in self.running_batch.values()],
                [s["generated"][-1] if s["generated"] else None
                 for s in self.running_batch.values()],
            )

            # Sample, check stopping, evict finished
            for req_id, logits in zip(self.running_batch.keys(), batch_logits):
                token = sample(logits, self.running_batch[req_id]["req"].params)
                self.running_batch[req_id]["generated"].append(token)
                if token == EOS or len(self.running_batch[req_id]["generated"]) >= max_tokens:
                    self.finish(req_id)

The actual implementations are far more complex — they handle memory allocation for KV caches, preemption policies, chunked prefill scheduling, and multi-node coordination — but the core loop is: admit new requests, run one decode step for all active requests, evict finished ones.


Model Parallelism: Tensor, Pipeline, and Expert

Frontier models don’t fit on a single GPU. An H100 has 80GB of HBM3. A dense 70B-parameter model in FP16 requires ~140GB just for weights. Flagship models (GPT-5.4, Claude Opus 4.6) are substantially larger. Parallelism strategies distribute the model across multiple GPUs.

Tensor Parallelism (TP) splits individual matrix multiplications across GPUs. A single attention head’s QKV projection gets partitioned column-wise, each GPU computes its portion, and results are combined via all-reduce. TP requires high-bandwidth interconnects (NVLink at 900 GB/s between H100s) because GPUs communicate every layer.

Pipeline Parallelism (PP) assigns different layers to different GPUs. GPU 0 runs layers 0–15, GPU 1 runs layers 16–31, etc. Communication happens only between adjacent stages (one GPU sends activations to the next). PP introduces pipeline bubbles — idle time when some stages wait for input — but uses less interconnect bandwidth than TP.

Expert Parallelism (EP) applies specifically to Mixture-of-Experts (MoE) models. Llama 4’s MoE variant, DeepSeek-V3, and reportedly GPT-5.4’s architecture use MoE. Each GPU holds a subset of experts. A routing network selects which experts to activate per token, and tokens get dispatched to the GPUs holding those experts. EP reduces compute per token (only ~2 of 16 experts fire, for example) but introduces load-balancing challenges when certain experts are popular.

Tensor Parallel(split matrices, all-reduce)Pipeline Parallel(split layers across GPUs)Expert Parallel(route tokens to expert GPUs) combine for large dense modelscombine for MoE models

Three parallelism strategies, often combined. TP needs high-bandwidth NVLink; PP and EP tolerate lower bandwidth.

Typical configurations for hosted inference:

ModelArchitectureLikely ParallelismGPUs per Instance
Claude Haiku 4.5Dense (smaller)TP=2 or TP=42–4 H100s
Claude Sonnet 4.6DenseTP=4 to TP=84–8 H100s
Claude Opus 4.6Dense (largest)TP=8, possibly PP=28–16 H100s
GPT-5.4Likely MoETP + EP8–16+ H100s
DeepSeek-V3MoE (671B total, ~37B active)TP + EP8 H100s
Llama 4 70B (dense)DenseTP=44 H100s (FP8)

These are informed estimates — providers don’t publish exact configurations — but they align with known model sizes and reported inference characteristics.

The parallelism strategy directly affects latency and throughput. More TP means lower latency (each GPU does less work per layer) but worse throughput per GPU dollar. Providers tune these parameters based on the target latency SLA for each model tier.


The KV Cache: Why Memory Is the Real Bottleneck

The KV (key-value) cache is the single most important concept for understanding inference economics. During autoregressive generation, each new token attends to all previous tokens. Without caching, generating token N would require recomputing attention over all N-1 previous tokens — O(N²) total compute for N tokens.

The KV cache stores the key and value projections from every previous token at every layer, so each decode step only computes attention for the new token against cached K/V pairs. This makes decode O(N) total but introduces a massive memory problem.

KV cache memory per token:

bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element

For a model with 80 layers, 8 KV heads (GQA), 128 head dimension, in FP16:

2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 320 KB per token

For a 128K context window: 128,000 × 320 KB = ~40 GB of KV cache per request. On an 8-GPU H100 cluster with 640GB total HBM, the KV cache for just 16 concurrent 128K-context requests would consume the entire memory.

This is why context length directly determines how many concurrent requests a cluster can serve, and why long-context requests cost more per token at some providers.

PagedAttention (from vLLM) addresses KV cache memory fragmentation. Rather than pre-allocating contiguous memory for the maximum possible sequence length, PagedAttention allocates KV cache in small pages (blocks) and maps them through a page table, similar to OS virtual memory. This eliminates internal fragmentation and improves memory utilization from ~50-60% to ~95%+.

Request ARequest BPage TableGPU HBM Blocks 0-3Blocks 0-2Non-contiguous, shared prefix

PagedAttention maps virtual KV cache blocks to physical GPU memory pages, eliminating fragmentation.

Grouped-Query Attention (GQA) reduces KV cache size by sharing key/value heads across multiple query heads. Instead of 64 KV heads matching 64 query heads (MHA), GQA might use 8 KV heads shared across 64 query heads — an 8x reduction in KV cache size with minimal quality loss. Nearly all models deployed after 2024 use GQA or its extreme variant, Multi-Query Attention (MQA, 1 KV head).

Multi-head Latent Attention (MLA), used by DeepSeek-V3, goes further by projecting keys and values into a low-rank latent space before caching, reducing per-token KV cache to ~70 bytes — roughly 4–5x smaller than GQA at comparable model sizes.


Quantization at Inference Time

Quantization reduces the precision of model weights (and sometimes activations and KV cache) from FP16/BF16 (16 bits) to lower bitwidths. The tradeoff is straightforward: less precision means less memory, faster memory bandwidth utilization, and cheaper inference, but potentially degraded output quality.

Common quantization formats for inference:

FormatBitsMemory SavingsQuality ImpactHardware Support
FP16/BF1616BaselineNoneUniversal
FP8 (E4M3/E5M2)82xMinimal for most modelsH100, L40S, MI300X, B200
INT8 (W8A8)82xMinimal with calibrationH100, A100
INT4 (W4A16)4 (weights only)~3.5xNoticeable on small modelsCUDA via kernels
GPTQ / AWQ4~3.5xModerate, model-dependentCUDA via kernels
GGUF Q4_K_M~4.5 effective~3xGood for chat modelsCPU + GPU (llama.cpp)

FP8 is the current production sweet spot. H100s have native FP8 tensor cores that deliver ~2x the FLOPS of FP16. Most major providers run FP8 inference for their flagship models, with careful calibration to preserve quality. The quality difference between FP8 and FP16 inference is generally within noise on standard benchmarks for models above 30B parameters.

Weight-only quantization (W4A16, W8A16) quantizes stored weights but dequantizes to FP16 for computation. This saves memory and bandwidth but doesn’t improve compute throughput. It’s popular for self-hosted deployments where GPU memory is the constraint.

W4A4 quantization is emerging for smaller models — both weights and activations in 4-bit. NVIDIA’s B200/B100 GPUs have native FP4 tensor cores for this. Quality degradation is nontrivial and requires quantization-aware training or very careful post-training quantization.

For self-hosted inference, the practical guidance:

# FP8 inference with vLLM (H100/L40S)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-4-70B \
    --quantization fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072

# 4-bit inference with llama.cpp (consumer hardware)
./llama-server \
    -m llama-4-70b-Q4_K_M.gguf \
    -ngl 99 \
    --ctx-size 8192 \
    --host 0.0.0.0 --port 8080

Attention Kernels and FlashAttention

Standard attention computes softmax(QK^T / √d) × V, which naively requires materializing the full N×N attention matrix in GPU memory. For N=128K tokens, that’s a 128K × 128K matrix of floats — 64GB for a single attention head in FP32.

FlashAttention (now at v3 for Hopper GPUs) restructures this computation using tiling and kernel fusion to:

  1. Never materialize the full attention matrix — it computes attention in blocks that fit in GPU SRAM (shared memory / L2)
  2. Fuse the softmax, scaling, and V multiplication into a single kernel — reducing HBM reads/writes
  3. Support variable sequence lengths efficiently via jagged/ragged tensors
Q, K, V MatricesTiled Blocks(fit in GPU SRAM)Fused Kernel(softmax + scale + matmul)Attention Output partitionno N×N materializationsingle pass

FlashAttention tiles the computation to fit in fast SRAM, avoiding the N×N memory bottleneck.

FlashAttention-3 on H100s exploits the asynchronous TMA (Tensor Memory Accelerator) and warp-group level parallelism specific to the Hopper architecture. Benchmarks show ~1.5–2x speedup over FlashAttention-2 on H100s.

Ring Attention extends FlashAttention across multiple GPUs for extremely long contexts. Each GPU holds a block of the KV cache and attention is computed in a ring pattern — GPU i computes attention against its local KV block, then passes its Q block to GPU i+1. This enables context lengths beyond what fits in single-GPU memory without the full materialization cost.

FlashDecoding optimizes the decode phase specifically. During decode, the query is a single token but keys/values span the entire sequence. FlashDecoding parallelizes across the KV sequence dimension rather than the batch dimension, which is critical for long-context decode performance.

These kernel-level optimizations are invisible at the API layer but directly determine the tokens/second and TTFT characteristics of every provider.


Speculative Decoding

Standard autoregressive decoding generates one token per forward pass through the full model. For a 70B-parameter model, each forward pass costs ~140 TFLOPS. Speculative decoding generates multiple tokens per step by using a smaller “draft” model.

The algorithm:

  1. A small draft model (e.g., 1-7B parameters) generates K candidate tokens autoregressively (fast, because the model is small)
  2. The full target model verifies all K tokens in a single forward pass (parallel, like prefill)
  3. Tokens are accepted left-to-right using a rejection sampling scheme that guarantees the output distribution exactly matches the target model
  4. If the first M ≤ K tokens are accepted, one additional token is sampled from an adjusted distribution at position M+1

The key insight: verification is parallel (one forward pass checks all K tokens), while draft generation is sequential but cheap. If the draft model matches the target model’s distribution well (high acceptance rate), speculative decoding provides ~2-3x speedup for decode-bound workloads.

Acceptance rates depend on how well the draft model approximates the target model. For well-matched pairs (e.g., a distilled 1B draft for a 70B target), acceptance rates of 70-85% are typical for natural language, lower for code or reasoning.

Medusa and EAGLE are alternative approaches that attach extra prediction heads to the target model itself, avoiding the need for a separate draft model. EAGLE-2, in particular, achieves ~3-4x speedup on some benchmarks by predicting multiple future tokens from the target model’s hidden states.

Providers probably use speculative decoding for their mid-tier models (Haiku, Flash) where decode speed is the key selling point. For the largest models, the draft model needs to be proportionally capable, and the engineering complexity increases.

Draft Model (1-7B)Target Model (70B+)Rejection Sampling 5 candidatesVerify in parallelContinue from last accepted

Speculative decoding uses a small draft model to propose tokens verified by the full model in parallel.


Sampling: Temperature, Top-p, and Beyond

After the model produces logits (one score per vocabulary token), the sampling strategy converts these into the next token. This is the last step before a token enters the output stream.

Temperature scales logits before softmax: p(token) = softmax(logits / temperature). Temperature 0 (greedy) always picks the highest-probability token. Temperature 1.0 samples from the model’s natural distribution. Temperature >1 flattens the distribution (more random). Most API defaults are 1.0 with top_p=1.0.

Top-p (nucleus sampling) sorts tokens by probability, then samples from the smallest set whose cumulative probability exceeds p. Top-p=0.95 means: find the top tokens that account for 95% of probability mass, sample from those. This dynamically adjusts the candidate set size based on the model’s confidence.

Top-k is simpler: only consider the top K tokens by probability. Top-k=50 means sample from the 50 most likely tokens. Less adaptive than top-p.

Min-p (supported by llama.cpp, vLLM, and some providers) sets a minimum probability threshold relative to the top token. If the top token has probability 0.9 and min_p=0.1, only tokens with probability ≥ 0.09 are considered. This adapts naturally to the model’s confidence level.

Repetition penalty and frequency/presence penalties modify logits based on tokens that have already appeared in the output. These reduce repetitive text but can degrade quality if set too high.

The sampling order matters and varies between implementations:

Raw LogitsTemperatureScalingTop-kFilterTop-pFilterSampleToken

Sampling pipeline — order varies between implementations, producing subtly different outputs.

Some implementations apply penalties before temperature; others after. This produces subtly different output distributions with the same parameters. It’s one reason why self-hosted models with identical weights can produce different outputs than API endpoints.

Practical guidance on sampling parameters:

Use CaseTemperatureTop-pNotes
Code generation0 – 0.21.0Determinism matters
Factual Q&A0 – 0.30.95Low creativity
Creative writing0.7 – 1.00.95Higher variety
Brainstorming1.0 – 1.31.0Maximally diverse
Structured output (JSON)01.0Greedy avoids format errors

Streaming and Token Delivery

Most inference APIs stream tokens via Server-Sent Events (SSE) over HTTP. Each token (or small group of tokens) is sent as it’s generated, rather than waiting for the complete response.

The streaming response contains a sequence of events:

data: {"id":"msg_01X","type":"content_block_delta","delta":{"type":"text_delta","text":"The"}}

data: {"id":"msg_01X","type":"content_block_delta","delta":{"type":"text_delta","text":" inference"}}

data: {"id":"msg_01X","type":"content_block_delta","delta":{"type":"text_delta","text":" stack"}}

Detokenization subtleties. The model generates token IDs, not text. Token ID 1234 might map to " inference" (with a leading space) or "inf" (a partial word). The detokenizer must handle:

  • Token boundaries that don’t align with character boundaries (especially in multibyte UTF-8)
  • Tokens that are prefixes of other tokens — the server sometimes buffers 1-2 tokens to ensure the detokenized text is valid UTF-8
  • Special tokens that should be filtered from the output

Token grouping. Some providers batch multiple tokens into a single SSE event for efficiency, especially at high generation speeds. A server generating at 100 tokens/s might send events every 30-50ms containing 3-5 tokens each rather than one event per token.

Backpressure. If the client reads slowly (or has high network latency), the server’s output buffer fills. Well-implemented inference servers apply backpressure to the decode loop — they won’t generate tokens faster than the client can consume them. This prevents unbounded memory growth but means slow clients get lower throughput.


Prefix Caching and Prompt Caching

Prefix caching stores the KV cache for common prompt prefixes across requests. If 1,000 requests share the same system prompt (which they do for every API call to a given model), the KV cache for that system prompt can be computed once and reused.

Automatic prefix caching (APC), implemented in vLLM and TensorRT-LLM, hashes prompt token sequences and caches their KV states. When a new request arrives with a matching prefix, the prefill phase skips the cached portion.

Provider-level prompt caching is more explicit:

ProviderFeatureCache DurationDiscount
AnthropicPrompt caching (beta)5 min TTL, extended on hit90% off cached tokens
OpenAIAutomatic cachingSession-based50% off cached tokens
Google (Gemini)Context cachingConfigurable TTL~75% off cached tokens
Request 1(system + user A)Request 2(system + user B)Prefix Cache(shared system prompt KV)GPU Prefill cache miss, computecache hit, skiponly new tokens

Prefix caching skips redundant prefill for shared prompt prefixes across requests.

The economics are substantial. A 10K-token system prompt + few-shot examples that appear in every request costs ~$0.15/M with GPT-5.4 at standard input pricing. With caching, those tokens are $0.075/M or less after the initial cache fill. For high-volume applications, this can reduce input token costs by 60-80%.

Implementation detail: Cached KV states must be stored in GPU HBM (or at minimum in CPU RAM with fast copy-back). This creates a tension between cache size and batch capacity — memory used for prefix caches is memory not available for active request KV caches. Providers balance this using LRU eviction and usage-frequency-weighted caching policies.

For self-hosted deployments:

# vLLM automatic prefix caching
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-4-70B",
    enable_prefix_caching=True,  # Enable APC
    max_model_len=32768,
)

# First request: full prefill
result1 = llm.generate(
    ["<system>You are a helpful assistant...</system>\n<user>What is 2+2?</user>"],
    SamplingParams(temperature=0),
)

# Second request with same prefix: prefill skips cached portion
result2 = llm.generate(
    ["<system>You are a helpful assistant...</system>\n<user>What is 3+3?</user>"],
    SamplingParams(temperature=0),
)

Putting It All Together: Latency Budget Breakdown

A typical API request to a frontier model, broken down by where time is spent:

Scenario: 2,000 input tokens, 500 output tokens, Claude Sonnet 4.6, streaming

StageLatencyNotes
Network round-trip (client ↔ edge)10–50msGeography dependent
API Gateway (auth, validation)5–15ms
Routing + queue wait5–500msHighly variable under load
Tokenization + preprocessing1–3ms
Prefill (2K tokens)50–150msCompute-bound, depends on batch state
First decode step10–30msProduces first output token
Time to first token (TTFT)~100–750msSum of above
Decode (499 remaining tokens)5–10sAt ~50-100 tok/s
Total request time~5–11s

The variance is dominated by queue wait time and prefill duration. During off-peak hours, TTFT for a short prompt on Sonnet is ~200ms. During peak, it can exceed 1s.

For long-context requests (100K+ input tokens):

Prefill dominates. 100K tokens at ~100ms per 2K tokens (rough estimate) means 5+ seconds of prefill alone. This is why long-context TTFT is noticeably slower and why providers like Anthropic charge more for >32K context tokens.

For batch API requests:

Queue wait time can be minutes to hours. But per-token cost is 50% less (OpenAI, Anthropic) because providers fill GPU idle capacity with batch work. If latency isn’t a constraint, batch API pricing is the most cost-efficient option.

Decode speed by provider (approximate, March 2026):

Provider + ModelDecode Speed (tok/s)TTFT (short prompt)
GPT-5.460–90200–600ms
GPT-4.1 Nano150–25080–200ms
Claude Opus 4.640–60400–1000ms
Claude Sonnet 4.680–120150–400ms
Claude Haiku 4.5150–20050–200ms
Gemini 3.1 Flash Lite180–30080–250ms
Gemini 3.1 Pro60–100200–500ms

These are output tokens per second as observed from the client side. Actual GPU-side generation is faster; network and SSE framing add overhead.


Summary

The inference stack is a series of engineering tradeoffs stacked vertically:

  • Routing trades latency variance for throughput via multi-cluster distribution and priority queuing
  • Continuous batching maximizes GPU utilization by interleaving requests at the iteration level rather than waiting for static batches to complete
  • Model parallelism (TP, PP, EP) distributes models that don’t fit on one GPU, with each strategy trading off communication overhead against pipeline efficiency
  • KV cache is the binding constraint on concurrent request capacity — PagedAttention and GQA/MLA are the primary mitigations
  • Quantization (FP8 as the current production default) trades minimal quality loss for ~2x memory and throughput improvement
  • FlashAttention eliminates the N² memory bottleneck in attention, making long-context inference practical
  • Speculative decoding breaks the one-token-per-forward-pass bottleneck using draft models, with ~2–3x speedup at matched quality
  • Sampling parameters control the quality/diversity tradeoff, and implementation ordering varies across providers
  • Prefix caching amortizes repeated prompt computation across requests, reducing costs by 50–90% for shared prefixes

The cost difference between GPT-4.1 Nano and Claude Opus 4.6 (roughly 30–50x per token) maps directly to differences in these layers: smaller model → fewer GPUs → more concurrent requests → faster decode → less KV cache per request → dramatically lower cost per token. The quality difference comes from model capacity. Every pricing tier reflects a specific configuration of this stack.


Further Reading