Part XVII · AI Infrastructure & Systems · Chapter 04

Inference Optimization, where milliseconds become millions of dollars.

A trained model is only as valuable as the inference economics that bring it to users. A 70B-parameter LLM that takes a second to start generating, runs at 5 tokens per second, and costs $0.10 per request is unusable for a chatbot at scale; the same model running at 50 ms time-to-first-token, 200 tokens per second, and $0.001 per request is the foundation of a viable product. The 1000× difference is inference optimisation. Batching amortises GPU compute across multiple requests; KV caching avoids recomputing attention's key-value pairs for tokens already generated; speculative decoding uses a small draft model to propose tokens that a large model verifies in parallel; FlashAttention redesigns attention to fit GPU memory hierarchy; serving frameworks (vLLM, TensorRT-LLM, SGLang, TGI, the various 2024–2026 entrants) bundle all of this into deployable systems. This chapter develops the methodology with the depth a working ML practitioner needs: the algorithmic optimisations, the systems-level techniques, the framework choices, and the operational realities that distinguish a productive serving deployment from one that is hemorrhaging money.

Prerequisites & orientation

This chapter assumes the hardware material of Ch 01 (memory bandwidth, the roofline model, GPU architecture), the distributed-training material of Ch 02 (since some inference patterns mirror training-time parallelism), and the model-compression material of Ch 03 (which composes naturally with inference optimisation). Familiarity with the deployment material of Ch 03 of Part XVI (Model Deployment & Serving) is assumed. The chapter is written for ML engineers, platform engineers, and infrastructure engineers who serve production ML — particularly LLMs and other generative models with non-trivial inference cost. Pure-research contexts where inference is run occasionally on small validation sets have less use for this material; teams that serve millions of inferences per day have substantial use.

Three threads run through the chapter. The first is the memory-bandwidth-bound reality of LLM inference: most LLM-inference time is spent moving data, not computing on it, and optimisations target memory traffic rather than FLOPs. The second is the throughput-vs-latency tension: a serving system optimised for high throughput (many tokens per second across all users) often sacrifices per-user latency, and the right balance depends on the deployment's user-facing characteristics. The third is the continuous-batching shift: classical batched inference required all requests in a batch to start and end together; modern continuous-batching systems (vLLM, the various 2024–2026 entrants) support requests entering and leaving freely, which fundamentally changes serving economics. The chapter develops each in turn.

In this chapter

Why Inference Optimisation Is Its Own Discipline economics · LLM specifics · throughput · latency · scale
Prefill and Decode: Two Different Workloads prefill (compute-bound) · decode (memory-bound) · disaggregation
Batching: Static, Dynamic, and Continuous static batching · dynamic batching · continuous batching · in-flight
KV Caching and PagedAttention KV cache structure · memory management · PagedAttention · prefix sharing
FlashAttention and Memory-Aware Algorithms tiling · IO-aware · FA-2 · FA-3 · attention variants
Speculative Decoding and Draft-Verifier Methods draft model · medusa · eagle · lookahead · self-speculation
Request Scheduling and Multi-Tenancy priority · fairness · preemption · SLO management · cost
Serving Frameworks Compared vLLM · TensorRT-LLM · SGLang · TGI · llama.cpp · Triton
Operational Realities of LLM Serving capacity planning · autoscaling · cold start · multi-region · cost
The Frontier and the Operational Question disaggregated inference · long context · agent serving · what next

Why Inference Optimisation Is Its Own Discipline

Inference optimisation has become one of the highest-ROI engineering activities in modern ML. The reason is straightforward: training is a one-time investment, but inference is a recurring cost that scales with usage. A model serving 100 million users runs through more inference compute in a week than its training run consumed; even modest inference-cost reductions translate to enormous absolute savings. The 2023–2026 era has produced a remarkable amount of methodology in this space, transforming what's economically viable to deploy.

The economics of inference at scale

For frontier-AI deployments, inference costs dominate operating expenses. ChatGPT-class services serve hundreds of millions of inferences per day; at $0.001 per inference (already an aggressive number), that's hundreds of thousands of dollars per day, or tens of millions per year. A 2× inference-cost reduction is therefore worth tens of millions annually — easily justifying substantial engineering investment. The 2023–2026 industry investment in inference optimisation reflects this: vLLM was open-sourced in 2023; SGLang in 2024; the various commercial inference platforms (Together, Anyscale, Fireworks, Replicate) compete on inference economics.

What makes LLM inference distinctive

LLM inference has several properties that distinguish it from classical-ML inference. Autoregressive generation: each token depends on all previous tokens, so the model runs once per generated token rather than once per request. Variable-length outputs: a request might generate 10 tokens or 10,000, with no way to know in advance. Memory-bandwidth-bound: the dominant operation is reading the model weights and the KV cache, not computing on them. Heterogeneous request shapes: short prompt + long output is operationally different from long prompt + short output. None of these properties apply to classical CNN inference; all of them shape modern LLM serving.

The compounding gains

Modern inference optimisation stacks multiple techniques. A typical production deployment combines: 4-bit weight quantisation (4× memory savings, Ch 03), continuous batching (5–10× throughput vs naive batching), PagedAttention (eliminates KV-cache fragmentation, enables larger batches), FlashAttention (2–3× attention speedup), speculative decoding (1.5–3× decode-stage speedup), and a high-quality serving framework (correct implementation of all the above). The compound speedup vs naive serving is often 50–100×. Each technique is incremental; the system effect is transformative.

The serving-stack convergence

The 2023–2026 inference-stack landscape has rapidly converged. Two years ago, every serving framework had different APIs, different optimisations, different gaps. Today, vLLM is the dominant open-source choice; TensorRT-LLM is dominant for NVIDIA-specific high-throughput; SGLang is rising for advanced workloads; the various commercial platforms (Together, Anyscale, Fireworks) wrap these primitives. The convergence has substantially reduced the "engineering tax" of inference optimisation; teams that would have built custom stacks two years ago now use vLLM with project-specific tuning.

The downstream view

Operationally, inference optimisation sits between the model artefact (a registered, possibly-compressed model from Ch 03 of Part XVI) and the user-facing service. Upstream: a deployable model file. Inside this chapter's scope: batching, caching, attention algorithms, speculative decoding, scheduling, frameworks. Downstream: a running service with latency and throughput characteristics that meet user-facing requirements at acceptable cost. The remainder of this chapter develops each piece: §2 prefill vs decode, §3 batching, §4 KV caching, §5 FlashAttention, §6 speculative decoding, §7 scheduling, §8 frameworks, §9 operational realities, §10 the frontier.

Prefill and Decode: Two Different Workloads

An LLM inference request has two distinct phases with completely different performance characteristics. Prefill processes the entire input prompt in one pass, computing attention over all input tokens; it's compute-bound and produces the first generated token. Decode generates each subsequent token autoregressively; it's memory-bandwidth-bound and dominates total inference time for long generations. Understanding the prefill-vs-decode distinction is the foundational insight for modern inference engineering.

The prefill phase

When a request arrives with a prompt of N tokens, the system runs prefill: a single forward pass through the model that computes attention over all N tokens, producing the first output token plus the KV cache for all input positions. The prefill compute scales as O(N²) for attention plus O(N) for the linear layers. For typical prompts (1K–10K tokens), prefill is compute-bound on the GPU's matrix-multiply units. The prefill is what determines time-to-first-token (TTFT), the latency that users experience as "how long does the assistant take to start answering."

The decode phase

After prefill, the system enters decode: each subsequent output token requires a forward pass that computes attention against the already-cached keys and values. Each decode step generates one token, computes its key/value contribution, and appends to the cache. The decode compute scales as O(M·N) where N is current context length and M is current decoded position; the per-token compute is roughly proportional to the running context length. Crucially, decode is memory-bandwidth-bound: at decode time, the GPU is mostly reading the model weights (per-token) and the KV cache (per-token), with relatively few FLOPs done per byte of memory traffic. The decode rate determines tokens per second — the rate at which the assistant produces output.

The bandwidth-bound implication

Decode being memory-bandwidth-bound has significant operational implications. Bigger batches help: with a larger decode batch, more useful work is done per byte of weight memory traffic, since the same weights serve all batched requests. Faster memory wins: H100's 3 TB/s HBM3 enables substantially better decode throughput than A100's 2 TB/s HBM2e, even at similar FLOPs ratings. Smaller models help: the bandwidth-bound nature of decode means smaller models (which have less to read per token) decode faster than the FLOPs ratio alone would suggest. Lower-precision weights help: 4-bit quantisation roughly 4× decode speedup because the bandwidth required is 4× lower.

Disaggregated inference

Because prefill and decode have such different characteristics, the 2024 frontier of LLM inference is disaggregated inference: prefill and decode run on different hardware, optimised separately. Prefill servers (high-FLOP machines optimised for compute throughput) handle the compute-bound prefill phase; decode servers (high-bandwidth machines, possibly with smaller GPUs) handle the bandwidth-bound decode. The KV cache is transferred from prefill to decode via fast network. Disaggregated inference has produced 2–4× total cost reduction for major LLM deployments. The 2025–2026 frontier inference stacks at major AI labs are increasingly disaggregated.

Chunked prefill and continuous batching's prefill-decode mixing

For aggregated (non-disaggregated) systems, modern serving frameworks mix prefill and decode in the same batch. Chunked prefill: a long prompt is broken into smaller chunks; each batch contains some chunks of long prompts plus decode steps from in-progress requests. The mixing improves utilisation: prefill chunks fill the compute headroom that decode steps can't use, while decode steps fill the bandwidth headroom that prefill doesn't need. vLLM, TensorRT-LLM, and SGLang all support chunked prefill; it's a substantial optimisation for typical production workloads.

The TTFT vs throughput trade-off

Prefill and decode also embody a fundamental serving trade-off. Time-to-first-token (TTFT) is dominated by prefill latency; it's the latency the user feels before any output appears. Tokens per second after the first is dominated by decode throughput. Optimising for TTFT (e.g., dedicating compute to prefill) hurts throughput; optimising for throughput (e.g., batching prefills, deferring them) hurts TTFT. The right balance depends on the user-facing workload — interactive chat needs low TTFT, batch processing tolerates higher TTFT. The serving framework's scheduler manages this balance (Section 7).

Batching: Static, Dynamic, and Continuous

Batching is the foundational throughput optimisation: process multiple requests simultaneously, amortising the GPU's per-step overhead across many requests. The challenge for LLM inference is that requests have variable lengths, arrive at different times, and complete at different times. Three batching strategies — static, dynamic, and continuous — represent successive sophistications in handling this variability.

Static batching: the naive starting point

Static batching: collect a fixed number of requests, run them through the model together, return all results when all are complete. The simplest pattern; the workhorse for non-LLM serving (CV models, etc.). For LLMs, it has a fatal flaw: requests have different output lengths, so the batch is bottlenecked by the longest-output request. A batch of 8 requests where 7 generate 100 tokens but 1 generates 5000 tokens runs at the speed of the longest one — the seven short requests sit idle for most of the runtime, wasting GPU time. Static batching produces 5–20% GPU utilisation for typical LLM workloads.

Dynamic batching: arrival-time optimisation

Dynamic batching: batches form dynamically based on request arrivals. The serving system collects incoming requests for a short window (e.g., 10 ms), batches whatever arrived, runs them. Improves over static batching by adapting to actual traffic patterns. NVIDIA Triton (the general-purpose inference server) uses dynamic batching for non-LLM models. For LLMs, dynamic batching still has the variable-length problem — once a batch starts, it runs until the longest request finishes.

Continuous batching: the LLM-era breakthrough

Continuous batching (also called in-flight batching) was the 2022–2023 breakthrough that fundamentally changed LLM serving economics. The key idea: at every decode step, the batch can change. Requests that finish are removed; new requests are added; the batch composition is dynamically managed. A request that finishes after 100 tokens immediately frees its slot for a newly-arriving request, regardless of what other requests in the batch are doing. The result: GPU utilisation rises from 5–20% (static batching) to 60–90% (continuous batching).

The vLLM contribution

The 2023 vLLM paper (Kwon et al., SOSP 2023) was the watershed moment for continuous batching at scale. The paper combined continuous batching with PagedAttention (Section 4) to substantially advance both batch-management efficiency and memory-management efficiency. vLLM achieved 24× throughput improvement over the previous-generation Hugging Face Transformers baseline, primarily through these two innovations. The methodology has been adopted by essentially every modern LLM serving framework; vLLM itself has become the dominant open-source LLM serving platform.

Mixing prefill and decode

Continuous batching also enables mixing prefill and decode tokens in the same batch. A batch might contain: 5 ongoing requests doing decode (one token each), 1 new request doing prefill (256 tokens), 2 newly-arrived requests starting prefill (small chunks of their prompts). All these heterogeneous activities run in the same forward pass, which the serving framework orchestrates. The mixing requires careful kernel implementation but produces substantial throughput gains over treating prefill and decode separately.

The batch-size tuning

The maximum batch size is bounded by GPU memory: each request's KV cache occupies memory proportional to its context length. Larger batches mean more concurrent requests but require more KV-cache memory; the trade-off determines the achievable throughput. PagedAttention (Section 4) substantially improves the practical batch size by reducing fragmentation in KV-cache memory. Modern serving frameworks expose batch-size limits as tuning parameters; finding the right value for a specific workload is configuration engineering.

KV Caching and PagedAttention

The KV cache stores the keys and values from attention's earlier-position computations so they don't have to be recomputed at every decode step. Without KV caching, generating N tokens would require O(N²) attention work; with KV caching, it's O(N) per step. KV caching is universal in modern LLM serving, but managing the cache efficiently — particularly across many concurrent requests — is its own substantial engineering challenge.

What the KV cache is

For each transformer layer, attention computes K = X · W_K and V = X · W_V from the input X (the activations at each position). At decode time, only the new token's K and V need to be computed; the previous tokens' K and V are read from the cache. The cache structure: a tensor of shape (batch, num_heads, sequence_length, head_dim), separately for keys and values, per layer. For a 70B-parameter model with 80 layers and 8K context, the KV cache occupies roughly 5–10 GB per request — substantial memory, especially with many concurrent requests.

The fragmentation problem

Naive KV-cache implementations pre-allocate cache memory based on a maximum context length. If most requests use far less than the maximum, this wastes most of the allocated memory. For a 32K-context system serving requests that average 1K tokens, naive allocation wastes 31/32 of the memory — and reduces achievable batch size by 32×. The 2022–2023 inference frameworks were heavily limited by this fragmentation; PagedAttention solved it.

PagedAttention

The 2023 vLLM paper introduced PagedAttention, an OS-style page-based KV-cache management. Memory is divided into fixed-size pages (typically 16 tokens worth); each request's KV cache is a sequence of pages; pages are allocated and freed as the request grows and completes. The result is essentially zero fragmentation — memory usage tracks actual cache usage rather than pre-allocated maximum. PagedAttention has become the dominant KV-cache architecture; vLLM popularised it but TensorRT-LLM, SGLang, and other modern frameworks have all adopted variants.

Prefix caching and shared prompts

Many real-world LLM workloads have repeated prefixes: a common system prompt across many requests, retrieval-augmented contexts that share retrieved documents, multi-turn conversations where earlier turns are fixed. Prefix caching (also called automatic prefix caching or APC) detects these repeated prefixes and reuses their KV-cache pages across requests. Properly implemented, prefix caching can dramatically reduce prefill cost for prefix-heavy workloads (system-prompt-dominated chatbots can see 50%+ prefill savings). vLLM, SGLang, and recent TensorRT-LLM versions all support prefix caching.

KV-cache compression

Beyond efficient management, the KV cache itself can be compressed. FP8 KV-cache: store keys and values in 8-bit floating-point rather than the model's native precision. Quantised KV-cache (KIVI, the various 2024 entrants): more aggressive quantisation, sometimes down to 4-bit. Attention sinks: keep only a few "sink" tokens at the start plus a sliding window of recent tokens, dropping the middle. Token eviction (H2O, the various streaming-attention methods): identify and drop tokens whose attention has been low. Each technique trades some quality for lower KV-cache memory; the right choice depends on workload sensitivity.

The long-context KV-cache challenge

For very long contexts (1M+ tokens, the 2024–2026 frontier), KV-cache memory is the binding constraint. A 1M-token KV cache for a 70B-parameter model is hundreds of gigabytes; serving multiple concurrent long-context requests requires either substantial memory or aggressive cache management. The 2024–2026 work on long-context serving (the various streaming-attention methods, ring-attention-based serving, the KV-cache compression methods above) is rapidly maturing; whether long-context serving becomes economically routine or remains a specialised tier is open.

FlashAttention and Memory-Aware Algorithms

FlashAttention (Dao et al. 2022) was the textbook example of a hardware-aware algorithm redesign. By recognising that attention's bottleneck on GPUs was memory traffic rather than compute, FlashAttention restructured the attention computation to keep intermediate values in fast SRAM rather than spilling to slow HBM. The result was 2–3× speedup with no quality loss. FlashAttention has become standard infrastructure; understanding it is essential for understanding modern attention engineering.

The naive-attention bandwidth problem

Standard attention computes attention scores S = Q · K^T (an N×N matrix), applies softmax, multiplies by V to get the output. The N×N attention matrix can be enormous — for context length 8K, it's 64M entries per attention head per layer — and writing it to HBM and reading it back dominates the runtime. The compute is straightforward (just matrix multiplies), but the memory traffic is what makes naive attention slow.

The FlashAttention insight

FlashAttention recognised that the N×N attention matrix doesn't need to materialise: it's an intermediate value that can be recomputed on-the-fly. The algorithm tiles the attention computation: process Q, K, V in small blocks that fit in SRAM, accumulate the output incrementally. The softmax is computed online (using a running maximum and running sum to maintain numerical stability without seeing all the entries at once). The result is a single fused operation that reads Q, K, V from HBM, computes attention entirely in SRAM, and writes the output to HBM — never materialising the N×N matrix.

FlashAttention 2 and 3

The methodology has continued to evolve. FlashAttention 2 (Dao 2023) improved the parallelism strategy and reduced non-matmul FLOPs, achieving ~2× speedup over the original. FlashAttention 3 (Shah et al. 2024) targets H100 specifically with asynchronous warp specialisation and FP8 support, achieving another ~2× speedup. Each version has been integrated into PyTorch's `scaled_dot_product_attention` and into the major serving frameworks; using FlashAttention is essentially free for users — the framework picks the best implementation for the hardware.

Attention variants and their efficient implementations

FlashAttention has been extended to many attention variants. Multi-Query Attention (MQA, Shazeer 2019) and Grouped-Query Attention (GQA, Ainslie et al. 2023) reduce the number of K/V heads, dramatically shrinking the KV cache; FlashAttention handles these with minor modifications. Sliding-window attention (used by Mistral and others) restricts attention to a local window; specialised FlashAttention kernels exploit the sparsity. ALiBi and RoPE position encodings have FlashAttention-compatible implementations. The general pattern: any reasonable attention variant has a FlashAttention-style efficient implementation in modern serving frameworks.

FlashDecoding for long-context decode

For the decode phase with long contexts, the standard FlashAttention is suboptimal — there's only one query token but a long key sequence. FlashDecoding (Dao et al. 2023) parallelises the decode-step attention across the key sequence dimension, exposing more parallelism for long contexts. FlashDecoding++ further optimises this. Modern serving frameworks switch between FlashAttention (prefill, long queries) and FlashDecoding (decode, single-query) automatically based on the workload shape.

Beyond attention: kernel fusion broadly

FlashAttention is the most-famous example of kernel fusion in ML — combining what would naively be multiple GPU kernels into a single fused operation. The same methodology applies to many other operations: layer normalisation followed by linear projection, residual connections, RMSNorm, the various activation functions. Modern compiled-model paths (torch.compile, JAX with XLA, TensorRT-LLM's compilation) automatically fuse where possible. The 2024–2026 work on kernel fusion has substantially advanced; production inference stacks routinely have hand-tuned fused kernels for the hot paths.

Speculative Decoding and Draft-Verifier Methods

Speculative decoding exploits a structural property of LLM decoding: most tokens are easy to predict, and the large model's "judgment" is only needed for the harder ones. A small draft model proposes tokens speculatively; the large verifier model checks them in parallel. When the draft is right (most of the time), many tokens are accepted in a single forward pass of the large model, dramatically increasing decode throughput. The methodology has matured into a standard production technique; serving frameworks integrate it natively.

The basic mechanic

The procedure: the draft model autoregressively generates K tokens cheaply (e.g., 5–10 tokens, each a single forward pass of a small model). The large model takes these K tokens as input and runs a single forward pass that produces probability distributions for all K positions plus the K+1th. The acceptance step: for each draft token in turn, compare the draft's distribution to the verifier's. Accept the token if a sampling-equivalence test passes; reject and resample if not. The crucial property: the procedure is mathematically equivalent to plain autoregressive sampling from the large model — no quality is lost. The speedup comes from amortising K large-model forward passes into one.

The acceptance-rate question

The speedup depends on the acceptance rate: what fraction of draft tokens does the verifier accept. With a perfect draft (always agrees with verifier), all K tokens accepted, K× speedup. With a useless draft (random tokens), 0 tokens accepted, no speedup. Real drafts achieve 60–80% acceptance for typical chatbot workloads, producing 1.5–3× speedup. The acceptance rate is a function of the draft model's quality, the workload, and the lookahead length K. Tuning K is a hyperparameter; longer K helps when acceptance rate is high but hurts when it's low.

Draft model selection

The draft model should be much faster than the verifier (otherwise the speculation overhead exceeds the savings) but accurate enough to achieve high acceptance. Common choices: a small fine-tuned model from the same family (e.g., a 1B-parameter draft for a 70B-parameter verifier) — accurate but adds infrastructure complexity. An n-gram draft (use the input tokens to look up plausible continuations) — fast and free but less accurate. A self-speculation method (the verifier model itself in some compressed form) — see below. A specially-trained tiny draft (Medusa, Eagle) — purpose-built for speculation.

Medusa, Eagle, and self-speculation

The 2023–2024 work on self-speculation avoids the need for a separate draft model. Medusa (Cai et al. 2024) adds extra decoding heads to the verifier model that produce multiple speculative tokens from a single forward pass. Eagle (Li et al. 2024) trains a lightweight predictor on the verifier's hidden states to generate speculative tokens. Lookahead decoding (Fu et al. 2024) generates speculative tokens via a different parallel-decoding approach. All eliminate the separate-draft-model infrastructure overhead; many achieve competitive speedups with simpler deployment.

Tree-based speculation

Beyond linear K-token speculation, tree-based speculation generates multiple speculation branches in parallel. The verifier accepts the longest accepted prefix across all branches, often capturing more total accepted tokens than linear speculation. SpecInfer, Eagle-2 (the 2024 follow-up), and the various tree-based methods have substantially advanced the speedup ceiling. Modern production speculative decoding often uses tree-based methods.

The integration with serving frameworks

Speculative decoding is now standard in production serving frameworks. vLLM supports speculative decoding with multiple draft-model options. TensorRT-LLM has Medusa, Eagle, and related methods built in. SGLang includes specialised speculative-decoding optimisations. The user typically configures the draft model and acceptance parameters; the framework handles the orchestration. Speculative decoding's adoption has been one of the major 2024–2026 throughput improvements at production scale.

Request Scheduling and Multi-Tenancy

Beyond per-request optimisations, a serving system handles many concurrent requests with diverse priorities, latency budgets, and resource needs. The scheduler decides which requests run in which batch at which time; getting this right substantially affects both individual-user experience and aggregate cost. This section unpacks the scheduling layer.

The basic scheduler

A serving framework's scheduler runs an event loop: incoming requests enter a queue; at every time step, the scheduler decides which queued requests start (begin prefill), which in-flight requests continue (do another decode step), and which complete and free their resources. The scheduler tracks per-request state: how many tokens generated, how much KV-cache memory used, accumulated time in queue, priority. Modern schedulers (vLLM's, SGLang's, TensorRT-LLM's) make these decisions at every step, integrating dozens of considerations into the allocation.

Priority and fairness

Real serving has heterogeneous request priorities. Premium-tier users get higher priority. Internal/test traffic gets lower priority. Long-running batch jobs get lowest priority. The scheduler implements priority through some combination of: first-class priority queues, weighted round-robin, deficit-round-robin, lottery-scheduling. Fairness guarantees prevent starvation: a low-priority request shouldn't wait forever even if higher-priority requests keep arriving. Modern schedulers expose priority/fairness as configurable parameters; getting the configuration right is operational engineering.

Preemption and continuation

For LLM serving, preemption has a cost: if a request's KV cache is evicted to free space for higher-priority traffic, the cache must be either swapped to CPU memory (slow but feasible) or recomputed (expensive but predictable). vLLM and other modern frameworks support both; the scheduler decides which strategy to use based on the request's resumability. Preempt-and-recompute is appropriate for short prefixes that recompute quickly; preempt-and-swap is appropriate for very long contexts where recomputation would be prohibitive.

SLO management

Production serving has Service Level Objectives (SLOs) — explicit commitments like "P99 time-to-first-token under 500ms" or "P95 tokens-per-second above 50". The scheduler manages traffic to meet these SLOs: rejecting incoming requests when load is too high, dropping low-priority traffic during overload, scaling out additional capacity. SLO-aware scheduling is a major topic in 2024–2026 serving research; modern stacks increasingly include SLO-driven schedulers (the various 2024–2026 entrants) as opposed to pure FIFO or priority-based scheduling.

Multi-tenancy and quotas

Production serving usually serves many tenants (different customers, applications, user tiers). Quota enforcement: each tenant has limits on requests per second, tokens per minute, etc. Tenant isolation: one tenant's traffic shouldn't degrade another tenant's experience. Modern serving frameworks have explicit multi-tenancy support; configuring tenant quotas is part of standard operational setup.

Cost-aware scheduling

Beyond SLO management, modern serving increasingly considers cost: batching more aggressively when load is high to maximise throughput per dollar, scheduling longer contexts to dedicated machines that have the memory, deferring non-urgent traffic to off-peak hours. Cost-aware scheduling is its own engineering discipline; the 2024–2026 work on inference-cost optimisation has produced several methodological advances (the various continuous-batching-with-cost-models papers, the FinOps-for-inference frameworks). Mature ML platforms increasingly track per-request cost as a first-class metric.

Serving Frameworks Compared

The 2023–2026 era has produced a substantial ecosystem of LLM serving frameworks. Each makes different trade-offs between performance, flexibility, ease of deployment, and hardware support. Choosing the right framework is a substantial architectural decision; this section surveys the dominant options.

vLLM: the open-source default

vLLM (UC Berkeley, 2023) became the dominant open-source LLM serving framework essentially overnight. The key innovations — PagedAttention and continuous batching — set the technical bar; the open-source release with permissive license, broad model support, and active community made it the path of least resistance. By 2024, vLLM was the default starting point for any serious LLM serving project; by 2026, it's the substrate of much of the open-source LLM ecosystem. The vLLM project has expanded substantially: distributed inference, speculative decoding, quantisation support, prefix caching, and deep integration with Hugging Face Transformers.

TensorRT-LLM: NVIDIA's high-performance stack

TensorRT-LLM (NVIDIA, 2023) is NVIDIA's optimised LLM serving framework, built on top of TensorRT and CUDA. The key advantage is ultimate performance on NVIDIA hardware: hand-tuned kernels, the latest hardware features (FP8, FP4 on Blackwell), and tight integration with NVIDIA's ecosystem. The trade-off is NVIDIA-specificity and somewhat-heavier deployment than vLLM. Production deployments at very-high-throughput tier (commercial inference platforms, frontier-AI labs' internal serving) often use TensorRT-LLM for its raw performance edge over vLLM.

SGLang: the structured-output frontier

SGLang (LMSYS, 2024) is an emerging serving framework that has matched or exceeded vLLM's throughput on many workloads while adding sophisticated support for structured output and complex request patterns. Specifically, SGLang's RadixAttention implements aggressive prefix caching with radix-tree-based cache management; its support for grammar-constrained decoding, JSON-schema-constrained generation, and multi-modal inputs has been ahead of vLLM on these features. The 2024–2026 trajectory has SGLang gaining ground on vLLM, particularly for advanced workloads.

TGI (Text Generation Inference)

Text Generation Inference (Hugging Face, 2023) is Hugging Face's official LLM serving framework. TGI provides good performance, deep Hugging Face Hub integration, and a friendly developer experience. The 2024–2026 status is that TGI is competitive with vLLM for typical workloads and is the path of least resistance for teams already deeply invested in the Hugging Face ecosystem; for state-of-the-art performance on demanding workloads, vLLM and TensorRT-LLM are typically preferred.

llama.cpp and the local-LLM ecosystem

llama.cpp (Gerganov, 2023) takes a different approach: lightweight, C++-based, optimised for running LLMs on consumer hardware (CPUs and consumer GPUs). The key innovations are aggressive quantisation (the GGUF format supports 2-bit through 8-bit quantisation), flexible hardware support, and minimal dependencies. For consumer-hardware LLM deployment (laptops, phones, browsers via WebGPU/WASM), llama.cpp is the dominant choice. The Ollama, LM Studio, and many other consumer-focused LLM tools wrap llama.cpp.

Triton Inference Server and the multi-framework option

NVIDIA Triton Inference Server (already discussed in Ch 03 of MLOps) provides a more general-purpose serving abstraction — supporting multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT) and integrating with all of them. For LLM workloads, Triton can host TensorRT-LLM models behind a unified gRPC/HTTP interface; for non-LLM workloads (classical CV, structured-data inference), Triton is the standard choice. The "vLLM for LLMs, Triton for everything else" pattern is common in mixed deployments.

The framework matrix

Practical guidance for framework choice. Just want it to work, open source: vLLM is the default. Maximum NVIDIA performance: TensorRT-LLM. Advanced workloads (constrained generation, multi-modal): SGLang. Hugging Face-centric workflow: TGI. Consumer hardware, edge: llama.cpp. Multi-framework production: Triton with framework-specific backends. Cloud-managed: SageMaker, Vertex AI, the various commercial inference platforms — typically wrap one of the above. The 2024–2026 landscape has consolidated substantially; the choice is increasingly driven by deployment context rather than technical-superiority arguments.

Operational Realities of LLM Serving

Beyond the algorithmic and framework layers, LLM serving has its own operational realities — capacity planning for variable workloads, cold-start latency for autoscaling, multi-region deployment, cost engineering. This section covers the operational layer that shapes what production LLM serving actually looks like.

Capacity planning for LLMs

LLM capacity planning is harder than classical-ML capacity planning because the workload is variable in two dimensions: arrival rate of requests, and complexity per request (prompt length × output length). A "1000 QPS" LLM service can be wildly different in compute requirements depending on whether each request generates 50 or 5000 tokens. Mature LLM platforms track tokens-per-second (input + output combined) as the primary capacity metric, not requests-per-second. Capacity planning targets aggregate token throughput at acceptable per-request latency; matching capacity to demand requires forecasting both dimensions.

Cold start and warm-pool sizing

LLM serving has substantial cold-start latency: loading a 70B-parameter model into GPU memory takes 30–60 seconds; warming up the kernels takes more. For autoscaling-based deployments, cold starts mean either accepting startup-time degradation or maintaining warm-pool capacity. Warm pool sizing is a real cost lever: a 10-minute over-provisioning headroom prevents cold-start incidents but costs steadily; a smaller headroom saves money but risks user-facing latency spikes during traffic ramps. Production teams have explicit warm-pool policies tuned to traffic patterns.

Multi-region deployment

For latency-sensitive applications, LLM serving often spans multiple geographic regions. Geo-routing: route each user's request to the nearest region. Per-region capacity: capacity must be sized regionally based on regional traffic. Failover: if one region is overloaded or down, traffic shifts to others — but other regions need spare capacity to absorb the spillover. The economics push toward fewer larger regions; the latency requirements push toward more smaller regions; the right balance depends on the application.

Cost-per-token engineering

The unifying cost metric for LLM serving is cost per million tokens (typically broken into input and output prices). For a 70B-parameter model on H100 in 2026, well-tuned serving achieves $0.50–$2.00 per million output tokens. Naive serving can be 5–10× higher. The gap is engineering investment in the optimisations of this chapter. Major LLM-as-a-service providers compete on this metric; the public benchmark leaderboards (artificialanalysis.ai and similar) measure cost-per-token across providers and serve as competitive pressure. Internal teams should benchmark their own serving against these public numbers; substantial gaps indicate engineering opportunities.

Quality monitoring

Inference optimisations can subtly change model behaviour: a 4-bit-quantised model produces slightly different outputs than the FP16 original; speculative decoding can have edge cases; KV-cache compression can affect quality. Quality monitoring for production serving (Ch 04 of MLOps) needs to catch these regressions. The discipline is to A/B-test optimisation changes before full rollout, monitor downstream metrics continuously, and have rollback paths ready.

Long-tail latency management

LLM serving has long-tail latencies that classical ML doesn't: a request that happens to need 10K output tokens takes orders of magnitude longer than the median. The P99 latency vs P50 ratio for LLMs is often 100×, vs 5–10× for classical ML. Tail-latency management includes: explicit max-token limits, streaming output (so users see progress even on long requests), per-request timeouts, and the various tail-trimming techniques. Modern serving frameworks support these natively; configuring them appropriately is operational engineering.

The Frontier and the Operational Question

LLM inference optimisation is mature in 2026 for the well-trodden patterns, but several frontiers remain active. Disaggregated inference is producing further cost reductions; long-context serving is methodologically distinctive; agent serving has new operational shapes; the various 2025–2026 inference-architecture experiments are reshaping what's possible. This section traces the open frontiers and the directions the field is moving in.

Disaggregated inference at scale

The 2024 introduction of disaggregated inference (Section 2) at major AI labs has substantially advanced production deployment. Frontier deployments now routinely run prefill on H100/B200 clusters and decode on lower-cost hardware (AMD MI300, the various inference-specialised ASICs). The KV-cache transfer between prefill and decode pods uses RDMA over InfiniBand or 200+ Gbps Ethernet; the orchestration is non-trivial but increasingly mature. The 2025–2027 trajectory plausibly has disaggregated inference becoming the default for high-scale deployment; the specialised frameworks (DistServe, Splitwise, the various 2024–2026 entrants) are continuing to mature.

Long-context serving

The 2024–2026 push toward 1M+ token contexts has substantially shifted serving challenges. Quadratic memory: even with PagedAttention, very long contexts produce KV caches in the hundreds of GB. Quadratic compute for prefill: a 1M-token prompt takes minutes, not seconds, to prefill. Sparse attention (sliding window, hierarchical attention, the various 2024–2026 long-context architectures) reduces the asymptotic cost. Caching aggressively: production long-context deployments rely heavily on prefix caching, since most long contexts are document-grounded with substantial reuse. Whether very-long-context serving becomes routinely affordable is open.

Agent serving

LLM-based agents (Part XI) have introduced new serving patterns. Multi-step inference: an agent's response involves multiple LLM calls plus tool calls, with state preserved across calls. Tool-call latency: external tool calls (web fetch, code execution) introduce latencies an order of magnitude larger than typical inference. Streaming with control flow: agents need to stream partial outputs while also making decisions about whether to continue, call tools, or finish. The serving-frameworks support for agent patterns (LangGraph integration, Temporal-style durable execution, the various 2024–2026 agent-runtime entrants) is rapidly maturing.

The MoE serving challenge

MoE models present their own serving challenges. The active parameters per token are small but the total model is large; routing changes per-token compute. Expert parallelism for inference is more complex than for training because token routing varies dynamically. Load balancing across experts: some experts may be much more popular than others, creating utilisation imbalance. The 2024–2026 work on MoE serving (DeepSeek-V3's serving methodology, the various 2024–2026 MoE inference papers) is still consolidating; production MoE serving is doable but less mature than dense-model serving.

Hardware-software co-design and FP4

The 2024 introduction of FP4 native support on Blackwell GPUs (B200) has substantially changed inference economics. FP4-native inference doubles throughput vs FP8 with comparable quality after appropriate calibration. The 2025–2026 push toward FP4-native deployment is reshaping serving stacks; older serving frameworks are being updated to support FP4 efficiently. Whether even-lower-precision (1-bit, ternary, see Ch 03) reaches production deployment is open; if BitNet-class methodologies become production-ready, the inference-cost frontier could drop another order of magnitude.

What this chapter has not covered

Several adjacent topics are out of scope. AI chips and custom silicon at depth — the Groq inference chips, the Cerebras inference platform, the various 2024–2026 inference-specialised ASICs — is the topic of Ch 05. The detailed methodology of distributed serving (multi-GPU inference for very large models) is touched only briefly. Edge and on-device LLM serving has been mentioned in passing; the complete on-device-AI methodology is its own substantial topic. The chapter focused on the methodology of inference optimisation for the dominant cloud-deployed serving pattern; the broader inference landscape develops adjacent topics in subsequent chapters.