Part XVII · AI Infrastructure & Systems · Chapter 02

Distributed Training, where one model lives across thousands of chips.

A modern frontier model has hundreds of billions to trillions of parameters; a single GPU has tens to hundreds of gigabytes of memory. The two numbers don't multiply: even the largest single chip cannot hold a frontier model, let alone train one efficiently. Distributed training is the methodology for splitting a single training run across many accelerators in coordinated fashion. Data parallelism replicates the model on each worker and splits the batch — the simplest pattern, the foundation of multi-GPU training. Tensor (model) parallelism splits each layer across workers — needed when individual layers won't fit on a single chip. Pipeline parallelism splits the model by depth across workers — useful for very deep models with limited interconnect bandwidth. ZeRO and FSDP are the dominant memory-optimisation strategies that shard optimiser state, gradients, and parameters across data-parallel workers, dramatically reducing per-worker memory at modest communication cost. Modern frontier training combines all of these — 3D parallelism with ZeRO/FSDP — across tens of thousands of accelerators in a single coordinated run. This chapter develops the methodology with the depth a working ML practitioner needs: the parallelism strategies, the memory-management techniques, the communication patterns, and the operational realities that distinguish a productive training cluster from one that wastes most of its compute.

Prerequisites & orientation

This chapter assumes the hardware material of Ch 01 (GPUs, memory hierarchy, interconnect, the roofline model) and the deep-learning material of Part VI. Familiarity with PyTorch or JAX is assumed; familiarity with collective-communication primitives (all-reduce, all-gather, reduce-scatter) helps for §3 and §6 but we cover the essentials. The chapter is written for ML engineers and ML researchers who train models at scale — anywhere from a single 8-GPU server up to tens of thousands of accelerators. Pure-application contexts where models are consumed via APIs have less use for this material; teams that train their own models above ~1B parameters have substantial use.

Three threads run through the chapter. The first is the memory-vs-communication trade-off: every parallelism strategy is a different way to spend memory or communication bandwidth, and the right strategy depends on which is more constrained for the workload. The second is the scale-shaping: parallelism strategies that work well at 8 GPUs may collapse at 8,000, and the methodology of frontier training is fundamentally different from small-cluster training. The third is the framework-and-stack ecosystem: PyTorch FSDP, DeepSpeed, Megatron-LM, JAX with sharding, the various 2024–2026 entrants. Different frameworks make different parallelism strategies easier or harder, and the choice substantially shapes what training looks like operationally. The chapter develops each in turn.

In this chapter

Why Distributed Training Is Necessary scale · memory · time · cost · capability
Data Parallelism: The Foundational Pattern DDP · all-reduce · gradient sync · scaling laws · large batch
Tensor (Model) Parallelism column/row split · Megatron-style · activation comm · layer fit
Pipeline Parallelism stages · micro-batches · GPipe · 1F1B · interleaved · bubble
ZeRO and FSDP: Sharded Data Parallelism ZeRO-1/2/3 · FSDP · DeepSpeed · per-param sharding · CPU offload
3D Parallelism and the Frontier Pattern data + tensor + pipeline · device meshes · placement · GPipe+ZeRO
Sequence and Context Parallelism long context · ring attention · sequence sharding · LLM training
Frameworks: PyTorch, JAX, DeepSpeed, Megatron FSDP · GSPMD · DeepSpeed · Megatron-LM · MosaicML
Operational Realities at Scale checkpointing · failures · debugging · convergence · MFU
The Frontier and the Operational Question 100K+ GPUs · async · multi-DC · MoE · what next

Why Distributed Training Is Necessary

Three pressures force distributed training: model size exceeds single-chip memory, training time on a single chip would be prohibitive, and total compute scales with model and data size in well-established ways. The combination is what has driven the field from single-GPU training (the AlexNet era) through multi-GPU servers (the GPT-2 era) to tens-of-thousands-of-GPU clusters (the current frontier). This section orients the AI reader to the why before the chapter develops the how.

The memory wall

A frontier-2026 model has 1–2 trillion parameters in BF16, occupying 2–4 TB of memory just for the weights. The optimiser state (Adam's moments) typically requires another 8–16 TB. Activations during the forward pass and gradients during the backward pass add several more TB depending on batch size and sequence length. Total training-time memory for a frontier model is comfortably in the 30–60 TB range. A B200 GPU has 192 GB. The arithmetic is unavoidable: hundreds to thousands of GPUs are required just to fit the training state, before any consideration of speed.

The time wall

Even when memory weren't a constraint, training time on a single chip would be prohibitive. A frontier model trained on multi-trillion-token corpora requires ~10²³–10²⁵ FLOPs of compute. A single H100 delivers ~10¹⁵ FLOPs/sec sustained. The arithmetic gives single-chip training times of years, not weeks. Distributed training reduces wall-clock time by spreading compute across many chips simultaneously. The 2024–2026 frontier training runs use 16,000–100,000+ GPUs to compress what would otherwise be infeasible into months.

The scaling-laws context

The motivation for ever-larger training is the empirical scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022 "Chinchilla"): test loss decreases as a power law in model size, training data, and compute, jointly. Doubling compute, with model and data scaled correspondingly, produces predictable improvements in capability. This empirical relationship has driven the industry's compute-investment curve; distributed training is the enabling infrastructure. Without distributed training, scaling laws would be theory; with it, they are operational.

The economics

Distributed training is expensive. A frontier-model training run in 2024–2026 costs roughly $50M–$500M in GPU-hours alone, before counting datacentre infrastructure, networking, electricity, engineering payroll, and data acquisition. The cost is justified by capability — the trained model is the firm's competitive asset — but distributed-training efficiency directly translates to capital efficiency. A training run at 30% MFU vs 50% MFU is a 67% cost increase for the same training. The operational engineering of getting MFU as high as possible is therefore one of the highest-ROI activities in the entire ML pipeline.

The downstream view

Operationally, distributed training sits between hardware (Ch 01) and the broader ML pipeline. Upstream: GPU clusters with appropriate interconnect, training data prepared by data pipelines. Inside this chapter's scope: the parallelism strategies, the memory-management techniques, the framework integrations, the operational pipeline. Downstream: a trained model that goes through evaluation (Ch 09 of the AI Safety part), deployment (Ch 03 of MLOps), monitoring, and the rest of the production lifecycle. The remainder of this chapter develops each piece: §2 data parallelism, §3 tensor parallelism, §4 pipeline parallelism, §5 ZeRO/FSDP, §6 3D parallelism, §7 sequence parallelism, §8 frameworks, §9 operational realities, §10 the frontier.

Data Parallelism: The Foundational Pattern

Data parallelism is the simplest, most-widely-used distributed training pattern. The model is replicated on each worker; the batch is split across workers; each worker computes a forward and backward pass on its shard; gradients are averaged across workers via all-reduce; each worker applies the averaged gradient identically. Conceptually elegant, operationally well-understood, and the foundation that everything else builds on.

The DDP mechanics

The dominant implementation is PyTorch's DistributedDataParallel (DDP). The mechanics: at each step, every worker computes its local batch's gradient; an all-reduce sums gradients across workers and divides by the worker count; all workers apply the same averaged gradient and stay in sync. Mathematical equivalence with single-worker training using the larger global batch is the defining property — DDP doesn't change what the model converges to (modulo numerical noise from the all-reduce); it just makes the training step run on more hardware.

The all-reduce primitive

The communication operation at the heart of DDP is all-reduce: every worker contributes a tensor; every worker receives the sum (or average) of all contributions. The classical implementation is ring all-reduce (Sergeev & Del Balso 2017, popularised by Horovod): each worker has 2(N−1) communication rounds, sending and receiving 1/N of the tensor each round. Bandwidth-optimal — total data sent is 2(N−1)/N times the tensor size, asymptotic to 2× regardless of N. Modern NCCL implementations use ring all-reduce on small clusters and tree-based variants on larger clusters; for very-large clusters, hierarchical all-reduces respect the topology (Section 7 of Ch 01).

Gradient bucketing and overlap

Naive DDP would all-reduce the full gradient at the end of the backward pass — but most of the time would be communication, with computation idle. Modern DDP uses gradient bucketing: gradients are grouped into chunks (typically 25–250 MB each), and each bucket is all-reduced as soon as it's ready, in parallel with continued backward computation. This communication-computation overlap hides much of the all-reduce latency and is what makes DDP efficient at scale. PyTorch DDP has bucketing built in; the bucket size is a tuning parameter.

The large-batch problem

Data parallelism naturally pushes toward large global batch sizes — N workers each with batch B gives global batch NB. For very large N, the global batch can become very large, and large-batch training has its own statistical issues. Linear scaling (Goyal et al. 2017, "Accurate, Large Minibatch SGD") established that learning rate should be scaled linearly with batch size and warmed up gradually; this lets DDP scale to thousands of workers without quality loss. LARS and LAMB are layer-wise-adaptive-rate optimisers designed specifically for very-large-batch training. The empirical limit on batch size depends on the workload but is typically 16K–256K examples for LLMs; beyond that, gradients become too noisy or the optimiser breaks down.

The memory limitation

DDP's central limitation is that each worker holds a full copy of the model: weights, gradients, optimiser state. For models that don't fit on a single GPU, pure DDP isn't enough. ZeRO and FSDP (Section 5) extend DDP by sharding these states across workers; tensor and pipeline parallelism (Sections 3–4) split the model itself across workers. For models small enough to fit per-worker, DDP is the simplest and often-best choice; for models that don't fit, it must be combined with model-splitting strategies.

When DDP is the right choice

Data parallelism alone is appropriate for: models that fit comfortably on a single GPU (anything below ~10B parameters in BF16 with reasonable batch sizes), training-data-bound regimes where adding more compute on a smaller model is the right way to use a budget, fine-tuning workflows where the model fits and compute is the bottleneck. For larger models, DDP is one ingredient of a 3D-parallelism stack rather than the whole solution. Most modern PyTorch training scripts use FSDP (Section 5) rather than raw DDP because FSDP's memory benefits compound naturally; raw DDP is mostly used for smaller models where memory isn't a concern.

Tensor (Model) Parallelism

When a single layer is too large to fit on one GPU — for very wide transformer layers (hidden dimension >10,000), embedding tables of multi-million-vocabulary models, or attention layers with very long sequences — the layer itself must be split across workers. Tensor parallelism (also called model parallelism in some sources) does this by partitioning the matrix multiplications within a layer across multiple GPUs. Megatron-LM (NVIDIA, 2019) introduced the dominant tensor-parallel pattern for transformers; it has become standard infrastructure for frontier-scale training.

Column and row partitioning

The transformer's main building block is a sequence of matrix multiplications: an MLP layer is Y = GeLU(X · A) · B, where A and B are weight matrices. Tensor parallelism splits A and B across N GPUs. Column-parallel partitioning: split A by columns, so each GPU has a slice [A_i] and computes Y_i = GeLU(X · A_i). No communication needed yet — each GPU has its slice. Row-parallel partitioning: split B by rows. Each GPU computes its share Z_i = Y_i · B_i, and an all-reduce sums them: Z = Σ Z_i. The pattern: column-parallel into the activation, row-parallel out of it, with one all-reduce per layer. Megatron's contribution was choosing the partitioning pattern that minimises communication while preserving mathematical equivalence with single-GPU training.

Attention layer tensor parallelism

The attention layer has a similar structure. The query/key/value projections are column-parallel; the output projection is row-parallel; one all-reduce per layer suffices. Multi-head attention partitions naturally — each tensor-parallel rank handles a subset of attention heads. The Megatron-LM paper details the exact partitioning; modern frameworks (Megatron-LM itself, NeMo, the various 2024–2026 entrants) implement it as a standard configuration.

Communication-bandwidth requirements

Tensor parallelism requires substantial bandwidth: every layer does an all-reduce of the activation tensor, which can be hundreds of MB to GB depending on batch size and sequence length. The all-reduce must complete before the next layer can start, so it's on the critical path — there's no overlap with computation as in DDP. The implication: tensor parallelism is bandwidth-bound and only works well within a single high-bandwidth domain. Practical deployments use tensor parallelism only across GPUs connected by NVLink (within a single 8-GPU server, or within an NVL72 rack), never across InfiniBand to a different server.

Sequence parallelism

An important extension: sequence parallelism partitions the activations along the sequence dimension during the layer-norm and dropout operations (which are not parallelised by tensor parallelism alone). This reduces memory by another factor proportional to the tensor-parallel degree, with modest additional communication. Modern training stacks (Megatron-LM, ColossalAI) include sequence parallelism by default; it's often referred to as a separate parallelism dimension but is most-naturally combined with tensor parallelism.

The TP-degree choice

The tensor-parallel degree — how many GPUs share each layer — is a substantial tuning parameter. TP=8 is common because it fits within an 8-GPU NVLink domain. TP=16 or TP=32 are used in NVL72 systems where higher-bandwidth domains exist. Higher TP degrees reduce per-GPU memory but increase communication overhead. The choice interacts with other parallelism dimensions (pipeline depth, data parallel size); finding the right combination is non-trivial and usually involves profiling.

When to use tensor parallelism

Tensor parallelism is appropriate when individual layers don't fit on a single GPU — typically for models with hidden dimensions ≥10K or vocabulary sizes ≥250K. Smaller models can use TP=1 (no tensor parallelism) and rely on data parallelism plus FSDP. Large models invariably use TP≥4. The methodology is well-understood; the tuning is the engineering work.

Pipeline Parallelism

For very deep models, an alternative to tensor parallelism is pipeline parallelism: split the model by layer (depth), with each worker holding a contiguous range of layers. Activations flow through the pipeline forward-pass; gradients flow back through. The communication cost is much lower than tensor parallelism (only activations between adjacent stages need to cross worker boundaries), but the introduction of pipeline bubbles — periods when workers are idle waiting for upstream stages — is the central efficiency challenge.

The basic GPipe pattern

The original pipeline-parallelism paper (GPipe, Huang et al. 2019) introduced the basic pattern. Split the model into stages, one per GPU. The full mini-batch is divided into smaller micro-batches; each micro-batch flows through the pipeline. While stage 1 processes micro-batch 2, stage 2 is still processing micro-batch 1; this creates the parallelism. The bubble appears at the start (stages 2..N are idle until stage 1 finishes the first micro-batch) and at the end (stage 1 finishes early). The bubble fraction is (P-1) / (M+P-1) where P is pipeline depth and M is the number of micro-batches; minimising the bubble means choosing many micro-batches per step.

The 1F1B schedule

An improvement: 1F1B (one-forward-one-backward) scheduling, introduced by Megatron and PipeDream. Instead of doing all forward passes for all micro-batches before any backward passes, alternate forward and backward passes to keep all stages busier and reduce activation memory. The 1F1B schedule is now standard in production pipeline-parallel implementations; the bubble is the same fraction but activation memory is substantially lower.

Interleaved 1F1B

A further improvement: interleaved 1F1B (the "virtual pipeline" approach). Instead of each stage holding consecutive layers, each stage holds non-consecutive layer chunks (stage 1 holds layers 1–4 and 13–16, stage 2 holds layers 5–8 and 17–20, etc.). The interleaving reduces the bubble by enabling more pipeline stages to be active simultaneously. Megatron-LM and DeepSpeed both implement this; it adds complexity but materially reduces bubble overhead at large pipeline depths.

Pipeline parallelism's communication advantage

The defining advantage of pipeline parallelism is its low communication: only the activation tensor between adjacent stages needs to cross worker boundaries, and the activations are typically small. This makes pipeline parallelism feasible across InfiniBand connections — unlike tensor parallelism, you can have pipeline stages on different servers without crippling throughput. Most large-scale training uses pipeline parallelism to span servers and tensor parallelism within servers.

The memory argument for pipeline parallelism

Pipeline parallelism's memory advantage is also distinctive: each stage holds only its layers' weights and activations, not the full model. This is similar to ZeRO's memory benefits but achieved through model splitting rather than state sharding. Pipeline parallelism is particularly attractive for models where layers are small individually but the model is very deep — deep transformers with hundreds of layers, for example. Modern frontier models combine pipeline parallelism with tensor parallelism: tensor parallelism within the high-bandwidth domain, pipeline parallelism across servers.

The micro-batch tuning problem

The central tuning parameter is the number of micro-batches per step. More micro-batches reduce the bubble fraction but increase activation memory and per-micro-batch overhead. The sweet spot depends on the model, hardware, and other parallelism dimensions; finding it usually involves profiling. Mature frameworks (Megatron-LM, DeepSpeed) make this configurable; ad-hoc pipeline implementations often hard-code suboptimal choices.

ZeRO and FSDP: Sharded Data Parallelism

Pure data parallelism wastes memory: every worker holds an identical copy of the model, optimiser state, and gradients. ZeRO (Zero Redundancy Optimizer, Microsoft 2019) and FSDP (Fully Sharded Data Parallel, Facebook/PyTorch 2021) eliminate this redundancy by sharding the various pieces of training state across data-parallel workers, dramatically reducing per-worker memory at modest communication cost. They have become the dominant memory-optimisation strategy in modern training.

The ZeRO three stages

Microsoft's ZeRO paper (Rajbhandari et al. 2020) introduced three sharding stages of increasing aggressiveness. ZeRO-1: shard the optimiser state across workers (Adam's first and second moments — typically 8 bytes per parameter at FP32). Each worker holds only its slice of the optimiser state; before the optimiser step, workers all-gather to reconstruct local parameter copies; after the step, only the local slice is kept. Memory saving: reduces optimiser-state memory by N (the worker count). Communication: minimal additional cost beyond DDP. ZeRO-2: also shard gradients. After the backward pass, instead of all-reduce, do a reduce-scatter so each worker holds the averaged gradient for its parameter shard. Memory saving: further reduces gradient memory by N. ZeRO-3: also shard parameters themselves. Each worker holds only its slice of the model weights; before each forward pass, the necessary weights are all-gathered just-in-time; after the pass, the gathered weights are released. Memory saving: model weights also reduced by N. Communication: substantial — every forward and backward pass requires all-gathers for weights.

FSDP: PyTorch's ZeRO-3

Facebook's Fully Sharded Data Parallel (FSDP) is essentially ZeRO-3 implemented natively in PyTorch. The semantics are the same: parameters, gradients, and optimiser state are sharded across data-parallel workers; weights are gathered just-in-time for each forward and backward pass; gradients are reduce-scattered rather than all-reduced. FSDP is now the recommended PyTorch distributed-training primitive (replacing DDP for large models) and is the substrate of most major PyTorch-based training systems in 2026. FSDP2 (the rewrite shipped in PyTorch 2.4) substantially improved performance and ergonomics over the original.

Activation checkpointing

Beyond parameter and optimiser sharding, activation checkpointing (also called gradient checkpointing) reduces memory at the cost of extra compute. Instead of storing all activations during the forward pass for use in backward, only store a subset; recompute the missing activations during backward. This roughly doubles forward-pass FLOPs for the recomputed layers but can reduce activation memory by 2–10×. Modern training combines FSDP/ZeRO-3 with activation checkpointing for maximum memory efficiency.

CPU and NVMe offload

For models too large even with ZeRO-3 + activation checkpointing, ZeRO-Infinity (Microsoft, 2021) extends offloading to CPU memory and NVMe storage. The optimiser state can live on CPU memory (much larger than GPU memory but slower); some parameters can live on NVMe (very slow, but enormous capacity). The trade-off is throughput: each level of offload reduces effective MFU substantially. Offload is appropriate for fine-tuning very large models on small clusters where throughput is less critical than feasibility.

The communication trade-off

ZeRO-3 / FSDP introduce substantial additional communication: parameter all-gathers and gradient reduce-scatters at every layer. The performance impact depends on the network: on NVLink-connected GPUs, the overhead is modest (tens of percent); on InfiniBand-connected GPUs across servers, it can be substantial (50% or more). Mitigations include hierarchical sharding (shard within a server but not across servers), prefetching (start the next layer's all-gather before the current layer finishes), and the various overlapping techniques. The 2024–2026 work on FSDP optimisation has substantially reduced the overhead.

When to use what

The practical guidance: DDP for models that comfortably fit on a single GPU. FSDP / ZeRO-3 as the default for larger models — it's a drop-in replacement for DDP with substantial memory benefits. FSDP + tensor parallelism for models whose individual layers are too large. FSDP + tensor + pipeline parallelism (3D parallelism, Section 6) for the largest frontier models. The dominant production stacks (Megatron-LM, NeMo, the various MosaicML/MosaicML-derived frameworks) configure all of these as parameters; the engineering work is choosing the configuration that maximises MFU for the specific workload.

3D Parallelism and the Frontier Pattern

Modern frontier-AI training combines all three parallelism dimensions: data parallelism across one axis, tensor parallelism across another, pipeline parallelism across a third. With FSDP/ZeRO layered on top of data parallelism. The combination is called 3D parallelism and is the dominant frontier-training pattern. This section unpacks how the dimensions interact and how to reason about the combined system.

The device mesh abstraction

The unifying conceptual model is the device mesh: a multi-dimensional grid of GPUs, with each parallelism dimension assigned to one mesh axis. A typical configuration: 8 GPUs along the tensor-parallel axis, 16 GPUs along the pipeline-parallel axis, 64 GPUs along the data-parallel axis — total 8,192 GPUs. The placement matters: tensor-parallel dimension on NVLink (within a server); pipeline-parallel and data-parallel dimensions across InfiniBand (across servers). The mesh abstraction is supported natively in JAX and modern PyTorch (PyTorch's DeviceMesh and DTensor).

Communication patterns by axis

Each parallelism axis has different communication characteristics. Tensor-parallel axis: high-frequency, large-tensor all-reduces inside every layer; needs maximum bandwidth. Pipeline-parallel axis: low-frequency, small-tensor point-to-point activation exchanges between adjacent stages; tolerates higher latency. Data-parallel axis: medium-frequency reduce-scatter and all-gather operations (with FSDP); medium bandwidth. Mapping these correctly to the network hierarchy (NVLink for tensor, NVLink-or-fast-IB for pipeline, IB for data) is essential for performance.

The compound MFU

Each parallelism dimension introduces its own efficiency loss. Tensor parallelism's all-reduces eat MFU. Pipeline parallelism's bubbles eat MFU. FSDP's all-gathers eat MFU. The compound MFU of a 3D-parallel system is roughly the product of each dimension's individual efficiency. Frontier-training MFU of 40–55% on H100/B200 is the typical target; achieving this requires careful tuning of every dimension and significant engineering investment in the training stack.

Layer placement and stage balancing

For pipeline parallelism, stage balancing matters: each pipeline stage should take roughly the same time for forward+backward, otherwise the slowest stage limits throughput. For transformer models, this is typically achieved by giving each stage the same number of layers; for non-uniform models (varying layer widths, special embedding/head layers), more sophisticated placement is needed. Padding tokens in variable-length sequences is another balancing issue — different micro-batches may have different actual sequence lengths, leading to load imbalance.

The configuration-search problem

Finding the optimal 3D parallelism configuration for a specific workload is a search problem. Variables include: tensor-parallel degree, pipeline-parallel degree, data-parallel degree, micro-batch count, ZeRO stage, activation-checkpointing strategy. The search space is large and the right answer is hardware-and-model-specific. Mature training stacks have either heuristic auto-tuners or manually-curated configuration libraries for common scales; the 2024–2026 work on automatic parallelism (Alpa, the various 2024–2026 entrants) is moving toward fully-automatic configuration selection, but expert manual tuning remains the frontier-AI norm.

Examples from frontier training

Public papers describe specific 3D configurations. The Llama 3 paper documents training with TP=8, PP=16, DP=128 on 16,384 H100s for the 405B model. The PaLM 2 paper describes TP=4, PP=12 on TPU v4 pods. The DeepSeek-V3 paper describes a different configuration with finer-grained MoE-aware parallelism. Reading these papers gives concrete intuition for the configuration space; they are required reading for anyone training models at frontier scale.

Sequence and Context Parallelism

The 2023–2026 push toward longer context windows (millions of tokens for frontier LLMs) has introduced a new parallelism dimension: sequence parallelism (also called context parallelism). The motivation is that attention's memory and compute scale quadratically with sequence length, and at long contexts the per-step memory becomes the binding constraint. Sequence parallelism splits the sequence dimension across workers, with carefully-orchestrated communication to maintain mathematical equivalence.

Why long context is hard

Attention's memory complexity is O(N²) in sequence length N. For a 1M-token context with hidden dimension 8192, the activation tensor for a single attention layer is roughly 32 GB in BF16 — potentially per-head, potentially times the layer count. Even with FlashAttention's memory optimisations, very long contexts blow out single-GPU memory. The compute also scales O(N²), making single-GPU training of very long contexts prohibitively slow.

Ring attention

The dominant approach to sequence parallelism in 2024–2026 is ring attention (Liu et al. 2023). The sequence is split into chunks across workers; each worker holds query, key, and value tensors for its sequence chunk; during attention, key/value tensors are passed around the workers in a ring, with each worker computing partial attention scores against received KVs. The communication is overlapped with computation. The memory savings are linear in the worker count; the communication cost is similarly linear in the worker count but with relatively small per-step constants.

Context-parallel training stacks

Modern training stacks (Megatron-LM, NeMo, ColossalAI, the various 2024–2026 entrants) have added context-parallelism support. The integration with other parallelism dimensions is non-trivial: context parallelism interacts with tensor parallelism (which also splits attention computation), with sequence parallelism in the layer-norm sense, and with FSDP. Getting the combination right is engineering work that the major training-stack maintainers have done; users typically configure the parallelism degrees and let the framework handle the details.

Why long context matters operationally

The push toward long context (1M tokens for frontier LLMs in 2024–2026) is driven by use cases: agents that need to track long conversations, code-understanding models that need to see entire codebases, document-processing models that need to handle entire books. The training methodology has had to evolve to support these contexts; sequence parallelism is the enabling primitive. Without it, the long-context capability that defines current frontier LLMs would not be trainable.

The Inference connection

Sequence parallelism methodology connects directly to inference techniques like vLLM's PagedAttention and Ring Attention's inference-time variant. Modern long-context inference uses similar partitioning across multiple GPUs to handle prompts that don't fit in a single GPU's HBM. The training-and-inference parallelism stories are converging in 2025–2026; the same primitives are used in both regimes.

The 4D-parallelism era

Adding sequence parallelism to the existing 3D parallelism produces what some sources call 4D parallelism. The configuration space grows; the engineering complexity grows; but the capability ceiling — what can be trained at all — substantially extends. The 2026 frontier training jobs increasingly use 4D parallelism with the additional axis dedicated to sequence-dimension splitting. Whether further dimensions emerge is open; expert-parallelism (for mixture-of-experts models, where different "experts" can be placed on different workers) is sometimes treated as another dimension.

Frameworks: PyTorch, JAX, DeepSpeed, Megatron

Distributed training requires substantial framework support: collective communication, parallelism primitives, checkpointing, optimisation. Several mature frameworks compete in 2026, with somewhat different design philosophies and trade-offs. This section surveys the dominant options.

PyTorch and FSDP/DDP

PyTorch remains the dominant ML framework, and its native distributed training stack is extensive: DDP for basic data parallelism, FSDP/FSDP2 for sharded data parallelism, DeviceMesh and DTensor for multi-dimensional parallelism, torch.distributed.pipelining for pipeline parallelism. The 2024–2026 work has substantially closed the gap with specialised distributed-training frameworks; PyTorch native is now competitive for most large-scale training. For frontier scale (>10K GPUs), PyTorch is typically combined with Megatron-LM or DeepSpeed for the additional optimisations.

JAX and the GSPMD approach

JAX takes a different approach: rather than explicit parallelism primitives, the user expresses computation in terms of arrays sharded across a device mesh, and the XLA compiler ("GSPMD" or "SPMD" — Single Program Multiple Data) automatically inserts the necessary collective communication. The user writes single-device-looking code and gets distributed execution for free. The approach scales remarkably well — Google's Gemini models are trained with JAX SPMD on TPU pods at very large scale. The trade-off is that JAX is less popular than PyTorch in the broader ML ecosystem; ecosystem effects matter for tooling, library support, and onboarding.

DeepSpeed

DeepSpeed (Microsoft, 2020) was the original ZeRO implementation and the dominant distributed-training stack on top of PyTorch through 2022–2023. DeepSpeed bundled ZeRO, pipeline parallelism, mixed-precision training, custom optimised kernels, and a simplified configuration interface. The 2024–2026 trajectory has been gradual displacement by PyTorch FSDP (which absorbed many DeepSpeed ideas natively), but DeepSpeed remains widely used, particularly for training-from-scratch frontier work. DeepSpeed-Chat and DeepSpeed-MII (inference) extend the framework into the post-training and serving regimes.

Megatron-LM and NeMo

Megatron-LM (NVIDIA) is the reference implementation of tensor parallelism and is widely used for transformer-specific training at scale. Megatron's tensor-parallel and pipeline-parallel implementations are the gold standard; many other frameworks borrow from or wrap around Megatron. NeMo (NVIDIA's broader ML framework) wraps Megatron in a higher-level training API with better ergonomics. The 2024–2026 frontier training stacks at major AI labs typically include Megatron-LM components even when the rest of the stack is something else.

MosaicML and the high-throughput stacks

MosaicML (acquired by Databricks 2023) developed Composer and Streaming, frameworks emphasising high throughput and ergonomic configuration for large-scale training. The Mosaic stack is widely used, particularly for fine-tuning and mid-scale training, and has substantially influenced how PyTorch FSDP+Composer is configured at scale. Mosaic's "MosaicML Foundry" platform ran some of the most-prominent open-source LLM training jobs of 2023–2024.

The 2024–2026 entrants

Several newer frameworks target specific niches. Levanter (Stanford CRFM) emphasises reproducibility and JAX-native training. torchtitan (Meta, 2024) is a reference implementation showcasing PyTorch's native distributed-training stack. EasyDeL (JAX-based, 2024) provides a friendlier JAX wrapper. Axolotl (community-driven) targets fine-tuning. The ecosystem continues to consolidate around PyTorch + FSDP + Megatron and JAX + GSPMD as the two dominant production patterns; smaller frameworks fill specialised niches.

Operational Realities at Scale

Beyond the parallelism methodology, distributed training at frontier scale has substantial operational realities that shape what teams actually do. Failures happen constantly; checkpointing must be efficient enough to survive them; convergence behaviour can be subtly different from single-GPU training; the engineering effort to maintain a high-MFU training run is substantial. This section covers the operational layer.

The failure-rate reality

A training cluster of 10,000+ GPUs has roughly daily individual GPU failures, in addition to memory errors, network glitches, cooling issues, and software bugs. The Llama 3 paper documents an average of 8 unscheduled interruptions per day during their 16K-GPU training run. The implication: the training stack must handle failures gracefully — checkpoint frequently, restart cleanly from checkpoints, gracefully replace failed workers. Frontier-AI training is best understood as continuous training with frequent restarts rather than a single uninterrupted run.

Checkpointing strategies

The dominant checkpointing approach for sharded training is to save the sharded state: each worker saves its slice of the parameters/optimiser state to durable storage; the full checkpoint is the union of all workers' shards. Asynchronous checkpointing overlaps the save with continued training: while one step's checkpoint is being persisted, the next step proceeds. Hierarchical checkpointing uses fast-but-volatile local storage (peer-GPU memory or local NVMe) for frequent intermediate checkpoints, with periodic persistent checkpoints to remote storage. The 2024–2026 work on checkpointing has substantially reduced overhead — frontier-training stacks now checkpoint with <2% overhead even at very large scale.

Debugging at scale

Debugging distributed training is its own discipline. Numerical issues: a NaN in one GPU's activations propagates across workers via collective operations; identifying which worker's input caused the NaN is non-trivial. Hangs and deadlocks: a single slow worker can stall the entire training run; identifying which one and why requires distributed tracing. Convergence issues: at large batch sizes, training can become unstable in subtle ways; identifying root cause requires careful experimentation. Mature training teams have substantial debugging infrastructure — distributed tracing, NaN watchdogs, hang detectors, gradient-norm monitors — that catch issues quickly.

MFU engineering

Achieving high MFU requires continuous engineering investment. The typical iteration: profile to identify the current bottleneck, apply a targeted optimisation, re-profile, repeat. Common optimisations include kernel fusion (combining multiple operations into a single kernel), better tensor layouts, communication-computation overlap, micro-batch tuning, and the various framework-specific tricks. Frontier-training MFU has steadily improved over 2020–2026: GPT-3 reportedly trained at ~30% MFU; modern Llama 3-class training reaches 40–50% MFU; the most-optimised stacks reach 50–60%. The remaining gap to 100% is fundamentally communication overhead and various other unavoidable losses.

Convergence and stability

Distributed training at very large batch sizes can have subtle convergence issues: training divergence (sudden loss spikes), spike-and-recover patterns, or simply slower convergence than expected. Loss spike monitoring is standard practice; spikes are typically handled by automatic learning-rate warmup or by skipping the offending batch. Gradient clipping at large batch sizes is essential. Numerical stability at FP8/BF16 precisions requires care — some models train stably at FP8, others require BF16 or FP32 for specific operations. The 2024–2026 work on FP8 training (Transformer Engine, the various 2024–2026 entrants) has substantially advanced the state of the art.

The training-run lifecycle

A frontier training run lifecycle: launch (allocate cluster, start workers, load initial checkpoint), warmup (early steps where everything is being initialised), steady-state (the bulk of training, hopefully high MFU), interventions (responses to monitoring alerts, occasional restarts, hyperparameter adjustments), cooldown (final steps with reduced learning rate), finalisation (last checkpoint, evaluation, model registry promotion). Each phase has its own engineering considerations. Mature training teams have explicit playbooks for each.

The Frontier and the Operational Question

Distributed training is mature in 2026 for the well-trodden patterns, but several frontiers remain active. Cluster sizes have crossed 100,000 GPUs at major labs; multi-datacentre training is becoming necessary for the largest jobs; mixture-of-experts models change the parallelism story; asynchronous training reduces communication at the cost of some convergence rigour. This section traces the open frontiers and the directions the field is moving in.

The 100K+ GPU regime

The 2024–2025 generation of frontier training jobs crossed 100,000 GPUs in single coordinated training runs (xAI's Colossus cluster, Meta's clusters, Microsoft/OpenAI clusters). At this scale, communication latencies become substantial even within a single datacentre, failure rates make continuous training a function of restart efficiency, and power infrastructure becomes a binding constraint. The 2025–2027 trajectory is toward 200K-1M-GPU clusters; the engineering challenges are unprecedented and the economics are unprecedented (single training runs costing nearly $1B in compute).

Multi-datacentre training

For the very largest training jobs, single datacentres are reaching power and physical-space limits. Multi-datacentre training spreads a single training run across multiple geographic locations, connected by inter-datacentre networks at much lower bandwidth than intra-datacentre. The methodology requires very-high-asynchrony tolerance: workers in different datacentres cannot exchange gradients at every step. The 2024–2026 work on async-friendly training methodologies (DiLoCo, the various asynchronous-SGD revivals) is starting to make multi-datacentre training operationally feasible. Whether multi-datacentre training becomes the standard for frontier or remains a specialised tool is open.

Mixture-of-experts and expert parallelism

Mixture-of-experts (MoE) models route different tokens to different "expert" sub-networks; the active parameters per token are much smaller than the total. MoE models have different parallelism considerations: expert parallelism places different experts on different workers, with token routing requiring all-to-all communication between every layer. The communication patterns are complex; the methodology has matured substantially through 2023–2026 (DeepSeek-V3 and similar) and is increasingly important at frontier scale. The 2026 frontier appears increasingly MoE-flavoured.

Asynchronous and federated training

Beyond the synchronous-SGD norm, several alternative training paradigms have emerged. Asynchronous SGD (workers update a shared model without strict synchronisation) has theoretical and practical issues but works in some regimes. Federated training (Part XIII Ch 10) trains across decentralised data sources without centralising the data; useful for privacy-sensitive workloads but with different operational profile. The 2024–2026 work on hybrid sync/async approaches is methodologically active; whether any displaces standard synchronous training at scale is open.

Compiler-based parallelism

Frameworks like JAX/XLA and the various MLIR-based stacks are pushing toward automatic parallelism: the user writes single-device code and the compiler determines the optimal sharding and communication. Alpa (2022), the various 2023–2025 follow-ups, and OpenXLA are advancing this direction. The 2026 status is that automatic parallelism works well for some workloads but doesn't yet match expert-tuned configurations for frontier-scale training; expert tuning remains the norm at the very-largest scale. Whether automatic parallelism eventually displaces manual configuration is open.

What this chapter has not covered

Several adjacent topics are out of scope. Model compression (quantisation, pruning, distillation) is the topic of Ch 03. Inference optimisation (batching, KV caching, speculative decoding) is Ch 04. AI chips and custom silicon (TPUs, Trainium, the various ASICs at depth) is Ch 05. The detailed methodology of post-training fine-tuning (LoRA, the various PEFT methods) is in Part IX (LLMs). The chapter focused on the methodology of distributed training at scale; the broader infrastructure landscape develops adjacent topics in subsequent chapters.