Part V · Deep Learning Foundations · Chapter 02

Training deep networks, the engineering discipline of actually making the gradient-descent loop of the previous chapter converge on a modern-scale architecture.

Chapter 01 showed how a neural network is defined, how a forward pass is computed, how backpropagation produces gradients, and how plain SGD updates parameters. On a two-hidden-layer MLP for MNIST, that machinery is enough. On a fifty-layer convnet, a transformer with a few billion parameters, or a large-language-model training run that spans a thousand GPUs, it is very far from enough. Dozens of engineering problems stand between the toy training loop of Chapter 01 and a full-scale run that actually converges: choosing an optimiser that handles ill-conditioned and stochastic losses, scheduling the learning rate so that neither divergence in the early iterations nor premature stalling late in training kills the run, picking a batch size that balances hardware utilisation against generalisation behaviour, initialising weights so that activations and gradients neither vanish nor explode, normalising intermediate activations so that training stays stable as depth grows, clipping gradients so that rare outlier examples do not cause single-step divergence, training in reduced-precision arithmetic without losing numerical fidelity, distributing the training workload across many accelerators, monitoring the dozens of scalar metrics that flag problems before they waste days of compute, and reproducing the whole thing reliably across runs and hardware generations. This chapter is the training-engineering layer of deep learning — the body of accumulated empirical practice that turns the core SGD algorithm into a reliable tool. Every working modern model, from a 200-line MNIST classifier to a multi-trillion-token LLM, relies on most of the material in this chapter; this is the chapter where the difference between a toy-scale researcher and a production-scale practitioner lives.

How to read this chapter

Sections one through five are the optimiser zoo. Section one revisits vanilla SGD as the baseline, with a precise statement of the objective it is minimising and the noise characteristics that make it a regulariser. Section two introduces momentum in its Polyak-heavy-ball and Nesterov forms, and the intuition that a running average of past gradients turns a noisy descent into a directed flow. Section three covers the adaptive-learning-rate family — Adagrad (Duchi, Hazan, Singer 2011), RMSprop (Tieleman, Hinton 2012), Adam (Kingma, Ba 2015), AdamW (Loshchilov, Hutter 2019) — the optimisers that per-parameter-scale the step size and that have become the default for most deep-learning tasks. Section four surveys the 2018–present generation: LAMB, LARS, Lion (Chen et al. 2023), Sophia, Shampoo, and second-order methods in general, the optimisers that appear in frontier-model training. Section five treats learning-rate scheduling in its own right: warmup, cosine decay, one-cycle, cyclical rates, and the schedule choices that matter as much as the optimiser choice.

Sections six through nine are the per-layer training machinery. Section six is the initialisation deep dive — Xavier, He, LeCun, orthogonal, identity, transformer-specific 1/√L scaling — each derived from a signal-propagation argument. Section seven is batch normalisation (Ioffe, Szegedy 2015), the technique that made training convnets of hundreds of layers practical and whose theoretical explanation is still debated a decade later. Section eight covers the alternatives — layer, group, instance, and RMS normalisation — and the reasons different normalisation schemes suit different architectures and batch sizes. Section nine is residual and skip connections (He, Zhang, Ren, Sun 2015), the architectural innovation that made networks with hundreds or thousands of layers trainable by keeping gradient signal paths short.

Sections ten through fourteen are the numerical-and-scale engineering. Section ten is gradient clipping: norm clipping, value clipping, and when each prevents training divergence. Section eleven covers the batch-size and learning-rate interaction — the linear-scaling rule, warmup, and the regime in which very large batches stop behaving like many small ones. Section twelve is mixed-precision training (Micikevicius et al. 2018): FP16, BF16, FP8, loss scaling, and the hardware-software co-design that turned precision reduction into a lossless throughput doubling. Section thirteen is distributed training in its three forms — data, model, and pipeline parallelism — and the engineering frameworks (DDP, FSDP, ZeRO, Megatron-LM, DeepSpeed) that implement them. Section fourteen is gradient accumulation and memory-efficient backprop: checkpointing, activation recomputation, and the techniques that let a 10-billion-parameter model train on a single accelerator at the cost of compute for memory.

Sections fifteen through eighteen are the operational discipline. Section fifteen is the training-dynamics diagnostic — the dozen scalar metrics (loss, gradient norm, weight norm, activation statistics, learning-rate trace) you monitor during training and the failure modes each signals. Section sixteen is reproducibility: random seeds, deterministic algorithms, numerical non-determinism on GPU, the replication crisis in ML, and the protocols that make a training run replicable. Section seventeen is compute budgeting: the Kaplan and Chinchilla scaling laws as practical planning tools, how to spend a fixed compute budget on the right combination of model size, dataset size, and training duration. Section eighteen places training-engineering inside the broader deep-learning landscape — the cost of frontier training runs, the open-vs-closed-weights debate, and the techniques (LoRA, QLoRA, pruning, distillation) that turn the expensive initial training into cheap derivative models.

Vanilla SGD revisitedThe baseline optimiser, noise as implicit regularisation
Momentum and Nesterov accelerationPolyak heavy ball, lookahead correction
Adaptive methods: Adagrad, RMSprop, Adam, AdamWPer-parameter scaling, the default of the modern era
Modern optimisers and second-order methodsLAMB, LARS, Lion, Sophia, Shampoo, K-FAC
Learning-rate schedulingWarmup, cosine, one-cycle, cyclical
Weight initialisation in depthXavier, He, LeCun, orthogonal, transformer 1/√L
Batch normalisationIoffe–Szegedy 2015, train vs eval modes, the small-batch failure
Layer, group, instance, and RMS normalisationPicking a scheme, LayerNorm for transformers
Residual and skip connectionsResNet, highway nets, the identity-shortcut trick
Gradient clippingNorm clipping, value clipping, exploding-gradient defence
Batch size and learning-rate interactionLinear scaling rule, large-batch generalisation gap
Mixed-precision trainingFP16, BF16, FP8, loss scaling, tensor cores
Distributed trainingData, model, pipeline parallelism; DDP, FSDP, ZeRO
Gradient accumulation and memory-efficient backpropCheckpointing, activation recomputation
Training dynamics and diagnosticsLoss, gradient norm, activation statistics, failure modes
Reproducibility and determinismSeeds, deterministic kernels, ML replication crisis
Scaling laws and compute budgetingKaplan and Chinchilla as planning tools
Where it compounds in MLFrontier training, derivative models, the engineering frontier

Vanilla stochastic gradient descent, revisited

Chapter 01 introduced stochastic gradient descent as the workhorse algorithm that drives neural-network training. This chapter treats optimisation as an engineering discipline in its own right, and we start by taking a harder look at what plain SGD actually does — the precise update rule, the noise statistics of the mini-batch gradient, the convergence theory, and the cases where unadorned SGD is still the right choice.

The update rule, stated precisely

Given parameters θ ∈ ℝᵖ, a training set of N examples, a loss L(θ) = (1/N) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ), a learning rate η, and a randomly sampled mini-batch B ⊂ {1, …, N} of size |B|, one step of stochastic gradient descent is θ ← θ − η · ĝ where ĝ = (1/|B|) ∑ᵢ∈B ∇θ ℓ(f(xᵢ; θ), yᵢ). The mini-batch gradient ĝ is an unbiased estimator of the full-batch gradient ∇L(θ) — its expectation over random mini-batches equals the true gradient. What distinguishes stochastic from full-batch descent is not bias but variance: ĝ is a noisy version of ∇L, with variance that scales as 1/|B|. That noise is a feature, not a bug — it helps SGD escape saddle points and flat regions of the loss surface — but it also means that SGD's trajectory is not a deterministic curve but a random walk biased in the direction of steepest descent.

The noise scale

Define the per-example gradient covariance Σ(θ) = Covᵢ[∇θ ℓᵢ(θ)], a p × p matrix that captures how much the per-example gradients disagree with one another at the current parameters. The variance of the mini-batch estimator is Σ(θ) / |B|; the variance of a single-example SGD step is the full Σ(θ). Smith and Le (2018) argued that the effective noise in an SGD update is proportional to η / |B| — large learning rates amplify noise, large batches dampen it. This linear scaling rule (Goyal et al. 2017) is the reason practitioners who want to double their batch size typically also double their learning rate: the trajectory of SGD, viewed as an approximation to a continuous-time stochastic differential equation, depends only on the ratio η / |B|, so preserving that ratio preserves the optimisation dynamics.

Convergence guarantees

For a convex L-smooth loss, SGD with step size η = 1/(Lt) at iteration t (or η = 1/(L√t) with appropriate averaging) converges to the global minimum at rate O(1/√t) — the optimum expected loss satisfies E[L(θ_t) − L*] = O(1/√t). For strongly convex losses with a decreasing step-size schedule the rate improves to O(1/t). For non-convex neural-network losses no such guarantees exist in general, but partial results — that SGD almost-surely escapes strict saddle points (Ge et al. 2015, Lee et al. 2017), that under realisability assumptions the overparameterised regime is effectively convex (Du et al. 2019, Neural Tangent Kernel theory) — give theoretical hope where classical theory despaired. What matters in practice is the empirical observation: SGD on a well-initialised, well-normalised deep network reliably drives training loss to near-zero, and does so for architectures whose loss surfaces are known to contain exponentially many local minima.

When plain SGD is the right choice

Adaptive optimisers (Adam and its descendants, treated below) dominate the deep-learning literature, but plain SGD with momentum remains the standard for computer vision tasks trained with convolutional networks — the ResNet, EfficientNet, and ConvNeXt families all report their best results with SGD-momentum. Wilson et al. (2017), in The marginal value of adaptive gradient methods in machine learning, showed that SGD-momentum generalises better than Adam on several vision benchmarks, attributed variously to Adam's implicit L2 geometry, its sensitivity to weight decay, or its tendency to converge to "sharper" minima. The rule of thumb: use SGD-momentum for computer-vision training from scratch, use AdamW for transformers and for fine-tuning almost anything, and don't be surprised if the rule fails on a new architecture — run a small hyperparameter sweep and let the validation loss decide.

Gradient noise and implicit regularisation

A separate strand of theory argues that SGD's noise is not merely a means to faster convergence but a regulariser in its own right. Keskar et al. (2017) showed that large-batch training finds "sharper" minima — ones whose loss rises quickly in a small neighbourhood — while small-batch training finds flatter minima that generalise better. Zhang et al. (2017), in Understanding deep learning requires rethinking generalization, showed that gradient noise, together with the architecture's inductive bias, is what keeps an overparameterised network from simply memorising its training set. Smith et al. (2020) formalised this as the "implicit bias" of SGD: among the many zero-training-loss configurations the optimiser could converge to, it preferentially finds those that have additional geometric properties (low norm, flatness, low-complexity Fourier components) which correlate with good test performance. This is one more reason the modern practitioner does not switch to full-batch gradient descent "because it's more accurate": the noise is doing useful work.

SGD as the baseline. Before reaching for a fancier optimiser, a careful practitioner runs SGD with momentum on their problem to establish a baseline. If SGD-momentum with a sensible learning-rate schedule does not converge, the problem is unlikely to be solved by switching to Adam — there is usually a deeper issue (initialisation, data normalisation, gradient explosion) that must be fixed first. Adaptive optimisers should be layered on top of good SGD practice, not used as a substitute for it.

Momentum and Nesterov's accelerated gradient

The first and most durable upgrade to plain SGD is momentum — adding a running-average of past gradients to the update direction. The cost is one extra vector of memory per parameter tensor; the benefit is a dramatic acceleration on loss landscapes with narrow ravines, and a qualitatively different notion of convergence speed for smooth convex functions.

Heavy-ball momentum

Polyak's 1964 "heavy-ball" method (Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics) augments gradient descent with a velocity variable v that tracks the running exponential average of past gradients. The update is v ← β v + ∇L(θ), then θ ← θ − η v, where β ∈ [0, 1) is the momentum coefficient — typical values are 0.9 or 0.99. The name comes from the physical analogy: a heavy ball rolling down the loss surface accumulates momentum, so it keeps moving through flat regions and oscillates less across narrow ravines than a frictionless particle would. The effective step size in a direction along which the gradient has been consistently non-zero is η / (1 − β) — for β = 0.9, that is 10η, which is why momentum speeds up training on problems with low-curvature directions.

The PyTorch convention

Read the documentation carefully: PyTorch's torch.optim.SGD implements a slightly different momentum variant in which the velocity is applied directly, without a 1 − β dampening factor. The update is v ← β v + ∇L(θ), θ ← θ − η v — which is Polyak's heavy ball. TensorFlow's tf.keras.optimizers.SGD and some theoretical presentations use the dampened form v ← β v + (1 − β) ∇L(θ), in which the steady-state velocity magnitude matches the gradient rather than being inflated by the geometric sum. The two forms differ only by a constant factor that can be absorbed into the learning rate — but if you are porting hyperparameters between frameworks, be aware that η = 0.1 with PyTorch-style momentum β = 0.9 behaves like η = 1.0 with dampened momentum, and a naïve copy can blow up training.

Nesterov's accelerated gradient

Yurii Nesterov's 1983 method (A method for unconstrained convex minimization problem with the rate of convergence O(1/k²)) is a look-ahead variant of momentum. Classical heavy-ball computes the gradient at the current parameters θ; Nesterov computes the gradient at the look-ahead point θ − η β v, the position the momentum step would take us to. The update is v ← β v + ∇L(θ − η β v), θ ← θ − η v. The intuition: if momentum is carrying us to a new point, we might as well evaluate the gradient there — the look-ahead gives us an extra half-step of information that corrects momentum's tendency to overshoot. For smooth convex functions, Nesterov's method achieves O(1/k²) convergence, provably optimal among first-order methods, compared with O(1/k) for ordinary gradient descent and O(1/k) for heavy-ball (Nesterov's rate is a genuine improvement). For non-convex neural networks the picture is murkier — the acceleration guarantee no longer applies — but in practice Nesterov momentum consistently edges out classical momentum by a small amount, and both PyTorch's SGD(..., nesterov=True) and Keras's SGD(nesterov=True) expose it as a flag.

The ravine geometry

The canonical illustration of why momentum helps is the ravine — a long, narrow valley in the loss surface, steep in one direction and shallow in the other. Plain SGD in a ravine bounces between the steep walls while making only slow progress along the valley floor; the component of its update along the shallow direction is proportional to the small gradient there. Momentum, by averaging successive gradients, damps the oscillations across the steep walls (which point in alternating directions and cancel) and reinforces the motion along the shallow direction (which points consistently the same way and accumulates). The loss surfaces of deep networks, especially those without batch normalisation, are full of such ravines — the condition numbers can be 10⁶ or worse — and momentum is the minimum equipment needed to train through them.

When momentum fails

Momentum is not universally a win. Very early in training, before the parameters are well-scaled, the velocity can grow so large that one step destabilises the network — the "cold start" problem. The standard workaround is a warmup schedule that ramps the learning rate from near-zero to its target value over the first few hundred to few thousand iterations (treated in §5 below). A second failure mode is when the loss landscape has sharp discontinuities or non-differentiable kinks — momentum can overshoot across them, and the velocity variable takes a long time to reset. Gradient clipping (§10) addresses the first; careful learning-rate selection addresses the second.

What momentum buys you. On a convex quadratic with condition number κ, plain SGD needs O(κ) iterations to halve the suboptimality; heavy-ball momentum needs O(√κ); Nesterov momentum achieves the provably optimal O(√κ) with better constants. For an ill-conditioned problem with κ = 10⁶ — not uncommon in an un-normalised deep network — that is a factor of 1000 in training time. The memory cost is one extra tensor per parameter, the code cost is a handful of lines, and the benefit is universal; there is no good reason ever to train a neural network with momentum-free SGD.

Adaptive methods: Adagrad, RMSprop, Adam, AdamW

Momentum gives every parameter the same learning rate, only the direction of the update varies. Adaptive methods take a harder look at that assumption: if some parameters have gradients that are consistently large and others consistently small, shouldn't the effective step sizes differ? This intuition launched a family of optimisers — Adagrad, RMSprop, Adam, AdamW — that have become the default choice for training transformers and most modern deep networks.

Adagrad

John Duchi, Elad Hazan, and Yoram Singer's 2011 Adaptive subgradient methods for online learning and stochastic optimization (JMLR) introduced the key move: accumulate a running sum of squared gradients, Gₜ = ∑ₛ₌₁ᵗ gₛ ⊙ gₛ, and normalise each parameter's step by the square root of its own accumulated squared gradient. The update is θₜ₊₁ ← θₜ − η · gₜ / (√Gₜ + ε). Parameters that have received many large gradients get a smaller effective learning rate; parameters with consistently small gradients keep their full step. Adagrad was a breakthrough for sparse-gradient problems — NLP and recommender systems in the pre-deep-learning era — because it gave frequently-updated features and rarely-updated features appropriate learning rates automatically. Its fatal flaw: because Gₜ is monotonically increasing, the effective learning rate decays to zero, and training stalls. On convex problems this decay is desirable (it matches the η/√t schedule the theory prescribes), but on non-convex deep-learning losses it prematurely freezes training.

RMSprop

Geoffrey Hinton's 2012 Coursera lecture on RMSprop (never published as a paper, but cited as "Tieleman and Hinton, 2012, Coursera lecture 6.5") replaced Adagrad's monotonic sum with an exponential moving average of squared gradients: Eₜ ← α Eₜ₋₁ + (1 − α) gₜ ⊙ gₜ, θₜ₊₁ ← θₜ − η · gₜ / (√Eₜ + ε). The typical decay rate α is 0.99, which means the running-average window is about 100 iterations. Because Eₜ does not grow without bound, RMSprop does not stall the way Adagrad does — it adapts the per-parameter step size to the recent gradient scale rather than the cumulative one. RMSprop was a standard for training recurrent networks in the 2013–2016 period and remains a common choice for reinforcement-learning algorithms where the gradient statistics change over the course of training.

Adam

Diederik Kingma and Jimmy Ba's 2015 Adam: A Method for Stochastic Optimization (ICLR) combines RMSprop's second-moment adaptation with a momentum-like first-moment accumulator. The update is: mₜ ← β₁ mₜ₋₁ + (1 − β₁) gₜ (biased first moment), vₜ ← β₂ vₜ₋₁ + (1 − β₂) gₜ ⊙ gₜ (biased second moment), m̂ₜ = mₜ / (1 − β₁ᵗ), v̂ₜ = vₜ / (1 − β₂ᵗ) (bias-corrected moments), θₜ₊₁ ← θₜ − η · m̂ₜ / (√v̂ₜ + ε). The defaults β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸ work astonishingly well across an enormous range of problems. Adam is the optimiser that launched a thousand deep-learning benchmarks: its per-parameter adaptive step size handles the wildly different gradient scales that a multi-layer network produces, its bias correction makes early-training updates well-scaled before the running averages have warmed up, and its insensitivity to hyperparameter choices makes it a reliable default when the practitioner has not had time to tune carefully.

The weight-decay bug and AdamW

Ilya Loshchilov and Frank Hutter's 2019 Decoupled Weight Decay Regularization (ICLR) identified a subtle but important bug in the standard Adam-with-weight-decay implementation. The classical way to add L2 regularisation to Adam is to augment the loss with (λ/2)‖θ‖², which adds λθ to the gradient before it is fed through the adaptive scaling. But this means parameters with large second-moment v̂ (large historical gradients) see a small effective weight decay, and parameters with small v̂ see a large one — the regularisation strength is coupled to the optimiser's scaling, which is almost certainly not what the practitioner wanted. AdamW decouples: instead of adding λθ to the gradient, it subtracts η · λ · θ from the parameters after the Adam update. The decoupled form recovers the correct interpretation of λ as a uniform parameter-norm penalty, and empirically produces substantially better generalisation — Loshchilov and Hutter's paper reports 1–2% top-1 accuracy improvements on ImageNet and similar gains on language tasks. Modern frameworks expose both: PyTorch's torch.optim.Adam uses the classical (buggy) form; torch.optim.AdamW uses the decoupled form. The practical recommendation, almost universal in the modern literature, is always use AdamW, never use plain Adam.

Amsgrad and convergence issues

Reddi, Kale, and Kumar's 2018 On the Convergence of Adam and Beyond (ICLR) gave a construction of a simple convex problem on which Adam fails to converge — the exponential moving average of second moments can decrease in a way that produces ever-increasing effective step sizes. Their fix, AMSGrad, replaces v̂ₜ with max(v̂₁, …, v̂ₜ) to ensure the second-moment normalisation is non-decreasing. AMSGrad recovers a convergence guarantee at the cost of one extra element-wise max per step. In practice, on deep-learning problems, AMSGrad and Adam behave similarly, and the convergence issue Reddi et al. highlighted is rarely a practical problem — it is more a cautionary tale that Adam's convergence theory is thin than a bug that needs fixing. Nadam (Dozat 2016) takes a different angle and folds Nesterov-style look-ahead into Adam; it sees occasional use.

AdamW as the modern default. If you are training a transformer, a diffusion model, or almost any architecture that includes attention, normalisation layers, and residual connections, AdamW with β₁ = 0.9, β₂ = 0.95 or 0.999, weight decay 0.01 to 0.1, and a warmup-then-cosine learning-rate schedule is the starting point from which almost every modern recipe is a variation. If you are training a ConvNet from scratch, SGD-momentum with cosine decay is still competitive, but for finetuning any pretrained model, AdamW is nearly universal. Start with AdamW's defaults; tune only after establishing a baseline that converges.

Modern optimisers: LAMB, LARS, Lion, Shampoo

Adam and its variants dominated deep-learning optimisation throughout the mid-2010s. A newer wave of optimisers, driven partly by the demand for very-large-batch training and partly by distillation of Adam's behaviour into simpler rules, has produced several interesting designs. None of these has displaced AdamW as a default, but several are competitive for specific workloads and worth knowing about.

LARS and LAMB — layer-wise adaptive rates

Yang You, Igor Gitman, and Boris Ginsburg's 2017 Large Batch Training of Convolutional Networks introduced LARS (Layer-wise Adaptive Rate Scaling). The idea: the ratio of parameter norm to gradient norm varies dramatically across layers of a deep network, and a single global learning rate is a poor fit to layers with wildly different scales. LARS computes a per-layer trust ratio ‖θₗ‖ / ‖gₗ‖ and scales each layer's update by that ratio, clipped to a reasonable range. The resulting optimiser enabled ResNet-50 on ImageNet to be trained with batch size 32k (eight times the usual) without loss of accuracy — a landmark result for distributed training. You et al.'s 2020 follow-up Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (ICLR) adapted the idea to Adam, producing LAMB (Layer-wise Adaptive Moments for Batch training), which enabled BERT pretraining at batch size 32k. Both LARS and LAMB are now standard tools for the supercomputer-scale training of the largest models.

Lion

Xiangning Chen et al.'s 2023 Symbolic Discovery of Optimization Algorithms (NeurIPS) used program search to look for simpler, better optimisers than AdamW — symbolic regression over the space of optimiser update rules, selected on held-out training-efficiency metrics. The winner, Lion (EvoLved Sign Momentum), uses only a sign operation and a single momentum buffer. The update is u ← sign(β₁ · m + (1 − β₁) · g), θ ← θ − η · (u + λ · θ), m ← β₂ · m + (1 − β₂) · g. There is no second-moment tensor — Lion saves roughly half of AdamW's optimiser state, a significant memory win for very-large-model training. On image classification and some language-model workloads, Lion matches or slightly beats AdamW while using less memory. Adoption has been gradual but steady: Google's Gemma models trained with Lion for part of their compute budget, several papers report competitive or better results, and PyTorch has a community-maintained lion-pytorch package. The hyperparameter recommendation in the paper: Lion's learning rate should be set about 10× smaller than AdamW's, and its weight decay 3× larger.

Sophia

Hong Liu et al.'s 2023 Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training takes a different angle — it approximates the Hessian cheaply (via a diagonal Gauss-Newton or Hutchinson stochastic estimate) and uses it as a curvature-aware preconditioner. Sophia claims a 2× speedup over AdamW on GPT-2-scale language-model pretraining at equal token budget. The method is more complex than Lion — it maintains a running estimate of the per-parameter Hessian diagonal and uses a clipping rule on the update — but its wall-clock numbers have attracted interest. As of early 2026 Sophia has been adopted in a handful of open-source pretraining stacks and remains a plausible but not yet dominant choice for new LLM pretraining runs.

Shampoo and K-FAC — full second-order methods

Vineet Gupta et al.'s 2018 Shampoo: Preconditioned Stochastic Tensor Optimization is a second-order optimiser that maintains a Kronecker-factored approximation of the full covariance matrix of gradients. For a weight matrix W ∈ ℝᵐˣⁿ, Shampoo maintains two smaller preconditioner matrices L ∈ ℝᵐˣᵐ and R ∈ ℝⁿˣⁿ and preconditions the gradient as L⁻¹ᐟ⁴ g R⁻¹ᐟ⁴. Roger Grosse and James Martens' 2015 K-FAC (Optimizing Neural Networks with Kronecker-factored Approximate Curvature) is a similar idea applied to the Fisher information matrix rather than the gradient covariance. Both methods incur substantial memory overhead — the Kronecker factors can be gigabytes — and require periodic matrix inverses. In exchange they converge in many fewer gradient evaluations. DeepMind's 2024 "distributed Shampoo" implementation made the method tractable at the scale of the largest language models, and there are scattered reports of 1.5–2× speedups over AdamW at equivalent wall-clock cost; as of early 2026 it is still a research-track optimiser rather than a production default.

The meta-pattern

Across all of these methods, a single meta-pattern emerges. Every modern optimiser can be described as adaptive per-parameter step size + momentum-like smoothing + something. The "something" is what distinguishes them: Adagrad's accumulated-squared-gradient, RMSprop's exponential-moving-average variant, Adam's first-moment correction, AdamW's decoupled weight decay, LARS/LAMB's layer-norm-based trust ratio, Lion's sign-only update, Sophia's Hessian-diagonal preconditioner, Shampoo's Kronecker-factored curvature. The zoo is wide, but the choice for most practitioners is narrow: use AdamW as the default, try Lion if you are memory-constrained, and reach for LARS/LAMB or Shampoo only when the problem (very-large-batch training, very-large-model pretraining) justifies the complexity.

Don't chase the optimiser. Optimiser novelty is a common distraction. On a well-tuned AdamW pipeline with proper warmup, cosine decay, and weight decay, the wall-clock gap to the state-of-the-art optimiser is usually less than the gap produced by fixing a data-loading bug, adding a missing normalisation layer, or tuning the batch size. Before switching optimisers, exhaust the ordinary lever arms — learning-rate schedule, weight decay, gradient clipping, initialisation — and only then evaluate whether the marginal gain of a fancier optimiser is worth the added complexity.

Learning-rate schedules

A fixed learning rate is almost never the right choice for deep-network training. The optimal step size at the start of training — when parameters are random and gradients are large — is very different from the optimal step size near the end, when the optimiser is polishing a nearly-converged solution. A learning-rate schedule is a time-varying function η(t) that rises and falls through training according to a principled recipe; choosing a good schedule is often a larger lever than choosing the optimiser.

Warmup

At the very start of training, the parameters are random and the gradients are noisy estimates of a loss surface the network has never seen. A large learning rate at this stage can send the network into a region of the loss landscape from which it cannot recover — activations saturate, gradients vanish, training never catches up. The standard fix is a warmup: ramp the learning rate from near-zero to its target value over the first few hundred to few thousand iterations, typically linearly. Warmup became standard practice for transformer training after Vaswani et al.'s 2017 Attention is All You Need paper, which used a 4000-step warmup with the target learning rate proportional to d_model^(−0.5). Goyal et al.'s 2017 Accurate, Large Minibatch SGD paper (the ImageNet-in-1-hour result) argued that warmup is especially important at large batch size, because the linear scaling rule (batch size k× larger → learning rate k× larger) breaks down early in training before the loss surface has been "locally linearised" by a few steps of training. The modern default: 1% to 5% of total training steps in warmup.

Step decay and cosine decay

The classical step decay drops the learning rate by a constant factor (typically 0.1) at one or more milestones — the multistep schedule used by the ResNet paper drops at 1/3 and 2/3 of the total training time. Cosine decay (Loshchilov and Hutter 2017, SGDR: Stochastic Gradient Descent with Warm Restarts) replaces the discontinuous drops with a smooth η(t) = η_min + (η_max − η_min) · (1 + cos(πt/T)) / 2. The schedule starts at η_max, smoothly decreases to η_min at the end of training, and has a shape that approximately matches the empirical optimal learning-rate curve for gradient-based optimisation of smooth functions. Cosine decay is the default for most transformer pretraining runs — it produces consistently better final-epoch results than step decay, is easy to implement, and has only two hyperparameters (the initial rate and, optionally, the minimum rate).

One-cycle and super-convergence

Leslie Smith's 2018 A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay proposed the "one-cycle" policy: the learning rate rises linearly from a minimum to a maximum over the first half of training, then decreases back to a very small value over the second half, with momentum moving inversely. The surprising empirical observation is that the maximum learning rate can be quite aggressive — 10× higher than a fixed-schedule optimiser would tolerate — and that the aggressive peak combined with the final slow polishing produces what Smith called "super-convergence", a 3–5× wall-clock speedup on tasks like CIFAR-10. PyTorch exposes this schedule as torch.optim.lr_scheduler.OneCycleLR, and it remains a popular recipe for from-scratch training of small-to-medium networks on image-classification benchmarks.

Cyclical learning rates

Smith's earlier 2017 paper Cyclical learning rates for training neural networks proposed repeatedly oscillating the learning rate between a minimum and maximum, on a triangular or sinusoidal schedule with period tens of thousands of iterations. The idea — that the periodic high-learning-rate phases help the optimiser escape saddle points and narrow minima — is not fully settled in theory, but the empirical result is robust: CLR schedules produce solutions that generalise as well or better than fixed schedules, and they eliminate the need to tune a single learning rate precisely. Warm restarts (Loshchilov and Hutter's original 2017 proposal) are a cyclical-schedule variant based on cosine decay followed by an abrupt reset to the maximum; the Stochastic Gradient Descent with Warm Restarts (SGDR) schedule is a standard choice for reinforcement learning and occasionally for supervised training.

How to pick a schedule

For transformer pretraining: linear warmup for 1% to 5% of steps, followed by cosine decay to 10% of the peak learning rate. For ConvNet pretraining on ImageNet: warmup for the first epoch, step decay at 1/3 and 2/3, or cosine decay — both are competitive. For fine-tuning: lower peak learning rate (often 10× smaller than pretraining), linear warmup for 10% of steps, linear or cosine decay. For from-scratch training of small networks: one-cycle with a learning-rate-range-test-derived peak. These recipes are starting points — they always require tuning for the specific problem — but they are far better than a constant learning rate, and they are the defaults built into every modern deep-learning framework for good reason.

Learning-rate-range test. Before committing to a schedule, run Smith's learning-rate-range test: train for one or two epochs while exponentially increasing the learning rate from 10⁻⁷ to 10, and plot the training loss against the learning rate. The peak of the useful range is the learning rate just below where the loss starts to diverge — typically 10× larger than the learning rate that first produces clear progress. This test takes only a few minutes on small datasets and is the single highest-value diagnostic in the deep-learning practitioner's toolkit.

Initialisation, revisited for depth

Chapter 01 introduced the Xavier and He initialisation schemes as the remedies for vanishing and exploding gradients in shallow networks. Real networks are deep — 100, 1000, 10 000 layers — and at those depths, initialisation choices that a shallow network tolerates compound into training failures. This section digs deeper into the geometry of forward and backward signal propagation and the initialisation schemes that make very deep networks trainable.

The signal-propagation view

Consider a d-layer MLP with weights Wₗ, biases bₗ, and activations φ. The forward pass computes hₗ = φ(Wₗ hₗ₋₁ + bₗ). If the inputs h₀ have unit variance, we want Var(hₗ) ≈ Var(h₀) for all ℓ — otherwise the activations either vanish (drift toward zero) or explode (drift toward ±∞) with depth. Writing σ²ₗ = Var(hₗ), one layer propagates variance as σ²ₗ ≈ α · nₗ₋₁ · σ²_W · σ²ₗ₋₁, where nₗ₋₁ is the input dimension and α is an activation-dependent constant (1/2 for ReLU, 1 for linear/tanh/sigmoid near zero). The stationarity condition — σ²ₗ = σ²ₗ₋₁ — requires σ²_W = 1/(α · nₗ₋₁), which recovers Xavier (for tanh, α = 1, so σ²_W = 1/n) and He (for ReLU, α = 1/2, so σ²_W = 2/n) as special cases. The backward pass has a symmetric condition on the variance of gradients, and Xavier-Glorot's original proposal uses the harmonic mean of the two, σ²_W = 2/(n_in + n_out); He's proposal drops the harmonic mean in favour of the forward-only condition, σ²_W = 2/n_in, which is better matched to ReLU's asymmetric non-linearity.

Xavier, He, and LeCun side by side

The three classical initialisations are close cousins that differ only in their constant. Xavier (Glorot): W ∼ 𝒩(0, 2/(n_in + n_out)), designed for tanh/sigmoid. He (Kaiming): W ∼ 𝒩(0, 2/n_in), designed for ReLU. LeCun: W ∼ 𝒩(0, 1/n_in), designed for tanh or SELU-based self-normalising networks (Klambauer et al. 2017). PyTorch exposes all three: torch.nn.init.xavier_normal_, torch.nn.init.kaiming_normal_, and torch.nn.init.normal_ with explicit standard deviation. The practical rule: match the initialisation to the activation function — ReLU → He, tanh → Xavier, SELU → LeCun. Using He on a tanh network will not catastrophically fail, but it will make training a few percent slower and less reliable.

Orthogonal initialisation

Andrew Saxe, James McClelland, and Surya Ganguli's 2014 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (ICLR) argued that the ideal initialisation for a deep linear network preserves not just the magnitude of signals but their geometry — distances, angles, and rank. An orthogonal matrix W satisfies WᵀW = I, and so it preserves norms and inner products exactly; initialising each weight matrix as a random orthogonal matrix (scaled appropriately for the activation) makes the forward and backward maps isometries at depth d = 0. In practice orthogonal initialisation is a small improvement over Gaussian He/Xavier for feedforward networks, a larger improvement for recurrent networks (where the same matrix is applied over and over and scale imbalances compound), and the standard for the recurrent connectivity of LSTMs. PyTorch exposes it as torch.nn.init.orthogonal_.

Fixup and its cousins

The initialisation problem becomes qualitatively harder for very deep networks — those with dozens or hundreds of residual layers — because the signal-propagation condition must hold not just at initialisation but throughout training. Hongyi Zhang, Yann Dauphin, and Tengyu Ma's 2019 Fixup Initialization showed that it is possible to train 110-layer residual networks without batch normalisation if every residual branch is initialised to have output norm scaled by L⁻¹ᐟ², where L is the number of residual blocks. The intuition: with L residual branches each contributing L⁻¹ᐟ², the total variance after L additions is O(1), not O(L). Subsequent work — De and Smith 2020 Batch Normalization Biases Residual Blocks Towards the Identity Function, Brock et al. 2021 Characterizing signal propagation to close the performance gap in unnormalised ResNets — refined the idea and showed that normalisation-free deep networks can match the accuracy of their batch-normalised counterparts if their initialisations are chosen carefully enough. For transformer pretraining, the analogous insight is that the residual branches should be initialised with variance O(1/L) to prevent the accumulated depth from amplifying signals; GPT-2's original training recipe used W ∼ 𝒩(0, 0.02 / √(2L)) for the output projections of every residual block.

Bias initialisation

Biases are usually initialised to zero — there is no good reason for the network to start with a preferred offset before it has seen any data. The one exception is the bias of a gated unit (an LSTM forget gate, a highway-network gate) where a small positive initialisation (b = 1 or b = 2) biases the gate open at the start of training and is empirically known to help. The other exception is the final-layer bias of a classifier on a heavily imbalanced dataset, which can be initialised to the log-odds of the class prevalence (b = log(π/(1−π))) so that the network's initial predictions match the marginal distribution of labels and do not have to learn that trivial baseline in its early epochs.

Initialisation as a foundation. The headline innovations of deep learning — skip connections, attention, normalisation — all exist in part to make deeper networks trainable. Good initialisation is a cheaper, simpler lever for the same goal: the difference between a trainable and an untrainable network at depth 100 is often one careful variance calculation at layer construction time. Use He for ReLU, Xavier for tanh, LeCun for SELU; scale residual branches by 1/√L in very deep networks; zero the biases unless you have a specific reason not to. These three rules prevent the majority of training failures that initialisation can cause.

Batch normalisation

Batch normalisation, introduced by Sergey Ioffe and Christian Szegedy in 2015, is arguably the single most impactful architectural innovation in deep learning between the original 2012 AlexNet and the 2017 transformer. It made much deeper networks trainable, dramatically reduced sensitivity to initialisation and learning rate, and added a form of implicit regularisation. A modern deep-learning practitioner needs to understand not just how to use it, but what it is doing — and the surprisingly deep literature on why.

The transform

Batch normalisation normalises the activations of a layer across the mini-batch dimension. For a layer output x with batch dimension B and feature dimension D, the BN transform computes per-feature mini-batch statistics μ_d = (1/B) ∑ᵢ xᵢ_d and σ²_d = (1/B) ∑ᵢ (xᵢ_d − μ_d)², normalises each feature as x̂ᵢ_d = (xᵢ_d − μ_d) / √(σ²_d + ε), and then applies a learnable affine transformation yᵢ_d = γ_d · x̂ᵢ_d + β_d. The affine parameters γ, β allow the network to recover the identity function if that is the optimal transformation, so BN does not reduce expressive capacity. At inference time, the mini-batch statistics are replaced by running averages accumulated during training.

Why it works: the original story and the revision

Ioffe and Szegedy's 2015 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift argued that BN stabilises training by reducing "internal covariate shift" — the phenomenon in which earlier layers change their output distributions during training, forcing later layers to continuously adapt. Santurkar et al.'s 2018 How Does Batch Normalization Help Optimization? (NeurIPS) challenged this account with direct measurements showing that BN does not reduce internal covariate shift by any natural metric. Their alternative account: BN makes the loss landscape smoother — the gradients are better Lipschitz, so gradient descent can take larger steps without diverging, which translates into higher learning rates and faster convergence. The revised story is more consistent with the empirical fact that BN-equipped networks tolerate learning rates 10× higher than their un-normalised counterparts, but the full theoretical picture is still not settled.

Placement: before or after the activation

The original ResNet paper placed BN before the ReLU nonlinearity (Conv → BN → ReLU), which is now the standard for ConvNets. Some architectures place it after (Conv → ReLU → BN), and the difference is subtle but real. BN-before-activation keeps the linear transformation's statistics Gaussian and applies a nonlinearity to a standardised input, which works well for ReLU. BN-after-activation standardises the post-nonlinearity, which for ReLU means the distribution is strictly non-negative and no longer zero-mean after normalisation — a waste of BN's centering capability. The pre-activation ResNet (He et al. 2016, Identity Mappings in Deep Residual Networks) takes the before-convention to its extreme by moving BN-and-ReLU to the start of each residual branch, which turned out to ease training of networks up to 1000 layers deep.

Failure modes

Batch normalisation has several well-known failure modes. First, it performs poorly with small batches — the per-feature mini-batch statistics are high-variance estimates of the true distribution when B is small, which destabilises training. The rule of thumb is that BN needs B ≥ 16 to be well-behaved; for detection, segmentation, and video tasks where memory constraints force B = 1 or 2, BN's accuracy degrades sharply and group-norm or layer-norm should be substituted. Second, BN's train-time and test-time behaviours are different (mini-batch vs. running-average statistics), which produces a train-test skew that is a common source of bugs — forgetting to call .eval() before validation, or tracking running averages with an insufficient warmup, can produce test-time numbers that do not match training. Third, BN interacts poorly with sequence models (RNNs, transformers) where the batch dimension is not the natural one to normalise over; this is why transformers use layer-norm instead.

Implicit regularisation

A side effect of BN's noisy mini-batch statistics is that it adds stochasticity to the training forward pass — each example sees slightly different statistics depending on which other examples it is batched with. This has been shown, by Luo et al. 2019 and others, to act as an implicit regulariser similar to (but weaker than) dropout. The empirical consequence: BN-equipped networks can often be trained with weaker explicit regularisation (smaller weight decay, less dropout) than their un-normalised counterparts, because BN is doing part of the regularisation job automatically. Conversely, when switching from BN to a deterministic normalisation (like layer-norm or no normalisation at all), you may need to add more explicit regularisation to compensate.

When to use BN, and when not to. Use batch normalisation for convolutional networks on image data with batch size ≥ 16 — this is the setting it was designed for and it nearly always helps. Do not use it for transformers (use layer-norm), for RNNs (use layer-norm), for small-batch training (use group-norm), for distributed training with very small per-worker batches (use SyncBN or switch norms entirely), or when the test-time statistics differ from training statistics (robustness-testing, domain-adaptation, meta-learning). BN is a powerful tool but a narrowly-appropriate one; the rest of §8 surveys the alternatives.

Layer, group, instance, and RMS normalisation

Batch normalisation normalises across the batch axis. A family of alternatives — layer norm, group norm, instance norm, RMS norm — normalise along different axes of the activation tensor, and each is the right choice for some class of architectures. The practitioner who wants to train anything other than a medium-batch-size ConvNet should know all four.

Layer normalisation

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton's 2016 Layer Normalization (arXiv) normalises across the feature dimension of a single example, not across the batch. For a layer output x ∈ ℝᴮˣᴰ, LN computes per-example statistics μᵢ = (1/D) ∑_d xᵢ_d and σ²ᵢ = (1/D) ∑_d (xᵢ_d − μᵢ)², then normalises each example independently. Because the statistics are computed per-example, LN is insensitive to batch size — it works identically for B = 1 and B = 1024, and its train-time and test-time behaviours are the same. LN is the standard normalisation for transformer architectures: every attention block and every MLP block in every major language model since GPT-1 has used LN as its normalisation of choice. It is also the standard for RNNs, for the same batch-independence reason.

Group normalisation

Yuxin Wu and Kaiming He's 2018 Group Normalization (ECCV) normalises within groups of channels of a single example. For a convolutional layer output x ∈ ℝᴮˣᶜˣᴴˣᵂ, GN partitions the C channels into G groups (typically G = 32) and normalises within each group across the channel-and-spatial dimensions of each example. GN is the workaround for BN's small-batch failure mode — it gives near-BN accuracy at B = 1 or 2, which makes it the default for object detection and segmentation architectures where memory constraints force tiny per-worker batches. The special cases G = C (one group per channel, normalise per-channel) is instance normalisation, and G = 1 (one group total, normalise across all channels) is layer normalisation applied to a 4D tensor.

Instance normalisation

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky's 2016 Instance Normalization: The Missing Ingredient for Fast Stylization normalises per-channel, per-example — the statistics are computed over the spatial dimensions H × W for each (i, c) pair independently. IN is the standard normalisation for style-transfer and image-to-image translation architectures, where removing per-instance contrast and brightness biases is desirable. It is almost never used for classification or detection because it strips away the cross-channel signal that those tasks need to carry.

RMS normalisation

Biao Zhang and Rico Sennrich's 2019 Root Mean Square Layer Normalization (NeurIPS) simplifies layer-norm by dropping the mean-subtraction: RMS-Norm computes only the mean of squared activations, ρ²ᵢ = (1/D) ∑_d xᵢ²_d, and normalises as yᵢ_d = γ_d · xᵢ_d / √(ρ²ᵢ + ε). The arithmetic is cheaper (one reduction instead of two, no subtraction), and empirically RMS-Norm matches LN's accuracy on most tasks. LLaMA, Mistral, and most open-source language models since 2023 use RMS-Norm for its marginal speedup. The trade-off: RMS-Norm is less numerically stable when activations are close to zero, so it requires slightly more careful initialisation and slightly larger ε.

Pre-norm versus post-norm

A cross-cutting design choice: in a transformer residual block of the form x ← x + f(x), should the normalisation be applied before f (x + f(LN(x)), "pre-norm") or after (LN(x + f(x)), "post-norm")? The original transformer used post-norm. Every major language model since GPT-2 (including GPT-3, PaLM, LLaMA, Claude) uses pre-norm, because pre-norm makes deeper networks trainable without warmup tricks. The theoretical reason (Xiong et al. 2020 On Layer Normalization in the Transformer Architecture): post-norm creates a path from input to output that amplifies gradients exponentially in depth, while pre-norm provides a clean residual path whose gradient norms remain bounded. Post-norm transformers can be trained but typically require elaborate warmup-and-schedule recipes; pre-norm transformers are much more forgiving.

Choosing a normalisation. For ConvNets on standard image data with batch size ≥ 16, use BatchNorm. For ConvNets with small batch size (detection, segmentation, video), use GroupNorm with G = 32. For transformers, use LayerNorm (or RMSNorm for a minor efficiency gain). For style transfer or image generation, use InstanceNorm. These choices are now well-established defaults, not open research questions; deviating from them is a sensible move only when you have a specific reason the default is wrong.

Residual connections

Before residual connections, networks beyond 20 or 30 layers deep were essentially untrainable — they suffered from degradation, where adding more layers increased both training and test error. Kaiming He et al.'s 2015 ResNet paper changed this overnight by adding a single arithmetic operation — the skip connection — that made 152-layer networks easy to train, and subsequently enabled transformers of any depth. Residuals are not just a training trick; they are a structural prior on what a deep network is.

The residual block

He, Zhang, Ren, and Sun's 2015 Deep Residual Learning for Image Recognition (CVPR 2016) introduced the residual block: instead of computing y = f(x), compute y = f(x) + x. The function f is a small stack of layers — in the original ResNet, two or three Conv → BN → ReLU blocks — and the identity x is added through a "skip connection" that bypasses them. The critical observation: at initialisation, if f starts near zero (which good initialisation schemes arrange), the block starts as the identity transformation, and the network can deepen without degrading. Subsequent training shapes f to add useful residual signals on top of the identity baseline. He et al.'s paper showed that this simple change enabled training of networks up to 152 layers on ImageNet, with every additional depth monotonically improving accuracy — a result that had never been achieved before.

Why residuals work: three complementary views

The gradient view: in a residual network, the gradient at layer ℓ is the gradient at layer ℓ+1 plus the gradient through the residual branch's f. The skip path provides a direct route for gradients to flow from deep layers to shallow ones without being attenuated by the chain of multiplicative Jacobians. This is the original motivation — residuals solve the vanishing-gradient problem by construction. The ensemble view: Veit, Wilber, and Belongie's 2016 Residual Networks Behave Like Ensembles of Relatively Shallow Networks argued that a depth-L residual network is effectively an ensemble of 2^L paths through the network, most of which have only modest depth. Pruning individual layers of a residual network, they showed, has much less impact than pruning layers of a plain network — consistent with the ensemble interpretation. The ODE view: a residual network of the form x_{ℓ+1} = x_ℓ + ε · f(x_ℓ) is a forward-Euler discretisation of an ODE dx/dt = f(x(t)). Neural ODEs (Chen et al. 2018) take this view seriously and parameterise f directly as a continuous-time dynamical system; even without that extension, the ODE view explains why very deep residual networks train smoothly — they are discretisations of a well-posed continuous process.

Pre-activation vs. post-activation

The original ResNet placed the nonlinearity after the residual addition: y = ReLU(f(x) + x). He et al.'s 2016 follow-up Identity Mappings in Deep Residual Networks (ECCV) proposed the pre-activation variant: y = f(x) + x, where the nonlinearity is inside f and the residual path has no nonlinearity at all. Pre-activation produces a fully linear "identity highway" from the input to the output, making signal propagation cleaner and enabling the training of 1001-layer networks on CIFAR. Almost all modern residual architectures — ResNeXt, Wide ResNet, DenseNet, EfficientNet, ConvNeXt — use pre-activation. Transformer residual blocks, by analogy, use a pre-norm structure where the layer normalisation is inside the residual branch and the skip connection is a clean identity.

Highway networks and their relation

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber's 2015 Training Very Deep Networks (NeurIPS) introduced highway networks a few months before ResNet, with a structurally similar idea: y = T(x) · f(x) + (1 − T(x)) · x, where T(x) is a learned "transform gate" that interpolates between the identity and the transformed input. Highway networks preceded ResNets and achieved similar depth-scaling benefits, but ResNets' simpler additive form — no gate to learn, no extra parameters — turned out to be a better fit for most architectures. Highway-style gates reappear in GRUs (reset and update gates) and in LSTMs (forget gate); the connection is that all three are instances of a broader idea — let the network choose what to pass through unchanged versus what to transform — and residual connections are the minimum-parameter version of that idea.

Residuals in transformers

Every transformer block — every attention block, every MLP block, every one of them — is wrapped in a residual connection. The transformer architecture is fundamentally a sequence of residual refinements to a stream of hidden states: each attention layer reads from the stream and adds back a transformed version; each MLP layer does the same. This pattern explains a great deal about how transformers behave. Features computed in early layers persist in the residual stream and remain available to every subsequent layer. Interpretability research (Elhage et al. 2021, A Mathematical Framework for Transformer Circuits) treats the residual stream as a common communication bus onto which each layer reads and writes. The depth of modern transformers — 96 layers for GPT-3, over 200 for the largest models — is tractable only because residual connections hold signals stable across that depth.

Residuals as a structural prior. When adding a layer to a deep network, the default should be to add it inside a residual block: x ← x + f(x), not x ← f(x). The first pattern is trivially trainable; the second is not. This rule extends from ResNets to transformers to graph neural networks to neural ODEs; it is the single most important architectural insight of the post-2015 deep-learning era, and the reason networks of arbitrary depth are now merely an engineering concern rather than an open research problem.

Gradient clipping

Even a well-initialised, well-normalised network can produce an occasional gradient of enormous magnitude — a single corrupt batch, a pathological example near a cliff in the loss surface, or a numerical overflow — that destroys the running parameters if applied in full. Gradient clipping is the simple, cheap, and nearly universal safeguard: if the gradient norm exceeds a threshold, rescale it before applying the update.

Why gradients explode

In a deep network, the gradient of the loss with respect to early-layer parameters is a product of L Jacobians. If the spectral norm of each Jacobian exceeds 1, the product grows exponentially in depth, and the gradient at the first layer can become numerically infinite. Residual connections and careful initialisation make this rare but not impossible — a single batch in which most examples happen to produce large Jacobian norms, or a learning-rate spike, can push the network into a regime where a single step destroys all previous progress. In recurrent networks the problem is more acute because the same recurrence is applied at every time step, and gradient explosion is one of the standard failure modes of RNN training that Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio catalogued in their 2013 On the difficulty of training recurrent neural networks paper.

Global-norm clipping

The standard technique, introduced in Pascanu et al. 2013, is global-norm clipping. Compute the overall gradient norm ‖g‖ = √(∑ᵢ ‖gᵢ‖²) across all parameter tensors, and if it exceeds a threshold τ, rescale all gradients by τ / ‖g‖. This preserves the direction of the gradient — the network still updates in the same direction it would have without clipping — but caps its magnitude. Typical values of τ are 1.0 for transformer training and 0.5 or 0.1 for more aggressive RL or scientific-computing workloads. PyTorch exposes this as torch.nn.utils.clip_grad_norm_, and it is called once per training step after the backward pass and before the optimiser step.

Value clipping

An alternative (rarely preferred) technique clips each individual gradient element to a range [−c, +c]. Value clipping distorts the gradient direction — a large gradient component is clipped while small ones are not, which can drastically change the update vector — and is generally worse than norm clipping in practice. PyTorch exposes it as torch.nn.utils.clip_grad_value_, but it is rarely used in modern pipelines.

Adaptive clipping

Brock et al.'s 2021 High-Performance Large-Scale Image Recognition Without Normalization (NF-Nets) introduced adaptive gradient clipping: instead of a fixed threshold, clip each parameter tensor's gradient to a multiple of the parameter tensor's own norm. The update rule is: if ‖gₗ‖ / ‖θₗ‖ > λ, rescale gₗ so that ‖gₗ‖ = λ · ‖θₗ‖. This is a layer-wise, scale-aware version of clipping that handles networks with dramatically different parameter norms across layers. NF-Nets use adaptive clipping with λ = 0.01 as part of their normalisation-free training recipe and report that it is essential to stability at very deep depths.

Clipping as diagnostic

Beyond its stabilising role, gradient clipping is a useful diagnostic tool. Log the fraction of training steps on which clipping actually fires (the norm exceeds the threshold), and the distribution of gradient norms. If clipping fires on more than 10% of steps, the threshold is too tight — you are losing information about the gradient direction on a substantial fraction of updates, and the learning rate should probably be lowered. If it fires on less than 0.1%, the threshold may be unnecessary — though keeping it as a safety net costs nothing. A sudden spike in clip-firing frequency — from a steady 1% to 50% over a few hundred steps — is a reliable early warning of training instability, visible many thousands of iterations before the loss itself diverges.

Clipping as a seatbelt. Gradient clipping is cheap, rarely harmful, and occasionally the difference between a crashed training run and a successful one. The modern recommendation is to always use global-norm clipping with τ = 1.0 for transformers and τ ∈ [0.1, 1.0] for most other architectures, and to log the clip rate as a diagnostic. When training runs sometimes crash with NaN losses and sometimes don't — a frustratingly common experience — gradient clipping should be the first intervention to try.

The batch-size / learning-rate interaction

Batch size and learning rate are the two most consequential hyperparameters in deep-learning training, and they are tightly coupled — changing one without adjusting the other will typically produce a worse result than changing neither. Understanding their interaction is the difference between a modest compute budget well-spent and a large compute budget wasted.

The linear scaling rule

Priya Goyal et al.'s 2017 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour crystallised a heuristic that had been floating around the distributed-training community: when scaling the batch size from B to kB, scale the learning rate from η to kη. The intuition is that the variance of the mini-batch gradient estimator scales as 1/B, and the noise-to-signal ratio of an SGD update step is roughly η/B. Keeping η/B constant keeps the noise statistics constant, and the SGD trajectory (viewed as a stochastic differential equation) remains approximately the same. Goyal et al.'s paper validated the rule by training ResNet-50 on ImageNet at batch size 8192 with learning rate scaled accordingly, matching the baseline accuracy at batch size 256. The rule is a good starting point but breaks down outside a window of roughly 2–10× scaling — beyond that, other heuristics (square-root scaling for adaptive optimisers, LAMB/LARS layer-wise rules) are needed.

The square-root scaling rule

For Adam and AdamW, the linear scaling rule overshoots — the empirical evidence (Smith, Kindermans, Ying, and Le 2018 Don't Decay the Learning Rate, Increase the Batch Size; Krizhevsky 2014 One weird trick for parallelizing convolutional neural networks) suggests that the optimal learning rate for adaptive optimisers scales as √B, not B. The practical recommendation: when scaling batch size for an adaptive optimiser, multiply the learning rate by √k, not k. This reflects the fact that Adam's per-parameter step size is already adaptively normalised, so the naive noise-scale argument does not quite apply.

The critical batch size

Sam McCandlish et al.'s 2018 An Empirical Model of Large-Batch Training (Anthropic / OpenAI) introduced the concept of a critical batch size — the batch size beyond which increasing the batch further gives diminishing returns in sample efficiency, even with optimal learning-rate scaling. Below the critical batch size, doubling the batch produces roughly a 2× wall-clock speedup (with linear-scaling learning rate) with no loss of final accuracy. Above it, the speedup curve flattens — a 4× batch increase gives only a 1.5× time saving, and a 16× batch increase gives almost none. The critical batch size varies with model scale and dataset: roughly 2048 for ResNet-50 on ImageNet, 4M tokens for GPT-3 scale language models. Training at the critical batch size is the sweet spot — smaller wastes sequential efficiency, larger wastes compute.

Large-batch training and generalisation

Keskar et al.'s 2017 On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima showed that naively scaling batch size beyond a few thousand examples often hurts generalisation — the network converges to a training loss comparable to small-batch training but tests worse. The original explanation (flat vs. sharp minima) has been revised (Dinh et al. 2017 showed that "flatness" is ill-defined under reparameterisation), but the empirical phenomenon is robust: very-large-batch training requires additional techniques — LARS/LAMB layer-wise optimisers, longer warmup, extended training schedules — to close the generalisation gap. The lesson: batch size is not just a throughput knob; it changes the loss surface the optimiser is solving, and very-large batches require compensating algorithmic changes.

In practice

For a practitioner choosing a batch size: start with the largest batch that fits in GPU memory, set the learning rate according to the linear (SGD) or square-root (Adam) scaling rule relative to a known-good configuration, and verify that the learning-rate schedule still makes sense (warmup length and total steps scale inversely with batch size). If training is stable and converges, you are in the good-batch-size regime. If training becomes unstable with large batch, either add warmup, switch to LAMB/LARS, or back off to a smaller batch. The critical-batch-size concept is a useful reality check — if you are far below it, you are leaving wall-clock speedup on the table; if you are far above it, you are wasting compute.

One pair of knobs. Batch size and learning rate are not two independent hyperparameters — they are a single pair that needs to be tuned together. When porting a training recipe from one compute budget to another, always scale them jointly. When reading a paper's training recipe, note whether they report the effective learning rate (post-scaling) or the base one. When a training run fails to converge, check whether the batch-size / learning-rate ratio is in the known-good regime before changing anything else.

Mixed-precision training

By default a deep network is trained in float32 — 32-bit floating-point numbers for all activations, weights, and gradients. Modern GPUs compute float16 and bfloat16 arithmetic two to four times faster than float32, and the corresponding tensors take half the memory. Mixed-precision training harnesses that speedup while preserving the numerical properties that gradient-based optimisation relies on. Getting it right is a now-standard piece of engineering that almost every modern training pipeline depends on.

fp16 vs. bfloat16

The two 16-bit floating-point formats have the same total width but different precision / range trade-offs. fp16 (IEEE 754 half-precision) has a 10-bit mantissa and 5-bit exponent — ample precision but a narrow dynamic range of roughly 6 × 10⁻⁸ to 6.5 × 10⁴. bfloat16 (Google's brain-float, now standard on NVIDIA Ampere and later) has a 7-bit mantissa and 8-bit exponent — less precision but the same dynamic range as fp32 (10⁻³⁸ to 10³⁸). For neural-network training, dynamic range is what matters: gradients can span 20 orders of magnitude across layers, and a format that cannot represent very small or very large values without underflow or overflow will produce silent numerical failures. bfloat16's wide range makes it essentially drop-in compatible with fp32 for training, while fp16 requires careful loss-scaling techniques to avoid underflow. bfloat16 is the modern default on hardware that supports it; fp16 with dynamic loss scaling remains common on older GPUs.

The Micikevicius recipe

Paulius Micikevicius et al.'s 2018 Mixed Precision Training (ICLR) laid out the standard mixed-precision recipe, which is now the template for every framework's automatic-mixed-precision system. First, keep a master copy of the weights in fp32 — optimiser state (Adam's first and second moments, momentum buffers) stays in fp32 as well. Second, at the start of each forward pass, cast the fp32 weights to fp16 for the forward computation; activations are fp16 throughout. Third, compute the loss and gradients in fp16, then upcast the gradients to fp32 for the optimiser update. Fourth, if using fp16, multiply the loss by a large scalar (loss scaling) before computing gradients, then divide it out before the update — this shifts small gradients into the fp16 representable range without affecting fp32 updates. Automatic loss scaling (AMP in PyTorch, mixed_precision in TensorFlow) adjusts the scale dynamically based on whether gradient overflow has been observed.

What breaks, and why

Mixed precision is mostly seamless, but there are several gotchas. Reductions (sum, mean, variance) done in fp16 can underflow or lose precision — batch-norm statistics, for instance, should be computed in fp32 even in a mixed-precision pipeline, which most frameworks arrange automatically. Softmax output can saturate — if any logit is very negative, exp of it underflows to zero in fp16 — so softmax is typically kept in fp32 or computed with a log-sum-exp stabilisation. Large matrix multiplications can accumulate roundoff; NVIDIA's TensorCores internally use fp32 accumulation even when inputs are fp16, which largely fixes this. Gradient accumulation across mini-batches should be done in fp32 — accumulating fp16 gradients can lose precision if the sum becomes much larger than individual contributions.

Automatic mixed precision

PyTorch's torch.cuda.amp context manager (now part of torch.amp) and TensorFlow's tf.keras.mixed_precision handle the entire mixed-precision recipe automatically. The user wraps the forward pass in with autocast():, the framework decides per-operator which precision to use (matmul and conv in fp16, reductions and norms in fp32), and a loss scaler object manages dynamic loss-scaling in the background. On modern hardware — an A100, H100, or similar Ampere-or-later GPU — AMP typically gives a 2–3× wall-clock speedup over fp32 training with no loss of final accuracy, and it is a mandatory component of any serious training pipeline.

fp8 and beyond

As of 2026, the frontier has moved to fp8 training. NVIDIA's Hopper GPUs support fp8 arithmetic with two variants (e4m3 for activations and weights, e5m2 for gradients), and the Transformer Engine library provides a mixed-fp8-fp16 recipe that gives another 2× speedup over bf16. The numerical subtlety is substantial — fp8 has only 3 or 4 bits of mantissa, so per-tensor scaling factors must be recomputed frequently — and fp8 training is not yet drop-in the way bf16 is. For now, bf16 remains the mainstream default; fp8 is an emerging option for the largest training runs where the speedup is worth the engineering complexity.

Precision as an engineering knob. Mixed precision is one of the rare interventions that costs almost nothing (a few lines of framework code) and gives a large, predictable speedup. On modern hardware with AMP, the correct recipe is: use bf16 if available, fall back to fp16 with dynamic loss scaling if not, keep a master fp32 copy of weights and optimiser state, and trust the framework to autocast the operators it understands. The 2–3× speedup translates directly into shorter iteration time or larger models at the same budget; leaving it off is leaving compute on the table.

Distributed training

Modern deep-learning models are too large to train on a single GPU — a transformer with 100 billion parameters needs roughly 800 GB of memory just for the fp16 weights and optimiser state, which is far beyond any single accelerator. Distributed training is the collection of techniques that spread the work across many GPUs (or many nodes of many GPUs) while preserving, as much as possible, the semantics of a single-node training loop. Chapter 05 of Part III surveyed the distributed-systems primitives; this section focuses on how those primitives specifically support model training.

Data parallelism

The simplest distributed-training scheme is data parallelism. Each worker holds a complete copy of the model, the mini-batch is split into per-worker shards, each worker computes gradients on its shard, and the gradients are averaged (all-reduced) across workers before the optimiser step. Data parallelism scales linearly in compute until the batch size reaches its critical value (§11), at which point further parallelism produces diminishing returns in sample efficiency. PyTorch's DistributedDataParallel and TensorFlow's MirroredStrategy are the standard implementations. The all-reduce operation is typically implemented as a ring-all-reduce or tree-all-reduce over a high-speed interconnect (NVLink within a node, InfiniBand or RoCE between nodes); its latency is the dominant communication cost at moderate scale.

ZeRO and FSDP — partitioned data parallelism

Data parallelism replicates the full model on every worker, which is wasteful when the model is large: each worker's copy of the fp32 weights, fp16 weights, gradients, and optimiser state occupies identical memory. Samyam Rajbhandari et al.'s 2020 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (DeepSpeed) introduced a family of stage-by-stage optimisations: ZeRO-1 partitions the optimiser state across workers, ZeRO-2 additionally partitions the gradients, ZeRO-3 additionally partitions the weights themselves. At ZeRO-3, the memory footprint per worker is 1/N of the full model's, where N is the number of workers — at the cost of an all-gather communication each time weights need to be fully materialised for a forward or backward computation. PyTorch's FullyShardedDataParallel (FSDP) is a native implementation of the same idea. FSDP is now standard for training models that exceed the per-GPU memory capacity of data parallelism.

Tensor parallelism

For models whose individual layers are too large to fit on one GPU — even after ZeRO-3 partitioning — tensor parallelism splits each weight matrix across workers. Mohammad Shoeybi et al.'s 2019 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism showed how to parallelise a transformer layer with minimal communication: split the MLP's first weight matrix column-wise (so the activation stays local after the first matmul) and the second row-wise (so a single all-reduce at the end recovers the full output). Attention is parallelised similarly, with heads distributed across workers. Tensor parallelism produces a communication cost per forward and backward pass, typically bounded within a single compute node to take advantage of fast intra-node interconnects like NVLink.

Pipeline parallelism

Pipeline parallelism splits the model's layers across workers — worker 1 has layers 1–8, worker 2 has layers 9–16, and so on — so that each worker stores only its share of the parameters. Naive pipeline parallelism has terrible throughput: only one worker computes at a time, and the rest wait. GPipe (Huang et al. 2018) and PipeDream (Harlap et al. 2018) introduced micro-batching — split each mini-batch into smaller micro-batches and pipeline them so that multiple workers compute simultaneously on different micro-batches. The 1F1B (one-forward-one-backward) schedule, used in modern implementations, interleaves forward and backward passes to minimise pipeline bubbles. Pipeline parallelism is the standard way to scale past a single node, typically composed with tensor parallelism (within-node) and data parallelism (across replicas).

3D parallelism

For the largest models — GPT-3 at 175B parameters and every successor — training requires the composition of all three types of parallelism, often called 3D parallelism. Tensor parallelism within a node (typically 8-way, matching the number of GPUs per node). Pipeline parallelism across nodes of nodes (typically 4-way to 32-way, defining the depth of the pipeline). Data parallelism on top of both (typically hundreds to thousands of pipeline replicas, feeding mini-batches through the pipeline). NVIDIA's Megatron-LM framework and Microsoft's DeepSpeed are the reference implementations; Meta's FairScale, Hugging Face's Accelerate, and the PyTorch-native torch.distributed stack expose variants of the same ideas.

Communication-computation overlap

The cross-cutting engineering concern is overlapping communication with computation. While one layer's gradients are being all-reduced across workers, the next layer's backward pass should already be in flight. Every serious distributed-training framework pipelines these operations aggressively — bucketing small gradients into a single larger all-reduce to amortise latency, prefetching weights before they are needed in ZeRO-3, and scheduling pipeline stages to minimise bubbles. The effective fraction of theoretical peak FLOPS achieved in a real training run (called model FLOPS utilisation, or MFU) is the engineering figure of merit; values of 50–60% are considered excellent at the scale of GPT-3.

Distributed training is a library, not an algorithm. The algorithmic content of distributed training is mostly solved: data parallelism, tensor parallelism, pipeline parallelism, and ZeRO are well-understood and composed in standard ways. The hard part is implementation — matching the communication patterns to the interconnect topology, minimising synchronisation, handling fault recovery, and debugging the occasional silent hang. For practitioners training models that fit in a single node, use PyTorch's DistributedDataParallel or its FSDP variant and trust the framework. For the frontier-scale training that needs 3D parallelism, rely on DeepSpeed, Megatron-LM, or a descendant framework with years of production hardening.

Gradient accumulation and activation checkpointing

Two engineering techniques — gradient accumulation and activation checkpointing — are the workhorses of memory-constrained training. Neither changes the final model in any way; they trade compute time for memory, allowing training to happen on hardware that would otherwise be too small. Both are essential tools when the per-GPU memory is the binding constraint.

Gradient accumulation

Gradient accumulation simulates a large effective batch size on hardware that can only fit a small per-step batch. Instead of computing gradients on a batch of size B and stepping the optimiser, accumulate gradients over k consecutive mini-batches of size B/k each, then step the optimiser once with the averaged gradients. The arithmetic is identical to a single step of batch size B, provided the forward pass does not depend on batch statistics that would differ (batch normalisation in particular, which is a common source of subtle bugs in accumulated-batch training — the BN statistics are computed on the per-micro-batch level, not the effective-batch level). The cost is wall-clock time: k forward-backward passes per optimiser step instead of one. The benefit is the ability to train at a given effective batch size on hardware that cannot fit it as a single mini-batch, which is often the difference between "can train this model" and "cannot". PyTorch's DDP does gradient accumulation naturally by deferring the all-reduce until the optimiser step.

Activation checkpointing

In a standard forward-backward pass, all intermediate activations are retained in memory because they are needed by the backward pass. For a deep network with a large activation footprint, this is often the largest single contributor to peak memory — for a transformer, activations scale as O(L · B · S · d) where L is depth, B is batch size, S is sequence length, d is hidden dimension, and even a single transformer block can occupy gigabytes. Activation checkpointing (Chen et al. 2016, Training Deep Nets with Sublinear Memory Cost) drops a subset of activations during the forward pass and recomputes them during the backward pass. The optimal strategy, for a depth-L network, keeps only √L of the activations and recomputes the rest — a √L× reduction in memory for a ~1.33× increase in compute (one extra forward pass per segment). PyTorch exposes this as torch.utils.checkpoint.checkpoint, which wraps a submodule in the recompute protocol; most modern training frameworks apply it automatically to transformer blocks at scale.

CPU offloading

For the largest models, even ZeRO-3 with activation checkpointing is insufficient, and one must offload tensors to CPU (or NVMe) memory. DeepSpeed's ZeRO-Offload and ZeRO-Infinity offload optimiser state, gradients, and even parameters to CPU memory, prefetching them back to GPU memory just before they are needed. The wall-clock cost is substantial — PCIe bandwidth to host memory is an order of magnitude below GPU-to-GPU bandwidth — but the memory capacity gained is enormous, and for some training configurations offloading is the only path to a working training run. Like 3D parallelism, offloading is a library concern: rely on DeepSpeed or FSDP's offload modes rather than implementing it by hand.

FlashAttention and fused kernels

Beyond these general-purpose memory tools, a recent generation of operator-specific optimisations has reshaped what is possible at the edge of memory. Tri Dao et al.'s 2022 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NeurIPS) rewrote transformer attention to avoid materialising the O(S²) attention matrix in GPU memory — instead computing attention block by block using a tiled-softmax approach and writing only the final output to memory. FlashAttention is exact, not approximate, and gives 2–4× speedup and 5–20× memory reduction on long sequences. It is now a standard component of transformer training. Fused kernels for common operator sequences (layernorm-plus-residual, softmax-plus-cross-entropy, adamw-update) similarly reduce memory traffic at the cost of some flexibility. The lesson: memory-conscious implementations of common operations matter as much as memory-conscious high-level algorithms.

When memory is the constraint. If training runs out of memory, the resolution order is: (1) reduce per-GPU batch size, compensated by gradient accumulation; (2) enable activation checkpointing for every transformer block; (3) switch from DDP to FSDP or ZeRO-3; (4) enable CPU offloading. Each step adds wall-clock cost but opens up additional memory capacity; most training runs should be sized to use step (2) or (3) and not exceed (4). Before any of these, check whether mixed-precision is enabled — a 2× memory win for essentially free.

Training diagnostics

Most of the time spent training a model is actually spent watching training curves, debugging why they look strange, and iterating on configurations. A small number of diagnostic signals — gradient norms, activation statistics, weight-update ratios, learning-rate schedule alignment — capture most of what can go wrong. A practitioner who logs the right things will catch problems within minutes; one who logs only the loss will be mystified for days.

The loss curve

The first and most important diagnostic is the loss curve itself. Plot training and validation loss on a log-scale y-axis, linear-scale x-axis; a well-behaved training run shows a smooth monotonic decrease that slows near the end. Spikes in the loss are almost always bad — a loss spike is evidence of a numerical failure, a gradient explosion that clipping did not catch, or a rare-example issue in the data. A training-validation gap that is too small suggests underfitting, a gap that is too large suggests overfitting; the appropriate target depends on the task and model but should be consciously targeted rather than accepted by default. The shape of the loss curve also reveals the learning-rate schedule: warmup phases produce an initial flat or slowly-decreasing segment, cosine decay produces a visibly slowing tail, and constant learning rates produce a noisier trajectory than well-scheduled ones.

Gradient norms

Log the global gradient norm at every step (or every few steps, to reduce overhead). A healthy gradient-norm trajectory starts at some moderate value, decreases as training progresses, and remains in a bounded range — typically 0.1 to 10 for a transformer in steady state. Sudden spikes indicate training instability, slow drift upward indicates that the learning rate is too high for the current loss-surface region, drift downward to near-zero indicates stalled training. Log also the gradient norm per parameter tensor: if one layer's gradient norm is dramatically out of line with the rest, it usually means that layer is experiencing a different dynamics (initialisation bug, dead ReLU, exploding activations) and needs investigation.

Activation statistics

The mean and standard deviation of activations at representative layers is a powerful diagnostic. A healthy network keeps activations roughly mean-zero, unit-variance (within a factor of 10) across depth; drift outside this range over training indicates a failure of normalisation or a cascading instability. Per-layer histograms are especially informative for ReLU networks — a layer whose ReLU outputs are more than 50% zero is experiencing "dying ReLU" symptoms, and a layer whose pre-activation has shifted so that almost all outputs are non-zero has effectively become linear. Tools like TensorBoard, Weights & Biases, and the built-in PyTorch profiler make collecting and visualising these statistics cheap.

Update-to-weight ratio

A useful per-layer metric is the ratio of the update magnitude to the weight magnitude: ‖Δθₗ‖ / ‖θₗ‖. For a well-tuned training run, this ratio lives in a range around 10⁻³ to 10⁻⁴ — the weights change by roughly 0.1% per step, which is slow enough to preserve stability but fast enough to make progress. Ratios above 10⁻² indicate the learning rate is too high (or the gradient is too large) for that layer; ratios below 10⁻⁵ indicate the layer has effectively stopped learning. Per-layer ratio imbalances — a two-orders-of-magnitude range across layers — are a signal that a layer-wise adaptive optimiser (LAMB, LARS) or a per-layer learning-rate multiplier is warranted.

Validation curves and early stopping

Compute validation loss and per-task metrics at a regular cadence (every epoch, or every few thousand iterations for long training runs). Plot validation and training curves on the same axes to track the generalisation gap over time. Early stopping — terminating training when validation loss has not improved for some patience period — is the simplest and often most effective regulariser. In modern pretraining, where the entire training budget is a single epoch through a very large dataset, explicit early stopping is less common, but tracking validation across checkpoint intervals allows the best-checkpoint selection that modern pretraining pipelines depend on.

Profiling

Separate from the training-health diagnostics above, profiling diagnostics measure wall-clock efficiency. Is the GPU actually busy? What fraction of time is spent on communication, memory transfers, or Python overhead? PyTorch's torch.profiler and NVIDIA's Nsight Systems expose this detail; the rule of thumb is that a well-tuned training run achieves >50% GPU utilisation, and anything below 30% usually indicates a fixable bottleneck. Common culprits: a CPU-bound data loader (fixed with more worker processes), a synchronous logging step inside the inner loop (fixed by buffering), or a suboptimal batch size (too small, leaving GPU compute idle).

Log everything, plot often. The difference between the practitioner who debugs a training failure in an afternoon and the one who spends a week on it is almost always the quality of the diagnostics being collected. Loss, gradient norm, activation statistics, update-to-weight ratios, learning-rate schedule, validation metrics — all should be logged at every step or at a regular cadence, in a system (TensorBoard, Weights & Biases, MLflow) that lets the practitioner plot them in real time. The habit of running a short training pilot before a long production run, and reading the diagnostic plots carefully, catches roughly 80% of the bugs that would otherwise waste days of GPU time.

Reproducibility and experiment hygiene

Deep-learning training runs are expensive — thousands of GPU-hours per experiment is common, millions of GPU-hours for frontier models — and a result that cannot be reproduced is a result that the practitioner does not really understand. Reproducibility is not a luxury; it is the discipline that makes research cumulative and engineering dependable.

Sources of irreproducibility

A deep-learning training run draws from several sources of randomness: weight initialisation, mini-batch order, data augmentation, dropout masks, and occasionally the GPU operations themselves (some CUDA kernels, such as atomic-add-based reductions, are non-deterministic). To make a run reproducible, each of these must be controlled. Set the random seeds for every library used: Python's random, NumPy's np.random, PyTorch's torch.manual_seed, and (if using CUDA) torch.cuda.manual_seed_all. Use deterministic data-loader settings: num_workers=0 or a seeded generator for worker initialisation. For strict reproducibility of CUDA kernels, set torch.use_deterministic_algorithms(True) and the CUBLAS_WORKSPACE_CONFIG environment variable — this disables a few fast non-deterministic kernels in exchange for bit-identical results across runs.

The versioning trinity

A reproducible experiment needs three things versioned: code, data, and environment. Code means a clean git commit hash that captures the training script, model definition, and configuration — a training run should always log the commit it was launched from, ideally as part of the run name. Data means a content hash of the training data and a record of the exact preprocessing pipeline applied to it; DVC (Data Version Control) and Git-LFS are the standard tools, though at pretraining scale a custom data-manifest system is usually required. Environment means a record of the Python version, the CUDA version, the PyTorch version, and the exact set of installed Python packages (a pip freeze dump or a Conda lock file). Docker containers capture all three in a single image and are the most robust form of environment reproducibility.

Experiment tracking

Logging which hyperparameters produced which validation metrics, across hundreds or thousands of experiments, is essential for any research or production ML workflow. The standard tools are Weights & Biases, MLflow, TensorBoard, Neptune, and CometML; all let the practitioner log hyperparameters, metrics, model checkpoints, and diagnostic plots from their training script with a few extra lines of code. Pick one, adopt it consistently across a team, and let every experiment — successful or failed — be recorded automatically. The downstream benefits (cross-experiment comparison, hyperparameter search, post-hoc investigation of why one run diverged) more than justify the integration cost.

Hyperparameter search

Most training runs live within a hyperparameter search, and the search methodology matters more than it seems. Grid search is simple but scales poorly; random search (Bergstra and Bengio 2012, Random Search for Hyperparameter Optimization) uses the same budget more efficiently. Bayesian optimisation (Gaussian-process-based methods, Tree-structured Parzen Estimator) performs better still when each trial is expensive; Optuna and Ray Tune are the standard frameworks. For deep-learning hyperparameters, where the interaction between learning rate, batch size, warmup, and weight decay is highly non-trivial, a structured two-stage search — coarse random sweep, then fine-grained search near the best region — is usually the right balance between efficiency and robustness.

Reporting

When reporting a result, report the full configuration, not just the result. This means the random seed (or seeds, if averaging over multiple runs, which is always preferred), the exact learning-rate schedule, the batch size and any gradient accumulation, the optimiser and its hyperparameters, the initialisation scheme, the normalisation layers, the loss function, the regularisation settings, the dataset version, and the evaluation protocol. Papers that report only a single metric value without a full configuration are unreproducible by design; practitioners who document their configurations (internal engineering runbooks, model cards, paper appendices) make their work usable by others and by their future selves.

Reproducibility is cheap, irreproducibility is expensive. The small up-front cost of seeding RNGs, checking in configuration files, and logging metrics through a proper tracker is tiny compared to the cost of recreating a training run a year later when the original author has left the team, or debugging a regression that only appears in the latest environment. Treat the experiment log as part of the production codebase, not a disposable artefact of iteration.

Scaling laws and compute budgeting

The modern deep-learning engineer must allocate a fixed compute budget across model size, training data, and training duration. Scaling laws — empirical power-law relationships between these quantities and final model performance — make this allocation a quantitative rather than a guessing-based exercise. The scaling-law literature, which matured over 2020–2022, is some of the most consequential deep-learning work of the decade.

Kaplan scaling laws

Jared Kaplan et al.'s 2020 Scaling Laws for Neural Language Models (OpenAI) fitted power laws across two orders of magnitude of model size (from 10M to 1B parameters), dataset size, and compute. Their headline claim: test loss scales as a power law in each of the three variables independently, and jointly as L(N, D, C) ≈ (N_c/N)^α_N + (D_c/D)^α_D. The exponents they fit were α_N ≈ 0.076 and α_D ≈ 0.095 — small but non-trivial, meaning that to halve the loss one needs roughly an 8× increase in model size. Kaplan's most influential conclusion was that at a fixed compute budget, the optimal allocation is to train a large model for relatively few tokens — an influential recommendation that shaped the design of models like GPT-3.

Chinchilla and the compute-optimal reversal

Jordan Hoffmann et al.'s 2022 Training Compute-Optimal Large Language Models (DeepMind, the "Chinchilla paper") revisited Kaplan's analysis with a much wider range of model sizes and a more careful experimental design. They found that Kaplan's analysis had under-estimated the benefit of training longer — the optimal compute-allocation ratio is approximately 20 tokens per parameter, not the 1–2 tokens per parameter Kaplan's laws implied. The practical consequence is dramatic: a 70B-parameter model trained on 1.4T tokens (Chinchilla's configuration) outperforms a 280B-parameter model trained on 300B tokens (Gopher's configuration) despite using roughly the same compute. The Chinchilla recommendation — scale data and parameters roughly together, at a 20:1 ratio of tokens to parameters — has guided most subsequent large-model training, though more recent work (LLaMA, Mistral) has moved to even higher token-to-parameter ratios to produce models cheap to serve at inference time.

Emergent capabilities

A separate strand of the scaling literature studies emergent capabilities — tasks on which a model shows near-random performance at small scale, then rapidly rises to high performance past some threshold. Jason Wei et al.'s 2022 Emergent Abilities of Large Language Models catalogued dozens of such step-function capability emergences in language models from 1B to 500B parameters. Subsequent critique (Schaeffer, Miranda, and Koyejo 2023 Are Emergent Abilities of Large Language Models a Mirage?) argued that many apparent emergences are artefacts of non-linear evaluation metrics and disappear under smoother metrics. The debate is still unresolved, but the practical relevance is clear: capabilities do not always improve smoothly with scale, and there is a range of model sizes at which a given benchmark is essentially useless for driving progress.

Beyond language: vision and multimodal scaling

Similar scaling laws have been fit for vision models (Zhai et al. 2022 Scaling Vision Transformers), multimodal models (Aghajanyan et al. 2023 Scaling Laws for Generative Mixed-Modal Language Models), and speech recognition. The exponents differ across domains — vision tends to have smaller α_N, reflecting the lower intrinsic complexity of image classification compared to language modelling — but the qualitative picture is consistent. Power-law improvement with compute is a robust feature of deep-learning model families across domains, which is the best reason yet to believe that further compute scaling will continue to improve models in the near future.

The inference-time twist

The compute-optimal analysis assumes the practitioner cares about training compute alone, but inference compute often dominates over the model's lifetime — a deployed language model might serve billions of tokens per day, each a forward pass that pays inference cost. This reverses the Chinchilla calculus: if inference matters, spend the compute budget on training a smaller model for more tokens (LLaMA's 7B model on 2T tokens — roughly 300 tokens per parameter, well above Chinchilla's 20:1) to get a smaller-but-still-capable model that is cheap to serve. Mistral, Gemma, Phi, and most recent open-source language-model families are designed this way. The general lesson: the compute-optimal allocation depends on whether you are budgeting training, inference, or the sum of both over the model's deployment lifetime.

Scaling laws as a planning tool. Before launching a training run at any serious scale — 1B+ parameters, 100M+ tokens — fit a small scaling-law analysis on a few smaller runs to project the expected final performance. This catches two common failure modes: a training recipe that is not "on the scaling curve" (some subtle bug is stealing effective compute), and a target compute budget that does not actually reach the performance goal the project needs. Scaling laws do not guarantee results, but they turn the problem from "train and hope" into "train and verify the expected trajectory".

Training in ML

Having covered the individual techniques that make deep-network training work, we close the chapter by placing them in their larger context — where the training loop sits in the ML lifecycle, how it connects to upstream data engineering and downstream deployment, and what the next several chapters of Part V will build on this foundation.

Training as one stage in the pipeline

A production ML system has a data pipeline (Part III Chapter 03) that assembles training examples, an experimentation environment in which model variants are trained and compared, an evaluation harness that tests candidates against a fixed benchmark, a deployment system that ships winning models to production, and a monitoring layer that watches for drift. Training sits in the middle of this stack: it consumes prepared training data, produces a checkpoint artifact, and is tracked by the experiment system. The techniques in this chapter — schedulers, optimisers, normalisation, clipping, distributed training — are best understood not as isolated tricks but as engineering levers inside that broader pipeline, each justified by the effect it has on either training-time throughput, final model quality, or the operational feasibility of the rest of the stack.

Connecting to classical ML

Much of the training discipline of Part V has direct precedent in classical ML: learning-rate schedules in classical convex optimisation (Nesterov 1983), mini-batch SGD in large-scale linear models (Bottou 2010), regularisation via weight decay (Hoerl and Kennard 1970 ridge regression), cross-validation and hyperparameter search as treated in Chapter 09 of Part IV. What is distinctively deep-learning is the interaction between these classical ideas and the non-convex, high-dimensional, highly overparameterised regime that neural networks live in. Good deep-learning practice is neither classical ML scaled up nor a completely new discipline; it is the subset of classical techniques that survive the transition to the new regime, combined with a new set of techniques (batch normalisation, residual connections, Adam) that emerged from empirical necessity.

What the next chapters build

Chapter 03 of Part V (Regularisation and Generalisation) treats the techniques — dropout, weight decay, early stopping, data augmentation, label smoothing, stochastic depth — that complement the training dynamics of this chapter to control the generalisation gap. Chapter 04 (Convolutional Networks) specialises the training loop to grid-structured data. Chapter 05 (Recurrent Networks and Sequence Modelling) extends it to sequential data, with the additional twist that the temporal recurrence compounds the training difficulties this chapter treated. Chapter 06 (Attention and Transformers) is where the training techniques of this chapter come into their own — transformers are unusually sensitive to optimiser choice, normalisation placement, warmup length, and batch-size / learning-rate scaling, and the recipes of this chapter are the foundation on which the transformer revolution was built.

Training as a dial, not a binary

A practitioner coming from classical ML often thinks of training as "run the optimiser until it converges". A deep-learning practitioner has to think of training as a continuous-dial decision: what model size, for how many tokens, at what batch size, with what learning-rate schedule, on what hardware — and the answers interact. The compute budget is the most important constraint, and the scaling-law analysis of §17 is the framework for using it well. The diagnostic signals of §15 are the feedback loop that tells you whether your choices are working. The engineering techniques of §13 and §14 are the tools that keep training feasible at the scale the problem demands.

The human factor

A final note: deep-learning training, more than almost any other computational discipline, rewards patience and pattern-recognition. Loss curves are read like ECG traces — experienced practitioners can diagnose problems from their shape alone. Training runs are launched and nursed over days or weeks. Debugging a failed run requires a working knowledge of linear algebra, numerical analysis, distributed systems, and the particular quirks of whichever framework is in use. These skills are built by experience; the chapters of Part V can explain the techniques but cannot substitute for actually running training pipelines, watching them succeed and fail, and building the intuition that makes the dial-setting decisions good ones. Alternative sources of experience — open-source training log posts, reproducing published results from scratch, participating in Kaggle competitions — are valuable complements to reading.

Training as the central skill of deep learning. The techniques of Chapter 02 — optimisers, schedulers, normalisation, residual connections, clipping, mixed precision, distributed parallelism — are the toolkit that converts a network architecture into a trained model. Architecture receives the headlines (the transformer! attention! diffusion!), but training is the substrate on which every architecture either works or fails. A team that masters this chapter's techniques can train any modern architecture effectively; a team that masters the architectures but not the training will reproduce neither published results nor stable production systems. The rest of Part V assumes this chapter's content as foundation.

How to read this chapter

Contents

Vanilla stochastic gradient descent, revisited

The update rule, stated precisely

The noise scale

Convergence guarantees

When plain SGD is the right choice

Gradient noise and implicit regularisation

Momentum and Nesterov's accelerated gradient

Heavy-ball momentum

The PyTorch convention

Nesterov's accelerated gradient

The ravine geometry

When momentum fails

Adaptive methods: Adagrad, RMSprop, Adam, AdamW

Adagrad

RMSprop

Adam

The weight-decay bug and AdamW

Amsgrad and convergence issues

Modern optimisers: LAMB, LARS, Lion, Shampoo

LARS and LAMB — layer-wise adaptive rates

Lion

Sophia

Shampoo and K-FAC — full second-order methods

The meta-pattern

Learning-rate schedules

Warmup

Step decay and cosine decay

One-cycle and super-convergence

Cyclical learning rates

How to pick a schedule

Initialisation, revisited for depth

The signal-propagation view

Xavier, He, and LeCun side by side

Orthogonal initialisation

Fixup and its cousins

Bias initialisation

Batch normalisation

The transform

Why it works: the original story and the revision

Placement: before or after the activation

Failure modes

Implicit regularisation

Layer, group, instance, and RMS normalisation

Layer normalisation

Group normalisation

Instance normalisation

RMS normalisation

Pre-norm versus post-norm

Residual connections

The residual block

Why residuals work: three complementary views

Pre-activation vs. post-activation

Highway networks and their relation

Residuals in transformers

Gradient clipping

Why gradients explode

Global-norm clipping

Value clipping

Adaptive clipping

Clipping as diagnostic

The batch-size / learning-rate interaction

The linear scaling rule

The square-root scaling rule

The critical batch size

Large-batch training and generalisation

In practice

Mixed-precision training

fp16 vs. bfloat16

The Micikevicius recipe

What breaks, and why

Automatic mixed precision

fp8 and beyond

Distributed training

Data parallelism

ZeRO and FSDP — partitioned data parallelism

Tensor parallelism

Pipeline parallelism

3D parallelism