Part V · Deep Learning Foundations · Chapter 03

Regularisation and generalisation, the discipline of getting an overparameterised deep network to learn the signal in the training data instead of memorising the noise.

Chapter 01 showed how a neural network is defined and trained; Chapter 02 showed how that training is made to converge at modern scale. Both chapters took the training loss as the objective to minimise. This chapter is about the much harder and more interesting question: why does the resulting model work on data it has never seen? A deep network with a few hundred million parameters has enough capacity to memorise the entire ImageNet training set pixel-for-pixel; as Chiyuan Zhang and colleagues showed in their celebrated 2017 paper Understanding Deep Learning Requires Rethinking Generalization, the same architecture that achieves 75% top-1 test accuracy on natural ImageNet can be driven to zero training loss on a copy of ImageNet where every label has been randomly permuted. Capacity is not the constraint; something else must be doing the work. That something else is regularisation — the sum of inductive biases, explicit penalties, stochastic training perturbations, architectural choices, data manipulations, and optimisation-trajectory quirks that together cause stochastic gradient descent on an overparameterised network to produce a function that generalises rather than one that memorises. This chapter is the catalogue of those techniques, from the classical textbook penalties (L1, L2, early stopping) through the techniques invented for deep networks (dropout, batch-norm-as-regulariser, stochastic depth) to the data-side methods (augmentation, mixup, cutmix) and the modern theoretical frame (double descent, implicit bias, flat-minima hypothesis) that places the whole field in context. This is the chapter that makes deep learning work on the test set, and understanding it is the difference between a practitioner who can fit a training set and one who can ship a model.

How to read this chapter

Sections one and two set the theoretical stage. Section one is the motivating question — why, given an overparameterised model that can memorise any training set, does deep learning usually generalise? We review the empirical evidence, the generalisation gap as a measured quantity, and the canonical shapes of the learning curve. Section two revisits the bias–variance decomposition for deep networks, where the classical U-shaped curve has been replaced by the more interesting double descent (Belkin, Hsu, Ma, Mandal 2019; Nakkiran et al. 2020) and the older intuition that "bigger models overfit more" has turned out to be almost entirely wrong in the overparameterised regime.

Sections three through six are the classical regularisation toolbox. Section three is weight decay — L2 regularisation, its Bayesian-MAP interpretation, the crucial distinction between coupled and decoupled weight decay (Loshchilov–Hutter 2019), and why it behaves differently in deep networks than in linear regression. Section four is L1 and sparsity-inducing regularisation — Lasso (Tibshirani 1996), group Lasso, structured sparsity, and the limited role sparsity plays in modern deep learning. Section five is dropout (Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov 2014), the defining deep-learning regulariser of the 2010s: its definition, its interpretation as an approximate ensemble, its variants (DropConnect, spatial dropout, variational dropout), and the reasons it has partly been displaced by alternatives in modern architectures. Section six is data augmentation in its basic form — random crops, flips, colour jitter for images; synonym replacement, back-translation for text — and the argument that training on an expanded dataset is often the single most effective form of regularisation.

Sections seven through twelve are the modern deep-learning regularisation kit. Section seven is advanced augmentation — mixup (Zhang, Cissé, Dauphin, Lopez-Paz 2018), CutMix (Yun et al. 2019), Cutout (DeVries, Taylor 2017), AutoAugment (Cubuk et al. 2019), RandAugment, TrivialAugment — the policy-learning and combinatorial methods that push augmentation far beyond hand-designed crops. Section eight covers label smoothing (Szegedy et al. 2016), the deceptively simple trick of training against soft targets, and its connections to calibration and knowledge distillation. Section nine is early stopping — the oldest form of regularisation (Prechelt 1998, 2012), its interaction with modern learning-rate schedules, and the checkpoint-averaging refinement (Caruana, Lawrence, Giles 2000). Section ten explains how batch normalisation acts as a regulariser — the mechanism distinct from its training-speedup role, why it contributes less in the very-large-batch regime, and what this implies for modern LayerNorm-based architectures. Section eleven covers stochastic depth (Huang, Sun, Liu, Sedra, Weinberger 2016), DropConnect (Wan et al. 2013), LayerDrop (Fan, Grave, Joulin 2019), and the broader family of "randomly remove part of the network" regularisers. Section twelve is noise injection — adding noise to inputs, activations, weights, or gradients — and the Jacobian-regularisation and input-perturbation methods that formalise the idea.

Sections thirteen through eighteen are the advanced topics and theoretical frame. Section thirteen is adversarial training (Goodfellow, Shlens, Szegedy 2015; Madry et al. 2018), which trains on worst-case perturbations and doubles as a regulariser with robustness properties. Section fourteen is ensembling — from classical bagging through snapshot ensembles (Huang et al. 2017) to Stochastic Weight Averaging (Izmailov, Podoprikhin, Garipov, Vetrov, Wilson 2018), the techniques that convert multiple training trajectories into one better-generalising model. Section fifteen is the classical generalisation theory — VC dimension, Rademacher complexity, PAC-Bayes, margin theory — and an honest assessment of why it has struggled to explain what deep networks actually do. Section sixteen is double descent — the Belkin–Hsu–Ma–Mandal 2019 picture of the modern generalisation curve, the model-size-, sample-size-, and epoch-wise versions (Nakkiran 2020), and the reason the classical bias–variance story needed updating. Section seventeen is implicit regularisation — the idea, due to Neyshabur and collaborators and sharpened by Keskar, Smith, and others, that SGD itself biases the solution it finds toward functions that generalise, via a preference for flat minima over sharp ones. Section eighteen places regularisation in the broader ML landscape — its relationship to inductive bias, to pretraining and transfer learning, to robustness, and to the argument that "architecture, data, and optimiser together define the regularisation" of a modern system.

Why regularise? The generalisation questionOverparameterisation, the generalisation gap, Zhang et al. 2017
Bias, variance, and the overparameterised regimeThe classical decomposition and why it misled a generation
Weight decay and L2 regularisationTikhonov, MAP Gaussian prior, AdamW decoupling
L1 and sparsity-inducing regularisationLasso, group Lasso, structured sparsity, pruning
DropoutSrivastava et al. 2014, inverted dropout, modern variants
Data augmentationCrops, flips, colour jitter, text perturbation
Advanced augmentation: mixup, CutMix, AutoAugmentInterpolation-based and learned policies
Label smoothingSzegedy 2016, soft targets, calibration
Early stopping and checkpoint averagingPrechelt, Caruana, validation-based stopping
Batch normalisation as a regulariserThe noise channel of BN, batch-size sensitivity
Stochastic depth, DropConnect, LayerDropRandomly removing parts of the network
Noise injection and Jacobian regularisationInput, activation, weight, gradient noise
Adversarial trainingFGSM, PGD, the robustness-accuracy trade-off
Ensembling, SWA, and snapshot ensemblesBagging the trajectory, weight-space averaging
Classical generalisation theory and its limitsVC, Rademacher, PAC-Bayes, margin theory
Double descentBelkin–Hsu–Ma–Mandal 2019, the modern picture
Implicit regularisation and flat minimaSGD's inductive bias, Keskar flatness, NTK
Where it compounds in MLArchitecture, data, optimiser as joint regularisers

Why regularise? The generalisation question

Training machine-learning models on a finite sample and hoping they perform well on unseen data is the central problem of the field. The classical theory said the answer is capacity control: limit the expressive power of the model until it fits the signal but not the noise. Deep learning broke this picture. A modern neural network has enough capacity to memorise any training set you hand it, yet the same architecture, trained with the same optimiser, generalises remarkably well on real-world data. Understanding why — and engineering the ingredients that make it so — is the project of this chapter.

The empirical risk and the true risk

We observe a training sample S = {(x₁, y₁), …, (x_N, y_N)} drawn from a data distribution 𝒟 over input–label pairs. We choose a function f_θ that minimises the empirical risk R̂(θ) = (1/N) ∑ᵢ ℓ(f_θ(xᵢ), yᵢ). What we actually care about is the true risk R(θ) = 𝔼_{(x,y)∼𝒟}[ℓ(f_θ(x), y)], the expected loss on a fresh sample. The generalisation gap is the difference R(θ) − R̂(θ) — how much worse the model performs on unseen data than on the training set. The entire project of regularisation is to control this gap without sacrificing too much empirical performance.

Zhang et al. 2017 and the capacity shock

The modern story of regularisation begins with Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals's 2017 paper Understanding deep learning requires rethinking generalization (ICLR best-paper award). Their experiment was devastatingly simple. Take a standard image classification dataset — CIFAR-10 or ImageNet. Replace every label with a uniformly random draw from the label set. The mapping from input to output is now pure noise: there is no structure to learn, and the theoretical minimum test error is that of random guessing. Train a standard deep network — Inception, AlexNet, a small MLP — on this random-label training set with exactly the same hyperparameters you would use on real data. What happens? The network converges to zero training loss. It fits the random labels perfectly. Its capacity is more than enough to memorise the entire training set. Yet on the real training labels, the same architecture trained the same way generalises well. The conclusion is inescapable: it is not the architecture's capacity that controls generalisation, because the capacity is clearly sufficient to memorise. Something else — in the training process, in the optimiser, in the loss landscape, in the interaction between data and architecture — must be responsible.

The generalisation gap in practice

For a well-tuned modern model, the generalisation gap is surprisingly small. A ResNet-50 trained from scratch on ImageNet-1k typically reaches ~99% training accuracy and ~76% top-1 test accuracy — the gap of ~23 percentage points is substantial, but the model is certainly not failing to generalise. A GPT-class language model pretrained on web text typically has training and validation losses within a few percent of each other over most of training; the generalisation gap is essentially negligible. At the other extreme, a network trained on a tiny dataset with a poor regularisation recipe can show gaps of 50 percentage points or more — driving training to 100% while test accuracy stalls at chance. The practical observable is the gap's trajectory over training: small, stable, and growing slowly is healthy; rapidly opening up in the late stages of training is the signal that the model is starting to memorise.

The menagerie of regularisation

The techniques that close the generalisation gap fall into several broad categories. Explicit penalties add a term to the loss that discourages large weights (L2, L1, spectral-norm penalties); these are classical and transfer directly from linear regression. Stochastic perturbations inject noise into the training process — dropout randomly zeroes activations, stochastic depth randomly skips layers, noisy SGD has noise in the updates themselves. Data-space methods expand or modify the training set — augmentation creates new examples, mixup interpolates between them, adversarial training creates worst-case perturbed examples. Architectural biases build prior knowledge into the function class — convolution's translation equivariance, transformer attention's permutation structure, residual connections' identity prior. Optimisation-trajectory effects let the dynamics of SGD itself bias the solution — early stopping truncates the trajectory, checkpoint averaging smooths it, the implicit bias of SGD prefers flat minima. No single technique accounts for why deep learning generalises; a modern training recipe uses several simultaneously, and their interactions matter.

The test-set protocol

Everything in this chapter depends on being able to measure generalisation, and measurement requires data discipline. The standard protocol splits data into three disjoint sets: a training set used for optimisation, a validation set used for hyperparameter selection and early stopping, and a test set used once to report final performance. Violations of this discipline are the single most common source of inflated generalisation claims in the literature: re-using the validation set as a test set after many rounds of hyperparameter tuning; evaluating on a leaderboard enough times that the test set effectively becomes a validation set (the famous CIFAR-10.1 replication study of Recht, Roelofs, Schmidt, Shankar 2019 showed test-set overfitting even on community benchmarks); training on data that overlaps the test set by accident (ImageNet-LSVRC has documented cases of near-duplicates across the train/test boundary). The protocol is simple to state and unforgiving of shortcuts.

The question this chapter answers. Given that an overparameterised deep network can memorise its training set, what makes the stochastic-gradient-descent training process preferentially find a solution that generalises? This chapter is the empirical and theoretical catalogue of the mechanisms — from explicit L2 penalties to implicit flatness preferences — that produce the generalisation deep learning is famous for. A modern training recipe weaves half a dozen of these mechanisms together; the goal of the chapter is to understand each one well enough to pick the right ones for a given problem.

Bias, variance, and the overparameterised regime

The bias–variance decomposition is the oldest formal tool for thinking about generalisation. It was the backbone of classical statistical learning theory and of most textbook treatments through the 2010s. For modern overparameterised deep networks its central predictions turn out to be wrong, or more precisely, wrong in the regime practitioners actually operate in. Understanding where the classical picture applies and where it fails is the conceptual groundwork for everything that follows.

The classical decomposition

For squared-error loss, the expected test error of a learned function f̂ at a point x decomposes into three terms: 𝔼[(y − f̂(x))²] = Bias(f̂(x))² + Var(f̂(x)) + σ². Here the expectation is taken over the training sample, Bias(f̂(x)) = 𝔼[f̂(x)] − f*(x) measures systematic error (how far the average prediction is from the ground truth), Var(f̂(x)) = 𝔼[(f̂(x) − 𝔼[f̂(x)])²] measures how much predictions fluctuate across different training samples, and σ² is irreducible label noise. The classical narrative: as model capacity grows, bias drops (the model class can better approximate the truth) but variance rises (a high-capacity model fits idiosyncrasies of each training sample). Somewhere in the middle sits the sweet spot that minimises test error. The canonical diagram is the U-shaped test-error curve with the sweet spot at its bottom.

What the classical picture got right

For classical model families — polynomial regression, kernel methods with a fixed kernel bandwidth, shallow decision trees — the U-shape is real. On a regression problem with twenty points you can draw it by hand: a degree-one polynomial has high bias and low variance, a degree-twenty polynomial passes through every point but extrapolates wildly, and a degree-five polynomial hits the sweet spot. The Vapnik–Chervonenkis theory of the 1970s formalised this: if a model class has VC dimension d, then test error with high probability is bounded by training error plus a penalty of roughly √(d/N), where N is the sample size. The recipe for avoiding overfitting was clear: cap the VC dimension (or its continuous relatives) at a value small compared to N.

Where the classical picture fails

Modern deep networks sit in the overparameterised regime: number of parameters P is orders of magnitude larger than the number of training samples N. ResNet-50 has 25 million parameters and is trained on 1.3 million ImageNet samples — P/N ≈ 20. A 70-billion-parameter LLM is trained on perhaps 10 trillion tokens — P/N ≈ 0.007 at the token level, but P/N in the hundreds or thousands at the batch-gradient level. Classical VC-style bounds in this regime predict test error around 100% — they are vacuous, because √(d/N) ≫ 1. Yet the networks generalise. Something is wrong with the classical picture, or more precisely, the classical picture applies to a regime (P ≪ N) that modern deep learning has left behind.

Double descent — the modern picture

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal's 2019 paper Reconciling modern machine learning and the bias–variance trade-off (PNAS) unified the classical and modern pictures with the double-descent curve. As model capacity grows, test error first follows the classical U-shape, reaching a local minimum at the sweet spot, then climbing to a local maximum at the interpolation threshold where the model has just enough capacity to exactly fit the training data (and generalises terribly there, because the solution is uniquely determined by the training data and the optimiser chooses arbitrarily among equally-fitting functions). Beyond the interpolation threshold, however, test error descends again, often far below the classical sweet-spot value. We treat double descent in detail in §16; here the key point is that the classical U-shape tells only half the story. Modern deep networks operate on the right side of the interpolation threshold, where "more capacity" means "better generalisation" — the opposite of the classical prediction.

Why variance doesn't explode

The most striking failure of the classical picture is its prediction that variance should blow up as P grows: with more parameters to fit the sample, the learned function should depend more and more on which sample you drew. Empirically this does not happen. Nakkiran, Kaplun, Bansal, Yang, Barak, and Sutskever's 2020 Deep double descent paper tracked variance directly across training and model sizes and found that in the overparameterised regime, variance decreases as P grows beyond the interpolation threshold. The mechanism is the implicit regularisation of gradient descent: among the infinite family of zero-training-loss solutions a highly overparameterised network admits, SGD preferentially finds ones with specific properties (small weight norm, flatness, simplicity measured in Fourier space) that make them robust to sample-level variation. The variance doesn't explode because the optimiser is choosing systematically, not arbitrarily.

The lesson. The classical bias–variance picture is not wrong — for classical models in the P < N regime, it is correct and useful. What has changed is that modern deep learning operates in the P ≫ N regime, where the classical U-shape is only the first half of a double-descent curve, and where variance does not behave as the classical theory predicts. The intuition that "bigger model ⇒ more overfitting" is roughly the opposite of the truth in the overparameterised regime, and shedding that intuition is a prerequisite for understanding why every item in the rest of this chapter works the way it does.

Weight decay and L2 regularisation

The oldest, simplest, and still almost-universal form of explicit regularisation is to penalise the squared norm of the parameter vector. It has half a dozen equivalent names — L2 regularisation, Tikhonov regularisation, ridge regression (in the linear case), weight decay — and two subtly different implementations that behave identically in SGD but diverge sharply in Adam. Understanding the equivalence and the divergence is one of the small but consequential details of a modern training recipe.

The L2 penalty

Add a term (λ/2)‖θ‖² to the loss: L_reg(θ) = L(θ) + (λ/2)‖θ‖². The hyperparameter λ (weight-decay coefficient) is typically in the range 10⁻⁵ to 10⁻². The gradient becomes ∇L_reg(θ) = ∇L(θ) + λθ. In gradient-descent update form, θ ← θ − η · (∇L(θ) + λθ) = (1 − ηλ)θ − η · ∇L(θ): every step shrinks the parameter toward zero by a multiplicative factor (1 − ηλ) before applying the gradient update. This multiplicative shrinkage is where the name weight decay comes from — weights decay exponentially toward zero in the absence of gradient signal. For vanilla SGD, adding the L2 penalty to the loss and applying weight decay directly to the parameters are algebraically equivalent, and most textbook presentations use the terms interchangeably.

The Bayesian interpretation

L2 regularisation has a clean Bayesian interpretation. If we place a Gaussian prior θ ∼ 𝒩(0, σ²_prior · I) on the parameters and compute the maximum a posteriori estimate, the log-posterior is the log-likelihood plus −(1/(2σ²_prior))‖θ‖² — exactly the L2 penalty, with λ = 1/σ²_prior. Large λ corresponds to a tight prior (most of the prior mass concentrated near zero, strong belief that weights should be small); small λ corresponds to a loose prior. This interpretation is pedagogically useful but somewhat misleading in practice: deep networks are not approximately Bayesian in any meaningful sense, and the "prior" over their weights does not correspond to any meaningful belief about the world. A better frame is capacity regularisation: the L2 penalty makes it cheaper for the optimiser to leave a weight small, so weights that are not needed to drive the training loss downward end up small.

What L2 actually does in a deep network

For a linear model, L2 regularisation has a clean effect: it shrinks coefficients toward zero in a way that is well-conditioned and has a closed-form solution θ_ridge = (XᵀX + λI)⁻¹ Xᵀy. For a deep network the effect is more complex and less well understood. Zhang et al. 2017 observed that L2 regularisation has only a modest effect on generalisation for standard image classifiers — removing it typically costs a point or two of top-1 accuracy, not the catastrophic loss a classical statistical analysis would predict. The scale invariance of batch-normalised networks (the norm of the pre-BN weights doesn't matter for the function) makes weight decay in a BN-ed network not a capacity constraint but an effective learning-rate modifier: as weights shrink, gradients get amplified, so weight decay in BN nets mostly controls the dynamics rather than the final function. Van Laarhoven's 2017 note L2 regularization versus batch and weight normalization and Zhang, Wang, Grosse 2018 Three mechanisms of weight decay regularization give careful mechanistic accounts; the summary is that the effect of weight decay in deep networks is both smaller and more subtle than the L2 textbook narrative suggests.

Weight decay vs L2 in Adam — the AdamW distinction

For vanilla SGD, adding λθ to the gradient and applying multiplicative decay (1 − ηλ)θ are equivalent. For Adam they are not. If we add λθ to the gradient before Adam's adaptive scaling, then parameters with large second-moment v̂ receive small effective decay (because their gradients are divided by √v̂), and parameters with small v̂ receive large effective decay. This is almost certainly not the intended behaviour: the regularisation strength has become coupled to the optimiser's adaptive scaling, rather than being a uniform pressure on the parameter norm. Ilya Loshchilov and Frank Hutter's 2019 Decoupled weight decay regularization (ICLR) identified this and introduced AdamW, which applies decay directly to the parameter after the Adam update: θ ← θ − η · m̂/√v̂ − η · λ · θ. The decoupled form restores the uniform-shrinkage interpretation and empirically gives 1–2% better top-1 accuracy on ImageNet and commensurate improvements on language tasks. The modern recommendation is almost unanimous: always use AdamW rather than Adam-with-L2, and set weight decay directly rather than through the loss.

Choosing λ

Typical weight-decay values for AdamW on vision classification are 0.01 to 0.1; for transformer language models, 0.01 to 0.1 applied to non-bias, non-LayerNorm parameters (biases and normalisation scale/shift are commonly exempted); for fine-tuning pre-trained models, often 0.0 or a small value like 0.01 (one doesn't want to pull a carefully-trained prior away from its initialisation). The hyperparameter is log-scale: sweep [10⁻⁵, 10⁻⁴, 10⁻³, 10⁻², 10⁻¹] to find the right order of magnitude, then refine if necessary. Weight decay interacts with learning rate (both have a scale effect on effective step size in BN nets) and with the learning-rate schedule (decay is effectively stronger at small learning rates late in training), so it should be tuned jointly with those, not in isolation.

The modern recipe. Use AdamW, not Adam. Set weight decay in the 0.01 to 0.1 range to start; exempt biases and normalisation parameters; sweep on a log scale if performance matters. Remember that in a batch-normalised network, weight decay is more of an effective-learning-rate control than a capacity constraint — its role is different from the textbook L2-regularisation narrative, but it is still almost always helpful in practice.

L1 regularisation and the limits of sparsity

If L2 regularisation shrinks parameters smoothly toward zero, L1 regularisation is the technique that forces parameters to zero exactly — producing sparse solutions where only a subset of features or connections carry signal. L1 was a central tool in the statistical-learning decade of 1996–2010 through Lasso and its cousins. Its role in deep learning is narrower than one might expect; understanding why is part of understanding what regularisation in deep networks is actually for.

The Lasso

Robert Tibshirani's 1996 Regression shrinkage and selection via the lasso (Journal of the Royal Statistical Society B) introduced L1-penalised linear regression: minimise (1/2)‖y − Xθ‖² + λ‖θ‖₁, where the penalty is ‖θ‖₁ = ∑ⱼ |θⱼ|. The geometry of the L1 ball — it has corners along the coordinate axes — means the optimum of this objective sits at a corner whenever the unregularised loss gradient points at the interior of the ball: exactly zero on some coordinates and nonzero on others. Lasso simultaneously fits a regression model and selects the subset of features with nonzero coefficients. This made it transformative in the genomics, econometrics, and signal-processing literatures of the 2000s, where problems with thousands of candidate features and hundreds of samples suddenly had a principled answer to "which features matter?". Bickel, Ritov, Tsybakov 2009, Bühlmann and van de Geer 2011, and Hastie, Tibshirani, Wainwright 2015 (Statistical Learning with Sparsity) are the canonical theoretical and practical references.

The geometry of sparsity

The intuition for why L1 produces sparsity and L2 does not is geometric. The contours of ‖θ‖₁ = c are diamonds (L1 balls are octahedra in higher dimensions), with sharp corners along each axis. The contours of ‖θ‖² = c are spheres. The regularised solution sits at the intersection of a level set of the data loss with the smallest ball that it touches; for a squared-L2 ball the intersection is generically in the interior, producing a small but nonzero coefficient on every feature. For an L1 ball, the intersection frequently occurs at a corner, producing exactly zero on several coefficients. A crisper way to see it: the L1 penalty's derivative is sign(θ), which does not go to zero as θ → 0, so there is a finite push toward zero at every point, which overwhelms a small data-gradient and produces an exactly-zero fixed point. The L2 penalty's derivative is θ, which vanishes as θ → 0, so zero is not a stable fixed point unless the data gradient is also exactly zero.

Why L1 is less central in deep learning

L1 regularisation gets used in deep networks, but far less than L2. Three reasons. First, weight-level sparsity is not the natural unit of interpretability in a neural network the way it is in linear regression. In a linear model, "coefficient of feature j is zero" means "feature j is irrelevant" — a human-readable statement. In a deep network, "weight from neuron 17 of layer 4 to neuron 31 of layer 5 is zero" is not a meaningful statement about the data; it is a statement about a particular circuit path, whose interpretation depends on every other weight in the network. Second, the non-differentiability of L1 at zero complicates optimisation. The sub-gradient is ∂|θ| = sign(θ) for θ ≠ 0 and [−1, +1] at zero; plain SGD with L1 tends to oscillate around zero rather than sitting there. Proximal methods (Beck, Teboulle 2009 A fast iterative shrinkage-thresholding algorithm) solve this cleanly for convex problems, but implementing them in a general-purpose neural-network framework is fiddly. Third, pruning is a better tool for sparsity — if you want a sparse network, training a dense one and then pruning (Han, Pool, Tran, Dally 2015 Learning both weights and connections for efficient neural networks, Frankle and Carbin 2019 The lottery ticket hypothesis) typically produces a sparser, better-performing network than L1-training does.

Elastic net and group Lasso

Hui Zou and Trevor Hastie's 2005 Regularization and variable selection via the elastic net combined L1 and L2 into α‖θ‖₁ + (1 − α)/2 · ‖θ‖². Elastic net keeps Lasso's sparsity-inducing behaviour when features are uncorrelated but groups correlated features together (rather than picking one arbitrarily, as Lasso does). Ming Yuan and Yi Lin's 2006 Model selection and estimation in regression with grouped variables introduced group Lasso, which applies an L2 norm within each pre-defined group of variables and an L1 across groups — producing solutions that zero entire groups together, useful when features have natural grouping structure. Both extensions saw use in statistical applications; neither has a central role in modern deep learning, where structured sparsity is usually pursued through pruning rather than through regularisation.

Structured sparsity for efficiency

Where L1-style regularisation does see active use in deep learning is for model compression and efficiency. Channel pruning (Wen et al. 2016 Learning structured sparsity in deep neural networks, He et al. 2017 Channel pruning for accelerating very deep neural networks) applies group Lasso to entire channels of a convnet, producing a network where whole feature maps can be removed without affecting inference accuracy. This converts the regularisation objective into a hardware-efficient architecture, because a zeroed channel can be skipped at inference time. Similarly, N:M sparsity (Nvidia's 2:4 sparse format) is enforced during fine-tuning to produce networks that map well to sparse-tensor-core hardware. These uses of L1 are closer to architecture search under constraints than to regularisation in the generalisation sense — the goal is a smaller network at inference time, not better test accuracy.

Bottom line. L1 regularisation was a landmark technique for classical statistics with thousands of candidate features; in modern deep learning it sits in a niche role for compression and structured pruning. The default choice for weight regularisation in a deep network is L2 (weight decay). If you find yourself wanting sparsity, what you probably want is pruning — and if you want interpretability, what you probably want is mechanistic-interpretability tools (treated in Chapter 06 of Part XI), not sparse weights.

Dropout

If there is a single technique emblematic of deep-learning-era regularisation, it is dropout. Proposed by Geoffrey Hinton's Toronto group in 2012 and formalised in a 2014 JMLR paper, dropout randomly "drops" a fraction of neurons on each training step — a deceptively simple scheme that turned out to be one of the most effective regularisers of the decade. Its role has narrowed in modern architectures, but its ideas — stochastic training perturbations as approximate ensembling — recur across every modern regulariser in this chapter.

The definition

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov's 2014 Dropout: A simple way to prevent neural networks from overfitting (JMLR) is the canonical reference. On each training forward pass, for each hidden unit, flip a biased coin: with probability p (the dropout rate, typically 0.1–0.5) set the unit's activation to zero; with probability 1 − p keep it. The backward pass uses the same mask: gradients only flow through the units that were kept. Crucially, at inference time no units are dropped — the full network is used, producing the final prediction. Because more units are active at inference than during training, the expected activations at each layer are inflated by a factor of 1/(1 − p) at inference vs training; the standard fix is inverted dropout, which scales the kept activations by 1/(1 − p) during training, so that expected activations match at train and inference time. This is the implementation every modern framework uses; the original paper used the non-inverted form and rescaled at inference time, but this is equivalent.

Why it works — the ensemble interpretation

Srivastava et al.'s central argument: training with dropout is approximately equivalent to training an ensemble of exponentially many sub-networks (one for each possible dropout mask) that share weights, then averaging their predictions at inference. For a network with n hidden units, there are 2ⁿ possible dropout masks; training with dropout samples one at random per step and updates the weights to reduce the loss on that sub-network. At inference, using the full network is an approximation to averaging over all 2ⁿ sub-networks — exact for linear models, approximate but empirically good for nonlinear ones. Ensembling is famously a good way to reduce variance (bagging, boosting, random forests all exploit it), and dropout gets most of the benefit of a full ensemble at almost no additional cost — just a multiplicative mask per step. This interpretation is intuitive and has been cited in nearly every dropout paper since, though it is not a rigorous equivalence, and a precise characterisation of what dropout does at the loss-landscape level remains open.

Variants: spatial, variational, Monte Carlo

Several variants have been developed to address specific weaknesses of vanilla dropout. Spatial dropout (Tompson et al. 2015 Efficient object localization using convolutional networks) drops entire feature maps rather than individual activations in a convnet; independent per-activation dropout in convnets doesn't help as much because neighbouring spatial activations are strongly correlated, so dropping one doesn't remove much information from the next layer. Variational dropout (Gal and Ghahramani 2016 Dropout as a Bayesian approximation) uses the same mask across all timesteps of an RNN, giving a Bayesian interpretation and fixing a bug that vanilla dropout had in recurrent networks. Monte-Carlo dropout keeps dropout on at inference time and samples multiple predictions, using the spread to estimate predictive uncertainty — a crude but practically useful uncertainty quantification. Zoneout (Krueger et al. 2017) preserves rather than zeros the hidden state in RNNs with probability p, a regulariser better suited to temporal dependencies.

Why dropout has partly faded

Dropout was near-universal in convnets and feed-forward networks of the 2012–2017 era. By the late 2010s its role had diminished in two important families of architectures. Modern convnets with batch normalisation mostly don't use dropout in convolutional layers — batch norm plus weight decay turns out to be sufficient regularisation, and dropout on convolutional features actually hurts on ResNet-style architectures (as shown by He et al.'s original ResNet paper and many follow-ups). Dropout is typically applied only on the final fully-connected classifier head, if at all. Transformers use dropout — typically on attention weights, residual streams, and MLP outputs — but with relatively low rates (0.1 is standard). And for very large transformers trained on massive datasets, dropout is often removed entirely (the Chinchilla and Llama training recipes use little or no dropout), because at sufficient data scale, the generalisation gap is small enough that any regularisation beyond weight decay is unnecessary — and the regularisation is training-slowing. So dropout is no longer a universal default but a tool that may or may not help depending on the architecture and data scale.

Practical recommendations

For a modern image classifier (ResNet, ConvNeXt, EfficientNet), apply dropout at rate 0.2–0.5 to the final classifier head, not to the convolutional features. For a transformer trained on moderate data (tens of millions of tokens), use dropout 0.1 on attention and residual streams; on large-scale pretraining, consider removing dropout entirely and relying on weight decay plus data scale. For RNNs and LSTMs, use variational dropout (same mask across timesteps) or zoneout rather than per-step independent dropout. The dropout rate is a hyperparameter worth tuning — too high and training plateaus, too low and it fails to regularise; sweep the 0.0–0.5 range and pick on validation accuracy.

The core idea, beyond dropout itself. Dropout instantiates a general principle that recurs throughout modern regularisation: training on stochastic perturbations of the network acts as implicit ensembling. Stochastic depth, noise injection, variational autoencoder training, and even adversarial training all inherit some of dropout's DNA. The principle is that a function robust to a family of perturbations generalises better than one fitted exactly to the training set — a principle that will come up repeatedly in the sections that follow.

Data augmentation

The most effective regulariser in deep learning, for many tasks and by a wide margin, is not a penalty or a stochastic layer — it is more data. When more real data is unavailable, the next best thing is to synthesise plausible variations of the data you have. Data augmentation is the craft of doing this well, and it is one of the relatively few areas of the deep-learning playbook where domain knowledge about the problem reliably beats raw architectural horsepower.

The basic idea

A training example (x, y) is typically just one representative of a large orbit of examples that should map to the same label. A picture of a cat remains a picture of a cat if you flip it left–right, shift it by a few pixels, change the lighting slightly, or crop to a slightly different region. The function we want to learn is invariant — or at least approximately invariant — to these transformations. If we generate new training examples by applying such transformations at random each epoch and train on the enlarged dataset, we force the network to learn the invariance explicitly rather than hoping it emerges. Formally, data augmentation replaces the empirical risk (1/N) ∑ᵢ ℓ(f(xᵢ), yᵢ) with (1/N) ∑ᵢ 𝔼_T[ℓ(f(T(xᵢ)), yᵢ)], where T is a random transformation drawn from a family the labels are invariant under.

Image augmentation — the classical pipeline

For images, a standard augmentation pipeline since the Krizhevsky, Sutskever, Hinton 2012 AlexNet paper includes: random crop (take a random square sub-region of the image at training time, resize to the network's input dimension); horizontal flip (mirror left–right with probability 0.5; vertical flip is usually not used because upside-down images are rare in natural data); colour jitter (small random adjustments to brightness, contrast, saturation, and hue); rotation (small random rotations, typically ±10° — larger rotations often make the task easier to solve in an undesirable way, by cueing on image borders); Gaussian noise or blur (small additive perturbations); random erasing (Zhong et al. 2020). This pipeline is enough to roughly double or triple the effective dataset size, and its use is the difference between a ResNet-50 that reaches ~72% top-1 on ImageNet and one that reaches ~76%.

Text augmentation

Augmenting text is harder than augmenting images because the invariances are less clean — swapping two words, deleting a word, or replacing a word with a synonym can change the meaning in ways that are hard to detect automatically. Still, several techniques are in active use. Back-translation (Sennrich, Haddow, Birch 2016 Improving neural machine translation models with monolingual data) translates a sentence to a pivot language and back, producing paraphrases that largely preserve meaning. Synonym replacement and the EDA (Easy Data Augmentation, Wei and Zou 2019) perturbations — synonym swap, random insertion, random swap, random deletion — offer small but measurable improvements on low-resource NLP tasks. Span masking (BERT's masked-language-model pretraining, Devlin et al. 2019) can be viewed as a form of augmentation: the same sentence with different masks is effectively a new training example. For language-model pretraining at scale, augmentation is mostly replaced by sheer data volume — there are so many tokens in the web corpus that explicit augmentation is redundant.

Audio and other modalities

Audio augmentation uses time stretching, pitch shifting, adding background noise, and the SpecAugment technique (Park et al. 2019), which masks frequency and time bands in the spectrogram — the audio analogue of masking pixel patches in an image. Tabular data augmentation is harder still because the structure of tabular features (mix of numerical and categorical, often with semantic meaning) makes automatic transformation dangerous; SMOTE (Chawla et al. 2002) and its variants synthesise minority-class examples via interpolation, and Gaussian-noise addition to numerical features is a simple and mostly-safe baseline. For point clouds and 3D data, rotation, translation, and jittering are standard.

The invariance principle

Not every transformation is a valid augmentation. The transformation must be label-preserving: if flipping the image left–right makes "left turn" become "right turn", you have corrupted the training signal. For natural-image classification the usual invariances are safe; for medical imaging, some of them are not (a flipped chest X-ray changes anatomical-side information). For tabular data, the safe invariances are often near-zero — augmentation may not be appropriate. The generic rule: augment only under transformations you are confident the target labels are invariant under, and verify on validation performance that the augmentation is helping rather than adding noise. An aggressive augmentation policy is a way to waste compute if it is generating examples whose labels are wrong.

Augmentation as regularisation, not just "more data"

It is tempting to describe augmentation simply as "making the dataset bigger", but the mechanism is subtler. A network trained with aggressive augmentation learns explicit invariance — a systematic insensitivity to the augmented transformations that extends to test-time variations drawn from the same family. This produces models that are robust to input perturbations in a way that simply having a larger dataset does not. Empirically this shows up as: (i) smaller train–test gap even when train and test are drawn from the same distribution, (ii) better out-of-distribution performance on perturbed test sets (ImageNet-C, ImageNet-R), (iii) better transferability to downstream tasks. Augmentation is one of the few regularisers that buys you both in-distribution and out-of-distribution benefits, which is part of why it sits at the top of most practitioners' regularisation toolboxes.

The single most important regulariser. If you are working on a deep-learning problem and can invest engineering effort in only one regularisation technique, make it data augmentation. The gains from a well-designed augmentation pipeline routinely exceed those from every other regulariser in this chapter combined, and the engineering is mostly a matter of identifying the invariances of your problem and building transformations that express them. The 2018–2023 wave of learned augmentation policies (AutoAugment, RandAugment, TrivialAugment — treated in §7) pushed this further: the augmentation pipeline itself became an object of hyperparameter search.

Advanced augmentation: mixup, CutMix, and learned policies

Hand-designed crop-and-flip augmentation was the state of the art from 2012 through about 2017. A wave of techniques between 2017 and 2020 — interpolation-based augmentation, combinatorial erasing, reinforcement-learning-based policy search — pushed augmentation well beyond the basic pipeline and became standard components of the modern vision training recipe. They share a common idea: the space of valid augmentations is much richer than the classical pipeline suggests, and performance improves when you search or sample it intelligently.

Mixup

Hongyi Zhang, Moustapha Cissé, Yann Dauphin, and David Lopez-Paz's 2018 mixup: Beyond empirical risk minimization (ICLR) is one of the most influential augmentation papers of the last decade. The rule is startlingly simple: to form a training example, take two random training examples (x_i, y_i) and (x_j, y_j), draw a mixing coefficient λ ∼ Beta(α, α) (typically α ≈ 0.2, so most mixes are near-original), and train on (λ x_i + (1 − λ) x_j, λ y_i + (1 − λ) y_j). The network is asked to predict a soft-mixed label on a pixel-linearly-mixed image. This sounds like it should corrupt the training signal — most pixel-linear mixes of two natural images look unnatural — yet it reliably improves generalisation, reduces memorisation of corrupt labels, and improves robustness to adversarial examples. The paper offers several mechanistic explanations: mixup encourages linear behaviour between training examples, it regularises the gradient norm, it approximates vicinal risk minimisation (training on small perturbations around each training point). Empirically, mixup gives ~1% top-1 improvement on ImageNet and similar gains on speech and NLP.

CutMix

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo's 2019 CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (ICCV) combines mixup's idea with spatially local replacement. Instead of linearly blending two images, CutMix cuts a random rectangular patch from one image and pastes it into another. The label is mixed in proportion to the areas of the two images in the composite. CutMix avoids mixup's unnatural-blend visual artefacts while still getting the regularisation benefit. On ImageNet it outperforms mixup by a small margin, and on object-localisation tasks (weakly supervised localisation, image retrieval) it improves feature-map quality more than mixup does — the spatially-coherent patches keep object-like structure in the composite. For classification tasks where both techniques apply, CutMix is typically the first to try; the more recent ManifoldMix (Verma et al. 2019) applies mixup in feature space rather than pixel space and is an interesting further variant.

Cutout and random erasing

Terrance DeVries and Graham Taylor's 2017 Improved regularization of convolutional neural networks with cutout and Zhong et al.'s 2020 Random erasing data augmentation (AAAI) replace a random patch of the image with a constant value (zero, random noise, or the image's mean). This forces the network not to rely on a single salient region of the image — if it has memorised "this patch decides the class" then the augmentation will routinely remove that patch. Cutout is cheap to implement, improves ResNet ImageNet accuracy by ~0.5% and CIFAR accuracy by ~1–2%, and is complementary to mixup/CutMix (you can stack them). It is the simplest of the "remove information" augmentations and is often included in modern vision recipes.

AutoAugment and learned policies

Ekin Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc Le's 2019 AutoAugment: Learning augmentation policies from data (CVPR) turned augmentation itself into a search problem. They defined a policy space of sub-policies — small sequences of image operations (rotate, shear, colour, solarize, etc.) with probability and magnitude parameters — and used reinforcement learning over a smaller proxy dataset to find the policy that maximised validation accuracy. The discovered ImageNet policy gave ~0.5–1% top-1 improvement over the standard pipeline, at the cost of thousands of GPU-hours for the policy search. RandAugment (Cubuk et al. 2020) simplified: instead of searching for an optimal sequence, randomly sample n operations from a fixed list at fixed magnitude m, treating n and m as two tunable hyperparameters. RandAugment matches or exceeds AutoAugment's performance at a fraction of the search cost. TrivialAugment (Müller and Hutter 2021) went further still: a single randomly-chosen operation per image, with no hyperparameters — and showed that this parameter-free baseline matches RandAugment on most benchmarks. The lesson is deflating: much of what AutoAugment "learned" is a specific magnitude schedule and operation mix, but the key ingredient is just applying an interesting menu of augmentations uniformly — the exact schedule matters less than practitioners assumed.

Modern recipes stack augmentations

The state-of-the-art image-classification recipes (the "DeiT" recipe of Touvron et al. 2021, the Wightman et al. 2021 ResNet strikes back recipe) stack several augmentations simultaneously: RandAugment or TrivialAugment for image-level perturbation, mixup or CutMix for cross-image blending, random erasing / Cutout for patch-level removal, and label smoothing (§8) for target softening. On top of the classical pipeline (crop, flip, colour jitter), these additions together typically add 2–4% top-1 ImageNet accuracy to a ResNet-50 and similar improvements to transformer classifiers. The cost is a more complex augmentation pipeline and slower CPU-side data loading (augmentation can become the training-throughput bottleneck), but the throughput cost is usually worth it.

The modern recipe. Use the DeiT or ResNet-strikes-back recipe as a starting point: RandAugment(n=2, m=9 to 12) + mixup(α=0.2) + CutMix(α=1.0) + label smoothing 0.1 + random erasing p=0.25. On domain-specific data, check that the augmentations are label-preserving (see §6), but otherwise the stacked recipe is a good default that dominates hand-designed pipelines for most classification tasks.

Label smoothing

Label smoothing is one of those techniques that sounds like it shouldn't work: the target labels are the ground truth, and softening them seems like deliberately injecting noise into the supervision. Yet a simple change — replacing the one-hot target with a small mixture of the one-hot and a uniform distribution — reliably improves classification accuracy, improves calibration, and interacts cleanly with distillation. It is one of the cheapest regularisers in the chapter.

The technique

For a K-class classification problem with a one-hot target y = e_c (the c-th standard basis vector), label smoothing replaces y with y_smooth = (1 − ε) e_c + (ε/K) · 1_K, where ε is the smoothing coefficient (typically 0.1) and 1_K is the all-ones vector. The correct class has target probability 1 − ε + ε/K ≈ 0.9 (for ε = 0.1); every other class has target probability ε/K ≈ 0.01 (for K = 10). The cross-entropy loss L = − ∑_k y_smooth,k · log p_k then rewards the network for putting most of its mass on the correct class without pushing the non-target logits to −∞. The technique was introduced in passing in Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna's 2016 Rethinking the Inception architecture for computer vision (CVPR), as one of a handful of refinements to the Inception-v3 training recipe; it has since become nearly universal for classification.

Why it works

Rafael Müller, Simon Kornblith, and Geoffrey Hinton's 2019 When does label smoothing help? (NeurIPS) gave the most thorough analysis. Their findings: label smoothing tightens the clusters of penultimate-layer activations for each class — the features extracted for class-c examples become more concentrated around a class centroid and further from centroids of other classes. This improves the network's calibration (its predicted probabilities more closely match true accuracy rates; treated in detail in Part IV Chapter 09 Model Evaluation) and its test accuracy by a consistent ~0.5–1% on standard benchmarks. The mechanism: without smoothing, the network is incentivised to drive correct-class logits to +∞ and incorrect-class logits to −∞, overfitting to the training set's specific noise patterns. With smoothing, the gradient vanishes once logits reach a finite target, and the network stops pushing when it has "fit enough". This is a form of capacity constraint on the training dynamics.

Interaction with knowledge distillation

Müller et al. also showed that label-smoothed models are worse teachers for knowledge distillation. Knowledge distillation (Hinton, Vinyals, Dean 2015) uses a large teacher model's soft predictions to supervise a smaller student; the teacher's soft targets carry information about which wrong classes are "nearly right", which is precisely the information label smoothing destroys by collapsing all wrong-class targets to the same low value. If you are training a model primarily for downstream use as a distillation teacher, disable label smoothing; if you are training a model for deployment, enable it. This asymmetry is not especially intuitive but is a well-documented practical guideline.

Alternative: focal loss

A related technique is focal loss (Lin, Goyal, Girshick, He, Dollár 2017 Focal loss for dense object detection, ICCV), which modifies the cross-entropy loss to down-weight examples the network is already confident on: L_focal = − (1 − p_c)^γ · log p_c, where γ (typically 2) is the focusing parameter. For γ = 0 this is standard cross-entropy; for γ > 0 the loss is essentially zero on examples with correct-class probability near 1, forcing the gradient to concentrate on hard examples. Focal loss was designed for object-detection problems with severe class imbalance (where most candidate boxes are background and confidently classified), but it is used more broadly as a regulariser that shifts attention to hard examples. Its effect is complementary to label smoothing: smoothing softens targets uniformly; focal loss re-weights examples based on confidence. Both are cheap to implement and both are standard options in modern classification recipes.

Practical recommendations

Set ε = 0.1 for most classification problems — this is the value used in nearly every modern ImageNet recipe and in transformer language-model training. For problems with few classes (binary or 3-way classification), slightly larger ε (0.15 or 0.2) can be appropriate; for problems with thousands of classes, the default 0.1 is usually correct. Combine with mixup and/or CutMix; these all act on the target side of the loss function but are largely orthogonal in their effects. Always disable label smoothing when training a teacher for distillation. Measure calibration (expected calibration error, reliability diagrams) in addition to accuracy, especially if the downstream system relies on calibrated probabilities.

A nearly-free win. Label smoothing is one of the few techniques in the chapter that costs essentially nothing — one extra line of code, no hyperparameters to tune beyond a default choice of 0.1, and a measurable improvement in both accuracy and calibration. It has been the default in serious classification training recipes since about 2017, and outside of distillation-teacher training there is rarely a reason to disable it.

Early stopping and checkpoint averaging

The oldest form of regularisation is also one of the cheapest: stop training before the network has had a chance to memorise the training data. Early stopping was a standard tool in the neural-network literature of the 1980s and 90s, and it remains effective in the deep-learning era, though its interaction with modern learning-rate schedules and the double-descent phenomenon has changed how practitioners use it.

The principle

Train for a fixed number of epochs; at each epoch (or every k steps) evaluate on a held-out validation set; track the best-so-far validation metric; when the validation metric has failed to improve for P consecutive checks (the patience), stop training. Lutz Prechelt's 1998 chapter Early stopping — but when? in the Neural Networks: Tricks of the Trade volume (Montavon, Orr, Müller eds.) is the classical reference; it formalises several stopping criteria (generalisation loss, progress, validation-plateau) and compares their performance. The intuition is that training loss decreases monotonically while validation loss first decreases (as the network learns the signal) and then begins to increase (as the network starts memorising noise); the sweet spot is near the validation-loss minimum. Early stopping is sometimes called poor man's regularisation: it does not prevent overfitting; it merely stops training before overfitting has had time to develop.

Why early stopping regularises

One line of analysis connects early stopping to L2 regularisation. For a linear-regression loss on a positive-semi-definite Hessian, early-stopped gradient descent is (approximately) equivalent to ridge regression with a regularisation parameter that decreases as training progresses (Friedman and Popescu 2003). The intuition: gradient descent first moves in the direction of the largest-eigenvalue component of the Hessian, then gradually incorporates smaller-eigenvalue components; early stopping prevents the small-eigenvalue (high-variance) components from being fitted. For deep networks the equivalence doesn't hold exactly, but the qualitative story is similar — early in training the network learns coarse, generalising features, and late in training it memorises fine-grained training-set details.

The double-descent wrinkle

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever's 2020 Deep double descent: Where bigger models and more data hurt (ICLR) documented an epoch-wise double descent: for some models and datasets, the validation loss first decreases, then increases as the model starts to interpolate, then decreases again at even longer training. The classical early-stopping advice — "stop at the first validation-loss minimum" — can leave significant generalisation on the table in these cases; training longer through the interpolation peak to the deeper second minimum is the correct move. Whether epoch-wise double descent occurs depends on the dataset size, model size, and amount of regularisation; in practice, for very-long training runs (LLM pretraining with trillions of tokens), validation loss is essentially monotonically decreasing, and early stopping is effectively never triggered. For shorter training runs on fixed datasets, early stopping is still the default.

Checkpoint averaging

Rich Caruana, Steve Lawrence, and Lee Giles's 2000 Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping (NIPS) introduced checkpoint averaging: instead of using the single best checkpoint, average the parameters of several nearby checkpoints to get a model that generalises better than any individual checkpoint. The intuition is that the loss landscape near a minimum is locally convex, so averaging points in the basin of attraction produces a point closer to the basin's centre — a flatter, better-generalising point. Modern incarnations include Stochastic Weight Averaging (Izmailov, Podoprikhin, Garipov, Vetrov, Wilson 2018; treated in detail in §14), which generalises checkpoint averaging to work well with cyclical learning-rate schedules, and Exponential Moving Average of parameters (the EMA that is standard in diffusion-model training and in the Polyak-averaged-target approach used in reinforcement learning).

Practical usage

For standard supervised training on a fixed dataset, implement early stopping with a patience of 5–20 epochs on validation loss. Save the best-validation-loss checkpoint explicitly and restore it at the end of training rather than using the final-epoch weights. For very-long pretraining runs with cosine schedules, early stopping is rarely triggered — the schedule ends before validation loss rises — but still configure it as a safety net. For fine-tuning with small learning rates and small datasets, early stopping is essential; validation loss can start rising within the first epoch, and without early stopping the fine-tune will immediately start overfitting. An EMA of parameters (decay 0.999 to 0.9999) as an inference-time average is nearly always a small win for generative models (diffusion, VAE, GAN) and is worth trying for classification models too.

Always save the best. Even if your schedule runs to a fixed number of epochs rather than stopping early, checkpoint at every epoch and restore the best-validation checkpoint at inference. The "run-ended" checkpoint is often not the best one — late-training overfitting, schedule-warmdown artefacts, and small-batch noise can all degrade the final model relative to a slightly earlier one. The disk cost of a few extra checkpoints is trivial; the performance cost of throwing away the best one can be large.

Batch normalisation as a regulariser

Batch normalisation was introduced (Chapter 02 Part V, §7) primarily as an optimisation tool — it stabilises training, enables larger learning rates, and lets practitioners train hundreds of layers without divergence. It is also, incidentally, a regulariser: a network trained with batch norm generalises better than the same network without, even after controlling for optimisation-stability effects. Understanding the regularisation channel of BN, distinct from its training-speedup role, matters for architectural choices and for understanding why LayerNorm-based transformers need different regularisation recipes than BN-based convnets.

The regularisation mechanism

At training time, BN normalises each feature by the mini-batch statistics — the mean and variance computed over the current batch. Because different batches contain different examples, the same input, processed in different batches, sees slightly different normalisation statistics. This is effectively noise injected into the forward pass: the pre-BN activation of a specific neuron is scaled and shifted by a batch-statistic-dependent quantity that varies from step to step. The noise is small when the batch is large (because sample means and variances are low-variance estimators of the population mean and variance), and large when the batch is small. The effect is similar in spirit to dropout: stochastic perturbations of the forward pass that the network cannot rely on, forcing it to learn robust features. Ioffe and Szegedy's original 2015 paper noted the regularisation effect in passing; Santurkar, Tsipras, Ilyas, Madry's 2018 How does batch normalization help optimization? (NeurIPS) formalised some of the mechanics of how BN's gradient landscape differs from plain training's.

Empirical evidence of regularisation

Several experiments isolate BN's regularisation effect from its optimisation effect. Luo, Wang, Shao, Peng's 2019 Towards understanding regularization in batch normalization (ICLR) trained networks with and without BN using learning rates carefully matched for training-loss trajectory (so that the final training loss reached the same value in both cases). The BN networks still generalised better — by a few percent on ImageNet — demonstrating that the regularisation effect is not just a side-effect of better optimisation. Li, Lian, Su, Dauphin 2018 (Visualizing the loss landscape of neural nets) showed that BN networks converge to flatter regions of the loss landscape, which by the flat-minima hypothesis (§17) should generalise better. The quantitative magnitude of BN's regularisation effect is roughly 1–3% top-1 accuracy on ImageNet; substantial but not dominant compared to data augmentation or weight decay.

The batch-size dependence

BN's regularisation effect is batch-size-dependent. Small batches produce noisy statistics and strong regularisation; large batches produce stable statistics and weak regularisation. This becomes problematic in the very-large-batch distributed-training regime, where each GPU sees a small fraction of a per-step mini-batch of 32k or 64k. Using BN with per-GPU statistics gives strong regularisation but different statistics on different GPUs (bad for model consistency); using BN with synchronised statistics across GPUs gives the regularisation of an effective batch of 32k, which is so weak as to be almost nonexistent. This is one of the reasons the largest-scale training runs typically use LayerNorm (which has no batch dimension and therefore no batch-size dependence) rather than BN — treated in Chapter 02 §8.

Implications for architecture choice

If you train a BN convnet, you get "free" regularisation from the batch-statistic noise. If you train a LayerNorm transformer, you do not — LayerNorm normalises over the feature dimension, not across examples, so there is no cross-example noise channel. Transformers, correspondingly, tend to need stronger explicit regularisation (dropout, weight decay, augmentation) than BN convnets do. This is empirical and somewhat hand-wavy — an exact comparison is hard because transformers and convnets also differ in many other ways — but the architectural asymmetry is a reason modern transformer training recipes include regularisers that would be considered redundant in a BN convnet context. The general principle: when you choose a normalisation scheme, you are also choosing how much implicit regularisation the architecture will provide, and you should plan your explicit regularisation accordingly.

GhostBatchNorm and small-batch tricks

If the regularisation effect of BN at small batch size is beneficial, one can engineer it back into a large-batch training regime: Ghost Batch Normalisation (Hoffer, Hubara, Soudry 2017 Train longer, generalize better) computes BN statistics over sub-groups of the mini-batch rather than the whole batch, recovering the small-batch noise even when the overall batch is large. This is one of several tricks that attempt to decouple BN's optimisation and regularisation behaviours. Its use is niche — most practitioners either use standard BN or switch to LayerNorm/GroupNorm entirely — but it is a useful technique to know about when large-batch training is hurting generalisation in a specific experiment.

The architectural dimension of regularisation. Every normalisation choice you make for your architecture also implicitly chooses a level of "free" regularisation. BN with moderate batches regularises more than LayerNorm or GroupNorm or no normalisation at all. Modern practice — especially for very-large transformers — has settled on LayerNorm and explicit regularisation recipes, precisely because it decouples the normalisation and regularisation decisions. If you are debugging generalisation behaviour, checking whether the normalisation scheme is providing the regularisation you expect is a cheap and useful diagnostic.

Stochastic depth, DropConnect, and LayerDrop

Dropout randomly zeros individual activations. A family of related regularisers randomly drops larger structural units — entire residual blocks, entire weight matrices, entire transformer layers. These "remove part of the network" regularisers have been especially important for very deep networks, where the interaction between depth and regularisation is delicate: the deeper the network, the more capacity for memorisation, but also the greater the risk that any single regulariser destabilises training. Random-depth methods split the difference.

Stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger's 2016 Deep networks with stochastic depth (ECCV) was motivated by the observation that very-deep ResNets (1000+ layers) are hard to train — partly because of the sheer number of operations per forward pass, partly because of the risk of vanishing or exploding gradients through so many layers, and partly because training is computationally expensive. Their fix: during training, for each residual block, randomly skip the block (use only the identity path) with probability p_l that depends on the layer index l — early layers are kept nearly always, late layers are skipped with probability up to 0.5. The effect is multi-fold: (i) the expected depth during training is shorter than the nominal depth, so training is faster; (ii) the network is forced to produce valid representations at many depths (early blocks must be useful even when later blocks are skipped); (iii) the randomly-skipped blocks introduce the ensemble-style regularisation familiar from dropout. ResNet-1202 with stochastic depth reached error rates substantially below ResNet-1202 without it on CIFAR, and the technique has become standard for very deep networks.

DropConnect

Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus's 2013 Regularization of neural networks using dropconnect (ICML) dropped individual weights rather than individual activations. The mask is applied to the weight matrix: on each forward pass, a random subset of weights is zeroed, forcing every activation to be computed from a randomly-chosen subset of its inputs. DropConnect is a strict generalisation of dropout (dropout corresponds to the special case of dropping entire columns of the weight matrix — every weight leading out of a dropped neuron gets masked). Empirically, DropConnect matches or slightly exceeds dropout on several benchmarks at comparable hyperparameter settings. It is less commonly used because its implementation is slightly more expensive (the mask is weight-matrix-sized rather than activation-sized) and the performance difference is usually small. Variational weight noise — adding Gaussian noise to the weights at each step — is a continuous relative of DropConnect that sees occasional use (Blundell et al. 2015 Weight uncertainty in neural networks).

LayerDrop for transformers

Angela Fan, Edouard Grave, and Armand Joulin's 2019 Reducing transformer depth on demand with structured dropout (ICLR 2020) adapted stochastic depth to transformer architectures. Each transformer layer (an attention-plus-feedforward block with a residual connection) is randomly dropped during training with some probability p. The dual purpose: regularisation, and enabling inference-time depth reduction — a LayerDrop-trained transformer can have some of its layers removed at inference with only a small accuracy cost, producing a family of models at different speed-accuracy trade-offs from a single training run. LayerDrop has become a common component in modern transformer training, particularly for models that will be deployed across multiple compute budgets.

Why these techniques work

All three techniques share the ensemble-approximation intuition from dropout, scaled up to larger structural units. Each training step corresponds to a different sub-architecture with a different subset of layers/weights active; the network must produce good features from every sub-architecture, averaged over the training distribution of masks. This is stronger than dropout's unit-level ensembling because the variation between sub-architectures is larger — different stochastic-depth masks produce meaningfully different computation paths, not just different activations of similar paths. The regularisation is correspondingly stronger, though it can also make training less stable if the stochastic probabilities are too aggressive. The standard practice is to use moderate probabilities (0.1–0.3), often scheduled (start low, increase) over the course of training.

Practical usage

For very-deep convnets (more than 50 layers), stochastic depth is nearly always helpful — it regularises and speeds up training. Modern vision architectures (ConvNeXt, Vision Transformer variants) routinely include stochastic depth with rates ramping from 0.0 (early layers) to 0.2–0.3 (late layers). For large transformer language models, LayerDrop is used selectively — it can interact poorly with the autoregressive training objective if applied too aggressively — but probabilities of 0.1–0.2 are standard in several open-source pretraining stacks. DropConnect sees less use in new architectures, having largely been superseded by its structural descendants. The general rule: the deeper the network, the more benefit from these techniques; for a shallow network (fewer than 20 layers), they are rarely worth the complexity.

Structural stochasticity. Dropout, stochastic depth, LayerDrop, and DropConnect form a spectrum from fine-grained (activation-level) to coarse-grained (layer-level) stochastic removal. Coarser methods regularise more strongly but are less stable; finer methods are easier to tune but regularise less. Modern recipes often stack them — dropout at ~0.1 in each layer, plus stochastic depth / LayerDrop at the layer level — getting the benefit of both scales simultaneously. As with all regularisation, tune the rates by validation performance; aggressive defaults can prevent convergence on small datasets.

Noise injection and Jacobian regularisation

A class of regularisation techniques does not drop components but adds continuous random noise to them — to the input, to intermediate activations, to the weights, or to the gradients themselves. These techniques have a long history, a clean Bayesian interpretation, and a recurring role in the regulariser's toolbox. They generalise dropout (which is noise with a particular Bernoulli distribution) to arbitrary noise models, and they connect naturally to robustness and generalisation theory.

Input noise

The oldest and simplest form is input noise: add Gaussian noise η ∼ 𝒩(0, σ²I) to each input before the forward pass. Chris Bishop's 1995 paper Training with noise is equivalent to Tikhonov regularization (Neural Computation) showed that for a Gaussian-MSE loss and small noise variance, training on noisy inputs is equivalent, to first order, to minimising the original loss plus a regularisation term proportional to the Jacobian of the network's output with respect to the input: L_reg ≈ L + (σ²/2) · 𝔼[‖∂f/∂x‖²]. The input-noise regulariser is a Jacobian regulariser — it penalises the sensitivity of the network's predictions to small input perturbations. This is an intuitive notion of smoothness: a function that generalises well ought to vary gracefully with its inputs rather than spike on adversarial perturbations.

Activation and weight noise

Noise can be added at any layer of the network: Gaussian noise on activations (equivalent in spirit to dropout with continuous rather than Bernoulli masks), Gaussian noise on weights (equivalent to Bayesian variational inference with a factorised Gaussian posterior, as in Blundell et al. 2015 Weight uncertainty in neural networks). Activation noise has seen less practical use than dropout, largely because dropout's Bernoulli formulation is cheaper to compute and usually works equivalently well. Weight noise is the foundation of Bayesian deep learning and variational inference — a substantial topic in its own right (Chapter 03 of Part VI Probabilistic Machine Learning in this compendium) — and is mostly used when the goal is predictive uncertainty rather than test accuracy.

Gradient noise

Neelakantan, Vilnis, Le, Sutskever, Kaiser, Kurach, Martens's 2016 Adding gradient noise improves learning for very deep networks (ICLR workshop) showed that adding Gaussian noise to the gradient before applying the SGD update — θ ← θ − η · (∇L + η_t), with η_t ∼ 𝒩(0, σ²_t · I) and a decaying schedule on σ_t — helps training very-deep feed-forward networks that otherwise fail to converge. Gradient noise is a cousin of SGD's natural noise: the mini-batch gradient is itself a noisy estimator of the full gradient, and adding extra noise essentially uses a smaller effective batch. The technique is rarely used in modern training; with well-designed initialisation, batch normalisation, and good learning-rate schedules, very-deep networks are trainable without extra gradient noise. But it remains a useful technique for the rare cases where training is stuck and one wants to encourage escape from a narrow local minimum.

Explicit Jacobian regularisation

Several papers have proposed adding the Jacobian norm directly to the loss rather than inducing it via input noise. Sokolić, Giryes, Sapiro, Rodrigues 2017 Robust large margin deep neural networks (IEEE Trans. Signal Processing) added a contractive-autoencoder-style penalty on the per-example input Jacobian. Hoffman, Roberts, Yaida 2019 Robust learning with Jacobian regularization made this efficient by using a stochastic estimator of the Frobenius norm. These methods are adversarial-robustness cousins to input noise; they are less commonly used than input noise itself because the direct penalty is expensive (computing Jacobians of deep networks is not free) and the gain over cheaper alternatives (mixup, augmentation) is modest.

Connections to adversarial training

The Jacobian regulariser ‖∂f/∂x‖² penalises sensitivity to random perturbations. Adversarial training (§13 below) penalises sensitivity to worst-case perturbations — which, for a network whose Jacobian is bounded, are necessarily small. This connection is made rigorous in several places (Finlay, Oberman 2018, Simon-Gabriel et al. 2019) and is one of the reasons adversarial training and input-noise training often have overlapping effects. Both encourage smoothness; they differ in whether smoothness is measured at typical perturbations (noise) or worst-case ones (adversarial).

When to use

In modern practice, explicit noise injection is rarely the first regulariser reached for. Data augmentation (§6–7), dropout (§5), and label smoothing (§8) cover most of the ground input-noise and weight-noise would cover, with better-tuned recipes and more extensive empirical support. The noise-injection literature is valuable as a theoretical frame (the Jacobian-regularisation interpretation of several techniques, the Bayesian interpretation of weight noise) and as a fallback for unusual problems where the standard toolbox underperforms. If you find yourself needing to add Gaussian noise to inputs as an explicit regulariser, it is a signal to check whether the right data augmentations have been used first.

Noise as a theoretical lens. The Bishop 1995 equivalence — input noise ≈ Jacobian regularisation — is one of the clearest theoretical statements about why a regulariser works, and its qualitative implications (generalising networks have bounded input Jacobians) extends to many other regularisation techniques. Even if you never explicitly add Gaussian noise to your inputs, understanding why a smooth Jacobian correlates with good generalisation is part of understanding the principles that this whole chapter is built on.

Adversarial training

Adversarial training is the technique of augmenting each training example with a worst-case perturbation — an input that has been modified within a small ball around the original to maximise the network's loss. It started life as a defence against adversarial examples (a specific security-flavoured failure mode of neural networks) but has come to be understood also as a general-purpose regulariser: training against worst-case perturbations produces smoother, more robust, and frequently better-generalising models.

Adversarial examples

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus's 2014 Intriguing properties of neural networks (ICLR) made the original observation: standard deep networks are easily fooled by small, targeted input perturbations that are imperceptible to humans but cause confident misclassification. A picture of a panda, perturbed by a pattern invisible to the eye, is classified as a gibbon with 99% confidence. These adversarial examples were a shock to the field — they showed that the features deep networks actually use to classify are not the features humans would expect, and they raised obvious security concerns for any deployment of neural networks in security-critical settings. The literature on adversarial attacks and defences has since become a major sub-field of its own (Chapter 05 of Part XI of this compendium).

FGSM and PGD

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy's 2015 Explaining and harnessing adversarial examples (ICLR) introduced the Fast Gradient Sign Method (FGSM): generate an adversarial example by taking one step of sign-of-gradient ascent on the loss, x_adv = x + ε · sign(∇_x L(f(x), y)), for a small ε (typically scaled to the image's pixel-value range). FGSM produces adversarial examples cheaply (one backward pass per example) but not very strong ones; stronger attacks use multiple steps. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu's 2018 Towards deep learning models resistant to adversarial attacks (ICLR) formalised Projected Gradient Descent (PGD) as the standard attack: iterate x ← Π_{B_ε(x_0)}(x + α · sign(∇_x L)) for k steps, where the projection Π keeps the adversarial example within a ball of radius ε around the original. PGD-based adversarial training is the standard defence: at each training step, for each example, run PGD to find an adversarial version and train on the adversarial example rather than the clean one.

The min-max formulation

Madry et al. framed adversarial training as a min–max optimisation: min_θ 𝔼_{(x,y)}[max_{δ : ‖δ‖ ≤ ε} L(f_θ(x + δ), y)]. The outer minimisation is standard training; the inner maximisation finds the worst-case perturbation within the ε-ball. The gradient of the min–max objective, using the envelope theorem, is just the gradient of the loss at the (approximate) inner-max point, which PGD provides. This is robust empirical risk minimisation — minimising expected worst-case loss rather than expected loss. The robustness it provides is against input perturbations of bounded norm; for perturbations outside that norm bound, or from a different threat model, the defence may not hold.

Adversarial training as a regulariser

Adversarial training's primary motivation is robustness, but an important observation in the literature (Tsipras, Santurkar, Engstrom, Turner, Madry 2019 Robustness may be at odds with accuracy, ICLR; and the many replies and refinements) is that adversarial training also acts as a regulariser — it changes the learned features in a systematic way, producing models whose features are more aligned with human perception (Engstrom, Ilyas, Salman, Santurkar, Tsipras 2019), more interpretable (Shafahi et al. 2019), and sometimes better-generalising on natural-distribution test sets (though this is model- and dataset-dependent and often there is a clean-accuracy cost to robust training). For applications where worst-case perturbation robustness matters (security, safety-critical deployment), adversarial training is essentially the only established defence; as a general-purpose regulariser for clean accuracy, it is expensive (typically 3–10× the training cost of standard training) and is usually beaten by cheaper alternatives.

TRADES and certified robustness

Zhang, Yu, Jiao, Xing, El Ghaoui, Jordan 2019 Theoretically principled trade-off between robustness and accuracy (ICML) introduced TRADES, which decomposes adversarial training into a clean-accuracy term plus a robustness-regularisation term and tunes the weight between them explicitly. Certified defences (Cohen, Rosenfeld, Kolter 2019 Certified adversarial robustness via randomized smoothing; Wong and Kolter 2018 Provable defenses against adversarial examples) aim for formal guarantees rather than empirical resistance — they produce classifiers that can be proven to be robust to all perturbations within some norm ball, not just robust to the specific attacks the defender thought to test. These are active research areas; the gap between certified guarantees and empirical robust accuracy remains a substantial open challenge.

Practical usage

Unless robustness to adversarial perturbations is an explicit requirement, do not use adversarial training as a regulariser — its compute cost is not worth the generalisation gains, which are matched or exceeded by mixup, label smoothing, and modern augmentation for a tiny fraction of the compute. Where robustness is required (safety-critical computer vision, some NLP applications where robust to paraphrase matters), PGD-based adversarial training with ε set appropriately for the threat model is the standard starting point. Expect a 3–10× training slowdown and a 5–20% drop in clean accuracy relative to standard training; these are the costs of the guarantee.

When adversarial training is the right tool. Adversarial training is essential when the deployment threat model includes adversarial perturbations — security, safety, anti-abuse. As a generic regulariser for "a bit more generalisation on natural-distribution data", it is a blunt and expensive instrument. Fit the tool to the problem: if you are not training for robustness, reach for data augmentation and mixup before you reach for PGD.

Ensembling, SWA, and snapshot ensembles

If training one neural network produces a good model, training several and combining them produces a better one. The classical ensemble techniques — bagging (Breiman 1996), boosting (Freund and Schapire 1997), random forests (Breiman 2001) — are treated in Chapter 03 of Part IV. This section is about the deep-learning-specific twist: producing multiple models from one training run, by exploiting the fact that a single SGD trajectory visits many nearby solutions, all of which can be turned into a useful ensemble at modest extra cost.

The classical ensemble

The simplest ensemble is "train M models from independent random initialisations, average their predictions at inference". Lakshminarayanan, Pritzel, Blundell 2017 Simple and scalable predictive uncertainty estimation using deep ensembles (NeurIPS) showed that deep ensembles — typically M = 5 or 10 — consistently beat single models on both accuracy and calibration, at roughly M× the training cost and M× the inference cost. The ensembling works through a variance-reduction mechanism: different random seeds produce models that make different mistakes, and averaging their predictions cancels the idiosyncratic errors while reinforcing the common signal. Deep ensembles are the gold standard for out-of-distribution generalisation and predictive uncertainty, but the compute cost — both training and inference — makes them impractical for many production settings.

Snapshot ensembles

Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John Hopcroft, Kilian Weinberger's 2017 Snapshot ensembles: Train 1, get M for free (ICLR) observed that a cyclical learning-rate schedule produces a sequence of local minima over the course of a single training run — each time the learning rate cycles from high to low, the network converges to a nearby but distinct minimum. Save a checkpoint at each of these low-learning-rate phases, and you have M models from a single training run. Averaging their predictions at inference gives an ensemble that approaches the quality of independent random-seed ensembles, at 1/M the training cost. The technique was influential in demonstrating that much of the ensemble benefit comes from finding different solutions, not specifically from independent initialisations — and that SGD with aggressive cyclical schedules naturally finds different solutions within a single trajectory.

Stochastic Weight Averaging

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Wilson's 2018 Averaging weights leads to wider optima and better generalization (UAI) introduced Stochastic Weight Averaging (SWA). Instead of ensembling predictions at inference time, SWA averages the parameters of checkpoints collected during training, producing a single model whose parameters are the running average. The averaged model is used for inference in the usual way — no ensemble overhead at test time. Izmailov et al. showed that the averaged weights sit in a flatter region of the loss landscape than any individual checkpoint (because the average of several points in a basin is closer to the basin's centre than any individual point is), and that this flatness translates into better generalisation. SWA typically adds 0.5–1.5% on classification benchmarks over the best individual checkpoint, at negligible extra cost — just maintain a running average of the weights during training, and use it as the final model. It has become standard in competitive training recipes and is available as torch.optim.swa_utils in PyTorch.

SWA vs snapshot ensembling

The two techniques address the same "multiple solutions from one run" opportunity in different ways: snapshot ensembling keeps the multiple solutions and averages their predictions at inference (requiring M forward passes); SWA averages the weights of the solutions and uses the single averaged model at inference (requiring one forward pass). Which is better depends on the geometry: if the different solutions sit in distinct basins, weight-space averaging lands outside either basin and performs worse than prediction-space averaging; if they sit in a common basin, weight-space averaging lands at a flat point and performs similarly or better than prediction-space averaging, at a fraction of the inference cost. Garipov et al. 2018 Loss surfaces, mode connectivity, and fast ensembling of DNNs showed that solutions found by different SGD trajectories from the same initialisation are typically connected by low-loss paths — they sit in a common "basin" in a generalised sense — making SWA competitive with snapshot ensembling for many modern architectures.

Exponential moving average

A continuous version of SWA is the Exponential Moving Average of parameters: maintain θ_EMA ← β · θ_EMA + (1 − β) · θ with β ≈ 0.999 or 0.9999 at each step, and use θ_EMA for inference. EMA is near-universal in modern generative-model training (diffusion models, VAEs, GANs) — the EMA weights produce substantially better samples than the training-step weights. It is nearly free to implement, costs one extra parameter tensor in memory, and usually helps a bit on classification too. Use EMA if there is any noise in the training trajectory; the cost is trivial, and it is one of the clearest cases of "a simple technique that practitioners really should use more".

The cheapest ensemble you can build. For most production settings a full M-way deep ensemble is too expensive, but SWA and EMA give most of the benefit for essentially no extra cost. SWA is typically the first post-training improvement to try; EMA is standard in generative-model recipes and a modest-but-real improvement in classification. If you need the full accuracy of independent ensembles — for high-stakes applications — budget the M× compute and train independently-seeded models; otherwise average parameters within a single training run and pay almost nothing.

Classical generalisation theory and its limits

The empirical success of deep learning outran its theoretical explanation for most of the last decade, and to a substantial extent still does. The classical statistical-learning-theory framework — VC dimension, Rademacher complexity, PAC-Bayes, margin theory — is a beautiful and deep body of work that gives clean generalisation guarantees for simple model classes. For modern overparameterised deep networks, most of its bounds are vacuous, and understanding why is important context for the empirical techniques in this chapter.

VC dimension

Vladimir Vapnik and Alexey Chervonenkis's 1971 paper On the uniform convergence of relative frequencies of events to their probabilities introduced the VC dimension: the largest number of points that can be shattered (labelled arbitrarily) by functions in a given class. A higher VC dimension means a richer hypothesis class, which classical theory translates into weaker generalisation guarantees: with training sample size N, the generalisation error is bounded (with high probability) by training error plus O(√(d/N)), where d is the VC dimension. For linear classifiers in p dimensions, the VC dimension is p + 1; for a neural network with P parameters it is at most O(P² log P) (Bartlett, Maiorov, Meir 1998) and in some cases scales polynomially with P. For a modern network with P = 10⁸ parameters trained on N = 10⁶ samples, √(d/N) ≫ 1 — the VC bound is vacuous, predicting that the network could have arbitrary test error. Since the network in fact generalises, something is wrong with either the theory or the way we are applying it.

Rademacher complexity

Peter Bartlett and Shahar Mendelson's 2002 Rademacher and Gaussian complexities: Risk bounds and structural results (JMLR) refined the VC framework with Rademacher complexity, which measures how well functions in a class can fit random sign-labels. A function class with low Rademacher complexity cannot fit arbitrary random labels well, so training error on real labels is informative about test error. For linear classifiers with bounded weight norm, Rademacher complexity is bounded by a function of the weight norm rather than the parameter count — suggesting a natural escape from the VC-bound vacuousness. This was a genuine theoretical advance, but its direct application to deep networks also produces vacuous bounds: the Rademacher complexity of a deep network grows exponentially with depth, or faster, and the bounds are not useful. Neyshabur, Bhojanapalli, McAllester, Srebro 2017 Exploring generalization in deep learning (NeurIPS) gave a detailed analysis of why Rademacher-based bounds fail to explain deep-network generalisation.

PAC-Bayes bounds

David McAllester's 1999 PAC-Bayes framework (PAC-Bayesian model averaging, COLT) produces generalisation bounds in terms of the KL divergence between a posterior over hypotheses and a prior. Gintare Karolina Dziugaite and Daniel Roy's 2017 Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data (UAI) produced the first PAC-Bayes bound for a deep network that was non-vacuous — a bound that was less than 100% and therefore informative. Their trick was to optimise a stochastic neural network (a distribution over weights) to minimise the PAC-Bayes bound directly, producing a bound of around 20% on MNIST for a specific network. This was a breakthrough — the first non-vacuous bound for a real deep network — but the bound was still much looser than the measured test accuracy, and the technique has not (yet) generalised to give useful bounds on state-of-the-art networks on realistic datasets.

Margin theory

Peter Bartlett, Dylan Foster, Matus Telgarsky 2017 Spectrally-normalized margin bounds for neural networks (NeurIPS) developed a theory based on margins — the gap between a correct classification and an incorrect one — normalised by the product of the network's layer spectral norms. The intuition is that a network with large margins and small spectral norms is, in some precise sense, a "simple" function despite having many parameters, and should generalise. The bounds are still loose in absolute terms but correlate meaningfully with actual test accuracy across different architectures; in this sense they are explanatorily non-vacuous even when numerically vacuous.

The open question

As of the mid-2020s, no theoretical framework produces a tight generalisation bound for a state-of-the-art deep network on a realistic dataset. Zhang et al. 2017's challenge — explain why networks that can fit random labels perfectly nevertheless generalise on real labels — has been answered at the level of mechanism (implicit bias, flatness preferences, data-architecture alignment) but not at the level of a clean theorem with a tight constant. The neural tangent kernel analysis (Jacot, Gabriel, Hongler 2018; Arora, Du et al. 2019) produces clean theorems in the infinite-width limit, but the theorems apply to a regime (infinite width, small learning rate) that real deep networks have left behind. Modern theoretical work on generalisation is an active and fertile research area; the practitioner's short summary is that the theory is still catching up to the practice, and the practice is quite firmly established even where the theory is still in motion.

Why this section matters. Classical learning theory gives the vocabulary — generalisation gap, complexity measures, margin, flatness — that every discussion of regularisation uses. It was built for a regime (few parameters, many samples) that modern deep learning has left behind, so its quantitative predictions are usually vacuous on modern networks, but its qualitative structure is still the right way to organise thinking about generalisation. The rest of this chapter has been the empirical practitioner's response to classical theory's limits: when you cannot prove generalisation from first principles, build a toolbox of techniques that induce it and measure the result on validation data.

Double descent

If you had to name a single phenomenon that forced the machine-learning community to rethink its classical narrative about regularisation, it would be double descent. The observation — that test error first rises as models grow beyond the training-set size, then falls again, often to levels well below the classical sweet spot — was hiding in plain sight for decades before Belkin and collaborators identified it in 2018, and it has become the organising frame for thinking about overparameterised learning ever since.

The classical vs modern regime

Recall from §2 the classical bias–variance picture: as model capacity increases, bias decreases but variance rises, producing a U-shaped test-error curve with a sweet spot in the middle. This picture is correct for classical low-capacity models, but it describes only what happens for P < N — models with fewer parameters than training samples. What happens when P > N, the overparameterised regime where deep networks live? The classical theory has no prediction, and for decades practitioners simply extrapolated the U-shape, expecting generalisation to get worse and worse as model capacity grew beyond the training-set size. Empirically, this extrapolation is wrong — test error drops again in the overparameterised regime, eventually reaching levels well below the classical sweet spot.

Belkin–Hsu–Ma–Mandal 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal's 2019 Reconciling modern machine learning practice and the classical bias–variance trade-off (PNAS) was the paper that put the two halves of the picture together. They showed, on a range of classical and modern models — random Fourier features, decision trees, neural networks — that the full test-error curve is double-descent shaped: the classical U on the under-parameterised side, a peak at the interpolation threshold (where P = N, the model has just enough capacity to exactly fit the training data), and a second descent on the overparameterised side. The peak at the interpolation threshold occurs because there is exactly one model in the class that fits the training data, and that one model is highly sample-specific and generalises poorly. Beyond the threshold, there are many models that fit the training data, and the implicit regularisation of the optimiser selects one that generalises well.

Model-wise, sample-wise, and epoch-wise

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever's 2020 Deep double descent: Where bigger models and more data hurt (ICLR) demonstrated that double descent appears not just when varying model size but also when varying training-set size and number of training epochs. Model-wise double descent is the Belkin picture: fix the data and training, vary the model size. Sample-wise double descent fixes the model and training, varies the dataset size — more data can hurt generalisation when it pushes the model through the interpolation threshold. Epoch-wise double descent fixes the model and data, varies the training duration — test error can first decrease, then increase around the interpolation epoch, then decrease again at longer training. All three are theoretically predicted and empirically observed; their existence means that the classical "more capacity, more data, longer training → worse generalisation" heuristic is dangerous in the overparameterised regime.

Why it happens

The mechanism of double descent is the implicit regularisation of the optimiser (§17). Under-parameterised models have a unique best fit, and more capacity extends the model's reach until it can exactly fit the training set. At the interpolation threshold, the unique interpolating model is sensitive to every training example — it has essentially no slack. Beyond the threshold, many models can interpolate, and the one chosen by the optimiser is not a random interpolator but the one with smallest weight norm (for linear models) or smallest implicit-bias-related complexity measure (for deep networks). This selected model has a smoother behaviour away from the training data and generalises better. The key theoretical result is that for linear models, the min-norm interpolator recovers the ridge-regression sweet-spot as the model grows beyond interpolation (Hastie, Montanari, Rosset, Tibshirani 2022 Surprises in high-dimensional ridgeless least squares interpolation), making the overparameterised regime effectively a "free" regularisation through large capacity.

When does double descent appear?

Double descent is not universal. Heavy explicit regularisation (strong weight decay, aggressive early stopping) can smooth out the peak at the interpolation threshold, recovering a monotonically-decreasing curve. Nakkiran, Venkat, Kakade, Ma 2021 Optimal regularization can mitigate double descent showed that when the regularisation strength is tuned optimally for each model size, the peak disappears and test error becomes monotonic in model size. Real-world training recipes typically do not tune regularisation this carefully, so double descent is visible on standard ImageNet and CIFAR training curves. The practical implication: a training run where test error is rising might be suffering from double descent rather than from "true" overfitting — the correct response might be to train more, not to stop.

Practical implications

Three practical takeaways from double descent. First, bigger is often better even if your training data seems small — overparameterisation is not the enemy it was taken to be in the classical picture. Second, more data is usually but not always better — when a training-set-size increase pushes you across the interpolation threshold, you may transiently see worse generalisation until you push further beyond. Third, long training sometimes helps when early stopping would suggest it hurts — especially for the interpolation-threshold epoch-wise peak. All of these are counter to the classical intuition, and getting them right is the difference between a training recipe that takes the modern geometry seriously and one that still thinks in terms of the 20th-century U-curve.

The modern bias–variance curve. The classical U-shaped test-error curve is the left half of the modern double-descent curve. The right half — where more capacity gives better generalisation — is where modern deep learning lives. If your intuition for regularisation is built on the classical U-shape, you will routinely make bad predictions about how modern models should be trained; if your intuition includes the full double-descent picture, you will understand why "just train a bigger model" is often the best regularisation move available.

Implicit regularisation and flat minima

The techniques in §3–§14 are all explicit regularisers — things the practitioner deliberately adds to the training recipe. This section is about the far murkier story of implicit regularisation — the fact that SGD, even without any added regularisers, systematically biases the solution it finds toward ones that generalise. The mechanism is not fully understood, but the empirical and theoretical picture is clear enough that most modern intuitions about "why deep learning generalises" rest on some version of it.

The phenomenon

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro's 2014 In search of the real inductive bias: On the role of implicit regularization in deep learning framed the question. Consider two networks with identical architectures and capacity, initialised differently and trained on the same data with the same optimiser. Do they converge to the same solution? Empirically, no — they converge to different solutions, and typically both solutions generalise well. This means the optimiser is selecting from among many possible zero-training-loss configurations, and the selection is systematic: different runs find different solutions but the distribution of solutions is biased toward good-generalising ones. This is implicit regularisation — the optimiser's bias toward specific solutions is what makes overparameterised networks generalise.

Flat minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang's 2017 On large-batch training for deep learning: Generalization gap and sharp minima (ICLR) gave the most influential empirical characterisation. They showed that models trained with small batch sizes converge to flat minima — regions of the loss landscape where loss is approximately constant in all directions — while models trained with large batch sizes converge to sharp minima — regions where loss rises steeply in some direction. Flat minima generalise better than sharp ones, and Keskar et al. argued that the noise in small-batch SGD updates preferentially finds flat minima because the sharpness of the landscape makes the effective loss (the expected training-batch loss) higher at sharp minima than at flat ones. The flatness–generalisation connection is one of the most robust empirical observations in deep-learning theory.

How SGD biases toward flatness

Samuel Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc Le's 2018 Don't decay the learning rate, increase the batch size (ICLR) articulated one clean mechanism. Think of SGD as discretising a continuous-time stochastic differential equation where the noise variance is proportional to the per-example gradient covariance divided by the batch size. The equilibrium distribution of this SDE is biased toward flat regions of the loss landscape because the noise lets the parameters escape sharp minima (where even a small perturbation substantially increases loss) but leaves them at flat minima (where small perturbations do not). The effective noise scale is η / |B|, which gives the famous linear-scaling rule: doubling the batch size is equivalent to halving the learning rate, and both move the equilibrium distribution toward sharper minima. This is one reason training recipes for very-large-batch SGD explicitly add noise or use larger learning rates to recover the flat-minimum bias.

Implicit bias toward min-norm solutions

For linear regression in the overparameterised regime, there is a clean theorem: SGD from zero initialisation converges to the minimum-norm solution among those that fit the training data (Gunasekar, Lee, Soudry, Srebro 2017 Implicit regularization in matrix factorization). For logistic regression on separable data, gradient descent converges to the max-margin solution — the one separating the classes with the widest margin, the same one an SVM would find (Soudry et al. 2018 The implicit bias of gradient descent on separable data). These are beautiful clean theorems; they show that for simple model classes the optimiser does indeed pick among interpolators by a specific rule, and that the rule matches common regularisation intuitions (small norm, large margin). For deep networks the analogous theorems are fragmentary and mostly hold only in the infinite-width limit, but the qualitative story carries over: the optimiser picks solutions with specific geometric properties, not arbitrarily.

Challenges to the flat-minima hypothesis

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio's 2017 Sharp minima can generalize for deep nets (ICML) challenged the flat-minima hypothesis on a technicality: because deep networks have reparameterisation symmetries (rescaling one layer's weights up and the next layer's weights down gives an identical function), any flat minimum can be continuously reparameterised into a sharp one without changing the function at all. This means "flatness" as measured by the Hessian's eigenvalues is not a well-defined property of a model; it depends on the parameterisation. More refined notions of flatness (relative flatness, or flatness measured in the tangent space to the symmetry orbit) recover the generalisation connection, but the naive picture of "flat = good, sharp = bad" was always more intuition than theorem.

Why this matters for practice

Implicit regularisation is the answer to "why does an un-regularised network generalise?" — and the answer has practical implications. It explains why small-batch training often generalises better than large-batch training (more SGD noise, flatter minima). It explains why SWA works (averaging parameters in a basin lands at a flatter point than any individual checkpoint). It explains why Adam sometimes generalises worse than SGD-momentum on vision tasks (Adam's adaptive scaling can bias toward sharper minima in specific regimes). And it motivates a class of sharpness-aware optimisers — SAM (Foret et al. 2021 Sharpness-aware minimization for efficiently improving generalization) and its descendants — which explicitly penalise sharpness during training, achieving better generalisation on many benchmarks at a modest compute cost.

The meta-principle. Every explicit regulariser in this chapter is partly redundant with the implicit regularisation of SGD — they bias the optimiser in similar directions, and a well-tuned un-regularised network may already achieve most of the generalisation benefit an explicit regulariser would provide. Understanding implicit regularisation is what converts regularisation from a recipe-follower's checklist into a principled discipline: you are not adding noise and penalties to force a network to behave, you are nudging an already-biased optimiser a little further in the direction it was already going.

Where it compounds in ML

Regularisation is not a self-contained topic; it is the connective tissue that joins architecture, data, and optimiser into a trainable system. In every modern ML application — from computer vision to NLP to robotics to foundation models — the regularisation recipe determines what the model actually learns. This closing section sketches how the techniques of this chapter compose, and where they sit inside the broader deep-learning landscape.

The joint regularisation view

A modern training recipe combines half a dozen regularisers: weight decay, dropout, data augmentation, label smoothing, mixup, EMA. Each is small on its own; together they compose into a training procedure that is substantially better-generalising than any individual component. The regularisers are usually thought of as independent additions to the loss or training loop, but they are not independent in their effects — weight decay and augmentation both constrain the function the network learns, and using both doesn't give twice the effect of using one. Zhang, Bengio, Singer 2018 Identity crisis: Memorization and generalization under extreme overparameterization and subsequent work has documented these interactions; the short summary is that empirical recipe-finding (sweep hyperparameters, measure on validation) still outperforms principled combinations in most cases.

Architecture as regularisation

Before the first explicit regulariser is added, architectural choices have already specified most of the network's inductive bias. Convolutional weight sharing enforces translation equivariance — a strong prior about natural images that dramatically narrows the function class without reducing expressive power. Attention's query-key-value structure enforces permutation-equivariance — a prior that matches many natural-language and sequence tasks. Residual connections enforce an identity prior, making it easy for the network to approximate the identity function before learning deviations from it. Each architectural choice can be seen as a form of regularisation: a restriction of the function class toward one that matches the domain's symmetries. Often the largest regularisation-like gains in a new domain come from finding the right architecture, not from tuning explicit regularisers on top of a generic architecture.

Pretraining and transfer

Pretraining on a large auxiliary dataset and then fine-tuning on the target task is itself a form of regularisation — perhaps the most important one of the 2020s. The pretrained weights encode broad features that match the structure of natural data (images, text, speech), and fine-tuning from those weights constrains the network to stay close to this broad-feature manifold. Kumar, Raghunathan, Jones, Ma, Liang 2022 Fine-tuning can distort pretrained features (ICLR) showed that aggressive fine-tuning can erode the pretrained features' quality; gentler fine-tuning recipes (low learning rate, parameter-efficient methods like LoRA) preserve the pretraining regularisation. In the foundation-model era, "how to regularise a fine-tune" has become nearly as important a question as "how to regularise a from-scratch training".

Scale as a regulariser

One of the most surprising empirical observations of the last five years is that dataset scale is itself a regulariser — as training sets grow from millions to billions to trillions of tokens, explicit regularisation becomes less necessary, and at the largest scales, many regularisers that were standard at smaller scales actively hurt performance. Chinchilla-scale language model training (Hoffmann et al. 2022) uses essentially no dropout and only modest weight decay, because with that much data the network has no opportunity to overfit — the training set is larger than the network's capacity for memorisation. This pattern — regularisation as a compensation for data scarcity that becomes redundant at scale — is one of the more consequential observations for planning a modern training recipe. If you are training on a small dataset, regularise heavily; if you are training on a dataset approaching the model's capacity, consider whether the regularisation is still needed.

Robustness and out-of-distribution

Everything in this chapter has been about in-distribution generalisation — test data drawn from the same distribution as training data. Out-of-distribution generalisation (test data from a related but different distribution) is a separate and harder problem, and the regularisers most effective for in-distribution generalisation are not always the most effective for OOD. Augmentation transfers best (building explicit invariance robustly helps OOD too); adversarial training tends to help robustness but can hurt in-distribution accuracy; simple explicit penalties like weight decay have mixed OOD effects. The frontier in 2026 is understanding which regularisers produce models whose features transfer and whose predictions are reliable on data they have never seen — a topic treated in detail in the Robustness chapter of Part VIII of this compendium.

Looking ahead

Chapter 04 of this Part V begins the architecture-specific treatment: convolutional neural networks, their origin in LeCun's 1989 paper, the AlexNet-to-ResNet-to-ConvNeXt development arc, and the specific regularisation recipes that have become standard for convolutional training. Chapters 05 and 06 treat sequence models and transformers respectively, each with their own regularisation idioms. Part VI develops the probabilistic-machine-learning view of regularisation through Bayesian neural networks, variational inference, and the formal connection between dropout and Monte-Carlo approximate Bayesian inference. Part VII covers deep-learning theory in more detail, including the NTK analysis, neural-network feature learning, and the emerging theoretical frames for overparameterised learning. Everything in this chapter is the vocabulary for those deeper dives: know the regularisers, know why they work, and the next two volumes of this compendium will be considerably easier going.

The chapter's takeaway. Regularisation is not "tricks you add to make a model generalise"; it is the design discipline that turns an overparameterised function approximator into a reliable estimator of a distribution. Every component of a modern training recipe — architecture, data, augmentation, optimiser, schedule, weight decay, dropout — participates. The discipline is to understand the contribution of each, to sweep the interactions that matter, and to recognise that no single lever controls generalisation — the whole system does. This is the chapter that, more than any other in Part V, rewards being read twice: once for the techniques, and once for the underlying principles that unify them.

How to read this chapter

Contents

Why regularise? The generalisation question

The empirical risk and the true risk

Zhang et al. 2017 and the capacity shock

The generalisation gap in practice

The menagerie of regularisation

The test-set protocol

Bias, variance, and the overparameterised regime

The classical decomposition

What the classical picture got right

Where the classical picture fails

Double descent — the modern picture

Why variance doesn't explode

Weight decay and L2 regularisation

The L2 penalty

The Bayesian interpretation

What L2 actually does in a deep network

Weight decay vs L2 in Adam — the AdamW distinction

Choosing λ

L1 regularisation and the limits of sparsity

The Lasso

The geometry of sparsity

Why L1 is less central in deep learning

Elastic net and group Lasso

Structured sparsity for efficiency

Dropout

The definition

Why it works — the ensemble interpretation

Variants: spatial, variational, Monte Carlo

Why dropout has partly faded

Practical recommendations

Data augmentation

The basic idea

Image augmentation — the classical pipeline

Text augmentation

Audio and other modalities

The invariance principle

Augmentation as regularisation, not just "more data"

Advanced augmentation: mixup, CutMix, and learned policies

Mixup

CutMix

Cutout and random erasing

AutoAugment and learned policies

Modern recipes stack augmentations

Label smoothing

The technique

Why it works

Interaction with knowledge distillation

Alternative: focal loss

Practical recommendations

Early stopping and checkpoint averaging

The principle

Why early stopping regularises

The double-descent wrinkle

Checkpoint averaging

Practical usage

Batch normalisation as a regulariser

The regularisation mechanism

Empirical evidence of regularisation

The batch-size dependence

Implications for architecture choice

GhostBatchNorm and small-batch tricks

Stochastic depth, DropConnect, and LayerDrop

Stochastic depth

DropConnect

LayerDrop for transformers

Why these techniques work

Practical usage

Noise injection and Jacobian regularisation

Input noise

Activation and weight noise

Gradient noise

Explicit Jacobian regularisation

Connections to adversarial training

When to use

Adversarial training

Adversarial examples

FGSM and PGD

The min-max formulation