Part XIII · Specialized ML Methods · Chapter 07

Bayesian Deep Learning, neural networks that know what they don't know.

Standard deep learning produces a single set of weights and a single prediction; Bayesian deep learning produces a distribution over weights and a distribution over predictions. The promise is calibrated uncertainty: a model that confidently extrapolates the training distribution and also flags out-of-distribution inputs as ones it cannot judge. The challenge is computational — exact Bayesian inference over millions of weights is intractable, and every method in this chapter is a different compromise. This chapter develops the Bayesian-neural-network framework, the practical posterior approximations (variational inference, MC dropout, stochastic-gradient MCMC, Laplace), the deep-Gaussian-process family that comes at uncertainty from the other direction, and the deployment patterns where calibrated uncertainty actually matters.

Prerequisites & orientation

This chapter assumes the probability and Bayesian-reasoning background of Part I Ch 04 and Ch 07, neural-network fundamentals (Part V Ch 01–02), and basic familiarity with optimisation (Part I Ch 03). The Gaussian-process material of Part IV Ch 05 is helpful for Sections 7–8 but the chapter develops what is needed. No prior exposure to Bayesian deep learning is assumed; readers comfortable with the difference between maximum-likelihood and maximum-a-posteriori training will be on solid ground.

Two threads run through the chapter. The first is uncertainty decomposition: a Bayesian model distinguishes aleatoric uncertainty (irreducible noise in the data) from epistemic uncertainty (ignorance that more data could resolve). This distinction is what calibrated uncertainty actually buys you, and it is what standard deep learning loses. The second thread is the approximation hierarchy: exact posterior inference is intractable, and every practical method approximates the posterior in some way. The chapter is organised so the cheapest approximations (MC dropout, deep ensembles, last-layer Laplace) appear first, followed by the more principled but more expensive ones (variational inference, stochastic-gradient MCMC, deep GPs). The right method for a given application depends on how much uncertainty quality you need versus how much compute you can spend.

In this chapter

Why Standard Deep Learning Is Overconfident point estimates · ECE · OOD failure modes · aleatoric vs. epistemic
Bayesian Neural Networks: The Foundation posterior over weights · predictive distribution · prior choice
Variational Inference and Bayes by Backprop ELBO · mean-field Gaussian · reparameterisation · flipout
MC Dropout: Cheap and Cheerful Bayesian Inference dropout as VI · Gal & Ghahramani · concrete dropout
Stochastic-Gradient MCMC SGLD · SGHMC · cyclical schedules · sample storage
Laplace Approximation and the GGN posterior at the MAP · Hessian / GGN · last-layer · linearised predictive
Deep Ensembles and Implicit Bayesianism ensembles as posterior samples · diversity · SWAG
Deep Gaussian Processes and Neural Processes stacked GPs · DKL · NPs · CNPs · attention-based
Calibration, OOD, and Practical Evaluation ECE · NLL · OOD detection · selective prediction · temperature scaling
Applications and Frontier medical · active learning · safety · LLM uncertainty · frontier

Why Standard Deep Learning Is Overconfident

A trained neural network outputs a single number for each prediction — say a probability of 0.97 that an image contains a cat. That number looks like a probability, walks like a probability, and software pipelines treat it like a probability. The problem is that it isn't one. Standard neural networks are systematically miscalibrated, dramatically overconfident on out-of-distribution inputs, and silent when they have no business making a prediction at all. Bayesian deep learning exists because deployment under real-world distribution shift demands models that quantify what they don't know, and standard training procedures actively destroy that information.

Point estimates and what they hide

A neural network is a function y = f(x; θ); training finds a single weight setting θ̂ by minimising a loss on training data. That single setting becomes the model. For any test input x, the model outputs f(x; θ̂) — a point prediction, with no information about how confident a Bayesian observer should be in it. Two models that fit the training data equally well can disagree wildly on out-of-distribution inputs, and standard training picks one of them by accident of initialisation and SGD trajectory. The hidden assumption is that the chosen weights are the only plausible ones; reality is that the data underdetermines the weights, and the gap between "the loss minimum we found" and "the set of weights compatible with the training data" is exactly what a Bayesian treatment captures.

Aleatoric vs. epistemic uncertainty

The clean way to think about uncertainty in prediction is to decompose it into two sources. Aleatoric uncertainty is irreducible noise in the data-generating process — even with infinite training data and a perfect model, you cannot predict a coin flip with probability 1. Epistemic uncertainty is uncertainty about the model itself — different model parameters fit the training data; we don't know which is right; more training data could resolve the ambiguity. Aleatoric uncertainty stays even at infinite data; epistemic shrinks toward zero as data accumulates.

Standard neural networks express aleatoric uncertainty (through their output distribution — softmax probabilities for classification, predicted variance for regression) but not epistemic. Bayesian deep learning's central contribution is an explicit treatment of epistemic uncertainty: a posterior distribution over weights yields a posterior predictive distribution that widens on inputs unlike anything in the training set. That widening is what "knowing what you don't know" means in practice.

The defining behaviour of Bayesian deep learning. Where training data is dense the posterior over functions is concentrated and the predictive band is narrow. Far from training data the posterior allows many compatible functions and the band widens — high epistemic uncertainty. A standard neural network would draw a single line with no band, and confidently extrapolate it.

Calibration: the empirical headline

The most measurable failure mode is miscalibration: when a model says "I'm 95% confident," does it actually get 95% of those predictions right? Modern deep networks tend to be massively overconfident — Guo et al. (2017) showed that ResNets on ImageNet routinely output 99% confidence on inputs they get wrong. The summary metric, expected calibration error (ECE), measures the gap between predicted and observed accuracy across confidence buckets; pre-calibration ResNets often have ECEs of 5–15%, where a few percentage points is considered serious in safety-critical domains. Section 9 develops the calibration evaluation toolkit; the takeaway here is that the problem is real, measurable, and addressable.

Out-of-distribution failure modes

The second failure mode is silent on in-distribution data and catastrophic on out-of-distribution. A network trained on cats and dogs will confidently classify an X-ray as a dog. A medical-imaging model trained on one hospital's scanner will hallucinate confident diagnoses on another hospital's scanner. The standard softmax output gives no signal that the input is anomalous. OOD detection using uncertainty estimates — flagging inputs where the model's epistemic uncertainty is high and abstaining or routing to a human — is one of the most-cited motivations for Bayesian deep learning, and Section 9 covers the standard benchmarks and metrics for it.

The Compute-Quality Trade-off

Every Bayesian deep-learning method is a compromise: full posterior inference is intractable, and the chapter's methods sit on a frontier where you trade computational cost for posterior accuracy. The simplest methods (MC dropout, deep ensembles, last-layer Laplace) cost almost nothing on top of standard training and provide useful but limited uncertainty. The principled methods (variational inference, stochastic-gradient MCMC) are 5–50× more expensive but more reliable. The right choice for a deployment depends on whether downstream decisions need calibrated probabilities or just a coarse confident/uncertain signal.

Bayesian Neural Networks: The Foundation

A Bayesian neural network treats the weights themselves as random variables with a prior distribution and a posterior conditioned on data. The prediction for a new input is then an integral over the posterior — a weighted average of the predictions that all compatible weight settings make. Setting up this object precisely, even before worrying about how to approximate it, is the foundation for every method in the chapter.

The Bayesian setup

Let D = {(x_i, y_i)} be the training data, θ the weights of a neural network f(x; θ), and p(θ) a prior over weights. Bayes' rule gives the posterior over weights:

Bayesian neural network: posterior and predictive p(θ | D) = p(D | θ) p(θ) / p(D) p(y* | x*, D) = \int p(y* | x*, θ) p(θ | D) dθ The posterior p(θ | D) places mass on weight settings that explain the training data well and are favoured by the prior. The posterior predictive p(y* | x*, D) for a new input x* is the integral of the per-weight prediction against the posterior — equivalently, an average over an ensemble of every plausible network. Both quantities are intractable in closed form for any non-trivial neural network; every method in the chapter approximates one or the other.

The integral is the entire Bayesian content. A point estimate replaces it with p(y* | x*, θ̂) for a single θ̂; a Bayesian treatment averages over many θ values weighted by their posterior probability. When the posterior is concentrated on one mode, the two answers agree; when it is broad or multimodal, they diverge dramatically — and on out-of-distribution inputs they diverge in exactly the way that produces calibrated uncertainty.

Prior choice

The prior p(θ) is the modelling decision that classical neural-network training hides. Standard L2 regularisation (weight decay) with coefficient λ is mathematically equivalent to a zero-mean Gaussian prior with variance 1/(2λ) followed by maximum-a-posteriori estimation — so every regularised network is implicitly Bayesian, but only at the MAP point. A full Bayesian treatment uses the same prior plus a posterior approximation that captures the spread around the MAP.

For most BNN methods, an isotropic Gaussian prior 𝒩(0, σ²I) is the default. The variance σ² is a hyperparameter — too small produces a brittle prior that prevents the posterior from fitting the data; too large produces a vague prior that lets the posterior overfit. The 2020s literature also explores cold posteriors (raising the posterior to a power > 1, equivalent to scaling the prior, which empirically helps generalisation), function-space priors (specifying priors over the output function rather than the weights), and structured priors that favour known invariances. For most practical work, an isotropic Gaussian with reasonable variance is the right starting point.

The intractability problem

For a network with D parameters (a million for a small model, a billion for a large one), the posterior p(θ | D) is a probability distribution over a D-dimensional space. The normalising constant p(D) = ∫ p(D | θ) p(θ) dθ is an integral over that whole space, which is computationally hopeless. The posterior predictive integral is similarly intractable. Two strategies dominate the response: variational approximation (replace the true posterior with a simpler family and optimise within that family — Sections 3, 4, 8) or Monte Carlo (draw samples from the true posterior and average — Sections 5, 7). Sections 6 and 7 add a third strategy — second-order curvature approximations — that is faster than MCMC and more principled than vanilla VI for many problems.

The MAP-and-Hessian decomposition

A useful conceptual decomposition: a posterior approximation can be thought of as (MAP location, posterior spread around the MAP). Standard training already finds a MAP estimate. The Bayesian extensions in this chapter mostly add procedures for computing or sampling the spread — the second moment, or higher moments, of the posterior around that mode. This is why almost every BNN method can be added on top of an already-trained network, often without retraining: the trained network is the MAP, and the BNN method adds the rest of the posterior structure.

Variational Inference and Bayes by Backprop

Variational inference replaces the intractable posterior p(θ | D) with a tractable family q_φ(θ) indexed by parameters φ, and optimises φ to make q as close to the true posterior as possible. The result is a variational posterior that is cheap to sample from and from which a posterior-predictive average can be computed by Monte Carlo. The Bayes-by-Backprop algorithm of Blundell et al. (2015) is the canonical instantiation for neural networks.

The ELBO

Variational inference minimises the KL divergence between q_φ(θ) and p(θ | D), which is equivalent to maximising the evidence lower bound (ELBO):

ELBO for variational inference ℒ(φ) = 𝔼 q φ (θ) [ log p(D | θ) ] - KL( q φ (θ) ‖ p(θ) ) The first term rewards weight settings that explain the training data well, averaged over the variational distribution. The second term pulls q toward the prior — a regulariser that prevents q from collapsing to a delta on the MAP. Maximising the ELBO trades data-fit against prior-closeness exactly the way Bayesian inference does.

Mean-field Gaussian and the reparameterisation trick

The standard variational family for BNNs is the mean-field Gaussian: each weight has an independent Gaussian variational distribution, q_φ(θ) = ∏_i 𝒩(θ_i; μ_i, σ_i²). The variational parameters are the per-weight means and standard deviations — twice as many as a standard network. Sampling from q is straightforward, and the KL term against a Gaussian prior is closed-form.

The optimisation challenge is that the data-fit term involves an expectation over q of a function of θ, and the parameter φ appears inside the sampling step. The reparameterisation trick resolves this: write θ = μ + σ ⊙ ε with ε ∼ 𝒩(0, I); gradients flow through μ and σ directly, allowing standard backpropagation through the sampled weights. This trick — applicable to any location-scale variational family — is the key technical move that makes Bayes by Backprop work and underlies most modern variational deep learning.

Flipout and other variance-reduction tricks

Vanilla Bayes-by-Backprop has a noisy gradient because every example in a mini-batch shares the same weight sample. Flipout (Wen et al. 2018) decorrelates the weight samples within a batch by applying random sign flips, dramatically reducing gradient variance and making variational training more stable. Local reparameterisation samples the pre-activations rather than the weights, achieving similar variance reduction. Both are now standard in production VI implementations and turn what was a finicky training procedure in 2015 into a near-drop-in replacement for standard SGD.

Predictive distribution at test time

For prediction, sample K weight configurations from q_φ, run the network forward on each, and average the outputs (or for classification, average the softmax probabilities). K = 10 to 100 is typical; K = 1 collapses to a single-sample stochastic prediction. The averaged prediction has both a mean (the point prediction) and a spread (the epistemic uncertainty). For regression with a learned output variance, the total predictive variance decomposes neatly into aleatoric (the average output variance across samples) plus epistemic (the variance of the means across samples) — a clean accounting that downstream decisions can use separately.

What VI gets right and what it misses

Mean-field VI is computationally cheap, easy to implement, and provides serviceable epistemic uncertainty for many production needs. It has known weaknesses: the mean-field assumption (independence between weights) systematically underestimates posterior covariance, leading to overconfident predictions even after the variational fit; the KL-divergence objective is mode-seeking, so multimodal posteriors collapse to a single mode. Structured VI with low-rank or matrix-Gaussian variational families addresses the covariance issue at higher cost; normalising-flow VI uses invertible neural networks to express richer variational distributions. For most practical work, mean-field with flipout is the right default; reach for the structured variants when uncertainty quality matters and the compute budget allows.

MC Dropout: Cheap and Cheerful Bayesian Inference

Most architectures already use dropout for regularisation. Gal and Ghahramani's 2016 paper observed that if you keep dropout active at inference time and average predictions across multiple stochastic forward passes, you get a Monte Carlo approximation to a particular variational posterior. The result is a Bayesian deep-learning method that requires no training changes, no architecture changes, and almost no extra compute — and it is one of the most-deployed approaches in production today.

Dropout as variational inference

Standard dropout randomly zeroes out hidden-unit activations during training; at inference time, dropout is disabled and weights are scaled to compensate. MC dropout instead leaves dropout active at inference time and computes the prediction as the average of many stochastic forward passes:

MC dropout prediction ŷ(x*) \approx (1/K) Σ k=1 K f(x*; θ k) where each θ k is the weight set with a fresh dropout mask applied. Gal and Ghahramani showed this corresponds to variational inference with a Bernoulli-mixture variational distribution — a particular tractable approximation to the BNN posterior. Importantly, no training changes are needed: any network that was trained with dropout is already an MC-dropout-capable Bayesian model. Set K = 30 to 100 for prediction-time averaging and the uncertainty estimate falls out for free.

What MC dropout gets right

The cost-benefit ratio is what makes MC dropout dominant in practice. Almost zero implementation cost (flip a flag at inference time), zero training cost, well-understood theoretical foundation, and uncertainty quality that is competitive with more expensive methods on most benchmarks. For applications where calibrated uncertainty is needed but the compute budget is tight — most production deployments — MC dropout is the right starting point.

Concrete dropout

One weakness of vanilla MC dropout is that the dropout rate is a fixed hyperparameter, and uncertainty estimates are sensitive to its value. Concrete dropout (Gal et al. 2017) makes the dropout rate learnable using the Concrete (continuous relaxation of Bernoulli) distribution, allowing the network to adjust its uncertainty per layer based on what the data needs. The result is better-calibrated uncertainty at the cost of a slightly more complex training procedure. Most modern MC-dropout deployments use concrete dropout at the most uncertainty-critical layers.

Caveats and limitations

MC dropout is not a fully general Bayesian inference method — it makes specific architectural assumptions (dropout is applied at every layer of interest) and the implicit variational distribution is restricted. The uncertainty estimates can be miscalibrated when dropout patterns interact poorly with batch normalisation, when the network is very deep, or when out-of-distribution inputs trigger pathological activation patterns. Several studies have shown that MC dropout estimates underestimate uncertainty far from training data — they capture variation among the trained model's plausible predictions but not the lack of any plausible prediction at all. For applications where this matters, deep ensembles (Section 7) or proper variational methods (Section 3) provide stronger guarantees.

Practical deployment recipe

The standard recipe in 2026: train a network with dropout layers (any modern architecture with dropout — most do); at inference time, run K = 30–100 forward passes with dropout still active; average the predictions for the point estimate and use the standard deviation across the K samples as the uncertainty estimate. For classification, average the softmax probabilities and use predictive entropy as the uncertainty measure. The whole thing is 50 lines of code on top of an existing model and provides reasonable epistemic uncertainty for most production needs.

Stochastic-Gradient MCMC

Markov chain Monte Carlo is the gold-standard tool for sampling from a posterior, but classical MCMC scales catastrophically with model size. Stochastic-gradient MCMC adapts the SGD machinery already used for neural-network training to draw approximate samples from the BNN posterior at near-zero overhead. The result is a principled Bayesian deep-learning method that lives squarely inside the deep-learning training loop.

SGLD: SGD plus noise equals sampling

The simplest stochastic-gradient MCMC method is stochastic-gradient Langevin dynamics (Welling & Teh 2011). The update rule is standard SGD with one modification — Gaussian noise injected at each step:

Stochastic-gradient Langevin dynamics (SGLD) θ t+1 = θ t + (η t / 2) \nabla log p(θ t | D batch) + 𝒩(0, η t I) η t is the learning rate, scheduled to decrease over time. The gradient term comes from the standard log-posterior on a mini-batch (data log-likelihood plus log-prior); the noise term has variance equal to the learning rate, balancing the deterministic gradient. As η \to 0, the chain provably samples from the true posterior. In practice the schedule is annealed but doesn't go to zero, and the late-stage iterates are treated as approximate posterior samples.

A handful of stored iterates from the latter part of training serve as the posterior samples — typically dozens to hundreds. Predictions are made by averaging across these samples, just as for VI or MC dropout. The cost over standard SGD is the noise injection (negligible) and the storage of the samples (a few network copies); the benefit is a principled posterior approximation with strong theoretical guarantees in the limit.

SGHMC and momentum-based variants

Stochastic-gradient Hamiltonian Monte Carlo (SGHMC) extends SGLD with momentum, mirroring the standard SGD-with-momentum update. The momentum term improves mixing on poorly-conditioned posteriors and is the more common choice for production deep-learning use. Variants like preconditioned SGLD use Adam-style adaptive preconditioners; the trade-off between principled-but-slow mixing and fast-but-biased sampling is application-specific.

Cyclical schedules and exploration

A pragmatic improvement over vanilla SGLD is the cyclical learning-rate schedule (Zhang et al. 2020): periodically warm-restart the learning rate, allowing the chain to escape from one mode and explore another. The samples stored from the low-learning-rate phases of each cycle then represent multiple modes, addressing the multimodal-posterior problem that vanilla single-trajectory MCMC has on neural-network landscapes. Cyclical SG-MCMC is now a strong baseline for BNN posterior sampling and is competitive with deep ensembles at lower compute cost.

Storage and efficient inference

Storing many full network copies is expensive. Two tricks reduce the cost. First, store only the parameters on a subset of layers (often the last layer, which carries most of the predictive uncertainty); the rest can be the deterministic MAP. Second, store low-rank or sparse perturbations from a base model rather than full copies — exploiting the fact that posterior samples are usually clustered near the MAP. The SWAG method (Section 7) takes the limit of this idea: store the running mean and a low-rank-plus-diagonal covariance of the SGD iterates, treating the result as a Gaussian posterior approximation.

When SG-MCMC is the right tool

SG-MCMC shines when (a) calibration quality matters more than headline speed, (b) the training pipeline can absorb a small per-iteration overhead, and (c) you want an asymptotically correct method rather than a heuristic. The cost relative to standard training is typically 1.5–3×, far cheaper than training a full ensemble. The catch is implementation complexity — getting the noise scaling, the schedule, and the sample-storage right requires more care than MC dropout. For research-grade Bayesian deep learning and for high-stakes deployments where uncertainty matters, SG-MCMC is the principled choice; for routine production needs, MC dropout or deep ensembles usually suffice.

Laplace Approximation and the GGN

A neural network's loss is a function of its weights. At a local minimum, the loss is well-approximated by a quadratic — the Hessian-defined "bowl" around the minimum. Translating that quadratic into a Gaussian posterior gives the Laplace approximation: an instant Bayesian deep-learning method that needs only a trained network and a curvature computation. It has become the dominant cheap-but-principled approach in the post-2020 BNN literature.

The basic idea

Suppose θ̂ is a MAP estimate — the maximum of the log-posterior L(θ) = log p(D | θ) + log p(θ). Taylor expanding L around θ̂ to second order gives:

Laplace approximation to the posterior log p(θ | D) \approx L(θ̂) - (1/2)(θ - θ̂) ⊤ H (θ - θ̂) \Rightarrow p(θ | D) \approx 𝒩( θ̂, H -1) where H = -\nabla 2 L(θ̂) is the Hessian of the negative log-posterior at the MAP. The first-order term vanishes because we are at a maximum. The Gaussian on the right is the Laplace approximation: a posterior centred at the MAP with covariance equal to the inverse Hessian.

The Laplace approximation has two appealing properties: it requires only a single trained network plus a Hessian computation (no retraining, no MCMC), and it is exact when the posterior is Gaussian, recovering classical statistics in the linear-regression limit. It is a strong default whenever you already have a trained model and want to add uncertainty post-hoc.

The Hessian problem and the GGN solution

For a network with D parameters, the full Hessian is a D×D matrix — completely infeasible to compute or store for D in the millions. Two classical approximations make Laplace tractable. The diagonal Hessian ignores cross-weight curvature and stores only the diagonal — D numbers, easy. The generalised Gauss-Newton (GGN) approximation replaces the true Hessian with the Gauss-Newton matrix, which is positive-semi-definite by construction (avoiding the indefinite-Hessian pathology near saddle points) and admits efficient block-diagonal and Kronecker-factored forms. The K-FAC Kronecker-factored approximation gives a per-layer block structure that is the standard production choice for Laplace approximations on modern networks.

Last-layer Laplace

An even cheaper variant is last-layer Laplace: keep the feature-extracting layers deterministic and apply the Laplace approximation only to the final linear layer. The last layer is a linear-regression problem given the learned features, where the Laplace approximation is exact in closed form. Empirically this captures most of the calibration benefit at near-zero cost — Kristiadi et al. (2020) showed last-layer Laplace recovers most of the OOD-detection benefit of full BNNs at <1% additional compute. For applications where you want uncertainty without changing the training pipeline, last-layer Laplace is the right starting point.

The linearised predictive

An important subtlety: applying the Laplace Gaussian directly to the network and propagating samples through it tends to give worse predictions than the deterministic network. The standard fix is the linearised predictive — at test time, treat the network as locally linear in the weights around the MAP and propagate the Gaussian uncertainty through this linearisation. The linearised predictive matches or beats deeper Bayesian methods on calibration benchmarks while preserving the deterministic network's accuracy. The 2021 laplace-torch library implements all of these variants and is the standard reference implementation in 2026.

Limitations and reach

Laplace is a local approximation: it captures posterior structure around one mode of the loss and misses the rest. For unimodal or near-unimodal posteriors this is fine, and most deep networks have well-defined minima where Laplace is a reasonable description. For multi-mode posteriors — relevant when the network has many equivalent solutions — Laplace will under-represent uncertainty. The fix is to combine multiple Laplace approximations centred at different modes, which leads naturally to the deep-ensemble approach of the next section.

Deep Ensembles and Implicit Bayesianism

Train the same network architecture five times with different random initialisations; average their predictions. This is the deep-ensemble method, and it is the strongest baseline in Bayesian deep learning today. Lakshminarayanan et al. (2017) introduced it as a "non-Bayesian" approach to uncertainty; subsequent work showed it functions as an approximate Bayesian inference method, sampling from different modes of the posterior. Whatever the framing, deep ensembles consistently match or beat formal BNN methods on every empirical benchmark of 2020–2026.

The deep-ensemble recipe

The procedure is simply: train M independent networks with different random seeds (different weight initialisations, different mini-batch shuffles); for prediction, average across the M networks. M = 5 is the standard choice; M = 10 or 20 helps on harder problems. Each network is trained the standard way, with no Bayesian machinery — the ensemble emerges from the diversity of the local minima that different runs converge to.

The empirical headline is striking: deep ensembles produce calibrated probabilities (often dramatic improvements over a single network's softmax), strong OOD detection, and they accomplish this with no training-time changes. The cost is a multiple of the original training cost, which is real but not prohibitive for many applications. In every careful benchmark of BNN methods (the Wilson-Izmailov line of work, the Uncertainty Baselines suite at Google), deep ensembles either win or tie with the more elaborate methods.

Why ensembles work: diversity and posterior modes

The Bayesian-leaning interpretation: the loss landscape of a deep network has many local minima, and SGD from different initialisations finds different ones. These minima correspond to different modes of the posterior — different "explanations" of the data that fit equally well. Averaging across modes is exactly what the posterior predictive integral does. Vanilla VI and Laplace methods center on a single mode and miss the others; deep ensembles capture multiple modes by construction.

This interpretation suggests that diversity across ensemble members is what actually drives the benefit. Methods that explicitly diversify ensembles — different architectures, different data subsets, anti-correlation losses — sometimes outperform the vanilla random-seed approach but at higher complexity. In practice, the random-seed ensemble is dominant.

SWAG: ensemble-on-the-cheap

The SWAG method (Maddox et al. 2019) approximates a deep ensemble at the cost of a single training run. The idea: the SGD trajectory in the late stages of training visits a region around a local minimum, and the empirical mean and covariance of the iterates form a Gaussian approximation to that local posterior mode. SWAG stores the running mean of late-stage iterates plus a low-rank-plus-diagonal covariance, then samples from this Gaussian at test time as if it were a posterior approximation. The result is a "single-pass ensemble" that captures within-mode uncertainty at a fraction of the ensemble cost. Combined with multiple training runs, SWAG-Ensemble (multiple SWAG models from different seeds) is among the strongest BNN methods in the literature.

When ensembles are the right answer

Deep ensembles are the right default for any problem where (a) calibrated predictions matter, (b) the per-prediction compute can absorb the M-fold ensemble cost, and (c) the simplicity of "train M models, average outputs" has organisational appeal over more elaborate Bayesian machinery. The downside is the M-fold training cost and the M-fold storage. For very large models — billion-parameter LLMs, for instance — full ensembling becomes uneconomic and methods like SWAG, Laplace, or LoRA-based weight-perturbation ensembles dominate.

The Pragmatic Hierarchy

In 2026 the working hierarchy for Bayesian deep learning looks roughly: last-layer Laplace when you can't change the training pipeline; MC dropout when the network already uses dropout and you want zero changes; deep ensembles when you can afford M-fold training and want the best calibration; SG-MCMC or full VI when you need theoretically grounded posteriors. Most production deployments use one of the first three.

Deep Gaussian Processes and Neural Processes

A Gaussian process places a prior directly over functions rather than over weights, with closed-form posterior updates from data and exact predictive uncertainty. GPs are the gold standard for low-dimensional regression with calibrated uncertainty, but their O(N³) scaling and limited expressiveness on high-dimensional inputs have kept them at the margin of deep learning. Deep Gaussian processes stack GPs to gain neural-network-like depth, and neural processes bring the GP-style posterior-conditioning machinery into the neural-network world. Both families address Bayesian deep learning from the function-space side rather than the weight-space side.

From single-layer GPs to deep GPs

A single-layer GP defines a distribution over functions f: X → ℝ via a mean function and a kernel. Predictive distributions for new inputs are Gaussian with mean and variance given by closed-form formulas in terms of training data and the kernel. The deep Gaussian process (Damianou & Lawrence 2013) stacks these — the output of one GP layer becomes the input of the next — producing a hierarchical model with neural-network-like depth and GP-like calibrated uncertainty. Inference uses variational sparse approximations (inducing points) to scale to large datasets; the standard implementation is in GPflow or GPyTorch.

Deep GPs are theoretically appealing — they inherit the GP's calibrated uncertainty and gain depth's expressiveness — but practically lag deep ensembles and proper VI on large benchmarks. Where they win is on small-to-medium structured datasets where the calibration quality matters: scientific applications, Bayesian optimisation, active learning. The 2024 generation (Wilson, Salimbeni, and others) has made them genuinely competitive at moderate scales.

Deep kernel learning

An intermediate architecture is deep kernel learning (Wilson et al. 2016): a neural network produces a feature representation, and a GP operates on those features. Train the network and the GP kernel parameters jointly. The result combines neural-network feature learning with GP-style calibrated uncertainty on the final prediction layer. Deep kernel learning is roughly the function-space dual of last-layer Laplace and has become standard in scientific machine learning where small-data regimes meet high-dimensional inputs.

Neural processes: meta-learning meets uncertainty

Neural processes (Garnelo et al. 2018) take a different approach: train a neural network to map a context set of (x, y) pairs to a predictive distribution over a target set, mimicking what a GP does explicitly but learning the mapping end-to-end. The network is trained on many tasks (each with its own context and target), and at test time it conditions on the available labelled examples to produce predictions with uncertainty. The Conditional Neural Process is the simpler version; the full Neural Process adds a global latent variable for additional stochasticity; Attentive Neural Processes use attention over the context set for sharper predictions.

Neural processes scale to high-dimensional inputs that vanilla GPs cannot handle, condition naturally on variable-size context sets, and produce closed-form (Gaussian) predictive distributions for downstream use. They underperform GPs on small structured problems but excel on tasks where the context is high-dimensional (image few-shot learning, time-series imputation, meta-learning for control). The 2022–2026 literature has produced increasingly powerful NP variants — Transformer Neural Processes, Permutation-Invariant NPs — that occupy a useful middle ground between deep ensembles and full deep GPs.

Function-space versus weight-space Bayesianism

The conceptual divide between Sections 2–7 (weight-space) and this section (function-space) matters for several reasons. Function-space methods specify the prior in terms of properties of the predicted function (smoothness, periodicity, known invariances), which is often more interpretable than a Gaussian over weights. Function-space posteriors are easier to inspect — you can plot predicted-function samples directly and see what the model thinks the data implies. The cost is computational: function-space methods are expensive at scale, and the engineering tooling lags behind the weight-space alternatives. Most practical Bayesian deep learning is weight-space; function-space methods are reserved for cases where their advantages clearly justify the cost.

Calibration, OOD, and Practical Evaluation

A Bayesian deep-learning method's worth is measured by how its uncertainty estimates perform on downstream tasks: calibration, out-of-distribution detection, selective prediction, active learning. The benchmark machinery is mature enough by 2026 that any production deployment can be evaluated on a standard suite of metrics; the right metrics depend on what the uncertainty is being used for.

Calibration metrics

The headline calibration metric is the expected calibration error (ECE): bin predictions by confidence, compute observed accuracy per bin, and average the gap between confidence and accuracy weighted by bin population. ECE near zero means a model's "I'm 90% confident" predictions are right 90% of the time. The standard binning is 10–15 equal-width bins; adaptive binning (equal-mass bins) reduces noise on imbalanced distributions. ECE has known weaknesses — it can hide compensating errors and depends on bin choices — but it remains the universal headline metric.

The negative log-likelihood (NLL) on the test set is a complementary metric: it rewards both correct point predictions and well-calibrated probabilities. A model with low NLL is doing both jobs well; a model with high accuracy but high NLL has miscalibrated probabilities. NLL is the standard Bayesian-deep-learning paper metric for regression and classification alike.

Temperature scaling and other post-hoc calibration

The simplest calibration fix is temperature scaling: divide the pre-softmax logits by a learned temperature T, with T chosen on a held-out validation set to minimise NLL. Guo et al. (2017) showed temperature scaling reduces ECE on ResNets from 5–15% to under 1% with a single learnable scalar. It is now standard practice — even non-Bayesian production deployments routinely apply temperature scaling as a final calibration step. The catch: temperature scaling is a global recalibration that does nothing for OOD inputs, so it complements but does not replace proper uncertainty estimation.

Out-of-distribution detection

OOD detection asks: given an input, is it from the training distribution or not? The standard benchmark setup trains on dataset A, evaluates on inputs drawn from datasets A (in-distribution) and B (OOD), and computes the AUROC of an OOD-score (typically uncertainty or 1 − max-softmax) at separating the two. Strong methods achieve AUROC > 0.9 on near-OOD pairs (CIFAR-10 vs. CIFAR-100) and > 0.95 on far-OOD pairs (CIFAR vs. SVHN). Bayesian methods with epistemic uncertainty reliably outperform softmax confidence on this task; the gap is typically 5–15 AUROC points.

Selective prediction

The deployment use of uncertainty is often selective prediction: output a prediction when confident, abstain (escalate to a human or a safer fallback) when uncertain. Evaluation plots coverage (fraction of inputs predicted on) against error rate on the predicted subset; sweeping the abstention threshold traces an accuracy-coverage curve, and the area under the curve summarises performance. A well-calibrated uncertainty estimator buys higher accuracy at any coverage level than a poorly-calibrated one, and selective-prediction curves are the most useful single picture for deployment-relevant comparisons.

Active learning and label-efficient settings

Active-learning is another natural use of epistemic uncertainty: when labelling is expensive, prioritise inputs the model is most uncertain about. The BALD acquisition function (Houlsby et al. 2011) uses mutual information between predictions and weights as the uncertainty measure and has been the standard choice for Bayesian active learning since its introduction. The 2020s literature has refined this with batch-aware variants (BatchBALD), application-specific scoring, and integrations with foundation-model fine-tuning. The headline: epistemic uncertainty quality predicts active-learning effectiveness, so if BALD-style queries don't improve over random labelling, the underlying uncertainty estimates are probably bad.

Applications and Frontier

Bayesian deep learning shows up wherever calibrated uncertainty is the difference between a useful product and a dangerous one. The deployment pattern varies by domain — medical imaging weighs OOD robustness most heavily, autonomous driving the speed-quality trade-off, scientific applications the small-data regime — but the toolkit of Sections 2–9 covers most production needs. This final section surveys the application landscape and the frontiers where Bayesian methods are reshaping the field.

Medical imaging and diagnostic decision support

Medical imaging is perhaps the canonical application: a model that confidently misclassifies a malignant lesion is worse than one that flags the same lesion as uncertain and routes to a radiologist. Deep ensembles and MC dropout are both common in production radiology pipelines, with selective-prediction thresholds set so that the most uncertain 20–40% of cases receive human review. Regulatory pathways (FDA's Software-as-a-Medical-Device guidance, the EU AI Act's high-risk-system requirements) increasingly treat calibrated uncertainty as a deployment requirement rather than a nice-to-have.

Autonomous driving and robotics

Self-driving systems use uncertainty estimates for decision-making in two ways: as inputs to safety controllers (which take more conservative actions when perception is uncertain) and as triggers for fallback policies (hand control to a remote operator or perform a safe stop when the perception stack reports high uncertainty). Lightweight methods (MC dropout, last-layer Laplace) dominate in this domain because of the latency budget — full ensembles are usually too expensive. The Tesla, Waymo, and Cruise stacks all incorporate uncertainty-aware perception modules, though specific methods are commercially confidential.

Scientific applications

Scientific machine learning — molecular property prediction, climate emulators, particle-physics simulations — is a heavy user of Bayesian deep learning because the data is small, the cost of bad predictions is high, and the downstream tasks (Bayesian optimisation, active learning, decision-theoretic experimental design) explicitly require calibrated uncertainty. Deep kernel learning and deep GPs dominate here; deep ensembles are common as well. The DeepMind AlphaFold confidence scores (pLDDT, PAE) are essentially Bayesian deep-learning uncertainty estimates trained alongside the structural prediction network.

Large language models and uncertainty

The 2023–2026 literature on LLM uncertainty is messy and active. Standard sampling-based uncertainty (sample multiple completions, measure agreement) is the dominant practical approach. Verbalised confidence — train the LLM to output its own confidence score — has produced surprisingly well-calibrated estimates in some studies and badly miscalibrated ones in others. Token-level entropy and log-probability statistics are weaker uncertainty signals than sampling-based methods but cheaper. The frontier here includes integrating retrieval as an uncertainty-reduction mechanism, training models to abstain explicitly on uncertain queries, and theory work on whether LLMs even have well-defined posterior beliefs in the BNN sense.

Frontier methods

Several frontiers are particularly active in 2026. Diffusion-model uncertainty: extending Bayesian methods to score-based generative models, where the natural object is an uncertainty over the score function rather than over discrete weights. Bayesian foundation models: pretraining transformers with explicit uncertainty quantification baked in, rather than as a post-hoc add-on. Function-space inference for neural networks: methods that specify priors over predicted functions and infer in function space directly, sidestepping the weight-space intractability altogether. Calibrated decision-making under model uncertainty: the meta-question of how to combine BNN outputs with downstream optimisation in a way that respects the uncertainty rather than collapsing it.

What this chapter does not cover

Several adjacent areas are out of scope. Classical Bayesian linear models, generalised linear mixed models, and Bayesian hierarchical regression are well-developed in the statistics literature with their own software ecosystem (Stan, PyMC, brms) and warrant separate treatment. The Bayesian-non-parametric family (Dirichlet processes, Indian buffet processes, Pitman-Yor) extends the ideas of this chapter to infinite-dimensional priors but lives mostly outside the deep-learning literature. Probabilistic programming languages and the corresponding inference engines (Pyro, NumPyro, Edward) are the practical tooling for many of the chapter's methods but warrant their own treatment. The reinforcement-learning literature on Bayesian RL and Thompson sampling intersects this chapter but is conventionally treated through the RL lens. And the ample literature on conformal prediction — a frequentist alternative to Bayesian uncertainty that produces calibrated prediction sets without distributional assumptions — is increasingly common in production but deserves its own chapter.