Part I · Mathematical Foundations · Chapter 07

Bayesian reasoning, uncertainty made quantitative.

Bayesian statistics begins with a single commitment: probability describes degrees of belief. From that premise follows a rule for updating beliefs in light of data (Bayes' theorem), a language for modelling structured problems (priors, likelihoods, hierarchies), and a computational toolkit (MCMC, variational inference, Laplace) for turning abstract posteriors into actionable predictions. This chapter develops that machinery from Bayes' theorem through to variational autoencoders and Thompson sampling — everywhere modern machine learning has, quietly, become Bayesian.

How to read this chapter

Sections build on one another, so read in order the first time through. The prose carries the argument, the equations make it precise, the diagrams keep it concrete. Where this chapter overlaps with the Probability and Statistics chapters — particularly on likelihoods and estimators — the earlier material is assumed, but not required for a first reading.

Notation: $\theta$ is an unknown parameter (possibly a vector); $y$ is observed data; $p(\theta)$ is the prior, $p(y \mid \theta)$ the likelihood, $p(\theta \mid y)$ the posterior, $p(y)$ the marginal likelihood or evidence; $\tilde{y}$ is a future, unobserved datum with predictive distribution $p(\tilde{y} \mid y)$. Hats denote point estimates ($\hat{\theta}_{\text{MAP}}$); bold lowercase are vectors; capitals are matrices. Code names appear in monospace. Key terms when first defined appear in a slightly different color.

Contents

  1. Why Bayesian reasoning?Motivation
  2. Bayes' theoremThe update rule
  3. PriorsWhere knowledge enters
  4. The posterior and its summariesWhat you actually report
  5. Conjugate familiesClosed-form updates
  6. Bayesian linear regressionThe worked example
  7. Hierarchical modelsPartial pooling and shrinkage
  8. Model comparisonBayes factors, BIC, LOO
  9. Exchangeability and de FinettiFoundations
  10. Markov chain Monte CarloSampling posteriors
  11. Variational inferenceOptimisation, not sampling
  12. The Laplace approximationGaussians around MAP
  13. Prior and posterior predictive checksModel criticism
  14. Bayesian decision theoryPosterior to action
  15. Where it shows up in MLPayoff
Section 01

Why Bayesian reasoning?

Probability theory gives us a mathematics of chance. Statistics gives us a machinery for estimation and testing. Bayesian reasoning makes one further commitment — that probability describes degrees of belief about unknown quantities — and builds an entire worldview from that single premise.

In the frequentist chapter we estimated a parameter $\theta$ by picking the value that made the data most likely: the maximum-likelihood estimator. The parameter itself had no distribution; it was a fixed unknown, and our estimator was the random thing, re-rolled under hypothetical replications of the experiment. That framing is excellent for repeated, controllable experiments. It is awkward almost everywhere else.

The Bayesian alternative is to treat $\theta$ itself as a random variable — not because it is physically random, but because we are uncertain about it. Before seeing data, we summarise that uncertainty with a prior distribution $p(\theta)$. The data $y$ arrive through a likelihood $p(y \mid \theta)$. Bayes' theorem combines them into a posterior $p(\theta \mid y)$, which is our new, updated description of uncertainty. That is the whole story. Everything else — priors, conjugacy, MCMC, variational inference, Bayesian neural networks — is machinery for doing that update in cases where the integrals get hard.

Key idea

Bayesian inference is a rule for updating beliefs. The rule is Bayes' theorem; the beliefs are probability distributions over unknown quantities; the update is conditioning on observed data. If you accept that probabilities can represent uncertainty, the rest is bookkeeping.

The payoff is large. Bayesian reasoning handles small samples, hierarchical structure, and missing data gracefully; it produces probabilistic predictions rather than point estimates; it gives honest uncertainty estimates for downstream decisions; and it connects cleanly to information theory, decision theory, and many modern deep-learning ideas — variational autoencoders, Thompson sampling, Bayesian optimisation, active learning, and the uncertainty quantification that production ML systems increasingly require.

The cost is that the integrals are usually intractable, which is why so much of this chapter is about how to do Bayesian inference in practice. The conceptual apparatus is small; the computational ingenuity is where the subject lives.

Section 02

Bayes' theorem.

A one-line identity. Forty years of controversy about how to interpret it. A worldview that falls out either way.

Start from the definition of conditional probability. For events $A$ and $B$ with $p(B) > 0$,

$$ p(A \mid B) = \frac{p(A, B)}{p(B)} \quad \text{and equally} \quad p(B \mid A) = \frac{p(A, B)}{p(A)}. $$

Equating the two expressions for the joint $p(A, B)$ gives Bayes' theorem:

$$ p(A \mid B) \;=\; \frac{p(B \mid A)\, p(A)}{p(B)}. $$

Applied to inference — replace $A$ with the parameter $\theta$ and $B$ with the observed data $y$ — we get the form every Bayesian argument starts from:

$$ \underbrace{p(\theta \mid y)}_{\text{posterior}} \;=\; \frac{\overbrace{p(y \mid \theta)}^{\text{likelihood}}\; \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(y)}_{\text{evidence}}}. $$

The denominator $p(y) = \int p(y \mid \theta)\, p(\theta)\, d\theta$ is the marginal likelihood or evidence. It is a constant with respect to $\theta$, so for most purposes we work with the unnormalised form

$$ p(\theta \mid y) \;\propto\; p(y \mid \theta)\, p(\theta), $$

reading the symbol "$\propto$" as "equal up to a normalising constant we will fix later, if we need to." The proportionality is often all you need: it is enough to decide whether $\theta_1$ or $\theta_2$ is more probable, to find the mode, or to run an MCMC sampler.

The disease-testing example

A textbook classic, worth doing once: a disease has prevalence $0.1\%$ in the population. A test has sensitivity $99\%$ (true positive rate) and specificity $99\%$ (true negative rate). You test positive. What is the probability you have the disease?

$$ p(D \mid +) = \frac{p(+ \mid D)\, p(D)}{p(+)} = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} \approx 0.09. $$

About $9\%$. The posterior is dominated by the prior: because the disease is rare, almost all positives are false positives. This is the canonical demonstration that Bayes' theorem is not intuitive — human reasoning routinely ignores base rates — and that the normalising constant in the denominator matters a great deal when the prior is strong.

A vocabulary reminder

The likelihood $p(y \mid \theta)$ is a function of $\theta$ for fixed $y$. It is not a probability distribution over $\theta$ — it does not integrate to $1$ as $\theta$ varies. The posterior $p(\theta \mid y)$ is a probability distribution over $\theta$. The difference is the prior and the normalising constant.

Section 03

Priors.

The prior is where the Bayesian framework asks you to say what you already believe. This is a feature, not a bug — but it is also the single most contested choice you will make.

A prior $p(\theta)$ is any probability distribution over the parameter space. You can pick it to reflect genuine knowledge ("previous experiments have given us roughly a Gaussian centred on $0.3$"), to encode mild regularisation ("the parameter is probably small"), or to be as uninformative as possible. Different choices lead to different posteriors — often similar when data are plentiful, sometimes wildly different when data are scarce.

A rough taxonomy

Informative priors use domain knowledge: a Normal$(0.3, 0.05^2)$ prior on a conversion rate because the last six A/B tests landed there. These are appropriate when you have real prior information and are willing to defend it.

Weakly informative priors encode "common sense" without strong commitment — a Normal$(0, 10^2)$ on a regression coefficient, say, which rules out physically impossible magnitudes without preferring any plausible value. The Stan community, following Gelman, recommends these as the default in applied work.

Non-informative priors try to let the data speak for themselves. The uniform prior $p(\theta) \propto 1$ is the simplest, but it is not invariant under reparameterisation — a uniform prior on $\theta$ is not a uniform prior on $\log \theta$. Jeffreys priors fix this by defining $p(\theta) \propto \sqrt{\det F(\theta)}$ where $F$ is the Fisher information matrix. Jeffreys priors are invariant under smooth reparameterisation, which is the mathematically principled non-informative choice.

Improper priors — priors that do not integrate to one, such as $p(\theta) \propto 1$ on an unbounded parameter — are often fine in practice as long as the resulting posterior is proper. But they can lead to paradoxes in model comparison, and are best avoided in hierarchical models.

Empirical Bayes estimates the prior from the data itself — for example, fitting a Gaussian to the observed variation across groups and using it as a prior on each group's parameter. This is philosophically awkward (you are using the data twice) but computationally convenient; in the hierarchical section we will see it emerge naturally as an approximation to the fully Bayesian approach.

Priors as regularisation

Any prior that peaks at zero acts as a regulariser on the posterior mode. A Gaussian prior on regression weights gives ridge regression; a Laplace prior gives the lasso. Bayesian MAP estimation is regularised maximum likelihood; the regulariser is $-\log p(\theta)$.

In practice, the question "what prior should I use?" is usually better phrased as "what does my model say before seeing data, and does that look reasonable?" Simulating from the prior predictive distribution (Section 14) makes this question concrete and answerable.

Section 04

The posterior, and how to summarise it.

Once you have a posterior distribution in hand, you have — strictly speaking — a complete answer to every inferential question. The trick is reducing that distribution to numbers a human can act on.

The full Bayesian answer to "what have I learned about $\theta$?" is the distribution $p(\theta \mid y)$. In practice you usually want scalar summaries. The standard choices:

The posterior mean $\mathbb{E}[\theta \mid y] = \int \theta\, p(\theta \mid y)\, d\theta$ minimises squared-error loss. It is the Bayes estimator under quadratic loss, and the most common "best single number" summary.

The posterior median minimises absolute-error loss; it is more robust to skewness and heavy tails than the mean.

The posterior mode, or maximum a posteriori (MAP) estimate, maximises $p(\theta \mid y)$. It is particularly convenient because for point estimation it only requires optimisation, not integration:

$$ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \, [\, \log p(y \mid \theta) + \log p(\theta) \,]. $$

This is maximum likelihood plus a log-prior penalty — the same object we called "regularised MLE" in the statistics chapter. MAP is fast and often good but it loses the one thing that makes Bayesian inference distinctive: a full distribution. A thin, sharply peaked posterior and a wide, flat one can share the same mode.

Credible intervals

A $95\%$ credible interval is any interval $[a, b]$ with $\int_a^b p(\theta \mid y)\, d\theta = 0.95$. It has a satisfyingly direct interpretation: "given the model and data, there is a $95\%$ probability that $\theta$ is between $a$ and $b$." This is what most people think a frequentist confidence interval means — and what only a Bayesian credible interval actually means.

Two common choices fix the arbitrariness of $[a, b]$. The equal-tailed interval uses the $2.5\%$ and $97.5\%$ posterior quantiles; it is easy to compute from MCMC samples but can be misleading when the posterior is skewed. The highest posterior density (HPD) interval is the shortest interval containing $95\%$ of the mass, and always contains the mode. HPD is the principled choice; equal-tailed is what everyone actually reports.

The posterior predictive distribution

Predictions about future data $\tilde{y}$ marginalise out the parameter:

$$ p(\tilde{y} \mid y) \;=\; \int p(\tilde{y} \mid \theta)\, p(\theta \mid y)\, d\theta. $$

This is the posterior predictive distribution. It incorporates both observational noise (through $p(\tilde{y} \mid \theta)$) and parameter uncertainty (through $p(\theta \mid y)$) into a single forecast. A frequentist plug-in prediction $p(\tilde{y} \mid \hat{\theta})$ uses only the first; it is overconfident by exactly the variance that averaging over $\theta$ would have added.

Section 05

Conjugate families.

For certain pairings of prior and likelihood, the posterior lives in the same family as the prior — and updating reduces to adding numbers. Conjugacy is the closed-form shortcut that made Bayesian inference tractable before computers could sample.

A prior $p(\theta)$ is conjugate to a likelihood $p(y \mid \theta)$ if the posterior $p(\theta \mid y)$ is in the same parametric family as the prior. The practical payoff is enormous: there is no integral to do. You just update the parameters of the prior using a simple rule.

Beta–Binomial

The classic example. Observe $y$ successes in $n$ trials with unknown success probability $\theta$. The likelihood is $p(y \mid \theta) = \binom{n}{y} \theta^y (1-\theta)^{n-y}$. With a Beta prior $\theta \sim \mathrm{Beta}(\alpha, \beta)$,

$$ p(\theta \mid y) \;\propto\; \theta^{\alpha + y - 1}(1-\theta)^{\beta + n - y - 1}, $$

which is $\mathrm{Beta}(\alpha + y,\, \beta + n - y)$. The update is "add successes to $\alpha$, failures to $\beta$" — exactly what intuition says it should be. The prior parameters $\alpha$ and $\beta$ act as pseudo-counts: they behave like $\alpha$ previously observed successes and $\beta$ failures.

0 0.25 0.5 0.75 1 $\theta$ prior Beta(2,2) likelihood (7/10) posterior Beta(9,5)

Other standard pairs

LikelihoodConjugate priorPosterior
Binomial($n, \theta$)Beta($\alpha, \beta$)Beta($\alpha + y,\, \beta + n - y$)
Poisson($\lambda$)Gamma($\alpha, \beta$)Gamma($\alpha + \sum y_i,\, \beta + n$)
Normal($\mu, \sigma^2$), $\sigma^2$ knownNormal($\mu_0, \tau_0^2$)Normal (updated mean and variance)
Normal($\mu, \sigma^2$), $\mu$ knownInverse-Gamma($\alpha, \beta$)Inverse-Gamma (updated $\alpha, \beta$)
MultinomialDirichlet($\alpha_1, \dots, \alpha_k$)Dirichlet (add counts)
Exponential family, generalNatural conjugateSame family

Every exponential-family likelihood has a natural conjugate prior, and the update is a linear operation on natural parameters. This is more than a coincidence: conjugacy is the exponential-family's way of being well-behaved under Bayesian updating, and it is the reason so many classical Bayesian results have closed forms.

Why conjugacy still matters

In the age of MCMC and variational inference, conjugacy is no longer strictly necessary — we can sample any posterior. But conjugate updates are exact, fast, and form the building blocks inside Gibbs samplers and variational methods. They also give the sanity-check formulas you can verify against an MCMC run: if the two disagree, you have a bug.

Section 06

Bayesian linear regression.

The smallest non-trivial example where conjugacy, posteriors, predictions, and the connection to classical regularisation all show up in a single closed form.

The frequentist model is $y = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I)$. The Bayesian version adds a prior on $\boldsymbol{\beta}$. Take a Gaussian prior $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2 I)$ for concreteness and assume $\sigma^2$ is known.

The posterior is Gaussian (conjugate!), with mean and covariance

$$ \boldsymbol{\beta} \mid y \;\sim\; \mathcal{N}(\boldsymbol{\mu}_n, \Sigma_n), \qquad \Sigma_n^{-1} = \frac{1}{\sigma^2} X^\top X + \frac{1}{\tau^2} I, $$ $$ \boldsymbol{\mu}_n = \Sigma_n \left( \frac{1}{\sigma^2} X^\top y \right). $$

The posterior mean is the ridge regression estimator: $(X^\top X + \lambda I)^{-1} X^\top y$ with $\lambda = \sigma^2/\tau^2$. This is the exact correspondence we saw in the statistics chapter, now with the prior showing where the penalty came from. A Laplace prior would have given the lasso. Classical regularisation is Bayesian MAP estimation; the hyperparameter is the prior variance.

Predictive distribution

For a new input $\mathbf{x}^*$, the posterior predictive is

$$ p(y^* \mid \mathbf{x}^*, y) \;=\; \mathcal{N}\!\left(\mathbf{x}^{*\top}\boldsymbol{\mu}_n,\; \sigma^2 + \mathbf{x}^{*\top}\Sigma_n\mathbf{x}^*\right). $$

The predictive variance has two components. The $\sigma^2$ term is the irreducible observation noise — you cannot predict it away. The $\mathbf{x}^{*\top}\Sigma_n \mathbf{x}^*$ term is the uncertainty in $\boldsymbol{\beta}$, propagated through the new input. It is largest in regions where the training data are sparse or the directions of $\mathbf{x}^*$ are poorly constrained by $X^\top X$. This is the Bayesian answer to "how confident am I at this input?" — and one reason Bayesian regression is still the first stop for small-data problems with meaningful uncertainty requirements.

The conjugate prior on $\sigma^2$

In full generality you also put a prior on the noise variance: $\sigma^2 \sim \text{Inverse-Gamma}$, giving a Normal-Inverse-Gamma joint prior on $(\boldsymbol{\beta}, \sigma^2)$. The posterior is again Normal-Inverse-Gamma; integrating out $\sigma^2$ gives a Student-$t$ predictive distribution with heavier tails than the known-$\sigma$ version. This is where the $t$-distribution earns its name in Bayesian inference — as the marginal predictive under Gaussian-Inverse-Gamma.

Section 07

Hierarchical models.

The single most useful construct in modern Bayesian statistics. Groups share information through a higher-level distribution, and the result — partial pooling — is almost always better than estimating each group alone or pooling them into one.

You have measurements of some quantity in $J$ groups: conversion rates at $J$ stores, test scores at $J$ schools, batting averages for $J$ players. Two extreme analyses present themselves. No pooling: estimate each group's parameter $\theta_j$ from its own data, treat them as unrelated. Complete pooling: assume $\theta_1 = \cdots = \theta_J$, pool all the data, and estimate one number. The first over-fits; the second under-fits.

A hierarchical (or multilevel) model sits between them. Each group has its own $\theta_j$, but the $\theta_j$ themselves come from a shared distribution:

$$ \theta_j \mid \mu, \tau \;\sim\; \mathcal{N}(\mu, \tau^2), \qquad y_{ij} \mid \theta_j \;\sim\; p(y \mid \theta_j), $$

with priors on $\mu$ and $\tau$. The hyperparameters $\mu$ and $\tau$ are learned from the data just like any other parameters. The posterior for each $\theta_j$ is then pulled — shrunk — toward the overall mean $\mu$, by an amount that depends on how much data that group has and how spread out the groups appear to be.

The 8-schools problem

Gelman's canonical example: an SAT coaching programme is evaluated separately at $8$ schools. Each school reports a point estimate $\hat{\theta}_j$ and standard error $\sigma_j$. The no-pooling analysis is noisy; the complete-pooling analysis washes out any real differences. The hierarchical analysis shrinks each school's estimate toward the grand mean by a factor depending on $\tau$, producing estimates that are simultaneously more accurate (on average) and more honest about uncertainty than either extreme.

Shrinkage is not ad hoc

Shrinkage falls out of Bayesian inference automatically. You do not have to add it; you get it by writing down the joint model and turning the crank. The James-Stein estimator, empirical Bayes, ridge regression, and random-effects ANOVA are all versions of the same hierarchical shrinkage, viewed from different angles.

Hierarchical models are the most practically useful Bayesian technique — especially when data are unbalanced across groups, when some groups have very little data, or when borrowing strength across groups is itself the modelling goal. Gelman & Hill's Data Analysis Using Regression and Multilevel/Hierarchical Models is essentially a whole book about the technique, and deservedly so.

Centred vs. non-centred parameterisation

A numerical footnote that matters in practice: the obvious parameterisation $\theta_j \sim \mathcal{N}(\mu, \tau^2)$ (called centred) produces posteriors that are often pathologically hard for samplers to explore when $\tau$ is small. The non-centred form writes $\theta_j = \mu + \tau z_j$ with $z_j \sim \mathcal{N}(0, 1)$, and samplers handle it much better. Any serious Stan or PyMC model uses non-centred by default.

Section 08

Model comparison.

Bayesian inference is elegant at comparing competing hypotheses — in principle. In practice, model comparison is the place where the framework most frequently fails to deliver on its promise, and where practitioners have retreated to predictive-accuracy criteria instead.

Suppose two models, $M_1$ and $M_2$, each with its own parameters $\theta_1$ and $\theta_2$. Bayes' theorem at the model level gives posterior model probabilities

$$ p(M_k \mid y) \;\propto\; p(y \mid M_k)\, p(M_k), $$

and the ratio between two models is the Bayes factor,

$$ \mathrm{BF}_{12} \;=\; \frac{p(y \mid M_1)}{p(y \mid M_2)} \;=\; \frac{\int p(y \mid \theta_1, M_1)\, p(\theta_1 \mid M_1)\, d\theta_1}{\int p(y \mid \theta_2, M_2)\, p(\theta_2 \mid M_2)\, d\theta_2}. $$

Each numerator and denominator is a marginal likelihood — an integral over the full parameter space — which automatically penalises model complexity through the volume of parameter space assigned prior mass that ends up with low likelihood. In this sense Bayes factors are the principled Bayesian version of "Occam's razor": simpler models that still fit the data win.

Why it often does not work

Bayes factors are extraordinarily sensitive to prior choice. Doubling the prior variance on a parameter halves the prior density at the MLE, halves the marginal likelihood, and halves the Bayes factor. This is unproblematic when priors are strongly informed; it is a disaster when priors are vague. For non-informative priors — the exact situation in which you might want a "neutral" comparison — Bayes factors can be arbitrary.

The Bayesian Information Criterion (BIC) is a large-sample approximation,

$$ \mathrm{BIC} \;=\; -2 \log p(y \mid \hat{\theta}_{\text{MLE}}) + k \log n, $$

with $k$ parameters and $n$ observations. Lower is better; differences of $2$–$6$ indicate positive evidence, $6$–$10$ strong, $>10$ very strong (Raftery's scale). BIC is a rough approximation to $-2\log p(y \mid M)$ but comes for free once you have the MLE. It is widely used and widely criticised.

What modern practitioners use instead

Increasingly, Bayesian model comparison reaches for predictive criteria that compare models on how well they would forecast held-out data:

These are less theoretically pure than Bayes factors but vastly more usable. They are the tools you will actually reach for in practice.

Section 09

Exchangeability and de Finetti.

Every probabilistic model we have written so far has assumed i.i.d. data given parameters. Why? The deep answer is de Finetti's theorem: it is the only consistent way to treat a sequence of observations where the order does not matter.

A sequence $(Y_1, Y_2, \dots)$ is exchangeable if its joint distribution is invariant under permutations — the order of the observations tells us nothing. This is a weaker assumption than i.i.d., which additionally requires independence.

De Finetti's theorem says that every infinitely exchangeable sequence can be written as conditionally i.i.d. given some (possibly infinite-dimensional) parameter $\theta$ drawn from a distribution $\pi$:

$$ p(y_1, \dots, y_n) \;=\; \int \left[ \prod_{i=1}^n p(y_i \mid \theta) \right] \pi(\theta)\, d\theta. $$

Read this carefully. The left side is a joint distribution over observations — no explicit parameter. The right side is the Bayesian decomposition — prior, likelihood, marginal. The theorem says that if you believe in exchangeability, you are already a Bayesian, even if you did not know it. A prior and a likelihood are not optional modelling choices; they are the unique representation of any exchangeable sequence.

What this bought us

De Finetti is the philosophical justification for why the "parameter plus i.i.d. data" story is not an arbitrary modelling choice but a consequence of a very weak symmetry assumption. It is also the reason mixture models, Gaussian processes, and Dirichlet processes can be understood as giving up the assumption of a finite-dimensional $\theta$ while preserving exchangeability.

The theorem generalises. Partial exchangeability (order within groups does not matter, between groups does) gives hierarchical models. Markov exchangeability gives latent-state models. Every time you make a modelling choice, there is a symmetry assumption behind it, and a de Finetti-style representation that corresponds.

Section 10

Markov chain Monte Carlo.

When you cannot compute the posterior in closed form — which is almost always — you sample from it. The workhorse family of algorithms for doing this is MCMC, and it is responsible for the modern practicality of Bayesian statistics.

The goal: draw samples $\theta^{(1)}, \theta^{(2)}, \dots$ whose long-run distribution is the posterior $p(\theta \mid y)$. Because we only need the unnormalised density $p(y \mid \theta)\, p(\theta)$ — the constant of proportionality cancels — we can sample from any posterior we can evaluate up to a constant, which is essentially all of them.

Metropolis-Hastings

The simplest general MCMC algorithm. Given a current state $\theta$, propose a new state $\theta'$ from a proposal density $q(\theta' \mid \theta)$. Accept it with probability

$$ \alpha(\theta \to \theta') \;=\; \min\!\left(1,\; \frac{p(\theta' \mid y)\, q(\theta \mid \theta')}{p(\theta \mid y)\, q(\theta' \mid \theta)}\right). $$

Otherwise stay at $\theta$. Under mild conditions, the sequence of states converges to the posterior. A symmetric random-walk proposal $q(\theta' \mid \theta) = q(\theta \mid \theta')$ simplifies the ratio to the posterior ratio alone.

Metropolis-Hastings always works in principle. In practice, high-dimensional posteriors force impossibly small step sizes (or exponentially low acceptance rates), and mixing becomes glacial.

Gibbs sampling

A special case: cycle through parameters one at a time, sampling each from its full conditional $p(\theta_j \mid \theta_{-j}, y)$. When full conditionals are conjugate (the usual case in hierarchical models), each step is an exact draw and acceptance is trivially 1. Gibbs sampling was the original BUGS/JAGS algorithm and is still the right tool for conditionally conjugate models.

Hamiltonian Monte Carlo and NUTS

Modern Bayesian computation runs on Hamiltonian Monte Carlo (HMC) and its self-tuning variant, the No-U-Turn Sampler (NUTS). HMC treats the negative log-posterior as a potential energy, augments the parameters with momentum variables drawn from a Gaussian, and simulates Hamiltonian dynamics to propose moves that travel long distances through parameter space while still respecting the target distribution.

The upshot: HMC/NUTS scales to posteriors with thousands of dimensions, handles strong correlation structure naturally, and requires only gradients of the log-posterior — computed automatically by modern PPLs (Stan, PyMC, Numpyro, Turing.jl) via reverse-mode autodiff, the same technique that trains neural networks.

Diagnostics that matter

Every MCMC run needs convergence diagnostics. The key ones: $\hat{R}$ (Gelman-Rubin) should be $\lesssim 1.01$ across all chains; effective sample size should be at least a few hundred per parameter; divergence counts in HMC should be zero; trace plots should look like hairy caterpillars, not trending lines. A model that passes all four diagnostics is not guaranteed correct, but a model that fails any of them is definitely wrong.

Section 11

Variational inference.

MCMC is exact but slow. Variational inference is fast but approximate. The trade-off has become central to modern probabilistic machine learning — and the ELBO, the variational objective, is the exact same loss that trains every VAE.

Variational inference turns posterior inference into optimisation. Pick a family $\mathcal{Q}$ of tractable distributions $q(\theta; \phi)$ parameterised by $\phi$, and find the member closest to the true posterior in KL divergence:

$$ \phi^* \;=\; \arg\min_\phi\, D_{\text{KL}}\!\left(q(\theta; \phi) \,\|\, p(\theta \mid y)\right). $$

The true posterior is unknown, so we cannot compute the KL directly. But a little algebra shows that minimising this KL is equivalent to maximising a lower bound on the log marginal likelihood — the Evidence Lower BOund, or ELBO:

$$ \log p(y) \;\geq\; \mathcal{L}(\phi) \;=\; \mathbb{E}_{q(\theta; \phi)}[\log p(y, \theta)] - \mathbb{E}_{q(\theta; \phi)}[\log q(\theta; \phi)]. $$

The two terms have separate interpretations. The first is the expected log-joint, which we want to be large. The second is the negative entropy of $q$, which pulls $q$ toward being broad (and so toward covering more of the posterior mass). Together they balance fit against regularisation.

Mean-field and beyond

The simplest family assumes $q$ factorises across parameters: $q(\theta) = \prod_j q_j(\theta_j)$. This is mean-field VI, and it leads to closed-form coordinate updates that resemble Gibbs sampling but optimise rather than sample. Mean-field chronically underestimates posterior variance because it cannot capture correlations — a warning that is worth taking seriously in practice.

Richer families fix this: structured variational families, normalising flows as $q$, amortised VI where $\phi$ is the output of a neural network, and hybrid approaches that mix VI and MCMC.

VAEs are variational inference

In a variational autoencoder, the "latent" $z$ plays the role of $\theta$, the "encoder" is an amortised $q(z \mid x; \phi)$ parameterised by a neural network, the "decoder" is $p(x \mid z; \theta)$, and the loss is the ELBO. The reparameterisation trick — $z = \mu(x) + \sigma(x) \odot \epsilon$ — is what makes the expectation differentiable, and the rest is stochastic gradient descent on a neural-network-parameterised variational approximation. This is the single clearest point of contact between classical Bayesian inference and modern deep learning.

Section 12

The Laplace approximation.

Simpler than VI, older than MCMC, and still the first thing to reach for when you need a rough Gaussian approximation to a posterior you can already optimise.

The Laplace approximation fits a Gaussian to the posterior by matching the mode and local curvature. Around the MAP estimate $\hat{\theta}$, a second-order Taylor expansion of the log-posterior gives

$$ \log p(\theta \mid y) \;\approx\; \log p(\hat{\theta} \mid y) - \tfrac{1}{2}(\theta - \hat{\theta})^\top H (\theta - \hat{\theta}), $$

where $H = -\nabla^2 \log p(\theta \mid y)\big|_{\hat{\theta}}$ is the Hessian of the negative log-posterior. Exponentiating gives

$$ p(\theta \mid y) \;\approx\; \mathcal{N}(\hat{\theta},\, H^{-1}). $$

That is the whole method. Run an optimiser to find $\hat{\theta}$; compute the Hessian once at $\hat{\theta}$; report a Gaussian. It reproduces the posterior exactly when the log-posterior is quadratic (as in conjugate Gaussian cases), and is surprisingly accurate when the posterior is unimodal and approximately Gaussian near the mode — which, by the Bernstein-von Mises theorem, it is in large samples for regular models.

Connection to Fisher information

As $n \to \infty$, the Hessian of the negative log-posterior converges to $n F(\theta_0)$, where $F$ is the Fisher information matrix. The Laplace posterior becomes $\mathcal{N}(\hat{\theta}, (n F)^{-1})$ — exactly the asymptotic sampling distribution of the MLE. This is the Bernstein-von Mises theorem in practice: the Bayesian posterior and the frequentist sampling distribution agree in the limit, both governed by Fisher information.

Why it still matters

Laplace is the basis for most practical approximations in Bayesian deep learning. Bayesian neural networks with the Laplace approximation treat the Hessian at a pre-trained MAP as the posterior covariance — which, because the Hessian is enormous, has to be further approximated (diagonal, K-FAC, last-layer). The story there is Laplace approximations all the way down; the underlying idea is what you just read.

Section 13

Prior and posterior predictive checks.

A Bayesian model is a story about how data are generated. The single most useful thing you can do with it — before reporting a number, publishing a paper, or deploying a system — is to run that story forward and see if it produces data that look real.

A prior predictive check draws parameters from the prior, simulates data given those parameters, and plots the result. If your prior implies that a conversion rate of $0.7$ or $-0.4$ is routine, you will see that immediately — and fix the prior before the model ever touches data.

$$ \tilde{y} \sim p(\tilde{y}) = \int p(\tilde{y} \mid \theta)\, p(\theta)\, d\theta. $$

A posterior predictive check does the same thing with the fitted posterior: draw $\theta$ from $p(\theta \mid y)$, simulate $\tilde{y}$, and compare its distribution — means, medians, extremes, group-wise summaries, whatever is domain-appropriate — to the observed data.

$$ \tilde{y} \sim p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta)\, p(\theta \mid y)\, d\theta. $$

The comparison does not need to be formal. A scatter of simulated versus observed group means, overlaid histograms of simulated versus observed values, a quantile-quantile plot — any of these will catch gross misspecification. Gelman advocates "model criticism" as the third leg of Bayesian workflow (alongside prior and posterior inference), and in applied work it is where most real debugging happens.

Bayes's unsung virtue

Because a Bayesian model defines a full joint distribution over parameters and data, it defines a way to simulate synthetic data. Simulation is inspection. There is no frequentist analogue that is nearly as general.

Section 14

Bayesian decision theory.

A posterior distribution is the right object for describing uncertainty, but decisions need a single action. Decision theory prescribes how to turn a posterior into an action by minimising expected loss.

Suppose you must choose an action $a$ from a space $\mathcal{A}$, and the goodness of action $a$ given the true parameter $\theta$ is measured by a loss function $L(\theta, a)$. The posterior expected loss, or risk, of action $a$ is

$$ \rho(a) \;=\; \mathbb{E}_{\theta \mid y}[L(\theta, a)] \;=\; \int L(\theta, a)\, p(\theta \mid y)\, d\theta. $$

The Bayes action is $a^* = \arg\min_a \rho(a)$. Different losses give different optimal actions:

This last case is the correct answer to "what should I predict if false positives and false negatives cost different amounts?" — and is a clean example of how Bayesian decision theory connects posterior inference to downstream economics.

The language of decision theory — loss functions, risk, admissibility, minimax — is also the native language of classical statistics, machine learning, and reinforcement learning. Bayesian decision theory is the bridge: it shows that Bayes estimators are optimal under any loss, that Bayesian decision rules are admissible, and that frequentist minimax procedures often have a Bayesian justification under some (often improper) prior.

Section 15

Where Bayesian reasoning shows up in ML.

Modern machine learning has quietly absorbed most of the Bayesian apparatus — sometimes under its own names, sometimes in simplified form, sometimes at full rigour. This is the payoff section.

The Bayesian influence on ML is broader than "Bayesian ML" would suggest. Here is a sampler of where the ideas in this chapter actually land in practice.

A conspicuous pattern runs through this list: Bayesian ideas keep getting re-discovered under new names because they work. If you take away one thing from this chapter, let it be the habit of asking, for any ML method you encounter, "what is the implicit prior, and what is the implicit likelihood?" The answers are usually illuminating.

Further reading

Where to go next

The Bayesian literature is unusually rich for a mathematical subject — it spans foundations, practical workflow, computation, and application to almost every field. Work through one modern practical textbook (Gelman or McElreath) alongside one foundational treatment (Jaynes or Bernardo & Smith) and you will have a working map of the terrain.

Modern practical textbooks

Foundations and theory

Computation

Bayesian machine learning

Software and tools

This is Chapter 07 of an eight-chapter tour of the mathematical foundations of AI. One chapter remains: signal processing — Fourier transforms, convolution, filtering, sampling — the last prerequisite for audio, time series, and the full picture of classical engineering mathematics that modern deep learning still relies on.