Part I · Mathematical Foundations · Chapter 04

Probability, the calculus of uncertainty.

Every prediction a model makes is a guess under uncertainty, and every loss it minimises is a statement about how surprised that guess is allowed to be. Probability theory gives that whole vocabulary its grammar: random variables, distributions, expectation, independence, the law of large numbers, the central limit theorem, and the concentration inequalities that tell you how quickly an empirical average becomes trustworthy. This chapter develops it from sample spaces and σ-algebras up to the high-dimensional concentration results that justify why training on finite data works at all.

How to read this chapter

Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the diagrams and small worked examples exist to make the prose less abstract; the equations exist to make the prose precise. If a formula looks scary, read the sentence before it first — it almost always says in English what the symbols are about to say in math.

Notation: scalars are lowercase italic ($x$, $t$, $\lambda$); random variables are uppercase italic ($X$, $Y$, $Z$); vectors are bold lowercase ($\mathbf{x}$, $\boldsymbol{\mu}$); matrices and covariances are capitals ($A$, $\Sigma$). The probability of an event $A$ is $\mathbb{P}(A)$; the expectation of a random variable $X$ is $\mathbb{E}[X]$; its variance is $\operatorname{Var}(X)$. $\mathbb{R}^n$ is the set of real $n$-vectors, $\mathbb{N}$ the natural numbers, and $\mathbb{1}_A$ the indicator function of an event $A$. The notation $X \sim \mathcal{D}$ means "$X$ is distributed according to $\mathcal{D}$." The symbol $\stackrel{\text{i.i.d.}}{\sim}$ means independent and identically distributed.

What is probability theory about?Motivation
Sample spaces, events, and σ-algebrasThe setup
The axioms of probabilityKolmogorov
Conditional probability and independenceUpdating beliefs
Random variablesNumerical outcomes
Mass, density, and CDFHow probability is described
Expectation, variance, momentsSummary statistics
Common discrete distributionsBernoulli to Poisson
Common continuous distributionsUniform to Gamma
The multivariate GaussianThe workhorse
Joint, marginal, and conditional distributionsMultiple variables
Transformations and change of variablesPushforwards
Sums of random variablesConvolutions and MGFs
The law of large numbersAverages converge
The central limit theoremWhy Gaussians are everywhere
Concentration inequalitiesQuantitative LLN
Stochastic processesRandom walks, Markov chains, Brownian motion
Where it shows up in MLPayoff

Section 01

What probability theory is about

Probability is the mathematics of uncertainty — a language for reasoning quantitatively about outcomes that are not yet determined, or that are determined but not yet known. Every prediction a machine learning model makes is a probabilistic statement; every loss function it minimises can be rewritten as a statement about probabilities.

There are two related but philosophically distinct ways to think about what a probability is. The frequentist reading says that a probability is the long-run relative frequency of an event: if you flip a fair coin many times, the fraction of heads converges to $1/2$. The Bayesian reading says that a probability is a degree of belief, consistent with a set of rational axioms, that can be updated in light of new evidence. The beauty of Kolmogorov's 1933 axiomatisation (section 3) is that both readings end up doing the same calculations; the disagreement is about what the numbers mean, not what they are.

For a machine learning practitioner, the bilingualism is a feature. A discriminative classifier trained by maximum likelihood is a frequentist object; its output, a conditional probability, is often then interpreted Bayesianly as a posterior over class labels. Almost every loss you will meet in later chapters — cross-entropy, KL divergence, negative log-likelihood, evidence lower bounds — is a probabilistic quantity in disguise. Read this chapter once as if probabilities were frequencies, and a second time as if they were beliefs. Both readings will pay off.

Why ML is probabilistic. A deterministic model predicts a single output. A probabilistic model predicts a distribution over outputs and reports how confident it is. The first sounds simpler but is almost never what you want: knowing "the label is probably cat, maybe dog, and definitely not airplane" is far more actionable than knowing "the label is cat." Probability is the machinery that makes "probably" and "definitely" precise.

The plan of this chapter is to develop enough probability to read the rest of the compendium without getting stuck. We begin with the set-theoretic scaffolding (sample spaces, σ-algebras, the three axioms), build up to random variables and their distributions, cover the distributions you will actually meet (Bernoulli, Gaussian, and a small zoo of siblings), introduce expectation and variance as the summary statistics that make distributions tractable, and then climb the two classical theorems — the law of large numbers and the central limit theorem — that explain why averaging works. We close with concentration inequalities, the finite-sample cousins of the LLN, which quantify exactly how fast averages become reliable. Those inequalities are the theoretical reason you can train a neural network on a million examples and expect it to generalise to a billion.

Section 02

Sample spaces, events, and σ-algebras

Probability theory starts not with numbers but with sets. Before you can ask "what is the probability of …?", you have to specify the universe of possible outcomes and the collection of statements you are willing to assign a probability to.

The sample space $\Omega$ is the set of all possible outcomes of a random experiment. For a single coin flip, $\Omega = \{H, T\}$. For the roll of a six-sided die, $\Omega = \{1, 2, 3, 4, 5, 6\}$. For the lifetime of a lightbulb, $\Omega = [0, \infty)$. A particular outcome $\omega \in \Omega$ is sometimes called a sample point or elementary event.

An event is a subset $A \subseteq \Omega$ — a question that can be answered "yes" or "no" after the experiment runs. "The die rolled even" is the event $\{2, 4, 6\}$; "the coin landed heads" is $\{H\}$. The empty set $\varnothing$ is the event that nothing happens (impossible); the full set $\Omega$ is the event that something happens (certain). Set operations become logical operations: $A \cup B$ is "$A$ or $B$," $A \cap B$ is "$A$ and $B$," $A^c = \Omega \setminus A$ is "not $A$."

For finite or countable $\Omega$ you can assign a probability to every subset and no further machinery is needed. For continuous $\Omega$ — the uncountable reals, say — assigning a probability to every subset leads to paradoxes (the Banach–Tarski and Vitali constructions). The fix is to restrict yourself to a well-behaved collection of subsets called a σ-algebra.

A σ-algebra $\mathcal{F}$ on $\Omega$ is a collection of subsets satisfying three closure properties:

$$ \Omega \in \mathcal{F}, \qquad A \in \mathcal{F} \implies A^c \in \mathcal{F}, \qquad A_1, A_2, \ldots \in \mathcal{F} \implies \bigcup_{n=1}^{\infty} A_n \in \mathcal{F}. $$

In words: $\mathcal{F}$ contains the whole sample space, is closed under complementation, and is closed under countable unions. Intersections come along for free by De Morgan's laws. Elements of $\mathcal{F}$ are called measurable events; the pair $(\Omega, \mathcal{F})$ is a measurable space.

The σ-algebra you will actually use. On $\mathbb{R}$, the standard σ-algebra is the Borel σ-algebra $\mathcal{B}(\mathbb{R})$ — the smallest σ-algebra containing every open interval. It contains every set you will ever write down in ML, including intervals, half-lines, finite unions, and countable intersections of those. In practice you can think of it as "every reasonable subset of $\mathbb{R}$" and not worry about pathologies until you take a measure-theory course.

The reason for this abstraction is to distinguish between "questions we will assign probabilities to" and "subsets that are too ill-behaved to be assigned a probability consistently." Outside that technical corner, the sample space $\Omega$, the events $A \subseteq \Omega$, and their union / intersection / complement are all you need to keep straight. The next section assigns probabilities to those events.

Section 03

The axioms of probability

All of probability theory — the rules, the theorems, the whole edifice — rests on three axioms Kolmogorov wrote down in 1933. Everything else is derived.

A probability measure $\mathbb{P}$ on a measurable space $(\Omega, \mathcal{F})$ is a function $\mathbb{P}: \mathcal{F} \to [0, 1]$ satisfying

$$ \begin{aligned} &\text{(A1)} \quad \mathbb{P}(A) \geq 0 \text{ for all } A \in \mathcal{F}, \\ &\text{(A2)} \quad \mathbb{P}(\Omega) = 1, \\ &\text{(A3)} \quad \text{if } A_1, A_2, \ldots \text{ are disjoint, then } \mathbb{P}\!\left(\bigcup_{n=1}^{\infty} A_n\right) = \sum_{n=1}^{\infty} \mathbb{P}(A_n). \end{aligned} $$

Probabilities are non-negative, they sum to one over the whole sample space, and they are countably additive over disjoint events. The triple $(\Omega, \mathcal{F}, \mathbb{P})$ is a probability space. Every probability calculation in this compendium takes place inside one.

A handful of identities follow in two or three lines of algebra. The probability of the empty event is zero: $\mathbb{P}(\varnothing) = 0$. The probability of the complement is $\mathbb{P}(A^c) = 1 - \mathbb{P}(A)$. Monotonicity: if $A \subseteq B$ then $\mathbb{P}(A) \leq \mathbb{P}(B)$. And the inclusion–exclusion formula for two events,

$$ \mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B), $$

which generalises to $n$ events with alternating signs. The union bound — frequently invoked in ML theory — drops the subtraction and asserts an inequality:

$$ \mathbb{P}\!\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} \mathbb{P}(A_i). $$

The union bound is everywhere. When you want to upper-bound the probability that any of $n$ bad things happens, you often cannot compute the exact probability. The union bound gives you a safe upper bound in one line: just sum the individual probabilities. It's loose when events are correlated and tight when they are disjoint, but it is almost always the first inequality to reach for in generalisation proofs, confidence-interval constructions, and concentration arguments.

Everything in the rest of this chapter — random variables, expectation, independence, the LLN, the CLT — is built on these three axioms and the two-line identities they imply. The entire edifice of modern probability is, in a real sense, elementary.

Section 04

Conditional probability and independence

Probability only becomes useful once you learn how to update it. "The probability of $A$ given that $B$ has occurred" is the quantity that makes probability a calculus of evidence, not just a calculus of outcomes.

For events $A, B \in \mathcal{F}$ with $\mathbb{P}(B) > 0$, the conditional probability of $A$ given $B$ is

$$ \mathbb{P}(A \mid B) \;=\; \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}. $$

Read as: restrict the sample space from $\Omega$ to $B$, renormalise so the restricted space has total probability $1$, and ask how much of $A$ lies inside. The map $A \mapsto \mathbb{P}(A \mid B)$ is itself a probability measure — on the same $\sigma$-algebra, just conditioned.

Rearranging the definition gives the multiplication rule:

$$ \mathbb{P}(A \cap B) = \mathbb{P}(A \mid B)\, \mathbb{P}(B) = \mathbb{P}(B \mid A)\, \mathbb{P}(A), $$

which by induction chains into a product expansion for $n$ events, and — swapping the two right-hand expressions and dividing — immediately yields Bayes' theorem:

$$ \mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A)\, \mathbb{P}(A)}{\mathbb{P}(B)}. $$

Bayes' theorem takes up its own chapter later in this compendium (Part I, Chapter 07), so we will not dwell on its interpretation here. Note only that the three factors on the right are usually called the likelihood, the prior, and the evidence — and that the formula was already implicit in the axioms; Bayes merely noticed it.

Two events $A$ and $B$ are independent, written $A \perp B$, if the occurrence of one tells you nothing about the other:

$$ \mathbb{P}(A \cap B) = \mathbb{P}(A)\, \mathbb{P}(B), $$

or, equivalently (when $\mathbb{P}(B) > 0$), $\mathbb{P}(A \mid B) = \mathbb{P}(A)$. Independence is not a physical property of events — it is a property of the probability measure. The same two events can be independent under one distribution and dependent under another.

Pairwise independence is not mutual independence. Three events $A, B, C$ are mutually independent if they are pairwise independent and $\mathbb{P}(A \cap B \cap C) = \mathbb{P}(A)\mathbb{P}(B)\mathbb{P}(C)$. The second condition does not follow from the first. This is almost never a practical issue in ML, but it is a classic exam trap.

A useful partner to Bayes is the law of total probability. If $B_1, B_2, \ldots$ partition $\Omega$ into disjoint pieces of positive probability, then for any event $A$,

$$ \mathbb{P}(A) = \sum_{i} \mathbb{P}(A \mid B_i)\, \mathbb{P}(B_i). $$

This is the marginalisation identity you will apply a hundred times when summing out nuisance variables in a joint distribution.

Section 05

Random variables

A random variable is not a variable that is random. It is a function — a deterministic assignment of a number to each possible outcome. Randomness lives in the sample space; the random variable merely reports a summary of what happened.

Formally, a random variable on a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ is a measurable function $X: \Omega \to \mathbb{R}$. "Measurable" means that for every Borel set $B \subseteq \mathbb{R}$, the pre-image $X^{-1}(B) = \{\omega : X(\omega) \in B\}$ is an event in $\mathcal{F}$. The measurability requirement is technical but essential: it guarantees that statements like "$X \leq 2$" have well-defined probabilities.

In practice, you rarely think about $\omega \in \Omega$ at all. Instead, you work with the distribution that $X$ induces on $\mathbb{R}$:

$$ \mathbb{P}_X(B) \;=\; \mathbb{P}(X^{-1}(B)) \;=\; \mathbb{P}(X \in B). $$

This pushforward measure $\mathbb{P}_X$ — the distribution of $X$ — is the object that gets names like "Gaussian" and "Poisson." The underlying sample space is almost always swept under the rug.

Random variables come in flavours. A discrete random variable takes values in a countable set; its distribution is a probability mass function. A continuous random variable takes values in an uncountable set (usually $\mathbb{R}$) and has a probability density function whenever one exists. A mixed random variable does some of both — think of a variable that is $0$ with probability $1/2$ and exponentially distributed otherwise. All three cases are handled uniformly by the cumulative distribution function in the next section.

Vector-valued random variables. Everything generalises: a random vector $\mathbf{X}: \Omega \to \mathbb{R}^n$ is an $n$-tuple of random variables on the same probability space. Its distribution lives on $\mathbb{R}^n$. In ML, almost every "random variable" you care about is actually a random vector — a feature vector, a weight matrix, a batch of labels.

Two random variables $X$ and $Y$ defined on the same probability space are independent, $X \perp Y$, if the events $\{X \in A\}$ and $\{Y \in B\}$ are independent for all Borel $A, B$. Equivalently, their joint distribution factors as the product of marginals (section 11). A sequence $X_1, X_2, \ldots$ is i.i.d. — independent and identically distributed — if the variables are mutually independent and all share the same distribution. Every dataset you train on is modelled as i.i.d. draws from some unknown $\mathcal{D}$; every theoretical result about learning starts with that assumption.

Section 06

Mass, density, and cumulative distribution

Three ways to describe a one-dimensional distribution, matched to the three flavours of random variable. They are different notations for the same underlying measure, and they convert into each other by summing, integrating, or differentiating.

The cumulative distribution function (CDF) of a real-valued random variable $X$ is

$$ F_X(x) \;=\; \mathbb{P}(X \leq x), \qquad x \in \mathbb{R}. $$

$F_X$ is always non-decreasing, right-continuous, with $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$. It is the most general description of a one-dimensional distribution: it works for discrete, continuous, and mixed variables, and it uniquely determines the distribution.

For a discrete random variable taking values $x_1, x_2, \ldots$, the probability mass function (PMF) is

$$ p_X(x_i) \;=\; \mathbb{P}(X = x_i), \qquad \sum_i p_X(x_i) = 1. $$

The CDF is the cumulative sum of the PMF; the PMF is the jump heights of the CDF.

For a continuous random variable, $\mathbb{P}(X = x) = 0$ for every single point $x$. The useful object is the probability density function (PDF), a non-negative function $f_X$ such that

$$ \mathbb{P}(a < X \leq b) \;=\; \int_a^b f_X(x)\,dx, \qquad \int_{-\infty}^{\infty} f_X(x)\,dx = 1. $$

The CDF is the integral of the PDF; the PDF is the derivative of the CDF (where it is differentiable). A density at a point is not a probability — densities can be arbitrarily large — it is a probability per unit length. A standard Gaussian density at $x = 0$ is $f_X(0) = 1/\sqrt{2\pi} \approx 0.399$, perfectly fine because it is a rate, not a probability.

Change of variables. If $Y = g(X)$ for a smooth, monotonic $g$, then the density transforms as $f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right|$. The Jacobian factor matters: densities are rates, and rescaling the axis rescales them inversely. This is the one-dimensional root of the $|\det J|$ factor that appears in normalising flows, reparameterisation tricks, and every change-of-variables argument in probability.

In higher dimensions the picture is the same with a little more bookkeeping: a random vector has a joint CDF $F_{\mathbf{X}}(\mathbf{x}) = \mathbb{P}(X_1 \leq x_1, \ldots, X_n \leq x_n)$ and, when one exists, a joint density $f_{\mathbf{X}}: \mathbb{R}^n \to [0, \infty)$ with integral $1$. Section 11 expands on the joint picture; for now, think of a density as a probabilistic mass-per-unit-volume that you can integrate over a region to get a probability.

Section 07

Expectation, variance, and moments

A distribution is a complicated object; an expectation is a single number. The first few moments of a distribution — mean, variance, skewness, kurtosis — compress its shape into a handful of scalars that are almost always what a learning algorithm actually sees.

The expectation (or mean, or expected value) of a random variable $X$ is

$$ \mathbb{E}[X] \;=\; \begin{cases} \displaystyle \sum_i x_i\, p_X(x_i) & \text{discrete}, \\[0.6em] \displaystyle \int_{-\infty}^{\infty} x\, f_X(x)\, dx & \text{continuous},\end{cases} $$

whenever the sum or integral converges absolutely. It is the probability-weighted average of the values $X$ can take. For a random vector, $\mathbb{E}[\mathbf{X}]$ is the vector of component-wise expectations.

Expectation is linear: for any random variables $X, Y$ (independent or not) and scalars $a, b$,

$$ \mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]. $$

Linearity of expectation is the single most useful identity in probability. It does not require independence, and it turns hard-looking combinatorial problems into one-line arithmetic. Counting the expected number of fixed points of a random permutation, for instance, reduces to summing $n$ copies of $1/n$ and getting $1$ — independently of any correlations between the events.

The law of the unconscious statistician says that for a function $g$,

$$ \mathbb{E}[g(X)] \;=\; \int g(x)\, f_X(x)\, dx, $$

i.e. you can compute $\mathbb{E}[g(X)]$ without first finding the distribution of $Y = g(X)$. This is how the loss $\mathbb{E}[\ell(f_\theta(X), Y)]$ is always written.

The $k$-th moment of $X$ is $\mathbb{E}[X^k]$. The variance is the second central moment,

$$ \operatorname{Var}(X) \;=\; \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2, $$

measuring the expected squared deviation from the mean. Its square root is the standard deviation $\sigma_X = \sqrt{\operatorname{Var}(X)}$, with the same units as $X$. For a scalar $a$ and variables $X, Y$, $\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)$, and $\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X, Y)$, where the covariance $\operatorname{Cov}(X, Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]$ measures how $X$ and $Y$ co-vary. When $X$ and $Y$ are independent, the covariance is zero and variances add.

Jensen's inequality. For a convex function $\varphi$, $\mathbb{E}[\varphi(X)] \geq \varphi(\mathbb{E}[X])$; for a concave function, the inequality reverses. Applied to $\varphi(x) = x^2$, it yields $\mathbb{E}[X^2] \geq (\mathbb{E}[X])^2$, i.e. $\operatorname{Var}(X) \geq 0$. Applied to $\varphi(x) = -\log x$, it gives the non-negativity of KL divergence. Jensen is one of the most-reached-for inequalities in information theory and variational inference.

Higher moments — skewness (third central moment, normalised) and kurtosis (fourth) — measure asymmetry and tail-heaviness. For a Gaussian, skewness is $0$ and (excess) kurtosis is $0$ by convention; anything heavier-tailed than a Gaussian has positive excess kurtosis. A whole subfield of robust statistics exists because sample means and variances are very sensitive to the tails that kurtosis measures.

Section 08

Common discrete distributions

A short zoo of the discrete distributions that appear constantly in ML: Bernoulli for single yes/no events, binomial for counts of yes/no events, geometric for waiting times, categorical for labels, and Poisson for rare events.

Bernoulli$(p)$: the distribution of a single coin flip with probability $p$ of landing $1$. $\mathbb{P}(X = 1) = p$, $\mathbb{P}(X = 0) = 1 - p$. Mean $p$, variance $p(1-p)$, maximised at $p = 1/2$. Every binary classifier outputs the parameter $p$ of a Bernoulli.

Binomial$(n, p)$: the number of successes in $n$ independent Bernoulli$(p)$ trials. PMF

$$ \mathbb{P}(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \qquad k = 0, 1, \ldots, n. $$

Mean $np$, variance $np(1-p)$. For large $n$, the binomial is well-approximated by a Gaussian (central limit theorem, section 15) — this is, in fact, the oldest instance of the CLT, noticed by De Moivre in 1733.

Geometric$(p)$: the number of Bernoulli$(p)$ trials up to and including the first success. PMF $\mathbb{P}(X = k) = (1-p)^{k-1} p$ for $k = 1, 2, \ldots$; mean $1/p$. The geometric is the discrete analogue of the exponential (section 9) and shares its memoryless property: $\mathbb{P}(X > m + n \mid X > m) = \mathbb{P}(X > n)$, i.e. waiting time doesn't age.

Categorical$(\mathbf{p})$: the $K$-way generalisation of Bernoulli. $X \in \{1, 2, \ldots, K\}$ with $\mathbb{P}(X = k) = p_k$, $\sum_k p_k = 1$. The output of a softmax classifier is the parameter vector $\mathbf{p}$ of a categorical. A multinomial$(n, \mathbf{p})$ is $n$ independent categorical draws aggregated into counts, just as binomial is the $n$-trial version of Bernoulli.

Poisson$(\lambda)$: the distribution of the number of rare events in a fixed interval, when events occur independently at a constant rate $\lambda$. PMF

$$ \mathbb{P}(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \qquad k = 0, 1, 2, \ldots. $$

Mean $\lambda$, variance $\lambda$. The Poisson arises as the $n \to \infty$, $p \to 0$ limit of Binomial$(n, p)$ with $np \to \lambda$ held fixed — the law of small numbers. In ML, Poisson appears in counting models (arrivals per minute, events per page, tokens per document) and as the emission distribution of Poisson regression and certain latent-Dirichlet-style generative models.

Why these five and not others? The discrete exponential family — Bernoulli, binomial, geometric, categorical, Poisson, negative binomial — has the property that the log-PMF is linear in the parameters. This makes maximum-likelihood estimation closed-form or convex, and gives rise to generalised linear models (logistic regression, Poisson regression) and most of the simple generative models in ML. You do not need to memorise the formulas; you need to recognise which distribution is staring at you from a given problem.

Section 09

Common continuous distributions

Six continuous distributions cover almost every modelling need in ML: uniform, exponential, Gaussian, Laplace, Beta, and Gamma. The Dirichlet (a multivariate generalisation of the Beta) is thrown in because it is the conjugate prior of the categorical.

Uniform$(a, b)$: density $f(x) = 1/(b - a)$ for $x \in [a, b]$ and zero elsewhere. Mean $(a+b)/2$, variance $(b-a)^2/12$. The "no information" distribution on a bounded interval; every pseudorandom generator in a language's standard library ultimately produces Uniform$(0,1)$ samples and transforms them to other distributions via inverse-CDF or rejection methods.

Exponential$(\lambda)$: density $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$. Mean $1/\lambda$, variance $1/\lambda^2$. The continuous analogue of the geometric distribution, and the unique continuous distribution with the memoryless property $\mathbb{P}(X > s + t \mid X > s) = \mathbb{P}(X > t)$. It describes waiting times between events in a Poisson process.

Gaussian (or normal) $\mathcal{N}(\mu, \sigma^2)$: density

$$ f(x) \;=\; \frac{1}{\sqrt{2\pi}\,\sigma}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). $$

Mean $\mu$, variance $\sigma^2$. The Gaussian appears everywhere for three interrelated reasons: it is the maximum-entropy distribution for a given mean and variance (so it's the "least committal" density you can choose); it is the CLT limit of averages (section 15) so any quantity that is a sum of many small independent effects ends up approximately Gaussian; and it is closed under a long list of operations (sums, conditioning, linear transformations) that makes it uniquely tractable. A linear regression residual, a weight initialization, a reparameterised VAE latent — all Gaussians.

Laplace$(\mu, b)$: density $f(x) = (2b)^{-1} e^{-|x - \mu|/b}$. A double-sided exponential — heavier tails than a Gaussian and a pointier mode. The negative log-density is the $\ell_1$ loss $|x - \mu|/b$, which is why Lasso regression and robust regression are Laplace-likelihood models in disguise.

Beta$(\alpha, \beta)$: density $f(x) = x^{\alpha-1}(1-x)^{\beta-1} / B(\alpha, \beta)$ on $[0, 1]$, where $B$ is the beta function. A flexible two-parameter distribution on the unit interval that becomes uniform at $\alpha = \beta = 1$, symmetric and bell-shaped for large equal parameters, and bimodal (pushed to the ends) for small parameters. It is the conjugate prior for the Bernoulli parameter $p$ — so if your prior is Beta and your likelihood is Bernoulli, your posterior is Beta too.

Gamma$(\alpha, \beta)$: density $f(x) \propto x^{\alpha-1} e^{-\beta x}$ on $x \geq 0$. Generalises the exponential (Gamma$(1, \beta) = $ Exponential$(\beta)$) and the $\chi^2$ (Gamma$(k/2, 1/2) = \chi^2_k$). It is the conjugate prior for the Poisson rate and the Gaussian precision, and shows up constantly in Bayesian analysis.

Dirichlet$(\boldsymbol{\alpha})$: the multivariate generalisation of the Beta — a distribution over probability vectors on the simplex $\{\mathbf{p} \in \mathbb{R}^K : p_i \geq 0, \sum_i p_i = 1\}$. Density proportional to $\prod_i p_i^{\alpha_i - 1}$. It is the conjugate prior of the categorical / multinomial, and the backbone of Latent Dirichlet Allocation, Dirichlet Process mixture models, and many Bayesian non-parametric methods.

Conjugacy as a design principle. A likelihood–prior pair is conjugate if the posterior belongs to the same family as the prior. Beta–Bernoulli, Gamma–Poisson, Gaussian–Gaussian, Dirichlet–categorical are the classical conjugate pairs. Conjugacy is not a mathematical requirement — it is a computational convenience: it turns Bayesian updating into a one-line update of the prior's parameters. Before variational inference and MCMC made arbitrary posteriors tractable, conjugacy was the only way to do Bayesian analysis at all.

Section 10

The multivariate Gaussian

The multivariate Gaussian is the single most useful distribution in applied probability. It is the joint distribution implied by independent Gaussian noise, the limit of averages of arbitrary random vectors, and the only family closed under every operation you are likely to perform in a linear model.

A random vector $\mathbf{X} \in \mathbb{R}^d$ has the multivariate Gaussian distribution $\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ if its density is

$$ f(\mathbf{x}) \;=\; \frac{1}{(2\pi)^{d/2}\,|\Sigma|^{1/2}}\, \exp\!\left(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right), $$

where $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector and $\Sigma \in \mathbb{R}^{d \times d}$ is the covariance matrix, a symmetric positive-definite matrix whose $(i, j)$ entry is $\operatorname{Cov}(X_i, X_j)$. The quantity in the exponent is half the squared Mahalanobis distance of $\mathbf{x}$ from $\boldsymbol{\mu}$; contours of constant density are ellipsoids centred at $\boldsymbol{\mu}$ with principal axes given by the eigenvectors of $\Sigma$ and principal radii proportional to the square roots of its eigenvalues.

The multivariate Gaussian is the last distribution you would design by hand and the first one that works. Three of its properties explain why it is everywhere:

1. Closure under affine transformations. If $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ and $\mathbf{Y} = A\mathbf{X} + \mathbf{b}$ with $A$ any matrix, then $\mathbf{Y} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b}, A\Sigma A^\top)$. Linear regression, PCA, and Kalman filtering all exploit this relentlessly.

2. Closure under marginalisation and conditioning. Marginals of a multivariate Gaussian are Gaussian (just drop the rows/columns of $\boldsymbol{\mu}$ and $\Sigma$ you don't care about). Conditionals are Gaussian too, with mean and covariance given by the Schur complement formulas

$$ \mathbb{E}[\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2] = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2), $$ $$ \operatorname{Cov}(\mathbf{X}_1 \mid \mathbf{X}_2) = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}. $$

3. Uncorrelated implies independent. In general, zero covariance does not imply independence. For jointly Gaussian random variables, it does. This property is what makes PCA — whose projections are uncorrelated — equivalent to independence under a Gaussianity assumption.

The precision matrix. Working with the inverse $\Lambda = \Sigma^{-1}$ — the precision matrix — is often more convenient than $\Sigma$ itself. The pattern of zeros in $\Lambda$ encodes the conditional independence structure of the Gaussian: $\Lambda_{ij} = 0$ iff $X_i \perp X_j \mid \text{rest}$. This is the foundation of Gaussian graphical models and an entire literature on sparse-inverse-covariance estimation.

Section 11

Joint, marginal, and conditional distributions

Two or more random variables are described jointly by their joint distribution. Marginals come from integrating out the variables you do not care about; conditionals from holding the others fixed and renormalising.

The joint density of continuous random variables $X$ and $Y$ is a function $f_{X,Y}: \mathbb{R}^2 \to [0, \infty)$ such that

$$ \mathbb{P}((X, Y) \in A) = \iint_A f_{X,Y}(x, y)\, dx\, dy. $$

The marginal density of $X$ alone is obtained by integrating out $Y$:

$$ f_X(x) = \int f_{X,Y}(x, y)\, dy. $$

This is marginalisation in the continuous world. In the discrete world, you sum instead of integrate: $p_X(x) = \sum_y p_{X,Y}(x, y)$.

The conditional density of $X$ given $Y = y$, defined whenever $f_Y(y) > 0$, is

$$ f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}. $$

Multiplying both sides by $f_Y(y)$ and integrating over $y$ recovers the law of total probability. Dividing the joint two ways and equating gives the continuous analogue of Bayes' theorem:

$$ f_{X \mid Y}(x \mid y) = \frac{f_{Y \mid X}(y \mid x)\, f_X(x)}{f_Y(y)}. $$

$X$ and $Y$ are independent precisely when $f_{X,Y}(x, y) = f_X(x) f_Y(y)$, i.e. when the joint density factors. Equivalently, $f_{X \mid Y}(x \mid y) = f_X(x)$: knowing $Y$ tells you nothing new about $X$.

The covariance of $X$ and $Y$ is $\operatorname{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$; the correlation coefficient $\rho_{XY} = \operatorname{Cov}(X, Y) / (\sigma_X \sigma_Y)$ rescales it to $[-1, 1]$. Correlation measures linear dependence only: two variables can be perfectly dependent (say $Y = X^2$) and have zero correlation. Mutual information, introduced in the Information Theory chapter, is the correlation-free alternative that captures all dependence, linear or not.

Law of iterated expectations. $\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]$, and $\operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(\mathbb{E}[X \mid Y])$. These identities — "tower rule" and "law of total variance" — justify Monte Carlo decompositions where you average over one variable at a time, and they are the algebraic source of the bias–variance decomposition in supervised learning.

Section 12

Transformations and the change of variables

If you know the distribution of $\mathbf{X}$ and you apply a deterministic function $\mathbf{Y} = g(\mathbf{X})$, what is the distribution of $\mathbf{Y}$? The change-of-variables formula answers this question, and its Jacobian factor is the mathematical core of normalising flows, the reparameterisation trick, and affine data augmentation.

In one dimension, if $Y = g(X)$ for a smooth, strictly monotonic $g$ with inverse $g^{-1}$, then

$$ f_Y(y) \;=\; f_X\!\left(g^{-1}(y)\right) \left| \frac{d}{dy} g^{-1}(y) \right|. $$

The absolute-value factor accounts for the stretching or squeezing of space under $g$: where $g$ stretches $x$-values apart, probability density thins out; where it compresses them, density piles up.

In $d$ dimensions, the same logic gives the multivariate change of variables formula. If $\mathbf{Y} = g(\mathbf{X})$ for a smooth invertible $g: \mathbb{R}^d \to \mathbb{R}^d$ with Jacobian matrix $J_g$, then

$$ f_{\mathbf{Y}}(\mathbf{y}) \;=\; f_{\mathbf{X}}\!\left(g^{-1}(\mathbf{y})\right) \left| \det J_{g^{-1}}(\mathbf{y}) \right|. $$

The absolute value of the Jacobian determinant plays the role of "local volume ratio" — by how much does $g$ stretch or shrink a tiny box around a point. This determinant is the single quantity that makes the whole formula work, and the reason every paper on normalising flows spends most of its time designing invertible layers whose log-determinant is cheap to evaluate.

The reparameterisation trick. If you want to sample from $\mathcal{N}(\mu, \sigma^2)$ but also differentiate through the sampling, write $X = \mu + \sigma Z$ with $Z \sim \mathcal{N}(0, 1)$. The randomness is now in $Z$, independent of the parameters $(\mu, \sigma)$, and the gradient flows through the deterministic map. This is change-of-variables in its most practically consequential form, and the whole reason variational autoencoders train by gradient descent instead of REINFORCE.

A common non-invertible transformation you still want to handle is $Y = g(X)$ where $g$ is many-to-one. For a smooth $g$, you split the domain into monotone pieces and sum:

$$ f_Y(y) = \sum_{x: g(x) = y} \frac{f_X(x)}{|g'(x)|}. $$

This is how the distribution of $Y = X^2$ (a $\chi^2_1$ when $X$ is standard normal) is derived. It is also the source of the Jacobian factors in marginalising out latent rotations or symmetries — an exercise you will meet in generative modelling.

Section 13

Sums of random variables

Sums of independent random variables are the most important compound objects in probability. Their distributions are computed by convolution; their moments can be read off moment generating functions; and their asymptotic behaviour is the subject of the law of large numbers and the central limit theorem.

If $X$ and $Y$ are independent with densities $f_X$ and $f_Y$, the density of $Z = X + Y$ is the convolution

$$ f_Z(z) \;=\; \int_{-\infty}^{\infty} f_X(x)\, f_Y(z - x)\, dx \;\equiv\; (f_X * f_Y)(z). $$

For discrete distributions, the integral becomes a sum. Convolution is commutative and associative; iterating it is how the distribution of a sum of many independent variables is — in principle — computed.

In practice, convolving many densities by hand is painful. The moment generating function (MGF)

$$ M_X(t) \;=\; \mathbb{E}[e^{tX}] $$

(when finite in a neighbourhood of $0$) turns convolution into multiplication: if $X$ and $Y$ are independent, $M_{X+Y}(t) = M_X(t)\, M_Y(t)$. The MGF also packages all moments: $\mathbb{E}[X^k] = M_X^{(k)}(0)$, i.e. the $k$-th derivative at zero. A distribution is uniquely determined by its MGF whenever the MGF exists near zero.

When the MGF fails to exist (e.g. heavy-tailed distributions like Cauchy), the characteristic function $\varphi_X(t) = \mathbb{E}[e^{itX}]$ — always finite since $|e^{itX}| = 1$ — plays the same role and gives the same product-under-independence property. Characteristic functions are the standard tool in proofs of the CLT (section 15).

A handful of closure identities are worth memorising:

Gaussians: if $X \sim \mathcal{N}(\mu_X, \sigma_X^2)$, $Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2)$ are independent, then $X + Y \sim \mathcal{N}(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2)$.
Poissons: if $X \sim \text{Poisson}(\lambda_1)$, $Y \sim \text{Poisson}(\lambda_2)$ are independent, then $X + Y \sim \text{Poisson}(\lambda_1 + \lambda_2)$.
Gammas with shared rate: if $X \sim \text{Gamma}(\alpha_1, \beta)$, $Y \sim \text{Gamma}(\alpha_2, \beta)$ are independent, then $X + Y \sim \text{Gamma}(\alpha_1 + \alpha_2, \beta)$.
Binomials with shared $p$: if $X \sim \text{Binomial}(n_1, p)$, $Y \sim \text{Binomial}(n_2, p)$ are independent, then $X + Y \sim \text{Binomial}(n_1 + n_2, p)$.

Each of these follows in a line or two from MGFs. They are also the reason these particular distributions are so often the "right" choice — sums are invariants, and invariance is what lets you push sums under a model without recomputing.

Section 14

The law of large numbers

Averages converge. Take enough i.i.d. samples from a distribution with a finite mean, average them, and you get the mean. This single theorem is the reason empirical risk minimisation works, why Monte Carlo integration works, and why the frequentist interpretation of probability is even internally consistent.

Let $X_1, X_2, \ldots$ be i.i.d. random variables with $\mathbb{E}[X_i] = \mu$ and let $\bar{X}_n = (X_1 + \cdots + X_n)/n$ be their running sample mean. The weak law of large numbers (WLLN) says

$$ \bar{X}_n \xrightarrow{\;\mathbb{P}\;} \mu, $$

i.e. for every $\varepsilon > 0$, $\mathbb{P}(|\bar{X}_n - \mu| > \varepsilon) \to 0$ as $n \to \infty$. Convergence in probability: the chance that the sample mean is more than $\varepsilon$ away from $\mu$ shrinks to zero, but any particular realisation might wobble.

The strong law of large numbers (SLLN) makes a sharper claim:

$$ \bar{X}_n \xrightarrow{\;\text{a.s.}\;} \mu, $$

i.e. with probability $1$, the sequence of sample means actually converges to $\mu$. Almost-sure convergence is a statement about trajectories; convergence in probability is a statement about marginal distributions. The strong law implies the weak law; the converse is false.

Both laws require only that $\mathbb{E}[|X|]$ be finite. They do not require a variance, let alone higher moments. The Chebyshev proof of the weak law is particularly clean: if $\operatorname{Var}(X) = \sigma^2 < \infty$, then $\operatorname{Var}(\bar{X}_n) = \sigma^2/n$, and Chebyshev's inequality (next section) gives

$$ \mathbb{P}(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\sigma^2}{n \varepsilon^2} \to 0. $$

Why ML works, in one theorem. Empirical risk minimisation replaces the true risk $\mathbb{E}[\ell(f_\theta(X), Y)]$ with its empirical version $\frac{1}{n}\sum_i \ell(f_\theta(x_i), y_i)$ and minimises the latter. The LLN guarantees that, for any fixed $\theta$, the empirical risk converges to the true risk as $n \to \infty$. Uniform convergence (over all $\theta$) is harder and requires VC-dimension or Rademacher complexity arguments, but the LLN is the kernel the whole statistical learning theory edifice is built on.

Pathological counter-examples are instructive. The Cauchy distribution has no mean; sample means of Cauchy variates do not converge, and empirical risk minimisation over Cauchy losses famously fails. Heavy-tailed distributions obey the LLN only if the tail decays fast enough for the mean to exist — and even then, finite-sample convergence can be painfully slow. The concentration inequalities of section 16 quantify how fast.

Section 15

The central limit theorem

Averages don't just converge to the mean — they converge in a particular shape. The central limit theorem says the deviation $\sqrt{n}(\bar{X}_n - \mu)$ is asymptotically Gaussian, regardless of the shape of the underlying distribution. This is why Gaussian noise is everywhere and why confidence intervals look the way they do.

Let $X_1, X_2, \ldots$ be i.i.d. with mean $\mu$ and finite variance $\sigma^2 > 0$. Then

$$ \sqrt{n}\,(\bar{X}_n - \mu) \;\xrightarrow{\;d\;}\; \mathcal{N}(0, \sigma^2), $$

where $\xrightarrow{d}$ denotes convergence in distribution: the CDF of the left side converges pointwise to the CDF of a $\mathcal{N}(0, \sigma^2)$. Equivalently, for large $n$,

$$ \bar{X}_n \;\approx\; \mathcal{N}\!\left(\mu, \frac{\sigma^2}{n}\right). $$

The sample mean is approximately Gaussian, with variance shrinking like $1/n$, no matter whether the underlying $X_i$ were Bernoulli, exponential, or something exotic — so long as the variance is finite.

The standard proof uses characteristic functions: the characteristic function of $\sqrt{n}(\bar{X}_n - \mu)/\sigma$ expands via Taylor's theorem to $1 - t^2/(2n) + o(1/n)$, and as $n \to \infty$ this converges to $e^{-t^2/2}$, the characteristic function of a standard normal. Lévy's continuity theorem then upgrades pointwise convergence of characteristic functions to convergence in distribution.

The multivariate CLT. If $\mathbf{X}_1, \mathbf{X}_2, \ldots$ are i.i.d. random vectors in $\mathbb{R}^d$ with mean $\boldsymbol{\mu}$ and finite covariance $\Sigma$, then $\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma)$. This is what justifies Gaussian approximations for vector-valued estimators (MLE asymptotics, bootstrap confidence intervals, asymptotic normality of gradient-descent iterates) — all of which say "for large $n$, the estimator is approximately Gaussian around its target with a specific covariance."

The CLT also has a non-i.i.d. version (the Lindeberg–Feller CLT) that allows sums of non-identically-distributed independent variables, provided no single variable dominates the sum. This is the right tool when you are summing gradients from heterogeneous mini-batches.

Rate of convergence is measured by the Berry–Esseen theorem, which states

$$ \sup_x \left| F_{\bar{X}_n}(x) - \Phi\!\left(\frac{x - \mu}{\sigma / \sqrt{n}}\right) \right| \leq \frac{C \rho}{\sigma^3 \sqrt{n}}, $$

where $\Phi$ is the standard normal CDF, $\rho = \mathbb{E}[|X - \mu|^3]$, and $C$ is a universal constant currently bounded by $C \leq 0.4748$ (Shevtsova, 2011). The error decays like $1/\sqrt{n}$, which is why $n$ in the thousands is usually "large enough" for CLT-based approximations to be reliable.

Section 16

Concentration inequalities

The LLN tells you averages converge. Concentration inequalities tell you how fast — with explicit, non-asymptotic bounds on how unlikely it is for a sample mean (or a more general function of many independent variables) to deviate from its expected value. Almost every finite-sample guarantee in modern machine learning is a concentration inequality in disguise.

The grandparent of the family is Markov's inequality: for a non-negative random variable $X$ with finite mean and any $t > 0$,

$$ \mathbb{P}(X \geq t) \;\leq\; \frac{\mathbb{E}[X]}{t}. $$

Applied to $(X - \mathbb{E}[X])^2$, it yields Chebyshev's inequality:

$$ \mathbb{P}(|X - \mathbb{E}[X]| \geq t) \;\leq\; \frac{\operatorname{Var}(X)}{t^2}. $$

Chebyshev gives polynomial decay of the tail and is sharp if all you know is the variance. For sums of independent bounded variables, the exponential-tail inequalities are much stronger.

Hoeffding's inequality. If $X_1, \ldots, X_n$ are independent with $X_i \in [a_i, b_i]$, then for any $t > 0$,

$$ \mathbb{P}\!\left( \left| \bar{X}_n - \mathbb{E}[\bar{X}_n] \right| \geq t \right) \;\leq\; 2 \exp\!\left( -\frac{2 n^2 t^2}{\sum_i (b_i - a_i)^2} \right). $$

For $X_i \in [0, 1]$, the denominator is $n$, so the bound becomes $2\exp(-2nt^2)$. This is exponential in $n$ for any fixed $t$ — the precise reason averages concentrate so tightly so fast. Hoeffding is the canonical inequality for bounding empirical-mean deviations of bounded losses.

Bernstein's inequality improves Hoeffding when the variance is small: for zero-mean bounded $X_i$ with $\operatorname{Var}(X_i) = \sigma^2$ and $|X_i| \leq M$,

$$ \mathbb{P}\!\left( \left| \bar{X}_n \right| \geq t \right) \;\leq\; 2 \exp\!\left( -\frac{n t^2 / 2}{\sigma^2 + M t / 3} \right). $$

When $t$ is small compared to $\sigma^2 / M$, this is an exp$(-nt^2/(2\sigma^2))$ bound — much tighter than Hoeffding's $\exp(-2nt^2)$ whenever $\sigma^2 \ll 1$. Bernstein captures low-variance concentration; Hoeffding captures bounded-range concentration.

McDiarmid's inequality (also called the bounded-difference inequality) generalises Hoeffding from sums to any function $f(X_1, \ldots, X_n)$ whose value changes by at most $c_i$ if you perturb the $i$-th coordinate. It states

$$ \mathbb{P}(|f - \mathbb{E}[f]| \geq t) \;\leq\; 2 \exp\!\left( -\frac{2 t^2}{\sum_i c_i^2} \right). $$

McDiarmid is the concentration tool behind Rademacher-complexity generalisation bounds: an empirical Rademacher average is a bounded-difference function, so it concentrates tightly around its expectation.

Sub-Gaussian random variables. A random variable is sub-Gaussian with parameter $\sigma^2$ if $\mathbb{E}[e^{\lambda X}] \leq e^{\lambda^2 \sigma^2 / 2}$ for all $\lambda$. Gaussians, bounded random variables, and many "nice" distributions are sub-Gaussian; their tails satisfy $\mathbb{P}(|X| \geq t) \leq 2 e^{-t^2/(2\sigma^2)}$. The whole theory of high-dimensional probability — covering everything from random matrices to JL lemmas to compressive sensing — is built on sub-Gaussian (and its heavier-tailed cousin, sub-exponential) calculus.

Concentration inequalities are, collectively, the reason finite-sample learning works at all. Every time you see a PAC bound, a generalisation gap in expectation, a high-probability guarantee, or a result of the form "with probability $1 - \delta$, …", one of the inequalities on this page is doing the work behind the scenes.

Section 17

Stochastic processes

A stochastic process is a sequence (or continuum) of random variables indexed by time. Random walks, Markov chains, martingales, and Brownian motion are the four families that cover almost every probabilistic model with memory — and they are the theoretical substrate of MCMC, reinforcement learning, and diffusion models.

A random walk is the simplest example: $S_n = \sum_{i=1}^n X_i$ where the $X_i$ are i.i.d. $\pm 1$ with probability $1/2$ each. Simple as it is, random walks already exhibit most of the phenomena of the general theory. The expected position is zero; the standard deviation grows like $\sqrt{n}$ (a direct LLN-scale calculation); paths are surprisingly wide-ranging — the probability of hitting zero infinitely often is $1$ in one and two dimensions but $<1$ in three or more (Pólya's theorem). The CLT says that, properly scaled, the walk approaches a Brownian motion.

A Markov chain is a sequence $X_0, X_1, X_2, \ldots$ with the Markov property: the distribution of $X_{n+1}$ given the entire past depends only on $X_n$. The past is summarised by the present. A Markov chain is fully described by its transition kernel $P(x, y) = \mathbb{P}(X_{n+1} = y \mid X_n = x)$. An ergodic Markov chain has a unique stationary distribution $\pi$ to which the chain's marginals converge, regardless of the starting state; this is what MCMC methods exploit, constructing a chain whose stationary distribution is the target posterior and then running the chain until it has mixed.

A martingale is a sequence $(M_n)$ adapted to a filtration $(\mathcal{F}_n)$ — an increasing sequence of σ-algebras representing "information at time $n$" — with

$$ \mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n. $$

"On average, knowing everything up to time $n$, the next step is zero." Martingales formalise "fair game" dynamics and unlock a suite of concentration and convergence results (Doob's inequalities, optional stopping, Azuma's inequality). In ML theory, martingale arguments underpin online-learning regret bounds, online-to-batch conversions, and the analysis of SGD trajectories.

Brownian motion (or the Wiener process) $W_t$ is the continuous-time, continuous-space analogue of a random walk: $W_0 = 0$, independent Gaussian increments with $W_t - W_s \sim \mathcal{N}(0, t-s)$, continuous but nowhere-differentiable sample paths. It is the scaling limit of nearly every discrete random walk by Donsker's theorem, and it is the noise process driving stochastic differential equations. Modern diffusion models for generative AI — DDPMs, score-based models, rectified flows — are built around an SDE driven by Brownian motion and its time-reversed pair.

Ergodicity in one sentence. A stationary process is ergodic if its long-run time averages equal its ensemble averages. The ergodic theorem is the LLN for dependent data: it is why a single long Markov-chain trajectory suffices to estimate integrals under the stationary distribution, without requiring independent samples. MCMC convergence diagnostics, policy-gradient estimators in RL, and the statistical analysis of long text corpora all rest on this one idea.

Section 18

Where it shows up in ML

A tour of where the machinery in this chapter turns into working machine learning — so you can see the bones of probability behind the names you have already heard.

Every loss is a negative log-likelihood. Mean squared error is the negative log-likelihood of a Gaussian observation model; cross-entropy is the negative log-likelihood of a categorical one; $\ell_1$ regression is the negative log-likelihood under Laplace noise. "Choose a loss" and "choose a noise model" are the same decision, phrased differently.
Softmax is a categorical. The output of a softmax layer is the parameter vector $\mathbf{p}$ of a categorical distribution over classes. Cross-entropy loss is its negative log-likelihood; temperature scaling is a reparameterisation; sampling from the softmax is drawing from that categorical.
Dropout, data augmentation, label smoothing. All three inject carefully calibrated randomness — independent Bernoulli masks, random affine transformations, random convex combinations with a uniform — turning a fixed model into a distribution over models (or data). The reason they regularise is concentration: an average over many noisy copies concentrates around a target that is flatter, better-behaved, and less overfit than any single copy.
Bayesian deep learning and variational inference. Bayesian neural networks replace point-weights with distributions; variational inference approximates the intractable posterior with a simpler family, typically Gaussian. The ELBO (evidence lower bound) you minimise in a VAE is an expectation over a latent variable plus a KL divergence — probability and information theory on every line.
Generative models. VAEs (Gaussian latents + change-of-variables), normalising flows (invertible transformations + Jacobians, section 12), autoregressive models (chain-rule factorisations of joint densities, section 11), and diffusion models (Brownian motion + learned score of a density, section 17) are all probability-first. The differences between them are choices of how to parameterise the joint distribution and how to compute gradients through it.
Reinforcement learning. Every MDP is a Markov chain with rewards. The Bellman equation is a conditional expectation; the policy gradient is a REINFORCE estimator $\mathbb{E}[\nabla \log \pi_\theta(a \mid s)\, Q(s, a)]$. Trust-region methods and PPO are applications of the log-derivative trick and Jensen's inequality.
Generalisation bounds. Everything in statistical learning theory — PAC bounds, VC dimension, Rademacher complexity, PAC-Bayes — is ultimately a concentration argument built from the inequalities of section 16. When a paper proves "with probability $1 - \delta$, the generalisation gap is at most …," it is applying Hoeffding, McDiarmid, or a clever variant.
MCMC and sampling. Every Markov chain Monte Carlo algorithm — Gibbs, Metropolis–Hastings, Hamiltonian Monte Carlo, Langevin dynamics — constructs a Markov chain whose stationary distribution is the target posterior. Ergodicity (section 17) is the theoretical guarantee that running the chain long enough gives you (dependent) samples from the right distribution.
Calibration and uncertainty quantification. A classifier is calibrated if $\mathbb{P}(Y = 1 \mid \hat{p} = p) = p$ — a probabilistic statement about the joint distribution of predictions and outcomes. Temperature scaling, isotonic regression, conformal prediction, and deep ensembles are all calibration tools, and their guarantees are finite-sample concentration results.
Diffusion models. Every current state-of-the-art image generator is a discretised stochastic differential equation driven by Brownian motion. Forward process: add Gaussian noise over time (the marginals are Gaussians whose covariance you can read off). Reverse process: learn the score $\nabla_x \log p_t(x)$ of those marginals and use it to undo the noise. Probability theory is not a side ingredient; it is the entire recipe.

Keep the bestiary in mind. Most ML papers are not introducing new ideas so much as rearranging the pieces of this chapter into new combinations — a different likelihood, a different prior, a different Markov kernel, a different change-of-variables. The language in which those rearrangements are expressed is the language developed here.

Where to go next

Probability is a subject where the same ideas get re-told many times, each telling a little more rigorous than the last. Reading two or three of these in parallel — one intuitive, one measure-theoretic, one applied — is the fastest way in.

Textbooks — first pass

A First Course in Probability

Sheldon Ross · 10th ed. 2019

The standard undergraduate introduction. Calculus-based, exercise-rich, and conceptually clear. If you only read one probability book in your life, read this one first.

Pearson
Introduction to Probability

Bertsekas & Tsitsiklis · 2nd ed. 2008

The MIT 6.041 textbook. Written by two engineers, so every chapter lands on an example you can actually imagine. Possibly the clearest first book for someone whose goal is ML rather than mathematics.

Athena Scientific
All of Statistics

Larry Wasserman · 2004

A one-semester condensation of probability and statistics for people who already know calculus and linear algebra. Dense, opinionated, and exactly the level of abstraction that modern ML readers need.

Springer
Mathematics for Machine Learning

Deisenroth, Faisal & Ong · 2020

Free PDF. Chapter 6 is a tight, modern tour of the probability every ML practitioner needs — pitched at exactly the level of this guide and continuous with the linear-algebra and calculus chapters.

mml-book.github.io

Textbooks — measure-theoretic

Probability: Theory and Examples

Rick Durrett · 5th ed. 2019

The standard graduate textbook. Covers measure theory, laws of large numbers, martingales, Brownian motion, and ergodic theory in one volume. Free PDF on the author's page.

Author's page
Probability and Measure

Patrick Billingsley · 3rd ed. 1995

The reference on the measure-theoretic foundations — $\sigma$-algebras, Lebesgue integration, product measures, and the central limit theorem done properly. Dense, but rewards patience.

Wiley
Probability with Martingales

David Williams · 1991

A short, conversational graduate book that makes martingales and conditional expectation feel inevitable. Beautiful exposition; the standard companion to Durrett.

Cambridge

Textbooks — high-dimensional and concentration

High-Dimensional Probability

Roman Vershynin · 2018

The definitive modern text on sub-Gaussian variables, concentration of measure, random matrices, and their use in data science. Free PDF on the author's page. Essential for anyone doing theoretical ML.

Author's page
Concentration Inequalities

Boucheron, Lugosi & Massart · 2013

The reference monograph on concentration — Hoeffding, Bernstein, bounded differences, entropy method, and more. The toolkit behind most modern generalization bounds in ML.

Oxford
Probabilistic Machine Learning

Kevin Murphy · MIT Press · 2022–2023

Two-volume modern ML textbook, organized explicitly around probability. Volume 1 covers the foundations; Volume 2 goes into advanced topics including variational inference, diffusion models, and causality. Free PDFs.

probml.github.io

Free video courses

MIT 6.041 / 6.431 — Probabilistic Systems Analysis

John Tsitsiklis · MIT OpenCourseWare

Tsitsiklis teaching from Bertsekas & Tsitsiklis, with full lectures, problem sets, and exams. Possibly the best introductory probability course available for free on the internet.

MIT OCW
Harvard STAT 110 — Probability

Joe Blitzstein · Harvard

Extremely clear, story-driven lectures. Paired textbook Introduction to Probability (Blitzstein & Hwang) is free and full of worked problems. A terrific alternative first pass.

Course page
3Blue1Brown — Probability and Bayes

Grant Sanderson · YouTube

A handful of short videos on Bayes' theorem, binomial distributions, and the central limit theorem. Best watched before any textbook — Sanderson's geometric intuition translates unusually well to probability.

3blue1brown.com
Stanford CS 228 — Probabilistic Graphical Models

Stefano Ermon · Stanford · notes freely online

Where probability meets structured ML. Covers Bayesian networks, Markov random fields, variational inference, and sampling — the probabilistic scaffolding behind many deep generative models.

Notes

References and papers

The Matrix Cookbook

Petersen & Pedersen · 2012

Has the cleanest published reference for multivariate Gaussian identities — marginalization, conditioning, products, KL divergences. Keep it open next to Section 10.

PDF
A tutorial on the EM algorithm

Bishop · Pattern Recognition and Machine Learning, Ch. 9 · 2006

The clearest published derivation of expectation–maximization and its probabilistic interpretation. The rest of the book is also a masterclass in probabilistic ML.

Book page
Auto-Encoding Variational Bayes

Kingma & Welling · 2013

The paper that introduced the VAE and — via the reparameterisation trick — made sampling differentiable. Read it once you have Sections 7 and 12 under your belt.

arXiv
Denoising Diffusion Probabilistic Models

Ho, Jain & Abbeel · 2020

The modern reference for diffusion models — a beautiful application of Gaussian processes, Markov chains, and variational bounds all at once. Section 17 and 18 of this guide point directly here.

arXiv
NumPy, SciPy & PyTorch distribution docs

numpy.random · scipy.stats · torch.distributions

Every distribution on this page has a ready-made class in all three libraries, complete with .sample(), .log_prob(), .mean(), and often .rsample() for reparameterisation. Skim the docs once, in order, to internalise which distribution to reach for when.

SciPy PyTorch

This is the fourth chapter of an ongoing compendium. The mathematical foundations continue with information theory, graphs and relations, dynamical systems, and — to close Part I — numerical linear algebra at scale. Part II then uses all of it to build machine learning from the ground up.

How to read this chapter

Contents

What probability theory is about

Sample spaces, events, and σ-algebras

The axioms of probability

Conditional probability and independence

Random variables

Mass, density, and cumulative distribution

Expectation, variance, and moments

Common discrete distributions

Common continuous distributions

The multivariate Gaussian

Joint, marginal, and conditional distributions

Transformations and the change of variables

Sums of random variables

The law of large numbers

The central limit theorem

Concentration inequalities

Stochastic processes

Where it shows up in ML

Where to go next

Textbooks — first pass

Textbooks — measure-theoretic

Textbooks — high-dimensional and concentration

Free video courses

References and papers