Part I · Mathematical Foundations · Chapter 05

Statistics, probability run backwards.

Probability assumes a model and predicts data; statistics receives data and infers the model. This chapter builds the frequentist machinery from the ground up — estimators, confidence intervals, hypothesis testing, regression, resampling, causal inference — and names each piece's ML descendant. By the end, cross-entropy losses, train/test splits, A/B tests, regularisation, and cross-validation should all look like the same set of ideas in modern clothing.

How to read this chapter

Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the equations make the prose precise; the callouts flag the places where intuition or terminology typically goes wrong. If a formula looks scary, read the sentence before it.

Notation: scalars are lowercase italic ($x$, $\theta$); random variables are uppercase ($X$, $Y$, $T$); vectors are bold lowercase ($\mathbf{x}$, $\boldsymbol{\beta}$); matrices are capitals ($\mathbf{X}$, $\Sigma$). $\mathbb{P}$ is probability, $\mathbb{E}$ is expectation, $\operatorname{Var}$ is variance. $X \sim F$ means "$X$ has distribution $F$"; $X_1, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} F$ means all are independent with the same distribution. Hats denote estimators ($\hat{\theta}$, $\hat{\boldsymbol{\beta}}$). Key terms when first defined appear in a slightly different color.

Contents

  1. Why statistics?Inverse problem
  2. Samples and sampling distributionsPopulations vs. samples
  3. Estimators and their propertiesBias, variance, MSE, consistency
  4. Method of moments and maximum likelihoodMLE, Fisher information
  5. Confidence intervalsInterval estimation
  6. The logic of hypothesis testingNull, alternative, p-values
  7. The common testst, $\chi^2$, F, ANOVA
  8. Multiple testing and the replication crisisFWER vs. FDR
  9. Linear regressionOLS, projection, likelihood
  10. Inference inside a regressiont, F, prediction intervals
  11. Generalised linear modelsLogistic, Poisson, link functions
  12. Bias–variance, ridge, lassoRegularisation as prior
  13. Bootstrap, jackknife, cross-validationResampling
  14. Bayesian and frequentist inference, side by sideTwo paradigms
  15. Correlation, causation, and the gap betweenPotential outcomes, DAGs, IVs
  16. Designing experimentsRCTs, power, A/B tests
  17. When observations are not independentTime series, HAC, grouped CV
  18. Where statistics shows up in modern MLPayoff
Section 01

Statistics is probability, run backwards.

Probability assumes a model and asks what data should look like. Statistics receives the data and asks what model produced it. Every method in this chapter is a different answer to that one inverse problem.

The previous chapter built up the machinery of distributions and expectations on the assumption that you know the underlying probability law. In real applications you almost never do. You see a finite sample of measurements — gene expression levels, click-through rates, sensor readings, image pixels — and you have to infer the distribution that produced them, or some functional of it: a mean, a variance, a regression coefficient, a class boundary, a treatment effect. That inversion problem is statistics.

The setup looks like this. We posit a statistical model: a family of probability distributions $\{p_\theta : \theta \in \Theta\}$ indexed by an unknown parameter $\theta$ in some parameter space $\Theta$. The data $X_1, \ldots, X_n$ are assumed to be drawn i.i.d. from one distribution in that family — call it $p_{\theta^*}$ — but we do not know which $\theta^*$ is the true one. The job is to use the data to make defensible statements about $\theta^*$: a single best guess (point estimation), a range of plausible values (confidence intervals), or a yes/no decision about a candidate value (hypothesis testing).

The two great schools. Frequentist statistics treats $\theta^*$ as a fixed but unknown constant and randomness as coming entirely from the sampling process. Probability statements are about the procedure (e.g. "this interval will contain $\theta^*$ in 95% of repeated samples"). Bayesian statistics treats $\theta$ as itself a random variable with a prior $\pi(\theta)$, and uses Bayes' theorem to compute a posterior $\pi(\theta \mid X)$. The two schools agree on the data, the model, and the algebra; they disagree on the interpretation of probability. This chapter is mostly frequentist; Chapter 07 is the Bayesian side of the story.

The chapter that follows is structured around the four tasks frequentists care about most. Sections 02–04 build the theory of estimation: what makes one estimator better than another. Section 05 turns single-number estimates into intervals. Sections 06–08 cover testing — formal procedures for deciding between hypotheses, and the modern problem of doing so honestly when you run many tests at once. Sections 09–12 develop regression, the most important applied branch of the field. Section 13 covers resampling — the bootstrap and friends — which made many of the older asymptotic formulas obsolete. Section 14 is a brief frequentist–Bayesian reconciliation. Sections 15–17 cover experimental design and other topics that bridge into causal inference and time series. Section 18 closes by mapping every concept onto its modern incarnation in machine learning.

The prerequisites are exactly the chapters that came before: linear algebra (for regression and multivariate distributions), calculus (for likelihoods and score functions), optimization (for estimating $\theta$ by minimising a loss), and probability (for everything). If those chapters felt like a long detour, this one is where the road rejoins the highway.

Section 02

Samples, statistics, and sampling distributions.

Anything you compute from data is itself a random variable. Understanding which distribution it follows is the central trick that turns "I measured a number" into "here is what I know about the truth."

A random sample of size $n$ from a distribution $F$ is a sequence $X_1, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} F$. A statistic is any function of the sample,

$$ T = T(X_1, \ldots, X_n), $$

that does not depend on the unknown parameter $\theta$. The sample mean $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$, the sample variance $S_n^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X}_n)^2$, the sample median, the sample maximum, the sample correlation — all are statistics. Because each is a function of random inputs, each has its own distribution: the sampling distribution of the statistic.

The most studied sampling distribution is that of the sample mean. If the $X_i$ have mean $\mu$ and variance $\sigma^2$, then by linearity of expectation and the variance of a sum of independents,

$$ \mathbb{E}[\bar{X}_n] = \mu, \qquad \operatorname{Var}(\bar{X}_n) = \frac{\sigma^2}{n}. $$

The standard deviation of the sample mean — sometimes called the standard error of the mean — is therefore $\sigma / \sqrt{n}$. By the central limit theorem, for large $n$,

$$ \bar{X}_n \;\overset{\text{approx.}}{\sim}\; \mathcal{N}\!\left(\mu, \frac{\sigma^2}{n}\right). $$

This single fact is the engine behind most classical confidence intervals and tests. It also explains the famous "$1/\sqrt{n}$" rate of convergence: doubling your dataset reduces uncertainty about the mean by only $\sqrt{2}$.

Standard error vs. standard deviation. The standard deviation describes the spread of the population (or one observation). The standard error describes the spread of an estimator. They differ by a factor of $\sqrt{n}$, and confusing them is the most common mistake in undergraduate statistics. When in doubt: the SE shrinks as you collect more data; the SD does not.

Three other classical sampling distributions show up constantly. If $X_1, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(\mu, \sigma^2)$, then

$$ \frac{(n-1)\,S_n^2}{\sigma^2} \sim \chi^2_{n-1}, \qquad \frac{\bar{X}_n - \mu}{S_n / \sqrt{n}} \sim t_{n-1}, \qquad \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1 - 1,\, n_2 - 1}. $$

The chi-squared distribution is the law of a sum of squared standard normals; it powers variance estimation and goodness-of-fit tests. The t-distribution is what you get when you replace the unknown $\sigma$ by the data's own $S_n$ — it has heavier tails than a normal, especially for small $n$. The F-distribution is the ratio of two chi-squareds; it powers ANOVA and tests of nested regression models. All three converge to standard reference distributions as $n \to \infty$, but their finite-sample form is what makes small-sample inference rigorous.

Section 03

Estimators and their properties.

An estimator is a recipe for turning data into a guess about a parameter. A good estimator is one whose guesses are right on average, narrow around the truth, and behave correctly as data accumulate.

A point estimator for a parameter $\theta$ is a statistic $\hat{\theta}_n = \hat{\theta}(X_1, \ldots, X_n)$ that we use as our best single guess. Since $\hat{\theta}_n$ is a function of random data, it is itself a random variable; its quality is measured by properties of its distribution.

Bias. The bias of an estimator is its expected error,

$$ \operatorname{Bias}(\hat{\theta}) \;=\; \mathbb{E}_\theta[\hat{\theta}] - \theta. $$

An estimator is unbiased if its bias is zero for every $\theta$. The sample mean is unbiased for $\mu$. The sample variance with $1/(n-1)$ is unbiased for $\sigma^2$; the version with $1/n$ is biased downward. The MLE is famously biased in finite samples but its bias usually shrinks at rate $1/n$.

Variance and mean squared error. Two unbiased estimators can still differ wildly in spread. The mean squared error of $\hat{\theta}$ is

$$ \operatorname{MSE}(\hat{\theta}) \;=\; \mathbb{E}_\theta\!\left[(\hat{\theta} - \theta)^2\right] \;=\; \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}(\hat{\theta})^2. $$

This is the bias–variance decomposition that will reappear under different names in regression (Section 12) and in machine learning (Section 18). It says you can pay for variance reduction with bias and vice versa, and the right trade is the one with lowest total error — not the one with the most reassuring zero in the bias column.

Consistency. An estimator is consistent if it converges in probability to the true parameter as $n \to \infty$:

$$ \hat{\theta}_n \xrightarrow{p} \theta \qquad \text{i.e.}\qquad \mathbb{P}(|\hat{\theta}_n - \theta| > \varepsilon) \to 0 \text{ for every } \varepsilon > 0. $$

Consistency is a minimum requirement: an estimator that doesn't get closer to the truth as you collect more data is unfit for purpose. Sufficient conditions include being unbiased with variance going to zero (immediate from Chebyshev) but the property is more general.

Sufficiency. A statistic $T$ is sufficient for $\theta$ if the conditional distribution of the data given $T$ does not depend on $\theta$ — informally, $T$ contains all the information in the sample about $\theta$. The sum $\sum X_i$ is sufficient for $\mu$ in a normal model with known variance; the pair $(\sum X_i, \sum X_i^2)$ is sufficient when both $\mu$ and $\sigma^2$ are unknown. The Rao–Blackwell theorem says that if you have any unbiased estimator, conditioning it on a sufficient statistic produces an estimator with smaller variance — a constructive way to improve estimators and a hint of why MLEs are usually nearly optimal.

The Cramér–Rao lower bound. Among all unbiased estimators, how small can the variance possibly be? The answer involves the Fisher information

$$ I(\theta) \;=\; -\,\mathbb{E}_\theta\!\left[\frac{\partial^2 \log p_\theta(X)}{\partial \theta^2}\right], $$

a quantity that measures how sharply the likelihood is curved around the true parameter. The Cramér–Rao inequality states that for any unbiased estimator $\hat{\theta}$ in a regular model,

$$ \operatorname{Var}(\hat{\theta}) \;\geq\; \frac{1}{n\, I(\theta)}. $$

An estimator that achieves this bound is called efficient. The MLE is asymptotically efficient under mild regularity conditions, which is one of the deepest reasons it dominates statistical practice. (And it generalises: the multivariate Cramér–Rao bound replaces $I(\theta)$ with the Fisher information matrix and the inequality with one between covariance matrices.)

Section 04

Method of moments and maximum likelihood.

Two recipes generate almost every estimator you will meet. One matches sample moments to model moments. The other maximises the probability of having seen what you saw.

Method of moments (MoM). If a model has $k$ parameters $\theta = (\theta_1, \ldots, \theta_k)$, set the first $k$ population moments equal to the corresponding sample moments and solve:

$$ \mathbb{E}_\theta[X^j] = \frac{1}{n}\sum_{i=1}^n X_i^j, \qquad j = 1, 2, \ldots, k. $$

For a normal model $\mathcal{N}(\mu, \sigma^2)$ with two parameters, this gives the sample mean for $\mu$ and the sample variance (with $1/n$) for $\sigma^2$ — clean and intuitive. MoM estimators are usually consistent but not always efficient; their main role today is producing decent starting points for iterative procedures and supplying closed-form alternatives in models where the likelihood is intractable.

Maximum likelihood (MLE). Given data $X_1, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} p_\theta$, the likelihood is the joint density viewed as a function of $\theta$:

$$ L(\theta) \;=\; \prod_{i=1}^n p_\theta(X_i). $$

The log-likelihood $\ell(\theta) = \log L(\theta) = \sum_i \log p_\theta(X_i)$ is easier to differentiate and almost always what people actually optimise. The maximum likelihood estimator is

$$ \hat{\theta}_{\text{MLE}} \;=\; \operatorname*{arg\,max}_{\theta \in \Theta} \ell(\theta). $$

Setting the derivative — the score function $U(\theta) = \partial \ell / \partial \theta$ — to zero and solving yields the MLE in models smooth enough to admit calculus. For a normal model, $\hat{\mu}_{\text{MLE}} = \bar{X}_n$ and $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_i (X_i - \bar{X}_n)^2$ — note the $1/n$, not $1/(n-1)$: the MLE for the variance is biased downward in this model.

Why MLE is everywhere in ML. Cross-entropy loss is negative log-likelihood for a categorical model. Mean-squared-error loss is negative log-likelihood for a Gaussian model with constant variance. Logistic regression's loss is negative log-likelihood for a Bernoulli model. Most "loss functions" you have ever fitted by gradient descent are MLEs in disguise; gradient descent is just a numerical solver for the score equation $U(\theta) = 0$.

Asymptotic normality of the MLE. Under mild regularity conditions (identifiability, smooth densities, an interior maximum), the MLE satisfies

$$ \sqrt{n}\,(\hat{\theta}_{\text{MLE}} - \theta^*) \;\xrightarrow{d}\; \mathcal{N}\!\left(0, \, I(\theta^*)^{-1}\right). $$

In words: MLEs are asymptotically unbiased, asymptotically Gaussian, and asymptotically efficient — their variance attains the Cramér–Rao bound. This is the theoretical justification for using $\hat{I}(\hat{\theta})^{-1}/n$ as an approximate covariance matrix when reporting standard errors of fitted parameters in everything from logistic regression to deep neural networks (where it shows up as the observed Fisher information or its diagonal approximation in second-order optimisers).

Numerical reality. Closed-form MLEs are rare outside textbook examples. In practice the log-likelihood is maximised by the optimisation algorithms of Chapter 03: Newton's method (using the Hessian, which is minus the observed Fisher information), Fisher scoring (replacing the Hessian with $-\,I(\theta)$), or first-order gradient methods. For very large datasets the gradient is approximated by stochastic mini-batches, and "MLE" silently becomes "stochastic gradient ascent on the log-likelihood" — i.e., training a neural network.

Section 05

Confidence intervals.

A point estimate without an interval is a number without context. Confidence intervals turn the sampling distribution of an estimator into a precise statement about how far the truth might plausibly be.

A $(1-\alpha)$ confidence interval for $\theta$ is a pair of statistics $L(X), U(X)$ such that

$$ \mathbb{P}_\theta\!\left(L(X) \leq \theta \leq U(X)\right) \;\geq\; 1 - \alpha \qquad \text{for every } \theta \in \Theta. $$

The interval is random; the parameter is fixed. The probability statement is about the procedure: if you repeated the entire experiment many times, in $1-\alpha$ of those replications the produced interval would cover the true $\theta$. It is not a statement about the probability that this particular interval contains $\theta$ — that probability is either 1 or 0, since both endpoints and $\theta$ are fixed once data are observed. (This is the line that separates the frequentist and Bayesian readings; the Bayesian credible interval, Section 14, has the natural "probability that $\theta$ lies in the interval" interpretation, but pays a different price.)

The Normal interval for a mean. If $X_1, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(\mu, \sigma^2)$ with known $\sigma$, then $\bar{X}_n \sim \mathcal{N}(\mu, \sigma^2/n)$ and

$$ \bar{X}_n \;\pm\; z_{1 - \alpha/2}\, \frac{\sigma}{\sqrt{n}} $$

is an exact $(1-\alpha)$ interval, where $z_{1-\alpha/2}$ is the relevant standard-normal quantile (1.96 for $\alpha = 0.05$). When $\sigma$ is unknown, replace it by the sample standard deviation $S_n$ and the normal quantile by a t-quantile $t_{n-1, 1-\alpha/2}$:

$$ \bar{X}_n \;\pm\; t_{n-1, 1-\alpha/2}\, \frac{S_n}{\sqrt{n}}. $$

For large $n$ the t-quantile is essentially the normal quantile and you can stop worrying about which one to use; for small samples (say $n < 30$) it matters.

Pivots. A pivot quantity is a function of the data and the parameter whose distribution does not depend on the parameter. For the normal-mean problem above, $(\bar{X}_n - \mu)/(S_n/\sqrt{n})$ is a pivot — its distribution is $t_{n-1}$ regardless of what $\mu$ or $\sigma$ are. Inverting a pivot's known distribution is the cleanest route to an exact confidence interval, and is the technique behind almost every classical interval in this section.

Asymptotic intervals via the MLE. When closed-form pivots are unavailable, the asymptotic normality of the MLE gives

$$ \hat{\theta}_{\text{MLE}} \;\pm\; z_{1 - \alpha/2}\, \frac{1}{\sqrt{n\, \hat{I}(\hat{\theta})}}, $$

a "Wald interval" valid for large $n$. It is the workhorse for parameter uncertainty in logistic regression, GLMs, and generic likelihood models. Two close cousins — the score interval (inverting a test based on the score function) and the likelihood-ratio interval (the set $\{\theta : 2[\ell(\hat\theta) - \ell(\theta)] \leq \chi^2_{1, 1-\alpha}\}$) — have better small-sample behaviour but require more computation.

Bootstrap intervals. When even asymptotics are inconvenient — heavy-tailed data, small samples, complicated functionals — the bootstrap (Section 13) offers a fully numerical alternative: simulate the sampling distribution by resampling from the data itself. Modern intervals for almost any quantity in modern statistics — variances of trained model parameters, complicated functionals of estimated distributions, derived quantities of regression coefficients — are bootstrap intervals.

Section 06

The logic of hypothesis testing.

A test is a decision rule. Frame the question as "could the data have plausibly come from this null distribution?" and the rest is bookkeeping about which kinds of mistakes you are willing to make.

The setup is two competing models of the world: the null hypothesis $H_0: \theta \in \Theta_0$ and the alternative hypothesis $H_1: \theta \in \Theta_1$. A test is a function $\phi(X) \in \{0, 1\}$ that decides whether to reject $H_0$ (output 1) or fail to reject (output 0). Two kinds of error are possible:

The classical paradigm fixes the level $\alpha$ — typically 0.05 or 0.01 — and then designs a test with the largest possible power against alternatives of interest. Note the asymmetry: a test cannot generally control both errors simultaneously, and the convention is to protect against false rejections.

The p-value, properly stated. For a test statistic $T(X)$ with larger values being more extreme under $H_1$, the p-value is $$ p = \mathbb{P}_{H_0}\!\left(T(X) \geq T(x_{\text{obs}})\right). $$ It is the probability — under the null — of seeing a statistic at least as extreme as the one observed. It is not the probability that $H_0$ is true; it is not the probability that the result is due to chance; and it is not a measure of effect size. The rule "reject $H_0$ if $p < \alpha$" is exactly the level-$\alpha$ test based on $T$.

The Neyman–Pearson lemma. For two simple hypotheses $H_0: \theta = \theta_0$ versus $H_1: \theta = \theta_1$, the most powerful test of level $\alpha$ rejects when the likelihood ratio is large:

$$ \Lambda(x) \;=\; \frac{p_{\theta_1}(x)}{p_{\theta_0}(x)} \;>\; k, $$

where $k$ is chosen so the false-positive rate equals $\alpha$. This is the only "optimality theorem" of classical testing, and it underwrites essentially every test in the next section. For composite hypotheses (e.g. $H_1: \theta \neq \theta_0$) the likelihood-ratio idea generalises to the generalised likelihood ratio test (GLRT)

$$ \Lambda(x) \;=\; \frac{\sup_{\theta \in \Theta_1} L(\theta)}{\sup_{\theta \in \Theta_0} L(\theta)}. $$

Under regularity, $-2 \log \Lambda \xrightarrow{d} \chi^2_k$ where $k$ is the difference in the dimension of the two parameter spaces — Wilks' theorem — which is what powers the F-test for nested regression models in Section 10.

One-sided vs. two-sided. When the alternative has a direction ($\theta > \theta_0$), use a one-sided test. When it does not ($\theta \neq \theta_0$), use a two-sided test. The two-sided test halves the rejection region on each side, doubling the p-value of the corresponding one-sided test. Reporting one-sided p-values when the direction was chosen after seeing the data is one of the classic and unfortunately common ways to inflate false positives.

Section 07

The common tests, on one page.

Most applied statistics happens with a small set of named tests. Understanding what each one assumes — and what it does when those assumptions fail — is more useful than memorising its formula.

z-test for a mean. $H_0: \mu = \mu_0$ versus $H_1: \mu \neq \mu_0$, with $\sigma^2$ known. The statistic

$$ Z = \frac{\bar{X}_n - \mu_0}{\sigma / \sqrt{n}} \;\sim\; \mathcal{N}(0, 1) \text{ under } H_0 $$

is compared to standard normal quantiles. In practice the variance is rarely known; this test is a stepping stone, not an end product.

One-sample t-test. Same as above but with $\sigma$ unknown, replaced by $S_n$:

$$ T = \frac{\bar{X}_n - \mu_0}{S_n / \sqrt{n}} \;\sim\; t_{n-1} \text{ under } H_0. $$

The "Student's t-test" — published anonymously by Gosset in 1908 because his employer, Guinness, considered the result a trade secret. It is robust to mild departures from normality and is the standard tool for comparing a sample mean to a hypothesised value.

Two-sample t-test. $H_0: \mu_1 = \mu_2$ for two independent samples. The pooled-variance version assumes $\sigma_1 = \sigma_2$; Welch's version does not and is the safer default. Either way the statistic is approximately t-distributed under $H_0$.

Paired t-test. When the two samples are paired (same subjects measured before and after), reduce to a one-sample t-test on the differences $D_i = X_i - Y_i$. This eliminates between-subject variability and usually yields much higher power than the two-sample version on the same data.

$\chi^2$ tests. Two flavours, both based on the same statistic but applied differently:

$$ X^2 \;=\; \sum_{\text{cells}} \frac{(O_i - E_i)^2}{E_i} \;\sim\; \chi^2_{\text{df}} \text{ under } H_0, $$

where $O_i$ is the observed count in cell $i$ and $E_i$ is the count expected under $H_0$. The goodness-of-fit test compares observed counts to a hypothesised distribution; df = (cells $-$ 1 $-$ estimated parameters). The test of independence in a contingency table compares observed cell counts to those expected under independence of the row and column variables; df = (rows $-$ 1)(cols $-$ 1).

F-test for equality of variances. Ratio of two sample variances; under $H_0: \sigma_1 = \sigma_2$ in normal samples,

$$ F = \frac{S_1^2}{S_2^2} \;\sim\; F_{n_1-1,\, n_2-1}. $$

Highly sensitive to non-normality; rarely used as a standalone test today and largely replaced by Levene's or Brown–Forsythe tests. The F-distribution itself remains central — it is the reference distribution for ANOVA and for nested-model comparisons in regression.

ANOVA. Analysis of variance compares means across $k \geq 3$ groups by partitioning the total sum of squares into "between-groups" and "within-groups" components. The ratio

$$ F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} \;\sim\; F_{k-1,\, n - k} \text{ under } H_0: \mu_1 = \cdots = \mu_k $$

is large when the group means differ substantially relative to within-group variation. ANOVA is exactly the F-test for nested regression models with categorical predictors, a connection made explicit in Section 10.

Non-parametric alternatives. When normality fails or the data are ordinal, the Wilcoxon signed-rank test (paired) and Mann–Whitney U test (two-sample) replace t-tests; the Kruskal–Wallis test replaces ANOVA; the permutation test replaces almost everything by computing the test statistic's exact null distribution from random shuffles of group labels.

The right reflex. Before any test, plot the data. Most "significant" results from t-tests evaporate when you realise the data have a heavy outlier; most "non-significant" results from $\chi^2$ tests come from cells with expected counts under 5. A test is only as good as the assumptions it rests on, and the cheapest sanity check is the one your eyes do in two seconds.
Section 08

Multiple testing and the replication crisis.

The 5% false-positive rate of a single test becomes near-certain when you run thousands. The correct response is not to give up testing, but to control the right error rate.

Run $m$ independent tests at level $\alpha$ each, all with true nulls. The probability that at least one rejects is

$$ 1 - (1 - \alpha)^m \;\approx\; m\alpha \quad \text{for small } \alpha. $$

With $\alpha = 0.05$ and $m = 20$, that's already 64%; with $m = 1{,}000$ (a typical genome-wide scan, or a typical hyperparameter sweep) it is essentially 1. Any culture that rewards "find a significant result" while ignoring the count of tests performed will produce a steady stream of false positives. This was the diagnosis of the replication crisis in psychology, biology, and economics in the 2010s — many "discoveries" failed to replicate not because the science was sloppy but because the multiple-testing arithmetic was ignored.

Two error rates have to be distinguished:

Bonferroni correction controls FWER by testing each hypothesis at level $\alpha / m$:

$$ \mathbb{P}(\text{any false rejection}) \;\leq\; \sum_{i=1}^m \frac{\alpha}{m} = \alpha. $$

Simple, conservative, and the right tool for confirmatory studies with a handful of hypotheses. Holm's step-down procedure is uniformly more powerful than Bonferroni at the same FWER and is essentially free; there is no reason to prefer Bonferroni in modern software.

Benjamini–Hochberg (BH). Controls FDR rather than FWER. Sort the p-values $p_{(1)} \leq \cdots \leq p_{(m)}$ and reject all hypotheses with rank $\leq k^*$ where

$$ k^* \;=\; \max\!\left\{\,k : p_{(k)} \leq \frac{k}{m}\,q\,\right\} $$

for a target FDR level $q$. BH is the workhorse of high-dimensional screening — used routinely in genomics, neuroimaging, and any setting where thousands of features are tested and a small fraction of false positives is acceptable.

The garden of forking paths. Multiple-testing corrections only fix the problem you can see. A subtler problem — Gelman and Loken's "garden of forking paths" — is that a single analysis can implicitly test many hypotheses through choices of variable transformations, exclusion criteria, and subgroup comparisons made after looking at the data. Pre-registering analysis plans, sharing code, and reporting all tests performed (not just the significant ones) are the cultural fixes; they matter as much as the algebraic ones.

For ML practitioners the analogues are everywhere: hyperparameter sweeps, dataset pre-processing choices, picking the random seed that worked, comparing against many baselines until one loses. The discipline that has emerged in recent years — train/validation/test splits with the test set touched once, results reported with confidence intervals across seeds, ablations rather than cherry-picked best runs — is essentially the multiple-testing discipline of statistics, transferred to a new field.

Section 09

Linear regression, three ways.

Linear regression is statistics' most successful idea. It is simultaneously a maximum-likelihood estimator, an orthogonal projection in linear algebra, and the right answer to a particular optimisation problem — and seeing all three at once is what makes it usable in modern ML.

The model: an outcome $Y$ is a linear function of predictors $\mathbf{x} = (x_1, \ldots, x_p)^\top$ plus noise,

$$ Y \;=\; \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2). $$

Stack $n$ observations into a vector $\mathbf{y} \in \mathbb{R}^n$ and an $n \times (p+1)$ design matrix $\mathbf{X}$ whose first column is all ones (for the intercept) and whose remaining columns hold the predictors. Then $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\boldsymbol{\beta} \in \mathbb{R}^{p+1}$ unknown.

The OLS estimator. The ordinary-least-squares estimator minimises the sum of squared residuals:

$$ \hat{\boldsymbol{\beta}} \;=\; \operatorname*{arg\,min}_{\boldsymbol{\beta}} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 \;=\; (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}. $$

The closed-form solution comes from setting the gradient of the squared loss to zero. The matrix $(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ is the Moore–Penrose pseudoinverse when $\mathbf{X}$ has full column rank; numerically you should use a QR or SVD factorisation rather than forming and inverting $\mathbf{X}^\top \mathbf{X}$ explicitly (the condition number squares).

$x$ $y$ $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$

The geometric picture. View $\mathbf{y}$ as a vector in $\mathbb{R}^n$ and the columns of $\mathbf{X}$ as spanning a $(p+1)$-dimensional subspace. Then $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ is the orthogonal projection of $\mathbf{y}$ onto that subspace, and the residual vector $\hat{\boldsymbol{\varepsilon}} = \mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to every column of $\mathbf{X}$. The "normal equations" $\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$ are exactly that orthogonality condition. The "$R^2$" reported with every regression is the cosine-squared of the angle between the projected and total signals.

The likelihood picture. Under the assumed Gaussian noise, the log-likelihood is

$$ \ell(\boldsymbol{\beta}, \sigma^2) \;=\; -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2. $$

Maximising over $\boldsymbol{\beta}$ for any fixed $\sigma^2$ minimises $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$ — i.e., gives OLS. So least squares is maximum likelihood under Gaussian noise, and that is why mean-squared error is the natural loss for so many models. The corresponding MLE for the variance is $\hat{\sigma}^2 = \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 / n$; a slightly larger denominator $(n - p - 1)$ gives the unbiased version.

Gauss–Markov. Among all linear unbiased estimators of $\boldsymbol{\beta}$, the OLS estimator has the smallest variance (minimum-variance linear unbiased estimator, "BLUE"). The conditions: $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$, $\operatorname{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}$ — homoscedastic and uncorrelated noise. Crucially, the theorem makes no normality assumption. Normality buys you exact small-sample inference (Section 10); Gauss–Markov gives you the optimality of OLS even without it.

The next two sections build on this scaffold: Section 10 turns the fit into inference about which coefficients matter, and Section 11 generalises beyond Gaussian noise to logistic regression, Poisson regression, and the rest of the GLM family.

Section 10

Inference inside a regression.

Once you have fitted a regression, you almost always want to know which coefficients are real, how well the fit explains the data, and what range of outcomes to expect for a new observation. Each question has its own classical machinery.

Under the standard linear model with Gaussian noise,

$$ \hat{\boldsymbol{\beta}} \;\sim\; \mathcal{N}\!\left(\boldsymbol{\beta},\, \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}\right). $$

The diagonal entries of $\sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$ are the variances of the individual coefficient estimates. Replacing $\sigma^2$ by its unbiased estimator $\hat{\sigma}^2 = \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 / (n - p - 1)$ gives standard errors $\operatorname{SE}(\hat{\beta}_j)$, and

$$ T_j \;=\; \frac{\hat{\beta}_j - \beta_j^{\text{(null)}}}{\operatorname{SE}(\hat{\beta}_j)} \;\sim\; t_{n - p - 1}\text{ under the null.} $$

These are the t-statistics and p-values printed next to every regression coefficient in every statistical software output. The null against which they are tested is almost always $\beta_j = 0$ — "this predictor has no marginal effect after adjusting for the others" — and the corresponding p-value answers exactly that question, no more.

F-test for nested models. To compare a full model with $p$ predictors against a reduced model with $q < p$ of them, fit both and compute

$$ F \;=\; \frac{(\text{RSS}_{\text{red}} - \text{RSS}_{\text{full}}) / (p - q)}{\text{RSS}_{\text{full}} / (n - p - 1)} \;\sim\; F_{p-q,\, n-p-1} \text{ under the null.} $$

A large $F$ means the additional predictors meaningfully reduce residual variance. The same machinery is what powers ANOVA: each group is a categorical predictor, and the F-test compares the model with all groups to the intercept-only model.

$R^2$ and adjusted $R^2$. $R^2 = 1 - \text{RSS}/\text{TSS}$ is the fraction of variance in $\mathbf{y}$ explained by the fit. It only ever increases when you add predictors, so it cannot be used to compare models with different numbers of variables. The adjusted $R^2$ penalises model size; AIC and BIC penalise it more. Pick one criterion, report it, and stop expecting any single number to decide model complexity for you.

Confidence vs. prediction intervals. Two distinct intervals are routinely confused. For a new test point $\mathbf{x}_0$:

$$ \text{CI for } \mathbb{E}[Y \mid \mathbf{x}_0]:\quad \mathbf{x}_0^\top \hat{\boldsymbol{\beta}} \;\pm\; t_{n-p-1, 1-\alpha/2}\, \hat{\sigma}\sqrt{\mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0}, $$ $$ \text{PI for new } Y_0:\quad \mathbf{x}_0^\top \hat{\boldsymbol{\beta}} \;\pm\; t_{n-p-1, 1-\alpha/2}\, \hat{\sigma}\sqrt{1 + \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0}. $$

The confidence interval bounds the regression line; the prediction interval bounds an individual future observation and is always wider, because a new observation has its own irreducible noise $\sigma^2$ on top of the uncertainty in $\hat{\boldsymbol{\beta}}$.

Diagnostics. The classical assumptions — linearity, homoscedasticity, normality of residuals, independence — are checked by inspecting residual plots, not just by tests. A scatter of residuals versus fitted values that fans out signals heteroscedasticity. A Q–Q plot of residuals against normal quantiles checks normality. High-leverage points (large diagonal entries of the hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$) are observations whose predictors lie far from the rest; they exert outsized influence on the fit and deserve scrutiny. Cook's distance and the various standardised-residual flavours are formalisations of the same idea.

Section 11

Generalised linear models.

Linear regression assumes Gaussian noise on a continuous outcome. Logistic regression handles binary outcomes, Poisson regression handles counts, and a single template — the GLM — covers them all.

A generalised linear model has three ingredients:

  1. A random component: a distribution for $Y$ from the exponential family — Bernoulli, Binomial, Poisson, Gaussian, Gamma, etc.
  2. A linear predictor: $\eta = \mathbf{x}^\top \boldsymbol{\beta}$.
  3. A link function $g$ relating the mean of $Y$ to the linear predictor: $g(\mathbb{E}[Y]) = \eta$.

Choosing the distribution and link recovers a familiar model. Gaussian + identity link gives ordinary linear regression. Bernoulli + logit link gives logistic regression:

$$ \log \frac{p}{1 - p} \;=\; \mathbf{x}^\top \boldsymbol{\beta}, \qquad p = \mathbb{P}(Y = 1 \mid \mathbf{x}) = \sigma(\mathbf{x}^\top \boldsymbol{\beta}), $$

where $\sigma(z) = 1/(1 + e^{-z})$ is the logistic function. Coefficients in logistic regression are log odds ratios: a one-unit increase in $x_j$ multiplies the odds of $Y = 1$ by $e^{\beta_j}$.

Poisson + log link gives Poisson regression for count data:

$$ \log \mathbb{E}[Y] \;=\; \mathbf{x}^\top \boldsymbol{\beta}, \qquad Y \mid \mathbf{x} \sim \operatorname{Poisson}(e^{\mathbf{x}^\top \boldsymbol{\beta}}). $$

A coefficient is then a log rate ratio: the multiplicative effect on the expected count. The standard correction for varying exposure (length of follow-up, area sampled, time at risk) is to add an offset $\log E_i$ to the linear predictor.

The link function is not optional. Fitting a linear regression to a 0/1 outcome can be done — and is sometimes called the linear probability model — but the predicted probabilities can leave $[0, 1]$, the residual variance is automatically heteroscedastic, and the standard errors are wrong. Pick the right link function and the right family, and the model lives in the right space by construction.

Fitting GLMs. The MLE for a GLM has no closed form (logistic regression notoriously does not), so it is computed iteratively. The classical algorithm is iteratively reweighted least squares (IRLS): at each iteration, linearise the link function and solve a weighted OLS problem; the weights depend on the current fit. IRLS is just Newton–Raphson on the log-likelihood with a particularly clean update; the asymptotic covariance of the resulting MLE is the inverse Fisher information, exactly as in Section 04, and software returns it as the standard error matrix for the coefficients.

Inference. Wald tests, likelihood-ratio tests (using deviance differences against $\chi^2_{p-q}$ from Wilks' theorem), and score tests all transfer from linear regression to GLMs with minor changes. The deviance $D = 2[\ell_{\text{sat}} - \ell_{\text{model}}]$ — twice the log-likelihood gap to a "saturated" model that interpolates the data — replaces the residual sum of squares as the basic measure of fit, and behaves like a $\chi^2$ for nested model comparisons.

Modern ML inherits this scaffolding wholesale. A neural network with a sigmoid output trained with binary cross-entropy is a logistic regression with a learned, non-linear feature map. A network with a softmax output trained with categorical cross-entropy is a multinomial logistic regression. The optimisers are different — SGD instead of IRLS — but the loss functions, the link functions, and the asymptotic standard errors all come from this one chapter.

Section 12

Bias–variance, ridge, lasso.

With many predictors and finite data, OLS fits the noise as well as the signal. Shrinking coefficients toward zero trades a little bias for a lot of variance — and is the bridge from classical statistics into modern machine learning.

Decompose the prediction error of an estimator $\hat{f}$ at a fixed input $\mathbf{x}_0$:

$$ \mathbb{E}\!\left[(\hat{f}(\mathbf{x}_0) - f(\mathbf{x}_0))^2\right] \;=\; \underbrace{(\mathbb{E}[\hat{f}(\mathbf{x}_0)] - f(\mathbf{x}_0))^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}(\hat{f}(\mathbf{x}_0))}_{\text{variance}} + \sigma^2. $$

The irreducible noise $\sigma^2$ is what it is. Bias and variance, however, are knobs the modeller controls. Simple models (few predictors, large penalties) have low variance and high bias; complex models (many predictors, no regularisation) have low bias and high variance. The total error is U-shaped in complexity; the goal is to land in the trough.

Ridge regression. Add an $\ell_2$ penalty on the coefficients to the OLS loss:

$$ \hat{\boldsymbol{\beta}}_{\text{ridge}} \;=\; \operatorname*{arg\,min}_{\boldsymbol{\beta}} \;\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_2^2 \;=\; (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top \mathbf{y}. $$

Two consequences are worth seeing immediately. First, the matrix $\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}$ is invertible for any $\lambda > 0$ even if $\mathbf{X}^\top \mathbf{X}$ is singular — ridge handles collinearity and high-dimensional ($p > n$) regimes that OLS cannot. Second, every coefficient is shrunk toward zero, with shrinkage strongest in the directions of small singular values of $\mathbf{X}$ (this is most easily seen via SVD). The penalty parameter $\lambda$ controls the bias–variance trade.

Lasso. Replace the $\ell_2$ penalty with $\ell_1$:

$$ \hat{\boldsymbol{\beta}}_{\text{lasso}} \;=\; \operatorname*{arg\,min}_{\boldsymbol{\beta}} \;\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1. $$

The geometry of the $\ell_1$ ball — a polytope with corners on the coordinate axes — means the optimum frequently lands exactly on those axes, setting some coefficients to exactly zero. Lasso therefore performs variable selection as a side effect of fitting. There is no closed form; the standard solver is coordinate descent or proximal gradient methods, both fast in practice.

Elastic net. A convex combination of the two penalties,

$$ \lambda \big(\alpha \|\boldsymbol{\beta}\|_1 + (1 - \alpha) \|\boldsymbol{\beta}\|_2^2\big), $$

keeps the variable-selection behaviour of lasso while smoothing the path through correlated predictors that lasso treats erratically. It is the default in many high-dimensional applications and the loss surface most ML libraries default to when "$\ell_1$ + $\ell_2$" is offered.

Penalty as prior. Ridge corresponds to a Gaussian prior on $\boldsymbol{\beta}$ with mean zero and variance $\sigma^2/\lambda$; lasso corresponds to a Laplace prior. The "regularised MLE" is just the MAP estimate under that prior. This identification is the formal bridge between regularisation and Bayesian thinking, and it is the lens Chapter 07 develops in detail.

Choosing $\lambda$. Cross-validation. Fit the model for a grid of $\lambda$ values; pick the one with smallest out-of-sample error. The whole pipeline — split, fit, score, repeat — is the empirical core of modern ML and is revisited in Section 13.

Section 13

Bootstrap, jackknife, cross-validation.

Resampling is the modern shortcut around the classical reliance on asymptotics: simulate the sampling distribution by drawing from the data itself.

The bootstrap. The non-parametric bootstrap replaces the unknown population distribution $F$ with its empirical analogue $\hat{F}_n$ — the discrete distribution that puts mass $1/n$ on each observed datum. Sampling from $\hat{F}_n$ means drawing $n$ observations from the original sample with replacement. Repeat $B$ times to get bootstrap replicates $\hat{\theta}^{*1}, \ldots, \hat{\theta}^{*B}$ of any statistic $\hat{\theta}$.

The empirical distribution of the replicates approximates the sampling distribution of $\hat{\theta}$. Standard errors:

$$ \widehat{\operatorname{SE}}_{\text{boot}}(\hat{\theta}) \;=\; \sqrt{\frac{1}{B-1}\sum_{b=1}^B (\hat{\theta}^{*b} - \bar{\hat{\theta}^*})^2}. $$

Confidence intervals come in several flavours: the simple percentile interval takes the $\alpha/2$ and $1-\alpha/2$ quantiles of the bootstrap replicates; the basic bootstrap reflects them around $\hat{\theta}$; the BCa (bias-corrected accelerated) interval adjusts for skewness and bias and is the recommended default for modern work. All three are implemented one library call away in R, Python, and Julia.

Why it works. The bootstrap is justified by a single deep observation: $\hat{F}_n \to F$ uniformly (Glivenko–Cantelli), so for "smooth enough" statistics, the sampling distribution under $\hat{F}_n$ converges to the sampling distribution under $F$. The technical conditions (Hadamard differentiability of the functional $\hat{\theta}$ in $F$) hold for almost everything practitioners care about; the famous failures (sample maximum, exact estimators on the boundary) are well documented and have specialised remedies.

Parametric vs. non-parametric bootstrap. The non-parametric bootstrap resamples from the data. The parametric bootstrap fits a model $p_{\hat{\theta}}$ to the data and then simulates new datasets from $p_{\hat{\theta}}$. The parametric version is more efficient when the model is approximately right and necessary in some structured settings (e.g. dependent data); the non-parametric version is more robust when the model is misspecified.

The jackknife. An older resampling idea: leave out one observation at a time, recompute $\hat{\theta}_{(-i)}$, and use the variability across the $n$ leave-one-out estimates to estimate $\operatorname{Var}(\hat{\theta})$. The jackknife is computationally cheaper than the bootstrap and conceptually clean; for smooth statistics it gives nearly identical answers and is built into most regression diagnostics (DFFITS, Cook's distance, leave-one-out cross-validation).

Cross-validation. The natural extension to predictive performance. Partition the data into $K$ folds, train on $K-1$, evaluate on the held-out fold, average. The result, $\text{CV}_K = \frac{1}{n}\sum_i L(y_i, \hat{f}_{(-k(i))}(\mathbf{x}_i))$, is a nearly unbiased estimate of the model's expected test error. With $K = n$ this is leave-one-out CV — high variance but minimal bias; with $K = 5$ or 10 a sweet spot for most applications. CV is how every hyperparameter in modern ML — regularisation strengths, network depths, learning rates — gets chosen, and is the practical engine behind the bias–variance trade of Section 12.

Section 14

Bayesian and frequentist inference, side by side.

Two interpretations of probability lead to two paradigms of inference. They produce identical answers in many common cases, opposite ones in others, and a complete account of statistics is incomplete without both.

The frequentist paradigm of this chapter so far treats $\theta$ as a fixed unknown and randomness as coming from sampling. Probability statements are about procedures: "in 95% of repeated experiments, the interval would contain $\theta$." The Bayesian paradigm treats $\theta$ itself as a random variable with a prior distribution $\pi(\theta)$ encoding beliefs before seeing data. After seeing data $X$, beliefs update via Bayes' theorem to the posterior:

$$ \pi(\theta \mid X) \;=\; \frac{p(X \mid \theta)\, \pi(\theta)}{\int p(X \mid \theta')\, \pi(\theta')\, d\theta'} \;\propto\; L(\theta)\, \pi(\theta). $$

The posterior contains everything the data and prior together imply about $\theta$. From it you read off:

Where they agree. With a flat prior and a regular likelihood, the posterior mode equals the MLE and the posterior covariance approaches the inverse Fisher information — so the Bayesian credible interval and the frequentist Wald interval coincide asymptotically. For most practical purposes, MLE-with-Wald-intervals and Bayesian-with-flat-prior return the same numbers; a working data scientist switches between the two phrasings without thinking about it.

Where they disagree. When data are scarce, when the prior is informative, or when the parameter is on the boundary of its space, the two paradigms diverge meaningfully. A Bayesian credible interval for a probability parameter shrinks toward the prior mean (often 0.5) and stays inside $[0, 1]$; a frequentist Wald interval for the same parameter can extend beyond $[0, 1]$ and only catches up asymptotically. Bayesian methods also handle hierarchical structure ("partial pooling" across groups) more naturally and quantify uncertainty in derived quantities through straightforward posterior simulation.

Penalised MLE = MAP. Ridge regression's $\ell_2$ penalty is equivalent to a Gaussian prior on $\boldsymbol{\beta}$; lasso's $\ell_1$ penalty is equivalent to a Laplace prior. The penalised solution is exactly the posterior mode (MAP estimate) under that prior. Every regularised loss in machine learning corresponds to some Bayesian prior; making the prior explicit is sometimes informative and almost always sobering.

Chapter 07 develops the Bayesian machinery in full — conjugate priors that yield closed-form posteriors, hierarchical models, MCMC and variational inference for everything else. For now the takeaway is that the frequentist methods of this chapter and the Bayesian methods of the next one are not in opposition: they are two complementary languages for the same act of statistical inference, and fluency in both is part of being literate in modern data analysis.

Section 15

Correlation, causation, and the gap between them.

Statistical models describe associations. Causal claims require either a randomised experiment or strong assumptions about the data-generating process — and an entire framework, distinct from estimation, for reasoning about both.

"Correlation does not imply causation" is the slogan; the modern framework that rigorises it goes by the name of potential outcomes (or, in Pearl's formulation, the do-calculus). Let $T \in \{0, 1\}$ denote a treatment and $Y$ an outcome. For each subject define two counterfactual outcomes: $Y(1)$, the outcome they would have had under treatment, and $Y(0)$, the outcome they would have had under control. The fundamental causal estimand is the average treatment effect

$$ \tau \;=\; \mathbb{E}[Y(1) - Y(0)]. $$

The fundamental problem of causal inference is that we observe only one of the two potential outcomes for each subject. The other is missing, and no amount of statistical machinery alone can recover it.

The randomised controlled trial. Random assignment of $T$ severs any dependence between treatment and the potential outcomes: $T \perp \{Y(0), Y(1)\}$. Under this independence,

$$ \tau \;=\; \mathbb{E}[Y \mid T=1] - \mathbb{E}[Y \mid T=0], $$

which is exactly the difference in observed sample means. The RCT is the gold standard not because randomisation is magic but because randomisation makes the missing-data problem solvable. Drug trials, A/B tests, and field experiments in economics all rest on this single identification result.

Observational data. When you cannot randomise — most settings outside controlled experiments — treatment and potential outcomes are not independent. Subjects who choose treatment differ from those who do not in ways that also affect $Y$ (this is confounding). Several frameworks address this:

Pearl's hierarchy. Judea Pearl distinguishes three rungs of causal reasoning: association ("how does $Y$ vary with $X$?"), intervention ("how would $Y$ change if I set $X = x$?"), and counterfactual ("would this patient have recovered if I had given them the drug?"). Statistics as taught in most undergraduate curricula lives entirely on rung one. Modern causal inference — and any honest application of ML to decisions — lives on rungs two and three, and requires the additional language of directed acyclic graphs and structural equations to navigate them.

The relevance to ML is enormous. Predictive models trained on observational data answer rung-one questions ("what is $\mathbb{E}[Y \mid X]$?") perfectly well. They will fail spectacularly if deployed to answer rung-two questions ("what happens if we change $X$?") because they cannot tell correlation from causation by themselves. Recommender systems that recommend books based on past sales conflate "people who already wanted this book bought it" with "showing this book to someone causes them to want it" — and an entire subfield (off-policy evaluation, causal inference for policy learning) exists to disentangle them.

Section 16

Designing experiments.

An experiment that asks the wrong question, allocates units inefficiently, or fails to randomise will not be saved by any analysis. Design is the part of statistics that happens before you collect a single data point.

The classical principles, due to Fisher, are three:

Power and sample size. Before running an experiment, decide how large it needs to be. The power of a test is its probability of detecting a true effect of a given size. For a two-sample t-test detecting a mean difference of $\delta$ with significance level $\alpha$ and power $1 - \beta$, the required per-group sample size is approximately

$$ n \;\approx\; \frac{2\sigma^2 (z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}. $$

Halving the detectable effect size quadruples the required sample. Underpowered experiments waste resources and produce a body of literature dominated by inflated effect sizes (the winner's curse: only large estimates survive significance filtering, so published estimates are systematically too large). Power calculations are not optional.

Factorial designs. When studying multiple factors, do not vary them one at a time. A $2^k$ factorial design tests all $2^k$ combinations of $k$ binary factors and lets you estimate main effects and interactions from the same data. With limited resources, a fractional factorial sacrifices some interaction estimates to fit more factors in the same number of runs — the foundation of design of experiments (DoE) in industrial settings.

A/B testing as designed experiment. The internet's favourite statistical method is exactly a two-sample comparison from a randomised experiment. The complications come from scale (millions of users), continuous monitoring (which inflates type I error if you peek at p-values mid-experiment — see sequential testing and always-valid p-values), heterogeneous treatment effects (interaction with user characteristics), and network effects (one user's treatment affecting another's outcome, violating SUTVA). All of these are extensions of the same Fisherian core, not departures from it.

Bandits and adaptive design. When you can adapt the assignment probabilities as data accumulate, you trade some statistical efficiency for faster identification of the best treatment. Multi-armed bandit algorithms — UCB, Thompson sampling, $\varepsilon$-greedy — are exactly this: experimental designs where the next assignment depends on previous outcomes. They are the bridge from classical experimental design to reinforcement learning, where every action is also a measurement.

Section 17

When observations are not independent.

Most of the chapter assumes i.i.d. data. Real-world observations are often correlated through time, space, or social networks — and pretending otherwise quietly inflates apparent precision until conclusions evaporate.

A time series $\{X_t\}_{t \in \mathbb{Z}}$ is a sequence of observations indexed by time. The fundamental quantity is the autocovariance

$$ \gamma(h) \;=\; \operatorname{Cov}(X_t, X_{t+h}). $$

For most useful theory we assume stationarity: the joint distribution of $(X_{t_1}, \ldots, X_{t_k})$ depends only on the spacing of the indices, not on $t$ itself. Under stationarity, $\gamma(h)$ is well-defined and the autocorrelation function $\rho(h) = \gamma(h)/\gamma(0)$ measures linear dependence at lag $h$.

The classical model families. Three workhorses cover most stationary series:

Effective sample size. The price of dependence is paid in how much information each new observation contributes. For positively correlated data, the variance of the sample mean is

$$ \operatorname{Var}(\bar{X}_n) \;\approx\; \frac{\sigma^2}{n}\left(1 + 2\sum_{h=1}^{\infty}\rho(h)\right) \;=\; \frac{\sigma^2}{n_{\text{eff}}}, $$

so the effective sample size $n_{\text{eff}} = n / (1 + 2\sum_h \rho(h))$ can be far smaller than $n$. Standard errors computed assuming independence are then too small, and confidence intervals too narrow. Newey–West and other heteroscedasticity-and-autocorrelation-consistent (HAC) standard errors correct for this in regression on time-series data; cluster-robust standard errors do the analogous job for grouped data.

Dependence in ML. Cross-validation assumes the test fold is independent of the training fold. For time series this fails — naïvely shuffling violates the temporal order and leaks future information into the past. Time-series CV (expanding or rolling windows) is the right replacement. For grouped data (multiple observations per user), group K-fold keeps all of one user's observations together. For graph-structured data (neighbouring nodes share information), keeping connected components together becomes essential. The general lesson: every statistical procedure that promises independence has to be re-examined when the data have structure.

Modern time-series ML — recurrent networks, temporal convolutions, transformers with positional encodings, state-space models — replaces the linear AR/MA structure with arbitrary non-linear function classes. The statistical principles (stationarity, effective sample size, sequential evaluation) remain. Most published headline results on forecasting that compare deep models to ARIMA do so on truly stationary, well-behaved series; the picture is murkier on the messy real-world data where Box and Jenkins still hold up surprisingly well.

Section 18

Where statistics shows up in modern ML.

Almost every loss, every diagnostic, every honest claim of model quality in modern machine learning has a statistical ancestor in this chapter. Naming the connections makes the field smaller and the vocabulary much more efficient.

A working list of correspondences worth memorising:

The pattern is consistent. Almost every "advance" in ML is a familiar statistical idea applied at a scale or with a function class that earlier statisticians could not access. Knowing the older language does not make you slower; it makes you faster, because you stop reinventing things and start composing them.

Keep these connections in mind as you read the rest of the compendium. The remaining chapters of Part I — information theory, Bayesian reasoning, signal processing — extend this scaffolding rather than replace it; Part II then uses the whole structure to build modern ML from the ground up.

Further reading

Where to go next

Statistics is a craft as much as a body of theory; the books below mix mathematical depth with applied judgment. Pick one introductory text, one regression-and-design text, and one modern-methods text in parallel — that combination covers the working data scientist's day-to-day.

Textbooks — first pass

Textbooks — regression and modeling

Textbooks — causal inference and design

Free video courses

References and applied resources

This is the fifth chapter of an ongoing compendium. The mathematical foundations continue with information theory (entropy, KL, mutual information), Bayesian reasoning (priors, posteriors, conjugacy, hierarchical models), and signal processing (Fourier, convolution, sampling). Part II then builds machine learning from the foundation laid in these eight chapters.