Part IV · Classical Machine Learning · Chapter 01

Regression, the oldest and still the most instructive place to start learning from data.

A regression model tries to predict a continuous outcome — a price, a temperature, a time, a concentration, a click-through rate — from a set of input features. It is one of the oldest ideas in statistics (Gauss and Legendre were fighting over the least-squares method in 1805) and one of the most heavily used techniques in modern practice, usually either as the production model itself or as the baseline every more complicated model has to beat. The point of this chapter is not to memorise formulas; it is to understand linear regression deeply enough that every later technique in classical and deep learning — regularisation, optimisation, generalised linear models, kernel methods, neural networks — can be read as an extension, variation, or replacement of the ideas introduced here. Chapter 05 of Part I touched on regression from the inferential side (confidence intervals, hypothesis tests about slopes); this chapter picks it up from the machine-learning side, where the target is predictive accuracy on new data rather than a p-value on a specific coefficient.

How to read this chapter

The first section motivates the problem and frames regression as a learning problem rather than an inferential one. Section two introduces the linear model and the design matrix — the notation every subsequent section uses. Section three derives ordinary least squares (OLS) in closed form from the normal equations; section four covers the same problem from the optimisation side — gradient descent and stochastic gradient descent, the workhorse algorithms that scale to datasets the closed form cannot. Sections five and six are the diagnostic pair: the classical assumptions of linear regression (linearity, independence, homoscedasticity, normality) and the residual-plot-and-leverage toolkit that tells you when they have broken. Sections seven and eight are the conceptual core of the chapter: polynomial regression and basis expansion are how a linear model flexes into non-linear relationships, and the bias-variance tradeoff is the single most important idea in all of machine learning — the reason regularisation, cross-validation, and every later technique exist. Sections nine through eleven are the regularisation family: ridge (L2), lasso (L1), and elastic net, with the geometric and probabilistic intuitions that explain why each one behaves the way it does. Section twelve is feature scaling — the unglamorous but essential preprocessing step without which regularisation does the wrong thing. Section thirteen opens the generalised-linear-model (GLM) framework: the realisation that the same machinery extends cleanly to Poisson counts, gamma-distributed positives, binomial outcomes, and (via the logistic link) all the way to the classification models that Chapter 02 will build on. Section fourteen covers robust regression techniques — Huber, RANSAC, quantile — for when outliers are a structural part of the data rather than a bug. Section fifteen handles categorical features and interaction terms, the two ingredients that let a linear model express surprisingly rich relationships. Section sixteen is evaluation: RMSE, MAE, R², MAPE, residual diagnostics, and the cross-validation protocols that give those metrics a chance of predicting new-data performance. Section seventeen closes with where regression compounds in ML: as a baseline that every more complex model has to justify itself against, as the linear layer that dominates the end of every deep network, and as the ingredient that quietly shows up inside nearly every other technique in the compendium.

A note on what this chapter is not. It is not a re-derivation of the statistical-inference theory around linear regression — hypothesis tests about coefficients, confidence intervals on predictions, F-statistics, the Gauss-Markov theorem — which Chapter 05 of Part I treats on its own terms. The ML perspective and the statistical-inference perspective on linear regression are the same mathematics read through different priorities: the statistician wants to characterise the parameter estimates given the data, and the ML practitioner wants to characterise the predictions on new data given the parameter estimates. The chapter notes the overlap explicitly where it helps and otherwise stays on the ML side. A second note: every section in this chapter has a one-line summary somewhere in the body that answers the question "what would you actually do in code". Regression is a topic where the conceptual weight is high and the implementation weight is low — scikit-learn will do most of what this chapter describes in one or two lines — so the chapter deliberately keeps the prose aimed at why rather than how.

Contents

  1. Why regression is the right starting placePredicting a continuous target
  2. The linear model and the design matrixNotation the rest of the chapter uses
  3. Ordinary least squaresThe normal equations, closed form, geometry
  4. Optimisation: gradient descent and SGDWhen the closed form is not an option
  5. The assumptions of linear regressionLinearity, independence, homoscedasticity, normality
  6. Residual diagnosticsPlots, leverage, influence, outliers
  7. Polynomial regression and basis expansionNon-linearity from a linear model
  8. The bias-variance tradeoffThe deepest idea in supervised learning
  9. Ridge regressionL2 regularisation, shrinkage
  10. The lassoL1 regularisation, sparsity
  11. Elastic net and regularisation pathsThe L1 + L2 compromise
  12. Feature scaling and preprocessingWhy standardisation matters for regularisation
  13. Generalised linear modelsPoisson, gamma, binomial via link functions
  14. Robust regressionHuber, RANSAC, quantile
  15. Categorical features and interactionsOne-hot, interaction terms, splines
  16. Evaluation and cross-validationRMSE, MAE, R², MAPE, K-fold
  17. Where it compounds in MLBaselines, linear probes, final layers
Section 01

Regression is where every machine-learning practitioner should start, because almost everything else is a variation on its ideas

A regression model maps a vector of inputs to a real-valued output. That is a deliberately narrow definition, but the narrowness is the point: it is the simplest problem in supervised learning, and the ideas invented for it — least squares, overfitting, regularisation, cross-validation, link functions — turn out to be the same ideas that structure every later technique. Starting here means that classification, ensembles, neural networks, and even transformers can later be read as refinements of a framework the reader already understands, rather than as separate topics.

The prediction problem, stated cleanly

Given a training set of pairs (xi, yi), where each xi is a feature vector of length p and each yi is a real number, the regression problem is to learn a function f such that f(x) predicts y accurately on new inputs drawn from the same distribution. Everything else in the chapter is a technical specification of what "accurately" means and of what hypothesis class f is drawn from. The split between training data (used to fit the model) and test data (used only to evaluate it) is the core honesty ritual of machine learning: a model's performance on the data it was trained on is nearly always optimistic, sometimes wildly so.

Inference vs prediction

Chapter 05 of Part I covered linear regression from the inferential angle — how confident can we be that this coefficient is non-zero, what is the standard error on this slope, does the F-test reject the null. Here the emphasis is predictive — how close will the model's outputs be to reality on data it has never seen. The two perspectives share almost all of their mathematics and disagree on almost all of their priorities. The inference-first practitioner is happy with an in-sample R2 and a confidence interval; the prediction-first practitioner wants a held-out test-set error and does not much care whether any single coefficient is "significant". This chapter takes the second stance throughout.

Why linear models have not gone away

A naive reading of the last decade of deep learning might suggest that linear models have been obsoleted; the reality is the opposite. Linear models remain the default baseline in almost every serious ML engineering team, for four concrete reasons: they fit in milliseconds on datasets where gradient-boosted trees take hours and neural networks take days; their coefficients are directly interpretable, which matters for regulatory and business reasons; they extrapolate predictably outside the training distribution, where more flexible models often go haywire; and on tabular data with strong linear structure, they are often competitive with anything fancier. A team that skips this chapter because "everyone uses deep learning now" will pay for it later by not knowing when to reach for the simpler tool.

What counts as regression

The word "regression" in this chapter covers any model whose output is a continuous scalar. That includes linear regression (with and without regularisation), polynomial regression, the regression arm of generalised linear models (Poisson, gamma, quasi-Poisson), kernel ridge regression, Gaussian-process regression, regression trees, regression forests and gradient-boosted regressors, support-vector regression, and neural-network regressors with a single real-valued output head. Chapters 02 through 09 will handle most of those. The framing of this chapter — loss, optimisation, regularisation, evaluation, the bias-variance tradeoff — applies to all of them, which is exactly why it is worth being the first chapter of Part IV.

Regression is the language, not just a technique

By the end of this chapter, the words loss function, regularisation, bias-variance tradeoff, cross-validation, feature scaling, and held-out set should feel like part of a single coherent vocabulary. That vocabulary is the one the rest of Part IV will speak; Parts V through VIII will refine it, but they will not replace it.

Section 02

The linear model is a matrix product and almost every later technique inherits its notation

The linear regression model predicts the output as a weighted sum of the inputs plus an intercept: ŷ = β0 + β1x1 + ⋯ + βpxp. Written in matrix form — which is the form every subsequent section will use — this becomes ŷ = Xβ, where X is the design matrix and β is the coefficient vector. That compact notation is the reason linear algebra (Chapter 01 of Part I) was placed first in the compendium.

The design matrix

For n training examples and p features, the design matrix X is an n × (p+1) matrix whose rows are the feature vectors and whose first column is all-ones (the intercept). The coefficient vector β has length p+1. The vector of predictions ŷ and the vector of observed outputs y each have length n. The residual vector is r = y − Xβ. This notation is surprisingly expressive: almost every model in this chapter can be written as some modification of "pick β to minimise some function of r plus some function of β".

Why write it this way

Matrix notation is compact, but more importantly it makes every subsequent extension legible. Polynomial regression (Section 07) becomes "use the same algebra but replace X with a basis-expanded Φ". Ridge regression (Section 09) becomes "the same algebra plus a λ on the diagonal". Generalised linear models (Section 13) become "the same design matrix, but the link function g wraps the prediction: g(ŷ) = Xβ". Reading every extension as "the same framework plus one change" is what makes the whole chapter memorable instead of a list of unrelated recipes.

The loss function as a design choice

The standard loss for regression is squared error: L(β) = ‖y − Xβ‖2, the sum of squared residuals. It is the loss that gives OLS its name and most of its properties. But it is a choice, not a mathematical necessity, and other choices produce other models: L1 loss (absolute error) gives quantile regression; Huber loss gives robust regression; log-likelihood losses give the GLM family. The loss function encodes the practitioner's cost model for mistakes — what it costs to over-predict versus under-predict, how rapidly the cost grows with the size of the error. Section 14 returns to this point; for now the point is only that squared error is the default, not the default's justification.

A concrete example

A housing dataset has rows like (sqft=1800, beds=3, baths=2, age=15, price=420000). The design matrix has a column for each of sqft, beds, baths, age, plus an intercept column of 1s; the target is price. A fitted linear regression produces a coefficient vector like β = (intercept=80000, 180, −5000, 12000, −1500), which says that the baseline price is $80k, each square foot adds $180, each additional bedroom after holding other features fixed subtracts $5k, and so on. The hedge "after holding other features fixed" is important — a linear coefficient always has that conditional interpretation, and missing it is the single most common error in reading regression output.

The whole chapter in one equation

Every technique in this chapter will end up being some variant of: minimise ‖y − Xβ‖2 + λ · complexity(β). Ridge sets complexity to ‖β‖2; lasso sets it to ‖β‖1; OLS sets λ = 0; GLMs replace the squared-error term with a log-likelihood. Holding that single template in mind is enough to get through Part IV.

Section 03

Ordinary least squares has a closed-form solution and the geometry of that solution is worth understanding

Ordinary least squares (OLS) is the regression model you get by minimising the sum of squared residuals — the classical starting point, with a closed-form solution dating to Legendre and Gauss. The formula is short enough to memorise, but the geometric and probabilistic interpretations behind it are what make it useful; the formula without the interpretations becomes a recipe that breaks whenever the data does not cooperate.

The normal equations

Setting the gradient of the squared-error loss to zero gives the normal equations: XXβ = Xy. When XX is invertible (which holds when X has full column rank), the solution is β̂ = (XX)−1Xy. That is the closed form; in practice numerical libraries never actually compute the inverse. They solve the system using a QR decomposition of X (numerically stable, O(np2) time) or an SVD (stable even when X is rank-deficient, slower). On modern hardware both are fast enough that "use the closed form" is the default for any dataset small enough to hold in memory.

The geometric picture

Read y as a vector in n-dimensional space. The columns of X span a (p+1)-dimensional subspace of that space, called the column space. The prediction vector ŷ = Xβ̂ is the orthogonal projection of y onto that subspace — the closest point in the column space to the observed vector. The residual r = y − ŷ is the component of y orthogonal to the column space, and orthogonality is why it is orthogonal to every column of X: Xr = 0. The Linear Algebra chapter (Part I Chapter 01) is the reference for this picture; once it is in the reader's head, OLS stops feeling like an algebraic trick and starts feeling like a theorem about projections.

The probabilistic interpretation

If the errors yi − xiβ are assumed to be independent, zero-mean Gaussian with the same variance, then the maximum-likelihood estimate for β is exactly the least-squares solution. This is not a coincidence: the Gaussian log-likelihood, stripped of constants, is the negative squared-error loss. That equivalence is why squared error is the default loss — it is the loss implied by the most common noise model in applied statistics. Section 13 generalises this by replacing the Gaussian noise model with other distributions and getting the other members of the GLM family.

When the closed form breaks

OLS fails cleanly when XX is not invertible. This happens for two distinct reasons. Exact collinearity: two columns of X are linearly dependent (for example, a dummy-variable trap where both indicator columns are included in addition to an intercept). The fix is to drop one of the collinear columns. High-dimensional regime: p ≥ n, so the design matrix has more columns than rows and cannot possibly be full rank. For this case OLS has no unique solution and must be replaced — ridge regression (Section 09) is the standard fix, because the λI term it adds makes the system invertible regardless of the rank of X.

The Gauss-Markov theorem

Under the classical assumptions (linearity, zero-mean errors, homoscedasticity, no autocorrelation, no perfect collinearity), the Gauss-Markov theorem says the OLS estimator is the best linear unbiased estimator (BLUE) — it has the smallest variance among all linear unbiased estimators of β. This is a remarkable result for a method as simple as OLS. It does not say OLS is best in general — biased estimators like ridge regression can have lower mean-squared error — and it does not say OLS is best when the classical assumptions fail. But in the idealised textbook setting, OLS is optimal, which is part of why it is the default.

In code, one line

In scikit-learn, OLS is LinearRegression().fit(X, y). In NumPy, np.linalg.lstsq(X, y). In R, lm(y ~ ., data). The algebra above is what those one-liners are doing under the hood; the value of knowing it is that when the output does not make sense — a coefficient of the wrong sign, a refusal to converge, an impossible R2 — the reader can diagnose the problem instead of guessing.

Section 04

Gradient descent and its stochastic cousin are what replace the closed form when the closed form cannot cope

The closed-form OLS solution requires inverting a p × p matrix and materialising the n × p design matrix in memory. For p = 104 features and n = 109 rows, neither is feasible. Gradient descent and stochastic gradient descent are the iterative alternatives that scale to those regimes, and they are also the template for every optimisation algorithm in Parts V through VIII. Learning them here, on a convex problem with a known closed-form answer, is the cleanest setting in which to understand what they do.

The gradient-descent update

The gradient of the squared-error loss is ∇L(β) = −2X(y − Xβ). Gradient descent starts with some initial β and repeatedly updates β ← β − η · ∇L(β) for a learning rate η. For OLS the loss is convex with a unique minimum, so gradient descent converges to the OLS solution for any sensible learning rate. In the optimisation-theory terms of Chapter 03 of Part I, the problem is strongly convex, η = 1 / (largest eigenvalue of XX) is a safe upper bound, and convergence is geometric.

Stochastic gradient descent

The cost of each full gradient-descent step is O(np) — it touches every row. Stochastic gradient descent (SGD) approximates the gradient using only one random training example (or, in practice, a small mini-batch) per step. The cost per step is O(p) or O(bp); the number of steps required to reach a given accuracy is higher because the noisy gradient makes irregular progress. The tradeoff heavily favours SGD for large n, which is why it is the dominant optimiser for modern ML and the workhorse that the Adam, AdamW, and Lion variants you will meet in Chapter 02 of Part V all descend from.

Why learn this on regression

On a convex problem, gradient descent and SGD are well-understood and reliable. The reader can compare the iterative solution to the known closed-form answer and confirm that they match to any desired tolerance. That makes the regression setting the ideal place to pick up the conceptual machinery — learning rate, step size, mini-batch, epoch, convergence criterion, optimiser state — that the rest of the compendium uses. When neural networks break convexity and the algorithms become harder to reason about, the vocabulary and intuitions from this section are the foundation.

Momentum, preconditioning, and second-order methods

Plain gradient descent is easy to improve on. Momentum adds a running average of past gradients to the update, smoothing out oscillation. Preconditioning multiplies the gradient by an approximation of the inverse Hessian, which for OLS is exactly (XX)−1 and brings us back to the closed form. Genuine second-order methods — Newton, BFGS, L-BFGS — exploit this fact and are what most off-the-shelf optimisers for logistic regression and GLMs actually use. Part V and Part VI will revisit this territory; for OLS the first-order methods are already enough to see the shape of the landscape.

Iterative solvers in the libraries

In scikit-learn, SGDRegressor is the iterative alternative to LinearRegression, with configurable loss, penalty, learning-rate schedule, and mini-batch options. PyTorch and JAX let you write the loop yourself in five lines. The broad rule of thumb is: use the closed-form estimator when the problem fits in memory and the design matrix is well-conditioned; use SGD when it does not; use Adam or its cousins for anything non-convex that resembles a neural network.

Two algorithms that answer the same question

For OLS, the closed-form estimator and SGD converge to the same β (up to numerical precision). They are not competing methods; they are appropriate to different scales. The conceptual value of learning SGD on regression is that the same algorithm, with obvious modifications, solves every supervised-learning problem in the compendium.

Section 05

Linear regression rests on four assumptions and each one fails in a recognisable, diagnosable way

The classical theory of linear regression — BLUE estimators, valid confidence intervals, predictable prediction errors — rests on a short list of assumptions about the relationship between the inputs, the output, and the noise. They are not assumptions the practitioner can wish into truth; they are claims about the data that should be checked. The four standard assumptions, sometimes taught with the mnemonic LINE, are linearity, independence, normality of errors, and equal variance (homoscedasticity).

Linearity

Linearity says the expected value of y is a linear function of x: E[y | x] = xβ. It does not say the relationship in the raw variables is linear — basis expansions (Section 07) extend the linear model to curved relationships while keeping the assumption technically true at the level of the transformed features. What it does say is that the relationship, after any transformations, is additive in the inputs. The most common failure is a real-world relationship that is non-linear in a way the chosen features do not capture; the diagnostic is a residual plot that shows systematic curvature.

Independence

Independence says the residuals are independent of each other. This is usually violated in one of three ways. Time series data: residuals are autocorrelated, because yesterday's error predicts today's. Clustered data: residuals within a cluster (a household, a school, a medical practice) are correlated because of shared unobserved structure. Spatial data: residuals at nearby locations are correlated. The Gauss-Markov theorem fails in each case, standard errors become wrong, and the practitioner needs to reach for a different model — time-series-specific regressions, mixed-effects models, or geographically weighted regression.

Homoscedasticity

Homoscedasticity — equal error variance — says the residuals have the same spread across the range of the predictions. The opposite, heteroscedasticity, is common in real data: residuals are larger for larger predicted values (the classic "megaphone" residual plot). The point estimate from OLS is still unbiased under heteroscedasticity, but the standard errors are wrong, which matters if the reader cares about confidence intervals. Fixes include weighted least squares, robust (sandwich) standard errors, and log-transforming the outcome to stabilise variance. The ML-first practitioner who does not care about confidence intervals can often ignore heteroscedasticity entirely; the inference-first practitioner cannot.

Normality of errors

Normality says the residuals are normally distributed. This is the assumption most commonly violated without consequence: the central limit theorem means that estimates of β are approximately normal even when the errors are not, given enough data, so inferences on the coefficients are robust to non-normality at sample sizes above about 30. Normality matters most for prediction intervals on individual predictions, which use the noise distribution directly. For ML-style predictive modelling, normality is usually the least consequential of the four assumptions.

Endogeneity — the hidden fifth assumption

Implicit in all of the above is a fifth condition: the inputs are exogenous, meaning the error term is uncorrelated with the features. When a relevant feature has been omitted and it correlates with the included features, endogeneity is violated and the coefficient estimates become biased — the classical omitted variable bias. Econometrics spends most of its effort on this problem; ML practice usually ignores it, because the goal is prediction rather than causal inference, and biased coefficients are tolerable as long as predictions are accurate. This is one of the sharpest places where the two disciplines diverge. Chapter 09 of Part X (Causal Inference, if included) is where it comes back.

Assumptions are not a loyalty test

When the assumptions fail, the linear model does not explode; it degrades. Point predictions remain useful but standard errors lie; extrapolations become untrustworthy; specific coefficients become hard to interpret. The practitioner's job is to know which failures matter for the use case and which do not. Section 06 is the toolkit for diagnosing the failures empirically.

Section 06

Residual diagnostics are how a linear-regression practitioner actually detects when something is wrong

The residual vector r = y − ŷ is the data's reply to the model. If the model is well-specified, the residuals should look like noise — centred at zero, with constant variance, no systematic pattern, and no extreme outliers. The discipline of residual diagnostics is a small collection of plots and numerical measures that quickly reveal when the residuals do not look like noise, and which assumption is most likely to blame.

The four-plot ritual

Every serious linear-regression package generates a standard set of diagnostic plots, and looking at them takes a minute. Residuals vs fitted values: should be a formless band centred at zero; curvature indicates a linearity violation, a megaphone shape indicates heteroscedasticity. Q-Q plot of residuals: should lie on a straight line if the residuals are normal; systematic deviations indicate non-normality. Scale-location plot: the square-root of the absolute standardised residuals versus fitted values; should be flat if variance is constant. Residuals vs leverage: highlights points with unusual combinations of x and residual; points in the upper or lower corners are potentially influential. These four together catch the majority of specification problems.

Leverage and influence

Leverage measures how far a point's feature vector is from the centre of the feature distribution; high-leverage points have an outsized ability to pull the fit toward themselves. The leverage of row i is hii, the i-th diagonal of the hat matrix H = X(XX)−1X. Influence combines leverage with the size of the residual to identify points that both have unusual inputs and disagree with the fitted model — these are the points that, if removed, would change the coefficient estimates most. Cook's distance is the standard scalar influence measure, and points with Cook's D above 1 (or above 4/n, depending on the rule of thumb) warrant investigation.

Multicollinearity

Multicollinearity is the situation where predictor variables are strongly but not perfectly correlated with each other. The coefficient estimates remain unbiased, but they become highly variable: a small perturbation of the data can swing individual coefficients wildly, even reversing signs. The standard diagnostic is the variance inflation factor (VIF), computed per feature as 1 / (1 − Rj2) where Rj2 is from regressing feature j on all the others. VIFs above 5 or 10 are the conventional warning signs. The usual fixes are to drop redundant features, combine them through PCA (Part IV Chapter 05), or switch to ridge regression (Section 09), which is specifically designed to behave well in collinear regimes.

Outliers, and what to do about them

A small fraction of the data points in most real datasets are outliers — genuine anomalies, mis-measurements, or data-entry errors. OLS is highly sensitive to them because squared loss grows quadratically with the residual, so a single outlier can dominate the fit. The diagnostic is straightforward (look at the residuals, look at Cook's distance); the response is not. Discarding outliers by rule is dangerous because it can discard real signal; keeping them is dangerous because it can distort the fit. The compromise options are to use robust regression (Section 14), which downweights large residuals automatically, or to investigate the outliers by hand and decide case-by-case.

The null-model comparison

A final diagnostic that is often neglected: compare the model to the null model that predicts ȳ for every input. The difference in error is what R2 measures. A model that does not beat the null model is uninformative, and this happens more often than one might think — particularly when the features are weakly correlated with the outcome. Reporting the null baseline alongside the fitted model is a simple honesty practice that prevents overselling; it also carries over directly to ML evaluation in Section 16.

Plot first, formulate later

The most common mistake in applied regression is to fit the model first and look at the residuals only if the output seems wrong. The better practice is to look at the residuals before reporting any results at all — the four-plot ritual takes a minute and saves a surprising number of misleading analyses.

Section 07

Polynomial regression and basis expansion show how a linear model can capture almost any shape

"Linear" in linear regression refers to linearity in the parameters, not in the inputs. Adding a column for x2, or for log(x), or for sin(x) to the design matrix makes the model non-linear in the raw inputs while remaining a linear-in-parameters regression. This is the trick that lets a method with a closed-form solution and clean diagnostics capture curved, bent, and oscillating relationships.

Polynomial regression

Polynomial regression augments the design matrix with polynomial powers of the original features: for a single predictor x, the model becomes y = β0 + β1x + β2x2 + ⋯ + βdxd. The order d controls flexibility: order 1 is ordinary linear regression, order 2 bends the line into a parabola, order 3 adds an inflection point. High-order polynomials are flexible but unstable — Runge's phenomenon makes them oscillate badly near the edges of the data — and they almost always benefit from centering the predictors (x − x̄) before raising them to powers to reduce collinearity between the polynomial terms.

The basis-expansion perspective

Polynomial regression is a special case of a broader idea: replace the raw feature vector x with a transformed feature vector φ(x), where φ is a fixed non-linear basis, and then fit ordinary linear regression in the φ(x) space. Common basis choices include polynomials, B-splines and natural cubic splines (piecewise-polynomial with smoothness constraints), radial basis functions (bumps centred at anchor points), and Fourier basis functions (sines and cosines). The design matrix Φ has the basis values as columns, and everything in Sections 02–06 proceeds identically — just with Φ in place of X.

Splines are the practitioner's default

For most real-world non-linear relationships on a single predictor, natural cubic splines are a better choice than polynomials. They are piecewise polynomials of degree three, joined smoothly at knots, with linear extrapolation beyond the outermost knots to tame the edge behaviour that plagues polynomials. Spline regression is local — the fit near x = 10 is determined by data near x = 10, not by observations far away — which makes the resulting curves far more stable and interpretable. Packages like patsy (Python) and the splines package (R) expose splines through a compact formula syntax that drops into any regression model.

Interaction terms between basis expansions

Basis expansion generalises naturally to multiple features. For two predictors x1 and x2 one can include a tensor-product basis, where the columns of Φ include every pairwise product of the per-variable basis functions. This is the machinery that generalised additive models (GAMs, Hastie and Tibshirani 1986) and multivariate adaptive regression splines (MARS, Friedman 1991) exploit to build additive non-linear regression models with a fitted shape per feature. GAMs in particular are the right choice when interpretability matters — they retain a per-feature "partial effect" plot that is directly readable — and are covered in most good modern regression texts.

Danger: flexibility invites overfitting

The more flexible the basis, the more data is required to fit it safely. A cubic polynomial on 10 rows of data has five parameters for ten observations and is almost guaranteed to overfit. A 20-knot spline on 100 observations has a similar problem. The standard remedies are to limit the number of knots or polynomial powers (which is a form of structural regularisation), to apply shrinkage via ridge or lasso on the basis coefficients (Sections 09 and 10), or to choose the amount of flexibility via cross-validation (Section 16). The next section — bias-variance — is the theoretical frame for thinking about these tradeoffs.

The trick behind kernel methods

Basis expansion is the conceptual ancestor of the kernel trick that Chapter 07 of Part IV covers for SVMs and kernel ridge regression. The kernel trick is a computational shortcut: it lets you compute the inner products φ(xi)φ(xj) without ever materialising φ(x) itself, which is what makes it possible to use infinite-dimensional bases. Everything in this section is the same idea, done the slow way; Chapter 07 makes it efficient.

Section 08

The bias-variance tradeoff is the single most important idea in supervised learning

Every model's expected prediction error decomposes into three terms: squared bias (the systematic error from using an overly restrictive model), variance (the sensitivity of the fit to the specific training sample), and irreducible noise (the part of the output that is unpredictable from the inputs). The practitioner's control knob is the flexibility of the model: simpler models have higher bias and lower variance; more flexible models have lower bias and higher variance; the total error is minimised somewhere in between. Almost every technique in this chapter and the next two is a different way of tuning that knob.

The formal decomposition

For a fixed point x0 and a model trained on a random sample, the expected squared error of the prediction decomposes as E[(y0 − ŷ(x0))2] = (E[ŷ(x0)] − f(x0))2 + Var(ŷ(x0)) + σ2, where f is the true function, the first term is squared bias, the second is variance, and σ2 is the irreducible noise. The derivation is a few lines of algebra (Elements of Statistical Learning, §2.9) and is worth working through once. The point of the decomposition is not the formula but the realisation that the two controllable components pull in opposite directions.

The U-shape

Plotting expected prediction error against model flexibility produces a characteristic U-shape. On the left, where models are too simple, bias dominates: the model cannot represent the true relationship and is systematically wrong. On the right, where models are too flexible, variance dominates: the model fits the training sample's quirks rather than the underlying relationship, and its predictions are unstable from sample to sample. The minimum of the U is the optimal tradeoff. A linear regression with too few features sits on the left; a 20-degree polynomial with 30 rows of data sits on the right; the point of cross-validation (Section 16) is to find where the bottom is empirically.

Overfitting and underfitting

Underfitting is the high-bias regime — the training error and the test error are both high, because the model cannot represent the signal. Overfitting is the high-variance regime — training error is low but test error is high, because the model has memorised noise. The classic diagnostic is the training-error-and-test-error curve as a function of model complexity: underfitting gives two high flat lines, overfitting gives a widening gap, the sweet spot is where test error is minimised and the two curves are close. This curve, called a learning curve when plotted against training-set size, is one of the most diagnostic plots in machine learning.

Regularisation as a bias-variance lever

Regularisation (Sections 09–11) is the standard way to trade a little bias for a lot of variance reduction. A ridge-regression model with penalty λ > 0 is deliberately biased — its coefficient estimates are shrunk toward zero — but its variance is much lower than OLS when features are correlated or numerous. For many realistic datasets the net effect is a substantial reduction in prediction error. This is one of the few places in statistics where introducing bias deliberately improves predictions; the intuition is essential for reading the rest of the chapter.

Modern wrinkles: double descent

The U-shape is the textbook picture, and it describes what happens for the range of model complexities covered in classical ML. Recent work (Belkin, Hsu, Ma, Mandal 2019) has shown that for very large models — deep neural networks, huge random-feature regressions — there is a second descent past the "interpolation threshold" where the model exactly fits the training data: increasing capacity further reduces test error again. This double-descent phenomenon is one of the reasons over-parameterised neural networks generalise as well as they do. Classical regression does not reach that regime; the U-shape is the right mental model for Chapters 01 through 09 of Part IV.

The reason for almost every later technique

Cross-validation, regularisation, ensembling, early stopping, dropout, weight decay, data augmentation — nearly every ML technique invented in the last sixty years is, at its core, a bias-variance lever. Understanding that is what turns a list of methods into a coherent field.

Section 09

Ridge regression penalises large coefficients and gets a lot of stability in return

Ridge regression (Hoerl and Kennard 1970), also called L2 regularisation or Tikhonov regularisation, modifies OLS by adding a penalty proportional to the sum of squared coefficients: the objective becomes ‖y − Xβ‖2 + λ‖β‖2. The closed-form solution is β̂ = (XX + λI)−1Xy — almost identical to OLS, with a λ added to the diagonal. That small change rescues the method from collinearity, rescues it from the p > n regime, and reduces its variance at the cost of a modest bias.

What ridge buys you

Three things, in order of practical importance. First, numerical stability: adding λI makes XX + λI invertible even when X has perfectly or near-perfectly collinear columns, which resolves the single biggest technical failure mode of OLS. Second, variance reduction: ridge coefficients have lower variance than OLS coefficients because the penalty pulls them toward zero; on datasets with many features or correlated features, the reduction in variance dominates the introduced bias and total prediction error falls. Third, high-dimensional feasibility: ridge works fine when p > n, which OLS does not, which is why it remains the default baseline for wide datasets in genomics and text.

The geometric picture

Ridge regression can be equivalently formulated as: "minimise squared error subject to ‖β‖2 ≤ t". The constraint set is a ball in coefficient space, centred at the origin. The OLS loss contours are ellipses centred at the OLS solution. The ridge solution is the point where the innermost ellipse that intersects the ball touches the ball — a shrunken version of the OLS solution pulled toward the origin. As λ → 0 ridge reduces to OLS; as λ → ∞ it pulls every coefficient to exactly zero. The right λ is in between and is chosen by cross-validation.

The Bayesian interpretation

Ridge regression is the maximum-a-posteriori estimator with a Gaussian prior on β: the prior says each coefficient is a priori small, and the data pulls them away from zero in proportion to how strongly they improve the fit. The tuning parameter λ corresponds to the variance of the prior — a tighter prior (smaller prior variance, larger λ) shrinks more aggressively. This Bayesian framing becomes important in Chapter 06 of Part IV (probabilistic graphical models) and again in Bayesian deep learning; for now it explains why ridge "works" in a way that goes beyond the bias-variance argument.

Scaling matters

Unlike OLS, ridge regression is not scale-invariant. If one feature is measured in millimetres and another in kilometres, the penalty treats them as comparable when they are not, and the fit becomes nonsense. The universal preprocessing step is to standardise features to zero mean and unit variance before fitting (Section 12); this is why virtually every scikit-learn pipeline involving ridge regression starts with a StandardScaler.

Choosing λ

The penalty parameter λ is a hyperparameter and must be tuned. The standard approach is K-fold cross-validation (Section 16): for a grid of candidate λ values, compute the average held-out error across folds, and pick the λ that minimises it. Scikit-learn's RidgeCV does this in one line. A widely used shortcut is the "one-standard-error rule": pick the largest λ whose cross-validated error is within one standard error of the minimum, trading a little accuracy for extra regularisation and a simpler model. The rule comes from Breiman et al.'s original CART paper and is a good default.

The 90% case

If an ML engineer is asked to fit a regression model on a tabular dataset with dozens or hundreds of features and is given five minutes, the reasonable default is "standardise features, fit ridge with cross-validated λ". The result is usually competitive with much more elaborate approaches and is almost always a strong baseline for something fancier to justify itself against.

Section 10

The lasso shrinks some coefficients exactly to zero which is feature selection as a side effect of the fit

The lasso (Tibshirani 1996) — Least Absolute Shrinkage and Selection Operator — replaces ridge's sum of squared coefficients with a sum of absolute values: the objective becomes ‖y − Xβ‖2 + λ‖β‖1. The change from L2 to L1 norm looks small and is anything but. L1 shrinks some coefficients exactly to zero, producing a sparse model that has built-in feature selection — a property ridge does not have, and one that turned out to matter enormously as datasets got wider.

Why sparsity happens

The geometry is the clearest explanation. For the constrained formulation min ‖y − Xβ‖2 subject to ‖β‖1 ≤ t, the constraint set is an 1 ball — a diamond in two dimensions — with corners on the coordinate axes. The OLS loss contours are ellipses. The loss contours are more likely to touch the constraint set at a corner than on a flat face, and touching at a corner means one coordinate of β is exactly zero. As λ increases, more corners are hit, and more coefficients go to zero. This sharp-corner geometry is what the L2 ball (smooth sphere, no corners) does not have.

No closed form, but convex

The lasso objective is convex but not differentiable (the absolute value has a kink at zero), so there is no matrix formula like ridge's. Standard solvers use coordinate descent, LARS (Least Angle Regression, Efron et al. 2004), or proximal-gradient methods like ISTA/FISTA. These are fast — sub-linear iteration counts for well-conditioned problems — and scikit-learn's Lasso and LassoCV are production-ready.

Feature selection vs variable selection

The lasso's sparsity is genuinely useful but should not be read as classical feature selection. When features are correlated, the lasso tends to pick one of the group more or less arbitrarily and zero out the others, which can make the specific selection unstable across bootstrap samples. The elastic net (next section) was partly invented to address this. Stability-selection techniques — fitting the lasso on many bootstrap resamples and picking features selected most often — are the standard remedy when the identity of the selected features matters (pharmaceutical applications, for example) rather than just their number.

The Bayesian interpretation

The lasso is the MAP estimator with a Laplace prior on each coefficient — a distribution with a sharp peak at zero and heavier tails than a Gaussian. The sharp peak is why coefficients stay at zero; the heavy tails are why large signals still come through relatively unshrunken. This asymmetry is exactly the behaviour one wants for a sparse problem, and it explains why lasso tends to outperform ridge when the true signal really is sparse and to underperform it when it is not.

The high-dimensional era

Lasso's killer use case is the modern high-dimensional regime: genomics (millions of SNPs predicting a trait), text regression (hundreds of thousands of word counts), image feature extraction. In each of these settings OLS has no unique solution, ridge returns a fit with every feature kept at a nonzero but small coefficient, and lasso returns a sparse model that is actually interpretable. The theoretical guarantees (oracle properties, restricted eigenvalue conditions, sign consistency) are mostly from the 2000s and 2010s; Hastie, Tibshirani, and Wainwright's Statistical Learning with Sparsity (2015) is the definitive reference.

In one line

Ridge keeps all features with small coefficients; lasso keeps a subset of features with larger coefficients. On sparse problems lasso tends to win; on dense problems ridge tends to win; when you do not know which you have, the next section is the answer.

Section 11

The elastic net combines L1 and L2 and behaves better than either alone on correlated features

The elastic net (Zou and Hastie 2005) is the convex combination of ridge and lasso: its penalty is λ1‖β‖1 + λ2‖β‖2, usually parameterised as α‖β‖1 + ½(1 − α)‖β‖2 with a single mixing parameter α ∈ [0, 1]. At α = 0 it is ridge; at α = 1 it is lasso; in between it inherits sparsity from lasso and stability-under-correlation from ridge. The intermediate α is almost always better than either extreme when features are correlated.

Why the mixture helps

Lasso's weakness — picking one feature arbitrarily from a correlated group — is cured by adding a small L2 term, which pulls all members of the group toward each other in addition to toward zero. The result is the grouping effect: correlated predictors tend to be selected or discarded together, with similar coefficient values. On genomic data, text data, and any other high-dimensional application with correlated features, the elastic net is typically the best of the three.

Regularisation paths

A regularisation path traces the fitted coefficients as λ varies from very large (all coefficients zero) to zero (the unpenalised solution). For lasso and elastic net, the path is piecewise linear, and efficient algorithms (LARS, glmnet's coordinate descent) compute the whole path in about the time of a single fit. Plotting the path is one of the more informative diagnostics in sparse regression: which features enter first (strongest signal), which come in as groups (correlated), which coefficients are stable as λ changes (reliable) versus which swing wildly (unreliable). Trevor Hastie's glmnet (R) and scikit-learn's lasso_path are the standard tools.

Tuning α and λ together

Elastic net has two hyperparameters, which sounds worse for tuning but rarely is. The standard approach is a two-dimensional cross-validation grid: a small set of α values (0, 0.25, 0.5, 0.75, 1 is typical) crossed with the full regularisation path for each. Scikit-learn's ElasticNetCV does this in one call. In practice α = 0.5 is a reasonable default when the reader does not want to tune.

Solvers and scale

For the problem sizes typical in classical ML (thousands to millions of rows, dozens to tens of thousands of features), the glmnet coordinate-descent solver and its derivatives are astonishingly fast — the full elastic-net path in seconds to minutes. For genomics-scale problems (hundreds of thousands of features, millions of rows) stochastic and distributed solvers take over, but the coefficient structure and intuitions stay the same. The chapter mentions but does not dwell on the numerical-optimisation details; Boyd and Vandenberghe's Convex Optimization (Chapter 03 of Part I's reading list) is the reference.

Beyond elastic net

More exotic penalties exist: group lasso (selects whole groups of features together — for example, all dummy columns encoding a single categorical), adaptive lasso (re-weighted L1 for sign consistency), SCAD and MCP (non-convex penalties with better oracle properties). They are specialised tools; the working practitioner gets 95 percent of the value from ridge, lasso, and elastic net, all of which are one-line fits in the standard libraries.

The default recipe

For the wide-data regression problem, a reasonable default pipeline is: standardise features, fit ElasticNetCV with a grid of α values, plot the regularisation path, inspect the coefficients at the cross-validated λ, and refit on all the data with the chosen hyperparameters. That recipe is the direct analogue of the ridge recipe from Section 09 and is the workhorse for almost any tabular regression where interpretability matters.

Section 12

Feature scaling is unglamorous and regularised regression does the wrong thing without it

OLS is invariant to the units of the features: multiplying one column of X by 1000 divides that column's coefficient by 1000 and leaves predictions unchanged. Ridge, lasso, and elastic net are not — their penalties treat all coefficients symmetrically, so the feature with larger numerical scale gets a larger-in-absolute-value coefficient and a disproportionate share of the penalty, producing a model that reflects the measurement units rather than the underlying relationships. The fix is standardisation, and it is the first preprocessing step in essentially every regularised-regression pipeline.

Standardisation vs normalisation

Standardisation subtracts the mean and divides by the standard deviation, producing features with zero mean and unit variance. It is the default for linear models, PCA, and any distance-based method. Min-max normalisation linearly rescales to [0, 1]; it preserves the shape of the distribution and is useful for neural-network inputs where bounded ranges matter. Robust scaling uses medians and interquartile ranges instead of means and standard deviations; it is the right choice when outliers are present. All three are one-call operations in scikit-learn's preprocessing module.

Scale-dependent vs scale-independent methods

The practitioner should know which methods require scaling and which do not. Scale-sensitive: ridge, lasso, elastic net, k-nearest neighbours, k-means, SVMs, PCA, neural networks, logistic regression with regularisation. Scale-invariant: unregularised OLS, decision trees, random forests, gradient-boosted trees (mostly). When in doubt, standardise — it never hurts the scale-invariant methods and it rescues the scale-sensitive ones.

Pipelines and leakage

The single most common scaling bug is data leakage: fitting the scaler on the entire dataset, including rows that will later be used as held-out test data. This leaks information about the test distribution into the training pipeline and produces optimistic error estimates. The correct pattern is to fit the scaler on the training fold only and apply the fitted scaler to the validation and test folds. Scikit-learn's Pipeline does this automatically when used with cross-validation; writing the loop by hand is a common source of silent errors.

Handling skewed features

Feature distributions are often heavily skewed — income, population, prices, counts. Linear models work better when features are roughly symmetric and not heavy-tailed; a log transformation (or log-plus-one for non-negative counts, or a Box-Cox for a family of power transformations) usually helps. The logic is the same on the target: a right-skewed target like house prices often becomes better-behaved on the log scale, and a linear model on log-target predicts multiplicative rather than additive effects. Both scaling and log-transforming are cheap and both should be considered by default.

Missing values

Regression models do not handle missing values gracefully. The options are: drop rows with any missing value (fast, reduces data); impute with the column mean or median (simple, introduces bias); impute using a regression of the missing column on the others (the EM-style iterative imputer, more accurate, more expensive); add a binary "is missing" indicator and impute arbitrarily (preserves information about the missingness itself). The choice depends on how much data has missing values, whether the missingness is random, and whether the missingness itself carries signal. The scikit-learn IterativeImputer and SimpleImputer implement the main options.

Boring, mandatory, often decisive

The preprocessing pipeline is one of the places where a careful practitioner separates themselves from a careless one. Two people can fit the same algorithm on the same dataset with the same regularisation parameter and produce substantially different predictions, because one standardised and log-transformed and the other did not. The effort is a few extra lines; the upside on real-world performance is frequently large.

13

Generalised linear models

Linear regression is one member of a larger family. By replacing the Gaussian error assumption with other distributions and by wrapping the linear predictor in a non-identity link function, the same framework handles counts, rates, positive skewed variables, and — crucially — the binary outcomes that open the door to classification.

The three-part structure

A generalised linear model (GLM), introduced by Nelder and Wedderburn in 1972, consists of three parts. First, a random component: the response y is assumed to come from a distribution in the exponential family (Gaussian, Poisson, gamma, binomial, negative binomial, inverse Gaussian, and others). Second, a systematic component: a linear predictor η = Xβ. Third, a link function g that connects the expected value of the response to the linear predictor: g(E[y]) = η. OLS is the special case where the distribution is Gaussian and the link is the identity.

Poisson regression for counts

When the target is a count — emails received per hour, defects per wafer, hospital admissions per day — the Gaussian assumption fails because counts are non-negative integers and their variance typically grows with the mean. The natural choice is Poisson regression: assume y | x ~ Poisson(μ(x)) and use the log link, log μ(x) = x⊤β. The exponential of each coefficient is then a multiplicative effect on the rate: a coefficient of 0.2 means a one-unit increase in that feature multiplies the expected count by e^0.2 ≈ 1.22. When counts are over-dispersed (variance larger than the mean, which is almost always), switch to negative binomial regression, which adds a dispersion parameter.

Gamma regression for positive continuous

For variables that are continuous, strictly positive, and right-skewed — insurance claim sizes, time-to-event durations, income — the gamma distribution with a log link is a standard choice. The gamma's variance scales with the square of the mean, which matches the empirical pattern that larger quantities vary more on an absolute scale but similarly on a relative scale. An alternative is a log-normal model fit by OLS on log y, but the interpretation of coefficients and the back-transformation to the original scale differ subtly; the gamma GLM is cleaner.

The logistic link — bridge to classification

When the target is binary — click or no click, default or repay, churn or stay — the binomial distribution with the logit link yields logistic regression. Write the linear predictor η = x⊤β and define P(y=1|x) = σ(η) = 1 / (1 + e^{-η}). The coefficient on a feature is then the change in the log-odds of the positive class for a one-unit change in that feature. This is the same algorithmic machinery as linear regression — a linear predictor, coefficients estimated by maximum likelihood, a loss that is convex — but the target is categorical. Logistic regression is the workhorse of classical classification and forms the natural bridge to the next chapter.

Estimation by IRLS and deviance

For the general GLM, the maximum-likelihood solution has no closed form, but an elegant iterative procedure called iteratively reweighted least squares (IRLS) works for the entire family: at each iteration, form a working response and working weights, solve a weighted least squares problem, and repeat until convergence. The algorithm is fast, stable, and converges in a handful of iterations for well-posed problems. The GLM analogue of the residual sum of squares is the deviance, defined as twice the difference between the log-likelihood of a saturated model and the fitted model; it reduces to SSR in the Gaussian case and plays an analogous role for model comparison, goodness-of-fit tests, and analysis-of-deviance tables.

Same bones, different skin

The GLM framework is one of the most durable ideas in applied statistics because it scales an intuition — that the mean of a response should depend linearly on features, on some appropriate scale — across almost every kind of target. When the data are counts, use a log link and a Poisson or negative binomial family; when the data are positive and skewed, use a log link and a gamma family; when the data are binary, use a logit link and a binomial family. The implementation in R's glm() and Python's statsmodels.GLM is unified across all these cases.

14

Robust regression

OLS squares the residuals, so a single wildly wrong point can dominate the fit. Robust regression replaces the squared loss with a function that down-weights extreme residuals, or finds a subset of inliers and fits to them alone. The result is a regression that tolerates a minority of contaminated data.

Why OLS is fragile to outliers

The quadratic residual term in OLS means that a point whose residual is twice as large contributes four times as much to the loss. A point ten times as far contributes one hundred times as much. This gives a single bad measurement — a data-entry typo, a sensor dropout, a transcription error, a fraudulent record — disproportionate influence over every coefficient in the model. In clean laboratory data this is a theoretical concern; in real-world data it is a daily reality. Robust regression acknowledges that outliers exist and builds estimators that do not collapse in their presence.

Huber loss

Huber loss is quadratic for small residuals and linear for large ones, with a smooth transition at a threshold δ. The quadratic region preserves the efficiency of OLS for most of the data; the linear region prevents any single large residual from dominating the sum. Huber regression is convex, differentiable, and can be solved with the same optimisation machinery as OLS (with a modified loss); the cost is a single hyperparameter, δ, which sets the outlier threshold. A reasonable default is 1.35 times the median absolute deviation of the residuals, which targets 95% efficiency under a Gaussian model.

RANSAC — consensus from random subsets

RANSAC (Random Sample Consensus), originally developed for computer vision in 1981, takes a different approach: instead of down-weighting outliers, it ignores them. Repeatedly sample a minimal subset of points, fit a model, count how many of the remaining points lie within a tolerance of the fitted model (the consensus set), and keep the fit with the largest consensus. After enough random trials, the algorithm finds a subset dominated by inliers and produces a fit that would have been invisible under any weighted average. RANSAC is the standard for fitting lines in noisy images, estimating homographies, and any problem where the inlier fraction is known but the outliers are structurally different.

Quantile regression

Quantile regression, introduced by Koenker and Bassett in 1978, targets a conditional quantile rather than the conditional mean. Minimising an asymmetric piecewise-linear loss — penalty τ on positive residuals and (1-τ) on negative — produces the conditional τ-quantile. The median regression (τ = 0.5) is the robust analogue of OLS; fitting τ = 0.1 and τ = 0.9 produces a prediction interval that makes no Gaussian assumption. Quantile regression is a staple of econometrics, finance (Value-at-Risk), and any problem where the tails of the conditional distribution matter as much as the mean.

M-estimators more broadly

All these methods are instances of M-estimators: estimators defined by minimising a sum of some loss function ρ(r_i) over the residuals. Choosing ρ(r) = r² recovers OLS; ρ(r) = |r| gives median regression; the Huber and Tukey biweight fall in between. The theory of M-estimators provides asymptotic standard errors, breakdown points, and efficiency measures; the breakdown point of OLS is zero (one bad point is enough to ruin it) while Tukey's biweight has a breakdown point near 50%. Scikit-learn's HuberRegressor, RANSACRegressor, and QuantileRegressor cover the main cases with a standard interface.

15

Categorical variables and interactions

Real data contains categories — country, industry, device type, zip code — not just numbers. Encoding them correctly is one of the quiet places where regression goes wrong. And interactions, the idea that the effect of one feature depends on another, are how linear models recover some of the flexibility that would otherwise require non-linear methods.

One-hot, dummy, and effect coding

A categorical variable with k levels cannot be fed directly to a regression as a single numeric column — the numeric values would imply an ordering and a distance that the categories do not have. The standard encoding is one-hot: create k binary columns, one per level, with a 1 in the column corresponding to the observation's level. In OLS with an intercept, using all k columns creates perfect collinearity (the columns sum to the all-ones vector), known as the dummy variable trap; drop one level as the reference and the remaining k-1 coefficients represent deviations from it. Effect coding uses −1 and +1 so the reference is the grand mean rather than an arbitrary level; it is more interpretable for balanced ANOVA-style analyses.

Interactions

An interaction term is the product of two features, added as its own column: x1 * x2. The coefficient on the product measures how the effect of one feature changes as the other changes. A classical example: the effect of education on income is different for men and women — an interaction between education and gender captures this, whereas additive main effects cannot. Interactions with categorical variables are pairwise products of each one-hot column, producing slope-by-group estimates. Interactions with continuous variables are products, and their interpretation requires centring the continuous variable to make the main effect meaningful.

High-cardinality categoricals

One-hot encoding scales poorly when the categorical variable has thousands of levels — US zip codes, product SKUs, user IDs. The design matrix becomes enormous and mostly zeros, and most levels see only a handful of observations. The three standard responses are: target encoding, replacing each level with the mean (or smoothed mean) of the target within that level, which compresses the encoding but can leak target information if done before cross-validation splits; hashing, mapping levels to a smaller fixed number of buckets, which loses some distinguishability but is constant-memory; and embeddings, learning a low-dimensional dense vector per level jointly with the model, which is now standard in neural-network approaches and extends naturally to classical regression via a learned lookup table.

Ordinal variables

Categorical variables with an inherent order — education level (none, high school, bachelor, graduate), satisfaction score (1–5), risk rating (AAA, AA, A, BBB) — are called ordinal. They can be encoded as integers if the distances between levels are plausibly equal; encoded as one-hot if the distances are arbitrary; or encoded as a thermometer (cumulative one-hot) which preserves order while allowing the model to learn non-uniform gaps. The right choice depends on how much you believe the ordering reflects actual linearity on the outcome.

Encoding is feature engineering

The topic runs deeper than can fit in a single section; a later chapter (Feature Engineering & Selection) returns to it in full. The operating principle for regression is the same one that recurs throughout machine learning: the model can only learn what the features let it see, and encoding choices determine what the features let it see. A good encoding can make a simple model outperform a complicated one fed garbage.

16

Evaluation and cross-validation

Fitting a model is only half of the work; estimating how well it will perform on unseen data is the other half. The standard metrics for regression each embody a different loss philosophy, and cross-validation gives us a principled way to estimate out-of-sample error without sacrificing data.

RMSE, MAE, and their tradeoffs

The two most common regression metrics are root mean squared error (RMSE) — the square root of the mean squared residual — and mean absolute error (MAE) — the mean absolute residual. RMSE is consistent with OLS (the model that minimises MSE also optimises RMSE) and penalises large errors heavily, which is appropriate when large errors are disproportionately costly. MAE is robust to outliers and corresponds to median regression; its units match the target's units and make it more interpretable. If the errors are roughly Gaussian, RMSE ≈ 1.25 × MAE; a large gap between the two is a signal of heavy-tailed residuals.

R² and adjusted R²

The coefficient of determination, R² = 1 − SSR/SST, expresses the fraction of target variance explained by the model. It ranges from 0 (model predicts the mean) to 1 (perfect fit), but can be negative on test data if the model does worse than predicting the mean. The key caution: never decreases as predictors are added, so it cannot be used to compare models with different numbers of features. Adjusted R² applies a complexity penalty and can decrease when useless predictors are added, making it a better comparison tool. On test data both are fine; on training data, use adjusted or — better — use cross-validation.

MAPE and scale-free metrics

When the target spans orders of magnitude — prices, populations, volumes — an absolute error is misleading because a ten-dollar miss on a hundred-dollar prediction is different from a ten-dollar miss on a million-dollar prediction. MAPE (mean absolute percentage error) divides each absolute error by the corresponding target value, yielding a scale-free relative metric. It has well-known pathologies (blows up when the target is near zero, is asymmetric between over- and under-prediction) and variants like sMAPE and wMAPE patch some of these. The log-transformed target with MAE on the log scale is a cleaner choice in many cases.

K-fold cross-validation

K-fold cross-validation partitions the data into K disjoint folds, fits the model on K-1 folds and evaluates on the held-out fold, and averages the K test errors. It is the standard tool for estimating generalisation error and for selecting hyperparameters. K=5 or K=10 is the customary choice; leave-one-out cross-validation (LOOCV) sets K = n and has a lower bias but higher variance. Stratified K-fold preserves the distribution of the target across folds (important with skewed or rare targets); grouped K-fold ensures that entire groups — patients, users, schools — stay in a single fold (preventing leakage); time-series K-fold uses a forward-chaining scheme so that training always precedes validation temporally.

The train–validation–test discipline

The classical split is three-way: the training set fits the model, the validation set tunes hyperparameters, and the test set produces the single, untouched estimate of generalisation. Once the test set is used, it is contaminated and must not be reused to drive further decisions. In a small-data setting, nested cross-validation achieves the same discipline automatically: an outer K-fold loop for honest error estimation, an inner K-fold loop for hyperparameter selection. The principle is old-fashioned and iron-clad: to estimate a model's performance on data it has never seen, it must actually be evaluated on data it has never seen.

The cross-validated estimate is the whole point

Every hyperparameter discussed in this chapter — the polynomial degree, the regularisation strength λ, the elastic-net mixing parameter α, the robustness threshold δ, the feature set — is selected by cross-validation. Reasoning about which model to pick, or which features to include, without having a cross-validation estimate in hand, is guessing. Reasoning about them with a cross-validation estimate in hand, using a sensible grid or a smart search, is doing machine learning.

17

Regression in the ML stack

Linear regression is sometimes treated as a relic — the thing students study before they get to the real models. It is not. Linear regression is the default baseline, the universal diagnostic, the final layer of most neural networks, and the analytical lens through which deeper models are interpreted. Understanding it well is not a step on the ladder; it is the ladder.

The universal baseline

Before deploying any complex model, the first question to ask is how a regularised linear regression performs on the same task. Often the answer is surprisingly close to the best model available; occasionally the answer is that the linear model is the best model, full stop. A complicated model that does not beat a careful linear baseline by a meaningful margin has no business being in production — it is paying interpretability and reliability and compute costs for no gain. The baseline is also the fastest feedback loop the practitioner has: it trains in seconds, its coefficients are readable, its failure modes are obvious.

The final layer of neural networks

A neural network with a regression head is, in its last layer, a linear regression on top of learned features. The hidden layers produce a vector of features from the raw input; the last layer takes a linear combination of those features to produce the prediction. The loss, the gradient, the notion of residual — the regression machinery of this chapter is unchanged. What is different is that the features are learned jointly with the regression rather than being hand-engineered. The consequence: every neural network practitioner is also, implicitly, a regression practitioner, and the intuitions about residuals, leverage, regularisation, and bias–variance carry over directly.

Linear probes and representation analysis

A staple diagnostic in deep learning is the linear probe: freeze a pretrained model, extract its intermediate activations on a dataset, and fit a linear regression (or classification) from those activations to a target of interest. If the linear probe performs well, the representation linearly encodes the target — the deeper layers would not add much. If the probe performs poorly, the representation requires non-linear readout. Linear probes are the cheapest, most interpretable tool for auditing what a large model has learned, and they are only possible because linear regression is fast, stable, and universally applicable.

Calibration and uncertainty

A linear regression produces not only a point estimate but — under Gaussian errors — a full predictive distribution: the prediction interval is ŷ ± t × SE where SE depends on leverage and the estimated residual variance. Extending this idea, Bayesian linear regression treats the coefficients themselves as random, yielding a posterior distribution over models and a posterior-predictive distribution for new predictions. The same calibration principles generalise to conformal prediction, a distribution-free approach that wraps any regression model in a prediction interval with guaranteed coverage. The tools for honest uncertainty in regression are mature; the same cannot yet be said of deep learning.

Bridge to classification

The next chapter moves from regression to classification — predicting a discrete label rather than a continuous value. The transition is smoother than it sounds: logistic regression is the GLM with a binomial family and a logit link, and it inherits everything developed here — the bias–variance tradeoff, the regularisation strategies, the cross-validation discipline, the importance of feature scaling, the caution with categorical encodings. The loss function changes (from squared error to cross-entropy), the output is a probability rather than a real number, and a few new evaluation metrics appear (accuracy, ROC curves, calibration). But the geometry of linear models, the mathematics of regularisation, and the discipline of held-out evaluation carry straight through.

Regression is the grammar of learning

The algorithms in the later chapters of this compendium — gradient boosting, support vector machines, neural networks, generative models — all contain some piece of the regression framework, whether in their loss functions, their regularisation, their evaluation, or their interpretation. A practitioner who understands linear regression deeply understands the shared substrate of machine learning. Everything after is variation and elaboration on the same theme.

Further reading

Where to go next

Regression is one of the most well-covered subjects in all of statistics, and the literature spans a wide range of depths. The list below picks the few works that repay re-reading: the two Hastie-Tibshirani textbooks that anchor modern regression teaching, the regularisation and sparsity follow-ups, the applied masterclasses by Harrell and by Gelman & Hill, the quantile and robust classics, and the software documentation that has become the de facto reference implementation for the methods in this chapter.

The anchor textbooks

The applied practitioner's texts

Robust and quantile regression

Software documentation

Papers and historical notes

This page opens Part IV: Classical Machine Learning. The next chapter — Supervised Learning: Classification — continues directly from this one. Logistic regression is the GLM of the binomial family and so inherits every piece of machinery developed here; the only substantive changes are the loss function, the output interpretation, and the evaluation metrics. Beyond that, Part IV will extend the regression framework in several directions: trees and ensembles that replace the linear predictor with piecewise constant pieces, kernel methods that lift the linear model into feature spaces, clustering and dimensionality reduction that are the unsupervised companions of supervised regression and classification. The regularisation, scaling, cross-validation, and bias–variance ideas of this chapter recur in every one.