Part IV · Classical Machine Learning · Chapter 02

Classification, where learning from data finally turns into decisions.

A classification model predicts a discrete label — spam or ham, malignant or benign, dog or cat or elephant — from a set of features. It is the other half of the supervised-learning picture introduced in Chapter 01: the mathematics shifts only slightly, but the consequences shift a great deal. A regression mistake of a few dollars is usually a few dollars; a classification mistake is almost always the wrong decision. The entire industry built on the back of machine learning — the spam filter that made email usable, the click predictor that funds the web, the credit model that decides who gets a loan, the medical classifier that decides who is recalled for a biopsy — runs on classification. This chapter introduces the canonical classical methods (logistic regression, k-nearest neighbours, naive Bayes, discriminant analysis, decision trees), the evaluation machinery that turns a black-box classifier into a trustworthy one (confusion matrices, ROC and PR curves, calibration, cost-sensitive thresholds), and the vocabulary (generative versus discriminative, margin versus likelihood) that every later chapter of this part will inherit. Chapter 05 of Part I touched on a piece of this (Bayes' theorem, hypothesis testing); Chapter 01 of this part built the regression scaffolding; here the two come together.

How to read this chapter

The first section motivates the classification problem and draws the two canonical distinctions — probabilistic versus geometric classifiers, and generative versus discriminative models — that organise the rest of the chapter. Section two introduces logistic regression, the binomial GLM that carries over directly from Chapter 01 and is the single most important classifier in classical machine learning. Section three extends it to multinomial (softmax) regression for more than two classes; section four covers estimation, the cross-entropy loss, and L1/L2 regularisation, inheriting most of the treatment from the regression chapter. Section five is naive Bayes, the simple generative model that punches far above its weight for text and high-dimensional sparse data. Section six is k-nearest neighbours, the non-parametric baseline that is both the simplest and the most stubbornly effective method ever invented. Sections seven and eight are the other two classical families: decision trees (with Gini/entropy splits, depth limits, and pruning, setting the stage for Chapter 03's ensembles) and linear and quadratic discriminant analysis, the Gaussian generative classifiers that complete the picture. Section nine develops decision boundaries as a unifying geometric lens across every classifier in the chapter. Sections ten and eleven are the practitioner's survival kit for real-world data: class imbalance (resampling, class weights, SMOTE) and thresholds and costs (choosing an operating point once the classifier is trained). Sections twelve through fourteen are evaluation, in three layers: the confusion-matrix metrics (precision, recall, F1), the threshold-free curves (ROC, PR, AUC), and calibration — the property that the predicted probabilities actually reflect empirical frequencies, which is where most deployed classifiers quietly fail. Section fifteen extends the chapter's machinery to multiclass and multi-label problems; section sixteen looks back at the generative-versus-discriminative divide in light of everything the chapter has introduced; section seventeen closes with where classification compounds in modern ML — as the final layer of almost every deep network, as the evaluation surface for representations, and as the statistical scaffolding under ranking, retrieval, and decision-making systems.

A note on what this chapter is not. It is not a treatment of support-vector machines or kernel methods (Chapter 07), ensemble methods (Chapter 03), or neural-network classifiers (Part V) — those are all classification methods, but each has enough machinery to deserve its own chapter. It is also not a deep treatment of model-evaluation theory, which Chapter 09 will cover from the model-selection side. Classification is a topic where the conceptual basics are unusually durable: logistic regression, cross-entropy, the confusion matrix, ROC curves, and calibration have changed very little in forty years and are exactly the vocabulary used inside modern deep classifiers. Every section of the chapter aims to leave the reader with an intuition that will still be useful in the last chapter of the compendium.

Contents

  1. Why classification is its own problemDiscrete targets, decisions, and distinctions
  2. Logistic regressionThe binomial GLM and the workhorse classifier
  3. Multinomial and softmax regressionMore than two classes
  4. Estimation, cross-entropy, and regularisationMLE, convex loss, L1/L2
  5. Naive BayesGenerative, conditionally-independent, surprisingly strong
  6. k-nearest neighboursThe non-parametric baseline that refuses to die
  7. Decision treesAxis-aligned splits, CART, pruning
  8. Linear and quadratic discriminant analysisGaussian class-conditionals, closed-form boundaries
  9. Decision boundariesGeometry as a unifying lens
  10. Class imbalanceResampling, class weights, SMOTE
  11. Thresholds and asymmetric costsChoosing an operating point
  12. Evaluation metricsConfusion matrix, precision, recall, F1
  13. ROC and PR curvesThreshold-free evaluation and AUC
  14. CalibrationWhen the probabilities actually mean something
  15. Multiclass and multi-labelOne-vs-rest, classifier chains, ordinal targets
  16. Generative vs discriminativeTwo philosophies, one problem
  17. Where it compounds in MLSoftmax heads, probes, retrieval, decisions
01

Why classification is its own problem

Classification and regression are close relatives — the same features, the same fitting machinery, often the same loss functions in disguise — but the discreteness of the target changes almost everything about how the problem is framed, evaluated, and deployed. The chapter opens by drawing the two cuts that organise everything that follows: probabilistic versus geometric classifiers, and generative versus discriminative models.

The classification problem

A classifier is a function that takes a feature vector x and returns a discrete label y from a finite set of classes — two classes (binary classification: spam/ham, malignant/benign, click/no-click), several classes (multiclass: digit recognition, topic tagging, dog breeds), or a structured set (multi-label: all the tags that apply to one image). The classifier is trained on a labelled dataset {(x_i, y_i)} of examples and then applied to new inputs. The mathematical frame is identical to regression's; the difference is only that the target is categorical. Almost every consequence — the loss function, the evaluation metrics, the decision procedure, the uncertainty representation — follows from that one change.

Probabilistic vs geometric classifiers

The classifiers of this chapter split cleanly into two philosophies. Probabilistic classifiers estimate P(y = k | x) — the conditional probability of each class given the features — and make decisions by taking the class with the highest probability (or by thresholding one probability against a cost-weighted cutoff). Logistic regression, naive Bayes, LDA/QDA, and the softmax output of a neural network all fall into this camp. Geometric or margin-based classifiers do not produce probabilities; they produce a decision rule, typically by learning a boundary in feature space that separates the classes. k-nearest neighbours, support vector machines (Chapter 07), and untuned decision trees are classical examples. The two philosophies answer different questions — how confident is the model versus which side of the line is this point on — and choosing between them depends on whether downstream use requires a probability or only a decision.

Generative vs discriminative

A related but distinct cut is between generative and discriminative models. A generative classifier models the joint distribution P(x, y) = P(x | y) P(y) — it learns what each class looks like — and then derives P(y | x) via Bayes' rule. Naive Bayes and LDA are generative. A discriminative classifier models P(y | x) directly, without any attempt to model the features themselves; logistic regression is the canonical example. The distinction is not cosmetic: generative models can generate synthetic data, handle missing features, and often need less labelled data; discriminative models usually achieve better classification accuracy per example when labels are plentiful, because they spend their capacity on the conditional distribution rather than on modelling the features. Section sixteen revisits the tradeoff in the light of everything the chapter builds.

What changes from regression

The handoff from the previous chapter is closer than it first appears. The regression loss (squared error) is replaced by a loss that respects the discreteness of the target (cross-entropy); the residual analysis is replaced by the confusion matrix; the prediction interval is replaced by the calibrated probability; and the threshold that turns a regression prediction into an action is replaced by a decision threshold chosen under an asymmetric cost. Almost every other piece — regularisation, feature scaling, cross-validation, the generalisation vs memorisation tension — carries over unchanged. The chapter is structured to make that inheritance visible: each classical method's section ends with a pointer to the regression analogue it extends or replaces.

Classification is where learning becomes decision

The reason classification deserves its own chapter — despite sharing so much machinery with regression — is that its outputs are directly actionable. A regression prediction is a number; a classification prediction is usually an action (approve, decline, flag, route, recommend). Every section of the chapter is ultimately about how to produce reliable actions from imperfect data. That is why evaluation takes three full sections, and why calibration gets a section of its own.

02

Logistic regression

If the last chapter had one protagonist (OLS), this one does too. Logistic regression is the binomial GLM — a direct descendant of the generalised linear model framework introduced in Section 13 of Chapter 01 — and it is the classifier that every more complicated method gets compared against.

The model

Logistic regression models the log-odds of the positive class as a linear function of the features. With η = x⊤β the linear predictor and σ(z) = 1 / (1 + e^{-z}) the logistic (sigmoid) function, the model is

$$P(y = 1 \mid x) = \sigma(x^\top \beta) = \frac{1}{1 + e^{-x^\top \beta}}.$$

The linear predictor can take any real value; the sigmoid squashes it into (0, 1); the boundary where P = 0.5 corresponds to η = 0 — a linear hyperplane in feature space. Every logistic regression, no matter how many features, has a linear decision boundary. The classifier is intrinsically linear; the non-linear part is only the squashing.

Coefficients and log-odds

The interpretation of a logistic regression coefficient is one of the cleanest in statistics: a coefficient β_j is the change in the log-odds of the positive class for a one-unit change in feature j. The exponent e^{β_j} is the odds ratio — a coefficient of 0.693 doubles the odds; a coefficient of -0.693 halves them. This interpretation makes logistic regression a staple in epidemiology, credit scoring, and political science, where coefficients need to be explained to non-technical audiences. The interpretability is not shared by most later classifiers; a deep network's coefficients do not directly map to any effect on log-odds.

Why a logistic model, not a linear one

It is tempting to just fit y = x⊤β + ε with OLS on a 0/1 target. This is the linear probability model, and it is still used in some fields (particularly economics) because coefficients are easy to interpret as percentage-point effects. The problems are three: predictions can lie outside [0, 1] and have no probability interpretation there; the residuals are heteroscedastic by construction (variance depends on the mean); and the efficient estimator is not OLS. Logistic regression fixes all three at once, at the cost of a non-closed-form estimator — a cost that IRLS (Section 04) makes almost invisible.

The decision rule

The default decision is predict class 1 if P(y = 1 | x) > 0.5, equivalently if η > 0. But the choice of threshold is not fixed by the math; it is a business decision that depends on the costs of the two kinds of errors, which can be enormously asymmetric (a missed cancer diagnosis is not the same as a false positive on a mammogram). Section eleven returns to the threshold explicitly. For now, note that logistic regression produces a calibrated probability under its own model assumptions — a number that can be thresholded, combined with costs, or fed into a downstream decision system — and that this calibration-for-free is one of the reasons it stays in production even when black-box classifiers beat it on raw accuracy.

The one-line code summary

In scikit-learn: LogisticRegression(penalty='l2', C=1.0).fit(X, y). Everything else in the chapter is variations, diagnostics, and choices around what is in this one line.

03

Multinomial and softmax regression

The binary classifier of the previous section generalises cleanly to more than two classes. The resulting model — multinomial logistic regression, or equivalently softmax regression — is the GLM analogue for the multinomial family, and is the exact final layer of almost every modern deep classifier.

Softmax and the multinomial likelihood

With K classes, each class has its own coefficient vector β_k, and the probability of class k is given by the softmax function:

$$P(y = k \mid x) = \frac{e^{x^\top \beta_k}}{\sum_{j=1}^{K} e^{x^\top \beta_j}}.$$

The probabilities are positive and sum to one. The logistic model of the previous section is the special case K = 2 with one of the β_k set to zero as a reference class. The decision rule is to predict the class with the largest linear score — equivalently, the largest probability.

One-vs-rest, one-vs-one, and native multinomial

Three approaches to multiclass classification coexist in practice. One-vs-rest (OvR) trains K binary classifiers, each of which distinguishes one class from all the others; prediction is the argmax of their scores. OvR is the default in scikit-learn for most binary-native models. One-vs-one (OvO) trains K(K-1)/2 pairwise classifiers and uses majority voting; it scales worse but can work better when classes are highly imbalanced. Native multinomial fits the softmax model above in a single joint optimisation; it is the standard for logistic regression, softmax-head neural networks, and modern gradient-boosted tree libraries. All three converge to similar accuracy on balanced, separable problems; native multinomial tends to be better calibrated because the probabilities are properly normalised by construction.

The identifiability of softmax coefficients

An oddity of softmax regression is that the coefficients β_1, …, β_K are only identified up to a common offset: adding the same vector to every β_k leaves the probabilities unchanged. The two standard fixes are to pin one class as the reference (β_K = 0), which makes the other coefficients interpret as log-odds ratios against the reference, or to add an L2 penalty that breaks the tie by shrinking all coefficients towards zero. Both choices give identical predictions; only the interpretation of the coefficients differs.

Deep learning's debt to softmax

Every deep classifier in production — image classifiers, language models, speech recognisers, retrieval rankers — has a softmax on the top. The features fed into the softmax are learned by the deeper layers, but the final step is exactly the softmax regression of this section. The cross-entropy loss used to train those networks is the negative log-likelihood of the multinomial. This is not a coincidence or a historical accident; it is the shortest path from a linear predictor to a well-calibrated probability distribution over a discrete alphabet, and nothing better has been found.

04

Estimation, cross-entropy, and regularisation

Fitting a logistic or softmax model is maximum-likelihood estimation under the binomial or multinomial distribution. The loss is cross-entropy, the optimisation is convex, and the regularisation machinery of the regression chapter carries over with almost no change.

Cross-entropy as negative log-likelihood

For a binary problem, the log-likelihood of the training data is

$$\ell(\beta) = \sum_{i=1}^{n} \left[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \right], \quad p_i = \sigma(x_i^\top \beta).$$

Maximising the log-likelihood is equivalent to minimising its negative, which is called the cross-entropy loss (or log loss). For the multinomial case the sum over classes generalises: a single term -log p_{y_i} for each example. Cross-entropy has a clean information-theoretic interpretation as the expected number of bits needed to encode the true label under the model's predicted distribution — large penalty for confident-and-wrong, small penalty for confident-and-right.

Convexity and the IRLS solution

The cross-entropy loss is convex in β — every local minimum is a global minimum, and gradient-based optimisation is guaranteed to converge to it. There is no closed form, but the iteratively reweighted least squares (IRLS) algorithm introduced in Chapter 01 applies: at each iteration, form a weighted-least-squares problem whose solution is the next iterate, and repeat until convergence. IRLS is equivalent to Newton's method on the cross-entropy loss, converges quadratically near the optimum, and is the default solver for small and medium logistic regressions. For very large n or very large p, stochastic gradient descent or its variants (Adam, L-BFGS) take over.

L2 and L1 regularisation

The ridge, lasso, and elastic-net penalties of Chapter 01 apply unchanged: add λ‖β‖²₂ for L2 (shrinks all coefficients towards zero, stabilises the optimisation, is the default in scikit-learn's LogisticRegression), λ‖β‖₁ for L1 (produces sparse coefficients, performs feature selection), or a convex combination for elastic net. The intuition for why regularisation matters is different from regression though: without regularisation, logistic regression on linearly separable data has β → ∞ — the log-likelihood can always be increased by making the coefficients larger. Regularisation is not optional for logistic regression; it is what makes the optimisation well-posed on clean separable classes.

The inverse regularisation parameter C

A cosmetic but persistent source of confusion: scikit-learn parameterises the penalty strength as C, the inverse of λ. Large C means weak regularisation; small C means strong regularisation. This convention came from the SVM literature; it is historically unfortunate. When porting code between libraries (glmnet, statsmodels, sklearn), the direction of the parameter is one of the most common sources of silent bugs.

Why cross-entropy, not squared error, on probabilities

A natural-seeming alternative is to train on squared error between predicted probability and the 0/1 label (the Brier score). It works but it is worse, for two reasons: cross-entropy is the proper MLE for the binomial, so it has the right asymptotic efficiency; and squared error saturates when predictions are already confidently correct, while cross-entropy keeps pushing confidently-wrong predictions towards the right answer. Cross-entropy is the loss every modern classifier — classical or deep — is trained on, and it is cross-entropy precisely because logistic regression derived it from the binomial likelihood fifty years ago.

05

Naive Bayes

Naive Bayes is the archetypal generative classifier — model each class's feature distribution, then apply Bayes' rule. The "naive" in the name is the pretence that features are conditionally independent given the class. The pretence is almost always false, and the classifier works almost always well.

The model

Bayes' rule expresses the posterior class probability as

$$P(y = k \mid x) \propto P(y = k) \prod_{j=1}^{p} P(x_j \mid y = k),$$

where the key simplifying assumption is that each feature's conditional distribution depends only on the class, not on the other features. The class prior P(y = k) is usually estimated as the empirical class frequency; the likelihoods P(x_j | y = k) are estimated separately for each (feature, class) pair. The prediction is the class with the largest posterior. Every piece of the calculation is a one-dimensional density estimate; there are no simultaneous parameters to juggle.

The three flavours

Three standard flavours handle different feature types. Gaussian naive Bayes assumes each continuous feature is normally distributed within each class; it is the default for generic tabular data and has a closed-form estimator (per-class means and variances). Multinomial naive Bayes is designed for count features — the classical spam filter uses it on word-frequency vectors, with each word's per-class probability estimated from its in-class frequency. Bernoulli naive Bayes uses binary features, the word-occurred-or-didn't version of the spam filter. All three are one-line models in scikit-learn (GaussianNB, MultinomialNB, BernoulliNB) and all three train in a single pass over the data.

Why the independence assumption is OK

The conditional-independence assumption is almost always violated — words in a spam email are correlated, image pixels are spatially dependent, medical features co-vary in well-understood patterns. Yet naive Bayes frequently wins competitions on small, high-dimensional problems. Domingos and Pazzani (1997) analysed why: naive Bayes only needs the ordering of the class posteriors to be correct, not the posteriors themselves. The independence assumption corrupts the probability estimates but often leaves the argmax intact, so classification accuracy holds up while probability calibration suffers. This is why naive Bayes should be reached for when class decisions matter and the probabilities are secondary.

Laplace smoothing and sparse features

A recurring practical wrinkle is the zero-probability problem: if a word never appeared in the training spam, its likelihood under the spam class is zero, and so the posterior for any email containing that word is zero, no matter what the other words say. The fix is Laplace smoothing: add a small count (usually 1) to every (word, class) cell before normalising. Smoothing is a crude form of Bayesian inference — equivalent to placing a Dirichlet prior on the multinomial — and is why naive Bayes never outputs a hard zero. Scikit-learn exposes the strength as an alpha parameter; the default alpha = 1.0 is usually fine.

The two-decade spam filter

The first large-scale deployed machine-learning system — Paul Graham's 2002 essay A Plan for Spam and the wave of Bayesian spam filters that followed — was naive Bayes on email word counts. It lowered spam volumes by orders of magnitude and kept email usable through the 2000s. Even today, on small text-classification problems with scarce labelled data, naive Bayes is often the first model to reach for — training is instant, deployment is trivial, and performance is rarely far from the ceiling.

06

k-nearest neighbours

kNN is the simplest classifier that can be described in a single sentence: predict the majority class among the k training examples closest to the query point. It is also one of the hardest to beat when the data is clean and moderately-dimensional, and it is the conceptual ancestor of every retrieval-based method in modern ML.

The method

Choose a distance metric (Euclidean is the default; Manhattan, cosine, and Mahalanobis are common alternatives). To classify a new point x: compute its distance to every training example; take the k nearest; predict the majority class among them. For a probability estimate, return the fraction of neighbours in each class. The classifier has essentially one hyperparameter (k) and no training — all the work happens at query time, which is why kNN is called a lazy learner.

Bias and variance with k

k = 1 produces a classifier with zero training error and very high variance — every training point is classified correctly, but the decision surface is jagged and fragile. Increasing k smooths the boundary, trading variance for bias: at k = n the classifier predicts the overall majority class and the boundary is gone entirely. The optimal k depends on the data; k = 5 or k = 7 is a typical starting point for moderately-sized datasets, chosen by cross-validation. The classical Bayes-optimal argument of Cover & Hart (1967) shows that as n → ∞ and k/n → 0, kNN error approaches the Bayes error — it is asymptotically as good as any classifier can be.

The curse of dimensionality

kNN fails in high dimensions in a specific way. As the number of features grows, the distances between all pairs of points concentrate — every point is roughly the same distance from every other — and the notion of "nearest" loses discriminative power. A kNN on raw pixel intensities of natural images is disastrous; a kNN on learned embeddings of those same images is excellent. This is the story of modern retrieval: the expensive part is computing a representation in which Euclidean distance is meaningful, and then kNN over that representation is competitive with much more complex methods.

Scaling kNN to large data

Naively, classification takes O(np) per query — do this for every query and the cost balloons fast. The standard accelerations are KD-trees (for p < 20 or so), ball trees (for larger p, sensitive to the metric), and approximate nearest neighbour (ANN) algorithms like HNSW, Annoy, FAISS, and ScaNN, which sacrifice a small amount of accuracy for enormous query-time speedups. ANN is now the standard substrate for vector databases and for every retrieval-augmented-generation (RAG) system; the classifier is kNN in spirit, even when the operative word is "retrieval".

Weighted voting and distance kernels

Vanilla kNN gives every neighbour an equal vote. Distance-weighted kNN weights each neighbour's vote by 1/d or exp(-d²/2σ²), so nearer neighbours count more. This is often a small improvement, and it extends naturally to kernel density classification (use a kernel to smooth the neighbour contributions), which is the non-parametric limit of the method. Scikit-learn's KNeighborsClassifier(weights='distance') is the one-line version.

The simplest classifier is one of the strongest

There is a recurring lesson in this chapter and the next: the simplest and the most complex classifiers tend to tie, and what loses is the middle. kNN on a good embedding regularly ties a deep network on the same task, at a tiny fraction of the training cost and with essentially no tunable parameters. Modern retrieval is kNN at industrial scale; modern few-shot learning is kNN in a learned metric. The method that was old in 1967 remains the default baseline, and understanding its strengths and failure modes is permanent intellectual equipment.

07

Decision trees

A decision tree partitions feature space into axis-aligned rectangles, making a greedy sequence of splits that carve the data into purer and purer regions. Alone, they are high-variance and rarely the production classifier. In ensemble (Chapter 03), they become the most-used family of models in classical machine learning.

How a tree makes a decision

Start at the root. Each internal node holds a test: "is feature j greater than threshold t?". The data splits left or right on the answer. Descend until reaching a leaf, which holds the predicted class (for a classifier) or the predicted value (for a regressor, if the tree is used for regression). The entire decision path is a conjunction of simple rules — "age > 40 AND income < 50k AND prior-default = yes" — which makes trees exceptionally easy to explain to non-technical stakeholders. A radiologist or a loan officer can audit a tree's decision the way they audit a decision flowchart.

The splitting criterion

The tree is grown greedily, top-down. At each node, consider every feature and every possible threshold, and choose the split that maximises some measure of impurity reduction. Two standard measures: Gini impurity, 1 − Σ p_k², and entropy, −Σ p_k log p_k. Both reward splits that produce children with a concentrated class distribution. Gini is slightly faster to compute and is the default in scikit-learn and most production libraries; entropy is more interpretable (information gain in bits) and is the default in C4.5. The choice rarely affects accuracy meaningfully.

Pruning and regularisation

An unpruned tree grown to full depth will achieve zero training error by producing a leaf for every unique training point — and, predictably, it will generalise very poorly. The classical remedy is cost-complexity pruning (also called weakest-link pruning), which fits a full tree and then prunes back internal nodes using a penalty on the number of leaves, with the penalty strength chosen by cross-validation. Modern implementations use a simpler and almost equally effective family of regularisations: maximum tree depth, minimum samples per leaf, minimum impurity decrease for a split. Each of these is a single hyperparameter to tune.

Handling categoricals, missing values, and scale invariance

Trees have three durable conveniences that regression-family models lack. First, they handle categorical features natively — every split is a question about a single feature, with no need to encode. Second, they handle missing values without explicit imputation — the standard approach is "surrogate splits", where each internal node records a backup rule for when the primary feature is missing. Third, they are scale-invariant: multiplying a feature by 1000 has no effect on any split the tree can make. These three properties make trees the first-choice classifier on messy tabular data with minimal preprocessing.

Why a single tree is rarely enough

For all their interpretability, decision trees have a fatal flaw: they are high-variance. A small perturbation of the training data (removing one row, adding a new row) can produce a completely different tree, with a completely different first split. This is the reason the next chapter exists: averaging many slightly-randomised trees (random forests) or sequentially correcting their mistakes (gradient boosting) collapses the variance and produces what is, in 2025, still the most used family of classical ML models on structured data. Read this section as setup for Chapter 03, not as a recommendation to ship a single tree to production.

Interpretability as the last moat

The one reason to ship a single decision tree to production is when the business requires a human-readable rule set — credit decisions, medical triage, regulatory compliance. In those domains, a pruned tree or a small rule list is sometimes preferred even when a black-box model would be more accurate. The Rule Lists literature (Letham et al., Rudin's work on interpretable ML) is the ongoing project of pushing that accuracy gap towards zero.

08

Linear and quadratic discriminant analysis

Discriminant analysis is the oldest classification method in statistics, introduced by Ronald Fisher in 1936. It is a generative classifier like naive Bayes, but without the independence assumption: it models each class as a multivariate Gaussian and classifies by Bayes' rule. With shared covariance, it reduces to a linear classifier; with per-class covariance, to a quadratic one.

The Gaussian class-conditional

Assume each class is drawn from a multivariate Gaussian with its own mean and (possibly its own) covariance: x | y = k ~ N(μ_k, Σ_k). Combined with the class priors π_k = P(y = k), Bayes' rule gives the posterior class probability as proportional to π_k · N(x; μ_k, Σ_k). Taking logs and dropping terms that don't depend on the class produces a discriminant function for each class; the decision rule is to predict the class with the largest discriminant.

LDA: shared covariance gives a linear boundary

In linear discriminant analysis (LDA), all classes share a single covariance matrix Σ. The quadratic term in the Gaussian log-density cancels across classes, and the discriminant functions become linear in x. The decision boundary between any two classes is a hyperplane, identical in form to logistic regression's boundary but derived from a different estimation procedure: LDA estimates μ_k by class means, Σ by the pooled within-class covariance, and π_k by class frequencies — all closed-form, all one pass over the data. When the Gaussian assumption holds, LDA is the Bayes-optimal linear classifier.

QDA: per-class covariance gives a quadratic boundary

In quadratic discriminant analysis (QDA), each class has its own covariance Σ_k. The quadratic term no longer cancels, and the decision boundaries become conic sections — ellipses, parabolas, hyperbolas. QDA is strictly more flexible than LDA but requires estimating K · p(p+1)/2 parameters instead of p(p+1)/2, which means QDA needs substantially more data per class. On small datasets LDA usually wins; on large datasets with genuinely different within-class covariances, QDA can be substantially better.

Fisher's formulation and dimensionality reduction

Fisher's original 1936 formulation framed LDA as a projection: find the linear combination of features that maximises the ratio of between-class variance to within-class variance. This is a dimensionality-reduction technique that preserves class separability, and it survives in modern practice under the name Fisher LDA. Unlike PCA (Chapter 05), which is unsupervised and preserves total variance, Fisher LDA uses the labels and preserves discriminative variance. The projection to K − 1 dimensions is often an excellent visualisation for a multiclass problem.

Relationship to logistic regression and naive Bayes

LDA produces a linear decision boundary, as does logistic regression, but estimates it differently: LDA's closed-form estimator assumes Gaussian class-conditionals; logistic regression's MLE makes no such assumption. When the Gaussian assumption holds, LDA is more efficient (lower variance of coefficient estimates); when it fails badly, logistic regression is more robust. LDA with diagonal covariance is exactly Gaussian naive Bayes; so the discriminant-analysis family generalises naive Bayes in one direction (allowing feature correlations) and differs from logistic regression in another (specifying the feature distribution).

Why LDA stays in the curriculum

LDA is rarely the production choice today — logistic regression is usually at least as accurate and does not impose distributional assumptions — but it is a permanent part of the classification curriculum for two reasons. One, it makes the generative approach concrete: the model, the estimation, and the Bayes-rule classification are all transparent. Two, its Fisher dimensionality-reduction interpretation is the foundation of a whole line of supervised embedding methods that reappear throughout modern ML.

09

Decision boundaries as a unifying lens

Every classifier in the chapter can be understood through the geometry of its decision boundary — the surface in feature space where the predicted class changes. Putting the classifiers side by side through this single lens reveals which ones are structurally similar and which are genuinely different.

Linear boundaries

Logistic regression, LDA, and linear SVMs (Chapter 07) all produce linear decision boundaries — hyperplanes in feature space. Geometrically, they all solve the same problem ("find the best separating hyperplane") with different objectives: logistic regression maximises the conditional log-likelihood, LDA minimises the Gaussian classification error, linear SVM maximises the margin. On separable classes, all three produce broadly similar boundaries; on non-separable or noisy classes, they differ in which mistakes they tolerate. The linear family is a good default when the problem is known to be linearly separable or when interpretability matters.

Piecewise linear / axis-aligned boundaries

Decision trees produce piecewise-axis-aligned boundaries — the decision surface is a union of rectangles, each labelled with one class. kNN with k = 1 produces a Voronoi tessellation, a piecewise-linear boundary at equal distance between the closest points of each class. Both families can approximate arbitrary boundaries given enough depth or enough data, but both pay for that flexibility with high variance: small changes in the data produce large changes in the boundary.

Quadratic boundaries

QDA produces quadratic boundaries — conic sections in two dimensions, quadric surfaces in higher dimensions. A classifier with polynomial feature expansion of degree two followed by a linear model (logistic regression, linear SVM) also produces a quadratic boundary, and in fact is equivalent to a kernel method (Chapter 07) with a quadratic kernel. The equivalence is one of the most useful unifications in classical ML: non-linear boundaries can be produced by either lifting into a non-linear feature space and using a linear classifier, or by using a classifier that is intrinsically non-linear. Kernel methods make the first approach cheap.

Smooth vs jagged

A cross-cutting distinction: some classifiers produce smooth boundaries (logistic regression, LDA, QDA, naive Bayes) while others produce jagged boundaries (decision trees, kNN). Smooth boundaries are typically preferred when the data is noisy or imbalanced — they extrapolate more sensibly to new data. Jagged boundaries can be exactly right when the underlying structure has sharp discontinuities — think of hardware faults that trigger above a specific temperature threshold, which a tree finds easily and a logistic regression approximates poorly.

The modern synthesis

Deep classifiers produce boundaries that are smooth in some regions, jagged in others, and almost arbitrary overall — the price of flexibility. A recurring finding in the interpretability literature is that the boundary learned by a deep classifier is often close to one a classical method could have learned, once the features have been transformed into the right representation. This is the thread that connects the classical methods of this chapter to the deep methods of Part V: the role of the classifier family is to choose the shape of the boundary; the role of the features is to make that shape sufficient.

10

Class imbalance

Real-world classification targets are almost never balanced. Fraud is 0.1% of transactions, defaults are 3% of loans, cancer is a fraction of a percent of mammograms. A classifier trained naively on such data learns the majority class and stops; the interesting events become invisible. Class imbalance is where good classification gets hard.

What goes wrong with naive training

A 99%-accurate classifier on a 1%-positive problem may be predicting "negative" for every example — and nobody will notice if they only look at accuracy. The loss function sees thousands of easy negatives and a handful of positives; the gradient is dominated by the easy cases; the model converges to the trivial solution. Every step of the pipeline — the loss, the metric, the evaluation, the threshold — has to be reconsidered in the imbalanced regime.

Resampling

The first family of fixes is to rebalance the training data. Random undersampling throws away negative examples until classes are balanced — fast, but discards information. Random oversampling duplicates positive examples — preserves information, but can cause overfitting on the duplicates. SMOTE (Chawla et al., 2002) interpolates new synthetic positives between existing ones along feature-space line segments, a compromise that usually outperforms simple oversampling. The imbalanced-learn Python library implements all the major variants with a scikit-learn compatible interface.

Class weights

The alternative to resampling is to reweight the loss: each example contributes to the loss with a weight proportional to the inverse frequency of its class, so that the rare class counts as much in aggregate as the common one. Scikit-learn exposes this as class_weight='balanced'; almost every modern classifier library has the equivalent. Class weighting is mathematically equivalent to oversampling with appropriate duplicates, and is preferred in practice because it leaves the dataset unchanged and plays well with cross-validation.

Anomaly detection as a limit

When the positive class is extremely rare — well under 1% — supervised classification starts to blur into anomaly detection. If there are not enough positive examples to learn their feature distribution reliably, treat the problem as modelling the negative class and flagging points that don't fit. One-class SVMs, isolation forests, and reconstruction-error-based autoencoders (Chapter 05) are the canonical tools. A later chapter (Part VIII, Time Series & Cross-Cutting Methods) covers anomaly detection in depth.

Don't touch the test set

An iron rule: resampling is a training-set operation only. The test set must reflect the production distribution, including its imbalance, because the metrics measured on it are meant to predict production behaviour. Oversampling or synthesising the test set produces metrics that look better but are not informative. This is a mistake the scikit-learn pipeline automates away when used correctly; manual resampling before a train-test split is a common and silent source of inflated metrics.

Don't evaluate an imbalanced classifier on accuracy

If the positive class is 1% of the data, the trivial classifier is 99% accurate. Precision, recall, F1, PR-AUC, and calibration are the metrics that matter on an imbalanced problem. The next three sections develop them in detail.

11

Thresholds and asymmetric costs

A probabilistic classifier outputs a number between 0 and 1; a decision classifier outputs a label. The bridge between them is a threshold, and choosing the threshold is a business decision, not a statistical one. Getting this step wrong is one of the most common ways a well-fit classifier produces a bad deployment.

Why 0.5 is almost always wrong

The default "predict class 1 if P(y = 1 | x) > 0.5" rule is only optimal under two strong assumptions: the two types of error (false positive, false negative) have equal cost, and the training set has the same class balance as the deployment distribution. Both are usually false. Spam filters should favour missing a spam over marking a real email as spam. Cancer screeners should prefer extra biopsies over missed diagnoses. Loan models have to balance the cost of a defaulted loan against the opportunity cost of a declined good customer. None of these problems should live at 0.5.

The cost-sensitive threshold

Given costs C_FP and C_FN for false positives and false negatives, and a prior π = P(y = 1), the Bayes-optimal decision rule is to predict positive when P(y = 1 | x) > t*, where t* = C_FP / (C_FP + C_FN). A cost ratio of 10:1 in favour of catching positives (missed positive ten times worse than a false alarm) gives t* = 1/11 ≈ 0.09. A 1:100 ratio in the other direction gives t* = 100/101 ≈ 0.99. The threshold depends only on the relative costs and on the classifier's calibration.

Threshold tuning by validation

When costs are hard to quantify precisely, the threshold is chosen by maximising some evaluation metric on a held-out validation set — highest F1 score, highest balanced accuracy, best Youden's J, lowest regret against a target precision or recall. This tuning is a separate hyperparameter selection and has to respect the usual train/validation/test split, or else the final test metrics are overfit to the threshold choice. The sklearn.model_selection and scikit-learn tutorials walk through the common patterns.

Business constraints and the operating point

Real deployments often have hard constraints, not costs. "Send at most 10,000 alerts per day." "Keep false positive rate below 1%." "Achieve recall of at least 95%." Each constraint picks out a single operating point on the classifier's ROC or PR curve — a specific threshold value — and the training process is ultimately judged by how well the classifier performs at that point. A classifier that is excellent on average over all thresholds can still be poor at the one threshold that matters. Training and evaluation should target the operating point, not a threshold-averaged metric, once the operating point is known.

The threshold is a first-class parameter

Most real classification failures in production are threshold failures, not model failures. The underlying classifier was fine; the threshold was left at 0.5; the costs were not elicited; the operating point drifted as class balance changed. A disciplined deployment process audits the threshold on every retraining and monitors it in production, the same way it monitors accuracy and latency.

12

Evaluation metrics

The confusion matrix is the foundation of classification evaluation; almost every threshold-dependent metric is some arithmetic on its four cells. Knowing which metric to report when — and which ones are lying to you — separates a trustworthy classifier from a false one.

The confusion matrix

For a binary classifier and a fixed threshold, every prediction falls into one of four bins: true positive (TP — predicted positive, actually positive), false positive (FP), true negative (TN), false negative (FN). The confusion matrix is the 2×2 table of these counts. For a K-class problem, it is a K × K matrix with the (i, j) cell counting predictions of class j on examples whose true class is i. The diagonal is correct classifications; the off-diagonal is errors, and the pattern of off-diagonal errors often reveals which classes are confusable with which (hard-to-distinguish dog breeds cluster off-diagonal).

Precision, recall, F1

Accuracy, (TP + TN) / n, is the proportion of correct predictions. It is the right metric only when classes are balanced and costs are symmetric; imbalanced problems render it almost useless. Precision, TP / (TP + FP), is the fraction of positive predictions that are correct — "of the things the model said were positive, how many really were?". Recall (or sensitivity, or true positive rate), TP / (TP + FN), is the fraction of actual positives that the model caught — "of the things that really were positive, how many did the model find?". Precision and recall trade off: lowering the threshold increases recall and lowers precision. The F1 score is their harmonic mean, 2 · P · R / (P + R), and is the standard single-number summary for imbalanced binary problems.

Specificity, sensitivity, and the medical vocabulary

Medical literature uses a different vocabulary. Sensitivity is recall; specificity, TN / (TN + FP), is the true negative rate — the fraction of negatives correctly classified. Youden's J is sensitivity + specificity − 1, a single number that is 0 at chance and 1 at perfect classification. Classifiers for diagnostic tests are often evaluated by sensitivity at a fixed specificity, the threshold chosen to match clinical operating constraints.

Macro vs micro averaging

For multiclass problems, precision, recall, and F1 are computed per-class and then averaged. Macro averaging gives each class equal weight (small classes count the same as big ones); micro averaging pools the TP/FP/FN counts across all classes before computing the metric (large classes dominate). They can differ enormously on imbalanced multiclass problems. Weighted averaging takes a weighted mean by class frequency. None is always right; the answer depends on whether the rare classes matter as much as the common ones.

The Matthews correlation coefficient

The Matthews correlation coefficient (MCC), (TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN)), is a single-number classification metric that ranges from -1 (perfect disagreement) to +1 (perfect agreement), 0 at chance. MCC is widely regarded as a better summary than F1 on imbalanced data because it uses all four cells of the confusion matrix; F1 ignores TN entirely. On problems where the negative class is the one that matters (e.g. predicting which machines will not fail), MCC is often the right default.

13

ROC and PR curves

Every metric in the previous section depends on a choice of threshold. Sometimes that threshold is known; more often, the classifier is evaluated across all possible thresholds and summarised by the area under a curve. ROC and precision-recall curves are the two standard visualisations, and they tell subtly different stories.

The ROC curve

The receiver operating characteristic (ROC) curve plots true positive rate (recall, TPR) against false positive rate (FPR = FP / (FP + TN)) as the threshold varies from 0 to 1. A perfect classifier hugs the top-left corner — 100% recall with 0% false positives. A random classifier falls on the diagonal. The curve is threshold-free by construction: each point corresponds to one threshold, but the curve as a whole summarises how well the classifier separates the two classes regardless of the threshold. The name comes from its origin in the 1940s, where it was developed to evaluate radar operators during the Second World War.

AUC: the area under the ROC curve

The area under the ROC curve (ROC-AUC or just AUC) summarises the curve as a single number between 0 and 1. It has a clean probabilistic interpretation: AUC is the probability that a randomly chosen positive example has a higher predicted score than a randomly chosen negative example. An AUC of 0.5 is chance; an AUC of 1.0 is perfect separation; a useful classifier usually lands between 0.75 and 0.95. AUC is threshold-invariant and scale-invariant, which makes it a robust summary but also strips out information about where on the curve the classifier actually operates.

The PR curve

The precision-recall curve plots precision against recall as the threshold varies. On a balanced problem it carries similar information to the ROC curve, but on an imbalanced problem it differs sharply. When negatives vastly outnumber positives, a large absolute number of false positives can still produce a low false positive rate — so the ROC curve looks optimistic — while precision catches exactly the problem of false alarms among the rare positives. On imbalanced problems (fraud, rare diseases, anomaly detection), the PR curve is the right picture and PR-AUC or average precision (AP) is the right summary.

When ROC-AUC is misleading

The canonical cautionary tale: a 99%-negative dataset. A classifier that catches 90% of positives at the cost of a 2% false positive rate seems good on ROC-AUC. But 2% of 99% negatives is two false positives for every real positive caught — 33% precision at 90% recall. The PR curve makes that bluntly visible; the ROC curve hides it. Saito & Rehmsmeier's 2015 comparison of the two is the reference treatment. On imbalanced problems, report PR-AUC; on balanced problems, either is fine, and ROC-AUC is historically more conventional.

Per-class ROC in the multiclass case

For a K-class problem, there is no single ROC curve. The standard practice is to compute one curve per class (each class against all others) and report either the macro average (equal weight across classes) or the one-vs-one variant. Scikit-learn's roc_auc_score with multi_class='ovr' or 'ovo' and average='macro' produces the standard summaries; the per-class curves are the right thing to plot when one class is more important than the others.

AUC is not accuracy

One of the most common confusions in the classification literature: ROC-AUC is not an estimate of accuracy, and a higher AUC does not guarantee a higher accuracy at the deployment threshold. A classifier with AUC 0.95 that is poorly calibrated can have worse accuracy than a classifier with AUC 0.85 that is well-calibrated. Always report the threshold-specific metrics (precision, recall, F1 at the operating threshold) alongside the threshold-free AUC.

14

Calibration

A classifier is calibrated if its predicted probabilities match observed frequencies — when it says 70%, things actually happen 70% of the time. Calibration is the property that turns a black-box score into a trustworthy probability, and it is the property most deployed classifiers quietly fail.

The reliability diagram

The standard visualisation is the reliability diagram: bin the predictions into deciles (0-10%, 10-20%, …), and for each bin, plot the mean predicted probability against the empirical frequency of positives. A perfectly calibrated classifier lies on the diagonal. A classifier below the diagonal is overconfident — its predicted 90% probabilities only come true 70% of the time. A classifier above the diagonal is underconfident. Most modern classifiers — in particular boosted trees and neural networks — are systematically overconfident, because their loss functions have incentives that do not match calibration.

Why calibration matters

Calibration matters whenever a downstream system consumes the probability rather than the hard decision. A risk model that scores a loan at 3% default probability is used to price the interest rate; if the actual default rate at that score is 6%, the pricing is wrong. A medical model that predicts 85% likelihood of a condition is used to decide whether to biopsy; if the actual rate is 50%, the biopsy decision is uncalibrated. A ranking model that combines multiple classifier scores needs all the scores on the same probability scale, or the combination is arbitrary. Calibration is the hinge between a score and an action.

Platt scaling

Platt scaling (Platt, 1999) is the classical post-hoc calibration method: fit a one-dimensional logistic regression that maps the classifier's raw score to a calibrated probability, using a held-out calibration set. It assumes the miscalibration has a sigmoidal shape, which holds reasonably well for SVMs and for some neural networks. Platt scaling is trivially simple, fast, and often the only calibration step needed. It is the default in scikit-learn's CalibratedClassifierCV.

Isotonic regression

Isotonic regression is the non-parametric alternative: fit a monotonic step-function from score to probability, using the pool-adjacent-violators algorithm on the calibration set. It makes no parametric assumption about the miscalibration shape, so it is more flexible than Platt scaling, but it needs more calibration data to avoid overfitting. Zadrozny & Elkan (2002) compared the two; the rule of thumb is isotonic for more than 1000 calibration examples, Platt below that. Both are one-line wrappers in scikit-learn.

Temperature scaling for neural networks

Guo et al. (2017) showed that modern neural networks are badly miscalibrated — their softmax probabilities are much more confident than their actual accuracy — and proposed temperature scaling: divide the logits by a single learned scalar T before the softmax, with T chosen to minimise cross-entropy on a held-out set. One parameter, no shape change in the predictions, often dramatic calibration improvements. Temperature scaling is now the default post-hoc calibration step for deep classifiers, and is the natural deep-learning generalisation of Platt scaling.

Expected Calibration Error

The summary metric for calibration is Expected Calibration Error (ECE): bin the predictions, measure the gap between mean predicted probability and empirical frequency within each bin, take the weighted average. ECE is to calibration what accuracy is to classification — a single number, easy to report, with known pathologies (it depends on the binning, it ignores sample-size differences between bins). Brier score is a related but differently-scaled alternative that folds calibration and sharpness into a single number.

Always measure calibration, often fix it

Calibration is cheap to measure (a reliability diagram and a single ECE number) and cheap to fix (Platt or temperature scaling on a held-out set). There is almost no excuse for deploying a classifier whose probabilities are not calibrated; the downstream systems that rely on them become wrong as a direct function of that miscalibration. Yet the step is routinely skipped, because the training process does not flag it. Adding calibration to the standard post-training pipeline is one of the highest-leverage changes a team can make.

15

Multiclass and multi-label classification

Binary classification is the simplest case. Multiclass (one of many) and multi-label (any subset of many) extend it in two different directions, each with its own practical quirks. Ordinal classification — classes with an inherent ordering — is a third relative.

Multiclass via softmax

The native multiclass model is the softmax regression of Section 03: one linear predictor per class, softmax normalisation, argmax decision. For deep networks, the final layer is a softmax, and training uses cross-entropy over K classes. For classical methods, one-vs-rest (K binary classifiers, argmax their scores) is the alternative and is often necessary for binary-native models like SVMs and gradient-boosted trees. Scikit-learn handles both transparently via multi_class and OneVsRestClassifier.

Multi-label problems

Multi-label classification assigns an arbitrary subset of classes to each example — an image can have both "beach" and "sunset" and "seagull" tags simultaneously. The standard decomposition is binary relevance: train an independent binary classifier per label and predict each independently. This is simple and scalable but assumes labels are independent, which is rarely true (sunset and seagull co-occur with beach). Classifier chains (Read et al., 2009) train the labels sequentially, each conditioned on the previous ones; label powerset enumerates subsets; neural approaches use a sigmoid output per label trained on binary cross-entropy. The scikit-multilearn library implements the main variants.

Ordinal classification

When classes have an inherent ordering — star ratings 1-5, disease severity stages, risk categories — neither the softmax (which treats classes as exchangeable) nor regression (which treats them as continuous) is ideal. Ordinal regression (also called ordered logit or proportional-odds regression) fits K-1 cumulative logistic regressions sharing coefficients but differing in intercepts; the predictions are structured probabilities that respect the class ordering. For deep models, the analogue is a ranking-style loss or a cumulative link head. Ignoring ordinality — treating "stage 1" and "stage 5" as equally different from each other as from "stage 2" — wastes information and often degrades accuracy on the close classes.

Hierarchical classification

When classes have a tree structure — biological taxonomy, product categories, ICD-10 codes — hierarchical classification exploits the structure. The standard approaches are: train a classifier at each internal node of the taxonomy (local approach), train a single flat classifier and post-process the predictions to respect the hierarchy (global approach), or use a hierarchical loss that penalises mistakes between distant classes more than between close ones. Hierarchical classification is the right choice for problems with thousands of classes — a flat softmax over 10,000 classes loses to a hierarchical decomposition, both in training speed and in generalisation.

Open-set classification

A subtle real-world problem: the training set has K classes, but at deployment the classifier encounters examples from classes it has never seen. Should it pick the closest known class (closed-set classification) or refuse to classify (open-set classification)? The closed-set assumption is implicit in almost every textbook method and is broken in almost every real deployment. The open-set recognition literature (Scheirer et al. 2013 and later) develops methods to learn a rejection option jointly with the classification. For production classifiers, a conservative heuristic is to threshold the maximum softmax probability — if the classifier's top prediction is below 30%, return "unknown" rather than commit — but this is a crude fix and the proper methods are worth understanding.

16

Generative vs discriminative, revisited

The section-one sketch can now be filled in. Every classifier in this chapter falls on one side of the generative/discriminative divide, and the tradeoff between them has some non-obvious implications that become visible only after seeing the methods side by side.

The Ng & Jordan result

Ng & Jordan (2001) compared the generative classifier (naive Bayes, LDA) and the discriminative classifier (logistic regression) on the same feature set and derived a clean result: the discriminative classifier has a lower asymptotic error rate (when n → ∞), but the generative classifier converges to its asymptotic error rate faster (with fewer training examples). In the small-data regime, the generative model wins; in the big-data regime, the discriminative model wins. The crossover point depends on the problem but is often in the hundreds-to-thousands range.

What generative classifiers give you for free

A generative classifier models P(x, y), which makes several things fall out automatically. Missing features: marginalise the unobserved features out of the class conditionals; the model still predicts. Unsupervised data: semi-supervised learning is natural, because unlabelled x contributes to the estimation of P(x). Anomaly detection: flag examples whose marginal likelihood P(x) is small. Generation: sample new x from each class. A discriminative classifier has none of these; it only knows P(y | x) on inputs it has been asked about.

The cost of specifying P(x | y)

The price is that the generative model requires specifying the distribution of features within each class — a hard task in high dimensions. Getting P(x | y) right is much harder than getting P(y | x) right, because the former is a p-dimensional density and the latter is a scalar. Naive Bayes sidesteps the difficulty with the independence assumption; LDA sidesteps it with the Gaussian assumption; both pay for their simplicity when the real features do not obey the assumption. The discriminative model has no analogous commitment — it can live with any feature distribution, as long as the log-odds happen to be linear.

Why modern ML is mostly discriminative

The last decade has been a nearly unbroken run of discriminative models dominating the leaderboards — classifiers, regressors, segmenters, detectors. The reason is data: modern ML systems are trained on millions or billions of labelled examples, and in that regime the Ng & Jordan argument's big-data limit applies. Generative classifiers still exist in modern practice where labels are expensive (few-shot learning, medical imaging with small cohorts, scientific data), where generation is itself the goal (language models, image generation), or where the downstream system requires P(x) (anomaly detection). But the default production classifier, at scale, is discriminative.

The modern generative revival

The revival of generative models for classification comes from an unexpected direction: large pretrained generative models (LLMs, image diffusion models) can be used as zero-shot classifiers by asking them to score the log-likelihood of a label given an input. A chat LLM asked "is this email spam? Answer yes or no" is performing a generative classification. This blurs the classical distinction and is one of the active research directions in 2024-2025. The classical theory still applies — the Ng & Jordan argument does not care whether the generative model is naive Bayes or GPT — but the empirical balance shifts when the generative model is very large and was pretrained on billions of tokens.

17

Classification in the ML stack

Classification is the universal backbone. Image recognisers, speech recognisers, language models, search ranking, recommender systems, medical triage, fraud detection, content moderation — almost every consumer-facing AI system is a classifier somewhere in the pipeline. The vocabulary of this chapter is permanent equipment.

The softmax head of every deep network

A deep classifier is a feature extractor — the hidden layers — plus the softmax regression of Section 03 on top. The feature extractor has grown dramatically in expressiveness over the last decade; the softmax head has barely changed. Cross-entropy loss, argmax prediction, temperature scaling for calibration, cost-sensitive thresholds for decisions — all carry over unchanged. A practitioner fluent in classical classification is fluent in the last layer of every deep classifier ever built.

Classification as the evaluation surface

Linear probes (Section 17 of Chapter 01 in its regression form) have a classification analogue that is equally universal: train a logistic regression on the activations of a frozen pretrained model and see how well it classifies. The probe's accuracy is a measure of how much of the target task's information is linearly readable from the representation. Every paper in modern representation learning — CLIP, DINO, MAE, SAM — reports linear-probe accuracies as its primary evaluation. The classifier underneath is logistic regression; the science is which representation makes the probe succeed.

Retrieval, ranking, and recommendations

Search ranking is classification: is this document relevant or not? Recommender systems are classification: will this user click or not? Both are trained with a cross-entropy or pairwise ranking loss that is a close relative of the classification losses of this chapter, and both are evaluated with PR-AUC or its ranking cousins (NDCG, MAP). A significant fraction of the applied ML research of the last twenty years is, under the surface, classification research applied to ranking and retrieval.

Decisions, costs, and the business layer

A deployed classifier is almost always part of a larger decision system. The classifier produces a calibrated probability; a cost model turns the probability into an expected utility; a decision policy chooses among actions. The classifier is the statistical piece; the costs and the policy are the business piece. A common failure mode is to treat the classifier as the whole system — to maximise AUC without ever asking what threshold the product will ship at, what the costs of the two mistakes are, or whether the probabilities are calibrated. Getting the classifier right is necessary; it is almost never sufficient.

Bridge to ensemble methods

The next chapter — Chapter 03: Ensemble Methods — takes the classifiers of this chapter (particularly decision trees) and combines them to produce the most-used family of classical ML models: random forests and gradient boosting. The classification machinery of this chapter (cross-entropy loss, confusion matrices, ROC and PR curves, calibration, thresholds, class imbalance) carries over unchanged. Ensembles change the model, not the evaluation, not the calibration, not the decision machinery.

Classification is ML's most durable vocabulary

Every technique in this chapter was developed before 2005. Every one is still in active use in 2025. The reason is that classification is what ML is for, at the end of most pipelines — the final translation of data into a discrete action or label. Regardless of what the features look like or where they come from, the vocabulary of probabilities, decisions, costs, and calibration is the common language. A practitioner who understands classification deeply is equipped to work at any layer of the modern stack, because the problems upstream always reduce, eventually, to choosing among a finite set of outcomes.

Further reading

Where to go next

Classification is one of the oldest and most heavily-written-about subjects in machine learning. The references below are the ones that reward re-reading: the standard textbooks that carry the classification chapters most teachers assign, the more mathematical treatments for the probability-theory and decision-theory angle, the classic papers that introduced the techniques the chapter relies on, and the software documentation that is the de facto reference implementation for deploying classifiers in practice.

The anchor textbooks

The applied practitioner's texts

Foundational papers

Calibration and evaluation

Software documentation

This page is Chapter 02 of Part IV: Classical Machine Learning. The next chapter — Ensemble Methods — takes the decision tree introduced in Section 07 and turns it into the most-used family of models in modern classical ML: random forests, gradient boosting, and the XGBoost/LightGBM/CatBoost lineage that dominates tabular data. The classification machinery developed here (cross-entropy loss, confusion matrices, ROC and PR curves, calibration, thresholds, imbalance handling) carries over unchanged; ensembles change only the model family, not the evaluation or decision machinery. Later chapters of Part IV extend to clustering and dimensionality reduction (the unsupervised cousins), probabilistic graphical models (structured classification), kernel methods and SVMs (a different route to non-linear boundaries), feature engineering (the quiet decisive step that precedes every classifier), and model evaluation and selection (the discipline that turns a classifier into a trustworthy deployment).