Part IV · Classical Machine Learning · Chapter 04

Clustering, where the structure of the data has to be discovered rather than supplied.

Clustering is the first serious problem in unsupervised learning: given a dataset without labels, partition it into groups of points that are similar to each other and different from points in other groups. The problem is ancient — Linnaeus was doing a kind of clustering when he classified species in 1735, and the k-means algorithm was independently invented at least four times between 1950 and 1965 — and it is also oddly unresolved. Unlike regression and classification, where a training loss and a held-out error together pin down success, clustering has no ground truth: different algorithms carve the same data into different partitions, and there is no single criterion that picks the right one. This chapter develops the four algorithmic families that do most of the work — centroid-based (k-means and its variants), hierarchical (agglomerative and divisive), density-based (DBSCAN, HDBSCAN), and model-based (Gaussian mixtures, soft clustering via EM) — the evaluation machinery that lets you defend a choice among them, and the practical habits that separate a useful clustering from a cargo-cult one. Chapter 05's dimensionality reduction is the usual companion: cluster first in a reduced space, interpret, iterate.

How to read this chapter

The first section motivates the clustering problem and lays out the four algorithmic families the rest of the chapter will inhabit. Section two is the quiet prerequisite — similarity and distance — because every clustering algorithm is a function of a distance metric, and choosing the metric is often the most consequential decision in the whole pipeline. Sections three through five are the centroid-based family: k-means itself (Lloyd's algorithm and its Forgy/MacQueen variants), the variants that fix its common failure modes (k-means++, mini-batch k-means, kernel k-means), and the robust relatives k-medoids and k-medians. Section six covers hierarchical clustering — the agglomerative linkage family and the dendrogram as a full multi-scale picture of the dataset. Sections seven and eight are the density-based family: DBSCAN as the canonical algorithm that discovers clusters of arbitrary shape and rejects noise, and its descendants HDBSCAN and OPTICS that turn the single density threshold into a hierarchy.

Section nine introduces Gaussian mixture models as the model-based cousin of k-means — k-means is the limit of a GMM with spherical covariances and zero variance — and the natural entry point into soft clustering. Section ten develops the EM algorithm in the detail it deserves, because EM is the tool every later probabilistic model in Part V will inherit. Section eleven is spectral clustering, the linear-algebra approach that handles non-convex geometry by clustering in the eigenvectors of a Laplacian. Section twelve is the hardest and most important: how many clusters — the elbow method, silhouette, gap statistic, BIC, information criteria, and the honest acknowledgement that the question often has no single right answer. Section thirteen is the rest of evaluation, with and without ground-truth labels: ARI, NMI, purity, V-measure, and the internal indices (silhouette, Davies-Bouldin, Calinski-Harabasz). Section fourteen is stability and consensus clustering — the idea that if a clustering is real, it should survive perturbations and agree with itself across methods. Sections fifteen and sixteen are practical: high-dimensional clustering (where Euclidean distance fails and dimensionality reduction becomes almost mandatory) and the operational recipe for attacking a new clustering problem. Section seventeen closes with where clustering shows up in modern ML — as the scaffolding under vector-quantisation, tokenisation, prototype-based few-shot learning, self-supervised representation learning, and the recommendation and retrieval systems that shape the modern web.

Contents

  1. Why clustering is its own problemNo labels, no truth, four families of answers
  2. Similarity and distanceThe metric choice is the model choice
  3. k-meansLloyd's algorithm and the objective it minimises
  4. k-means variantsk-means++, mini-batch, kernel k-means
  5. k-medoids and k-mediansRobust centroids for non-Euclidean data
  6. Hierarchical clusteringLinkage, the dendrogram, multi-scale partitions
  7. DBSCANDensity-based clusters of arbitrary shape
  8. HDBSCAN and OPTICSVarying-density hierarchies
  9. Gaussian mixture modelsSoft clustering with probabilistic components
  10. The EM algorithmExpectation-maximisation and latent-variable inference
  11. Spectral clusteringEigenvectors of the graph Laplacian
  12. Choosing the number of clustersElbow, silhouette, gap statistic, BIC
  13. Evaluation metricsARI, NMI, silhouette, purity, Davies-Bouldin
  14. Stability and consensus clusteringResampling, bootstrapping, agreement across methods
  15. Clustering in high dimensionsThe curse, dimensionality reduction, subspace methods
  16. Clustering in practiceA recipe for a new dataset
  17. Where it compounds in MLVector quantisation, tokenisation, self-supervision, retrieval
01

Why clustering is its own problem

Clustering is what learning from data looks like when the data does not come with labels. It is the defining problem of unsupervised learning and a surprisingly hard one — the objective is underdetermined, the evaluation is circular, and two reasonable algorithms will often produce genuinely different partitions of the same dataset. The chapter opens with what that unresolved condition implies for how to approach the problem.

The clustering problem

Given a set of data points X = {x_1, …, x_n} in some feature space, partition them into clusters C_1, …, C_K such that points in the same cluster are more similar to each other than to points in different clusters. The problem sounds simple but conceals an entire hierarchy of decisions: what "similar" means (Euclidean distance, cosine, mutual information, graph adjacency), what shape clusters are allowed to take (convex blobs, arbitrary density ridges, elongated manifolds), whether each point must belong to exactly one cluster (hard clustering) or can belong to several (soft / fuzzy clustering), and — the killer — how many clusters there are. A single dataset can be partitioned into 3 meaningful clusters or 30 meaningful clusters depending on which of these questions gets answered first.

The four algorithmic families

Almost every classical clustering algorithm falls into one of four families, and choosing between them is more consequential than tuning any single algorithm. Centroid-based methods (k-means, k-medoids) represent each cluster by a prototype and assign points to the nearest one; they are fast, well-understood, and assume roughly convex, equal-sized, equal-variance clusters. Hierarchical methods build a tree of nested clusters (the dendrogram), giving a multi-scale picture rather than a single partition; they are slow (O(n²) or worse) but expose structure at every granularity. Density-based methods (DBSCAN, HDBSCAN) find clusters as connected regions of high density and explicitly label low-density points as noise; they discover arbitrary shapes and need no prior K. Model-based methods (Gaussian mixtures) fit a probability distribution as a mixture of components and assign points to the component that most likely generated them; they give soft assignments and a principled path to model selection via likelihood. The rest of the chapter is a tour of each family.

Clustering has no ground truth. The right number of clusters, the right distance, the right algorithm are all choices a practitioner makes in light of what the clustering is for. A clustering used to compress data needs different properties from a clustering used to discover customer segments, and no evaluation metric separates them automatically.

Why clustering is ill-posed

Jon Kleinberg's 2002 impossibility theorem (An Impossibility Theorem for Clustering) formalised the everyday frustration: no clustering function can simultaneously satisfy three very reasonable axioms — scale invariance (rescaling the distance shouldn't change the clustering), richness (any partition should be achievable for some distance matrix), and consistency (shrinking within-cluster distances and expanding between-cluster ones shouldn't change the clustering). Any algorithm that satisfies any two will fail the third. This is not a practical obstacle — most real algorithms sacrifice richness, picking a family of achievable partitions — but it is the theoretical reason that clustering admits no universal answer. Every algorithm is biased in favour of some partitions and against others; the art is matching the bias to the problem.

What this chapter does not cover

This chapter is about clustering as it sits inside classical machine learning: working on vector data, returning partitions, evaluated against internal or external criteria. It does not cover topic models (soft clustering of documents via latent Dirichlet allocation; see Part V), community detection on networks (modularity, Louvain, Leiden — a large adjacent literature), deep clustering (end-to-end learning of both embeddings and cluster assignments; a mid-2010s-onward line of work), or time-series clustering as a specialised domain. The techniques developed here — distances, centroids, densities, mixtures, hierarchies — are the vocabulary all those extensions assume. Chapter 05's dimensionality reduction is the natural first extension and is almost always used in conjunction with the clustering methods in this chapter.

02

Similarity and distance

Every clustering algorithm is a function of a distance or similarity matrix. The choice of metric is not a preprocessing step; it is a statement about what kind of structure the clustering is meant to find. A practitioner who ignores the metric has already made a choice — usually Euclidean — and often the wrong one.

Metric distances

A metric on a set is a function d(x, y) satisfying non-negativity, identity of indiscernibles (d(x, y) = 0 iff x = y), symmetry, and the triangle inequality d(x, z) ≤ d(x, y) + d(y, z). The classical metrics on R^p are the Euclidean distance ‖x − y‖₂, the Manhattan (L1, taxicab) distance Σ|x_i − y_i|, and the Chebyshev (L∞, max-coord) distance max|x_i − y_i| — all special cases of the Minkowski family (Σ|x_i − y_i|^p)^(1/p). The Euclidean default comes from its closed-form relationship to the mean (the sum of squared Euclidean distances from a cluster's mean is minimised, which is exactly what k-means exploits) and to Gaussian likelihoods. The L1 default comes from its robustness to outliers. Pick Euclidean when the data is roughly Gaussian and in comparable units; pick L1 when outliers are common.

Non-metric similarities

Many useful notions of closeness are not metrics. Cosine similaritycos(x, y) = x·y / (‖x‖‖y‖) — measures angle between vectors and ignores magnitude; it is the dominant similarity on text (tf-idf and embedding vectors) and retrieval (dense vector search). It is not a metric — the triangle inequality fails — but 1 − cos(x, y) is bounded in [0, 2] and is the distance surrogate used in practice. Jaccard distance 1 − |A ∩ B|/|A ∪ B| is the canonical measure on sets (bag-of-words, user-click sets) and does satisfy the triangle inequality, making it a true metric. Hamming distance counts mismatches in fixed-length strings and is the right measure for categorical data encoded as binary vectors. Edit distance (Levenshtein) on sequences is a metric but expensive to compute. The choice among these is almost always dictated by the data type rather than the algorithm.

Mahalanobis distance and whitening

The Mahalanobis distance d(x, y) = √((x − y)ᵀ Σ⁻¹ (x − y)) generalises Euclidean distance by accounting for the covariance Σ of the data: it is the Euclidean distance after whitening (multiplying by Σ^(−1/2)), so that features with correlated variation count less. A cluster that is visibly elongated in one direction in Euclidean space is a round cluster in Mahalanobis space. The Gaussian mixture model in Section 09 uses exactly this distance in its likelihood formula — one Mahalanobis per component — and that is why GMMs handle elliptical clusters that k-means cannot. For quick work, whitening the data (standardising each feature, or doing a full PCA whitening) before running k-means is a cheap approximation.

The scaling trap. On heterogeneous numerical data (height in metres, weight in kilograms, income in dollars), Euclidean distance is dominated by whichever feature has the largest numerical range — income, almost always. Standardise (z-score) or min-max scale before computing distances, unless you have a reason not to. This single fix recovers more clusterings than any algorithm choice.

Learned distances

For problems where the right metric is not obvious, metric learning learns the distance function from data. Classical methods (Xing et al., 2003; LMNN, Weinberger and Saul, 2009; ITML, Davis et al., 2007) learn a Mahalanobis matrix that pulls same-class points closer and pushes different-class points apart, given some side information about which pairs are similar. Modern deep-learning analogues (Siamese networks, triplet loss, contrastive learning) do the same thing with non-linear encoders — the embedding itself is the learned metric, and the resulting cosine distances are the effective similarity. For clustering, the relevance is that the same raw data embedded by different representation-learning pipelines can produce very different cluster structures, and the embedding stage is often more consequential than the clustering algorithm that follows.

03

k-means

k-means is the most-used clustering algorithm in the world. It was invented at least four times between 1950 and 1965 (Steinhaus, Lloyd, MacQueen, Forgy), it has a beautifully simple objective, and it is fast enough to run on millions of points. It is also badly behaved in several specific ways, and its failure modes organise the next few sections.

The objective

k-means partitions X into K clusters C_1, …, C_K and assigns to each cluster a centroid μ_k, minimising the total within-cluster sum of squares (WCSS, also called inertia): J = Σ_k Σ_{x ∈ C_k} ‖x − μ_k‖². For fixed cluster assignments, the optimal centroid is the mean of the cluster's points. For fixed centroids, the optimal assignment is to the nearest centroid. The k-means algorithm (Lloyd, 1957; published 1982) alternates these two steps: assign each point to its nearest centroid; recompute each centroid as the mean of its newly-assigned points; repeat until nothing changes. This is coordinate descent on J, and it is guaranteed to converge because each step strictly decreases J and there are finitely many possible assignments.

Assumptions and failure modes

The objective quietly encodes several strong assumptions. Spherical clusters: because WCSS penalises Euclidean distance uniformly in every direction, k-means prefers round clusters. Elliptical clusters get split. Equal variance: clusters of very different sizes or spreads are incorrectly partitioned — a tight cluster next to a diffuse one will have the diffuse one's edge points assigned to the tight one. Equal size: k-means implicitly biases toward clusters with roughly the same number of points, because large clusters drag their centroid more. Known K: the number of clusters must be supplied externally. Convex clusters only: non-convex shapes (two interlocking half-moons) cannot be captured by any Voronoi partition. Sensitive to outliers: a single extreme point in a cluster yanks the centroid toward it. Every clustering method in the rest of the chapter exists to fix one or more of these failures.

Initialisation matters

Lloyd's algorithm only finds a local minimum of J, and which local minimum depends on where the centroids start. The original Forgy initialisation picks K random points as centroids; random partitioning assigns each point to a random cluster and computes centroids. Both can produce terrible solutions when the random starts happen to land multiple centroids in the same true cluster. The standard fix is to run the algorithm R times with different random seeds and keep the solution with the lowest J. A better fix — k-means++, Section 04 — is smarter about the initialisation itself. The lesson: treat k-means as a randomised algorithm, not a deterministic one, and always run multiple restarts.

Complexity. Each Lloyd iteration is O(nKp) for n points in p dimensions with K clusters. The number of iterations is usually small (5–50) and largely independent of n. k-means comfortably handles millions of points on a single machine; billions need the mini-batch variant in Section 04.

Why it's still the default

Despite its limitations, k-means remains the first thing to try on almost every new clustering problem. It is fast, the output is interpretable (centroids are themselves points in feature space), the objective is a single number that you can compare across values of K, and the algorithm is one of the best-understood optimisation problems in unsupervised learning. Many real problems are well-approximated by spherical equal-variance clusters, especially after standardisation and dimensionality reduction, and a k-means baseline is the reference point against which every fancier method is measured. The chapter's strategy is to understand k-means thoroughly, then meet every other method as a specific fix for one of its failure modes.

04

k-means variants

Each common failure mode of k-means has a specific fix. k-means++ addresses bad initialisation. Mini-batch k-means addresses scale. Kernel k-means addresses non-convexity. Fuzzy c-means addresses the hard-assignment assumption. Together these variants cover most of the real-world cases where plain k-means struggles.

k-means++

k-means++ (Arthur and Vassilvitskii, 2007) is a smarter initialisation scheme. Pick the first centroid uniformly at random from the data. For each subsequent centroid, sample a point with probability proportional to its squared distance from the nearest already-chosen centroid. This biases the initialisation toward spread-out centroids — a point in a region already covered by a centroid has low probability of being chosen next — and provably achieves O(log K)-approximation to the optimal J in expectation. The practical effect: fewer random restarts needed, better final solutions, and the new default in every major library (scikit-learn's KMeans uses k-means++ by default; Spark's MLlib does too). Use k-means++ always; plain random initialisation is a historical footnote.

Mini-batch k-means

Full k-means recomputes centroids against every point at every iteration. Mini-batch k-means (Sculley, 2010) computes centroid updates from random mini-batches of size 100–1000 at each step, much like stochastic gradient descent. The quality of the final solution is slightly worse than full k-means (around 1–3% higher J on standard benchmarks), but the speed is 10–50× faster at scale, and for streaming data it is the only option. Mini-batch is the right default at n > 10^6; at smaller scales full k-means is still preferred. The scikit-learn MiniBatchKMeans class implements it with sensible defaults.

Kernel k-means

Kernel k-means replaces the Euclidean distance with a kernelised distance d(x, y) = K(x, x) + K(y, y) − 2 K(x, y), effectively running k-means in an implicit high-dimensional feature space without computing coordinates there. With a Gaussian RBF kernel, this gives the algorithm the flexibility to fit non-convex clusters that plain k-means cannot represent. The cost: no explicit centroid in input space (the centroid lives in the RKHS), O(n²) kernel-matrix storage, and the usual Gaussian-kernel bandwidth choice. In practice, spectral clustering (Section 11) is the more common route to non-convex clusters, but kernel k-means is the direct generalisation and the bridge to the kernel-methods chapter.

Fuzzy c-means and soft assignments

Fuzzy c-means (Bezdek, 1981) allows each point to belong to every cluster with a degree of membership in [0, 1] that sums to 1. The membership of point x_i to cluster k is u_{ik} ∝ (1/d(x_i, μ_k))^(2/(m−1)), where m > 1 is the "fuzziness" parameter (m = 2 is standard). Centroids become weighted means of all points, weighted by their memberships. The algorithm is still alternating optimisation and still requires K supplied externally, but it captures the intuition that boundary points are genuinely ambiguous. For overlapping clusters, fuzzy c-means often gives a more faithful picture than a hard partition. Gaussian mixture models (Section 09) are the probabilistic version of the same idea and have largely superseded fuzzy c-means in modern practice, but the fuzzy version remains common in image segmentation and bioinformatics.

The default stack. Start with k-means++ with 10 restarts. If the data is too large, switch to mini-batch k-means. If the clusters are clearly non-convex, move to DBSCAN or spectral clustering. If points seem to belong to multiple clusters genuinely, move to a Gaussian mixture. Each step is a specific response to a specific failure of the previous one.

Other variants worth knowing

Bisecting k-means starts with all points in one cluster and recursively splits the worst cluster into two via k-means until K clusters are produced — a hierarchical cousin of k-means that is often more stable than the flat version. Spherical k-means projects all points onto the unit sphere and clusters on cosine similarity, and is the algorithm inside classical topic-model initialisation and dense-embedding clustering. Constrained k-means enforces must-link and cannot-link constraints from external information (semi-supervised clustering). Each is a tactical variant; none displaces the core algorithm.

05

k-medoids and k-medians

k-means uses the arithmetic mean as its prototype, which is not robust to outliers and not meaningful for non-Euclidean data. k-medoids and k-medians replace the mean with a robust alternative — an actual data point, or a coordinate-wise median — making clustering possible on data where means are either fragile or undefined.

k-medoids (PAM)

k-medoids (Kaufman and Rousseeuw, 1987, published as PAM — Partitioning Around Medoids) replaces each cluster's centroid with a medoid: the actual data point in the cluster that minimises the sum of distances to all other points in the cluster. The medoid is a real point, not a synthetic average, which is essential when points have no meaningful average — categorical data, string data, graph-node data, any data whose distances are defined but whose arithmetic is not. The objective is J = Σ_k Σ_{x ∈ C_k} d(x, m_k) for an arbitrary distance d, not necessarily Euclidean. The classical PAM algorithm is O(K(n − K)²) per iteration, which limits it to smaller datasets; CLARA (Clustering LARge Applications) approximates by running PAM on random subsamples, and CLARANS does randomised neighbour search over medoid assignments.

Why medoids matter

Two reasons. First, robustness: the medoid is the geometric median of the cluster (in the L1 sense) and is insensitive to outliers in a way the arithmetic mean is not. A single point 100 standard deviations from a cluster's centre will pull the mean substantially but not the medoid. Second, interpretability: a medoid is itself a data point that can be examined, explained, described. For a customer-segmentation project, "the medoid customer of cluster 3 is customer #4172, who has these properties" is often more useful than "the centroid customer has these synthetic averaged properties". Every interpretable-clustering pipeline eventually benefits from medoids as exemplars, even if the partitioning itself was computed with k-means.

k-medians

k-medians uses the coordinate-wise median as the cluster prototype and minimises the L1 sum J = Σ_k Σ_{x ∈ C_k} Σ_j |x_j − μ_{kj}|. Unlike the medoid (which is a data point) the coordinate-wise median is a synthetic point, but unlike the mean it is robust to outliers — a single extreme point in one coordinate does not move the median at all. For data with heavy-tailed feature distributions (income, response time, transaction value), k-medians frequently produces more stable clusterings than k-means without requiring the switch to arbitrary-distance medoids. It is the natural choice when the L1 distance is the right metric and the data remains in R^p.

When to use each. k-means: Euclidean data, no severe outliers, speed matters. k-medians: Euclidean data, heavy-tailed features or outliers. k-medoids: non-Euclidean data (categorical, strings, graphs, any custom distance), interpretability of prototypes matters, and n is not enormous. The three share the alternating-optimisation structure and the same K-choosing problem.

Affinity propagation as a generalisation

Affinity propagation (Frey and Dueck, 2007) is a message-passing algorithm that identifies exemplars — points that are representative of the clusters — without requiring K to be supplied. It passes responsibility and availability messages between all pairs of points and converges on a set of exemplars that double as cluster prototypes. It is elegantly parameter-light, it handles non-Euclidean distances naturally, and it finds its own cluster count (controlled indirectly by a "preference" parameter). The cost is O(n²) memory and many iterations, limiting it to moderate datasets; for those datasets, affinity propagation is often a strong alternative to k-medoids with clustering count pre-specified.

06

Hierarchical clustering

Hierarchical clustering does not produce a single partition of the data; it produces a tree of nested partitions — a dendrogram — that shows how points merge (or split) at every scale. The tree is often more informative than any single cut of it. For small-to-medium datasets where the right number of clusters is unclear, hierarchical clustering is usually the first method to try.

Agglomerative and divisive

Agglomerative (bottom-up) hierarchical clustering starts with every point as its own singleton cluster and repeatedly merges the two closest clusters until only one remains. Divisive (top-down) clustering starts with everything in one cluster and recursively splits it, usually via k-means with K = 2. Agglomerative is overwhelmingly more common because it is simpler to implement, produces the same dendrogram at the same cost (O(n²) memory, O(n² log n) to O(n³) time depending on the linkage), and admits several natural definitions of cluster-to-cluster distance that the divisive form lacks. The output of either is a dendrogram — a tree where leaves are data points and internal nodes are merges, with merge heights equal to the distance at which the merge happened.

Linkage criteria

The single knob in agglomerative clustering is the linkage: how is the distance between two clusters defined? Single linkage uses the minimum distance between any two points, one from each cluster; it produces long chain-like clusters and suffers from the chaining effect, where a single bridge of outliers glues distant clusters together. Complete linkage uses the maximum distance; it produces tight compact clusters and tends to break large clusters apart. Average linkage (UPGMA) uses the mean pairwise distance; a reasonable compromise. Ward's linkage (Ward, 1963) minimises the increase in within-cluster variance at each merge, and it produces clusterings that most closely resemble k-means — Ward is often called the "k-means of hierarchical clustering" and is the default in scikit-learn. Picking the linkage is a choice about what kind of clusters you are looking for, and the four standard linkages can produce genuinely different dendrograms on the same data.

Reading a dendrogram

The dendrogram is the main output of hierarchical clustering. Read it by drawing a horizontal line at a chosen height: every vertical line the horizontal crosses is a cluster, and the line's leaves are the cluster's members. Small changes in the height produce small changes in the number of clusters, which is itself a diagnostic — if cluster counts jump abruptly with tiny threshold changes, the clustering is unstable. The tallest un-split branches indicate the cleanest cluster separations; very short branches at the top usually indicate weak structure. Dendrograms also reveal hierarchical structure the flat-partition methods cannot: a clustering of species that contains "carnivore" as a super-cluster of "canid" and "felid" is a genuinely useful piece of information that k-means would collapse into three unrelated partitions.

The dendrogram is both an algorithm output and a visualisation. For datasets under a few thousand points, always plot it before committing to a flat partition; it tells you whether the clustering structure is nested, whether the cluster count is robust, and which points are unambiguously assigned versus boundary cases.

Scaling and limitations

The O(n²) memory cost of hierarchical clustering makes it impractical above n ≈ 50,000 — the full pairwise distance matrix becomes a bottleneck before any algorithm runs. BIRCH (Zhang, Ramakrishnan, Livny, 1996) is a scalable hierarchical alternative that maintains a compact in-memory summary (the CF-tree) and processes data in a single pass; it is the right choice when hierarchical structure matters at scale. For most real use cases, the workflow is: sample a subset, run agglomerative clustering on the sample to understand the structure and choose K, then run k-means on the full dataset with that K. This hybrid recipe combines hierarchical clustering's exploratory strength with k-means' scale.

07

DBSCAN

DBSCAN (Ester, Kriegel, Sander, Xu, 1996) is the canonical density-based clustering algorithm: it finds clusters as connected regions of high point density, allows clusters of arbitrary shape, does not require K to be specified, and explicitly labels sparse points as noise. It is the first method to reach for whenever the clusters are not going to be round.

The algorithm

DBSCAN takes two parameters: a distance threshold ε and a minimum point count minPts. A point is a core point if at least minPts points (including itself) lie within ε of it. A border point is within ε of a core point but has fewer than minPts neighbours itself. A noise point is neither. The algorithm picks an arbitrary unvisited point; if it is a core point, start a new cluster and recursively include all points within ε of any core point in the cluster; if not, label it as noise (it may be reassigned to a cluster later as a border point). Continue until every point has been visited. The output is a partition of the core and border points into clusters, with noise points set aside. The clusters take whatever shape the dense regions of the data happen to have.

What DBSCAN gets right

Three things that k-means gets wrong. Arbitrary shape: a dense S-curve or concentric rings are detected as coherent clusters, because density propagates along any path where neighbours exist. No K required: the number of clusters emerges from the data; you pick ε and minPts instead. Explicit noise: points in sparse regions are not forced into a cluster. A cluster of 20 customers who behave coherently and 5 idiosyncratic outliers is partitioned into a 20-point cluster plus 5 noise points, which is usually what you want. For geographic data, astronomical data, fraud-detection patterns, and any setting where some points just don't belong to any cluster, DBSCAN's noise-awareness is a structural advantage.

Choosing ε and minPts

The two parameters trade off in a specific way. minPts controls how many points constitute "dense" — smaller values find more clusters (and more noise); larger values find fewer, denser clusters. The standard heuristic is minPts = 2p or 2p + 1 where p is the feature-space dimension. ε controls what counts as "close". The standard choice: plot the k-distance graph (for each point, the distance to its k-th nearest neighbour, where k = minPts), sort the distances, and look for the elbow — the knee of the curve is a reasonable ε. Getting ε right is genuinely harder than getting K right in k-means, and DBSCAN is sensitive to it: too small and everything becomes noise; too large and separate clusters bleed together.

DBSCAN's failure mode. Clusters of very different densities. A dataset with one dense tight cluster and one diffuse cluster cannot be captured by any single ε: the small ε that finds the dense cluster breaks the diffuse one into fragments; the large ε that holds the diffuse one together merges the dense one with everything nearby. This is the problem HDBSCAN solves.

Complexity and scale

Naive DBSCAN is O(n²) because every point needs its neighbours computed. With a spatial index (KD-tree, ball tree, R-tree), the cost drops to O(n log n) in low dimensions, which is the regime where DBSCAN is most useful. At p > 20, the curse of dimensionality kicks in (every pair of points becomes roughly equidistant) and density-based methods struggle; this is where dimensionality reduction before clustering is essentially mandatory. For low-dimensional geographic or tabular data, DBSCAN scales comfortably to millions of points and is the right default.

08

HDBSCAN and OPTICS

DBSCAN's single ε parameter is its main weakness: clusters of varying density cannot be captured. HDBSCAN and OPTICS both fix this by making density a hierarchy rather than a threshold. HDBSCAN, in particular, has become the modern default density-based clusterer.

OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure; Ankerst, Breunig, Kriegel, Sander, 1999) runs a DBSCAN-like sweep across all values of ε simultaneously by recording a reachability distance for each point — how far out you have to look to reach the point from the already-processed part of the dataset. The resulting reachability plot, a bar chart of reachability distances in the processing order, shows clusters as valleys: deep narrow valleys are dense tight clusters; wide shallow ones are diffuse clusters. Any horizontal cut of the plot at a given reachability gives a DBSCAN-like clustering; cuts at different heights give different-density clusterings. OPTICS is rarely used directly — the reachability plot is hard to automate — but it was the first algorithm to make the density hierarchy explicit.

HDBSCAN

HDBSCAN (Campello, Moulavi, Sander, 2013) replaces OPTICS's reachability plot with a more principled condensed cluster tree and an automatic extraction rule. The algorithm: (1) transform the data by replacing each point's distance to its neighbours with a mutual reachability distance that accounts for local density; (2) build a minimum spanning tree in the transformed space; (3) compute the hierarchical clustering tree of the MST; (4) condense the tree by collapsing clusters smaller than min_cluster_size; (5) extract the flat clustering by selecting, for each path through the condensed tree, the cluster that maximises a persistence/stability measure. The output: clusters of arbitrary shape, at arbitrary (and mixed) densities, with noise points explicitly labelled, with no ε parameter at all. The single knob is min_cluster_size, which is much easier to set sensibly than ε.

Why HDBSCAN has displaced DBSCAN

HDBSCAN's advantages over DBSCAN are practical rather than theoretical: it handles varying density; it has one intuitive parameter instead of two; the parameter min_cluster_size directly controls what you usually care about (minimum useful cluster size); and it returns a soft cluster membership probability in [0, 1] as a bonus. The cost is some complexity in the implementation and slightly higher runtime than DBSCAN. In practice, HDBSCAN is now the default density-based clusterer in scientific Python (hdbscan package), and DBSCAN is mostly used when there is a specific reason to fix ε externally (e.g. a known physical distance scale in the data).

The density-based pipeline, 2026. For a new clustering problem where the shapes may be arbitrary: reduce dimensionality (UMAP to 5–15 dimensions is standard); run HDBSCAN with min_cluster_size chosen to match the smallest useful cluster; inspect the results; adjust min_cluster_size and the UMAP n_neighbors if the clusters seem wrong. This pipeline — UMAP + HDBSCAN — has become the de facto standard for exploratory unsupervised analysis, particularly in single-cell biology, NLP embeddings, and customer-analytics work.

Related density methods

Two other density-based methods are worth knowing. Mean-shift (Fukunaga and Hostetler, 1975; Comaniciu and Meer, 2002) treats each point as a seed and iteratively shifts it toward the mode of the local density (via kernel density estimation), converging on local density maxima that become cluster centres. Clusters are then defined by which mode each point converges to. Mean-shift is elegant and parameter-light (one bandwidth), but O(n²) per iteration limits it to smaller datasets. DENCLUE is a generalisation that uses arbitrary density-estimation kernels. For most practical work, HDBSCAN has absorbed the density-based niche; mean-shift persists in image-segmentation and computer-vision applications.

09

Gaussian mixture models

A Gaussian mixture model is k-means in its full probabilistic form: each cluster is a Gaussian distribution, each point belongs to every cluster with some probability, and the model is fitted by maximising the likelihood of the data. GMMs handle elliptical clusters that k-means cannot, give proper soft assignments, and plug into a principled model-selection framework via BIC.

The model

A Gaussian mixture model assumes the data was generated as follows: first, a hidden cluster label z ∈ {1, …, K} is drawn from a categorical distribution with probabilities π_1, …, π_K; then the observed point x is drawn from a Gaussian N(μ_{z}, Σ_{z}) specific to that cluster. The marginal density of x is then the mixture p(x) = Σ_k π_k · N(x; μ_k, Σ_k). A GMM fits this model by finding the parameters {π_k, μ_k, Σ_k} that maximise the log-likelihood Σ_i log p(x_i). Once fitted, the posterior responsibility γ_{ik} = p(z_i = k | x_i) ∝ π_k N(x_i; μ_k, Σ_k) is the probability that point i belongs to cluster k — a soft assignment that sums to 1 across clusters.

Covariance structures

The covariance matrix Σ_k controls the shape of each Gaussian, and different choices make GMMs more or less flexible. Spherical (Σ_k = σ_k² I): each cluster is a round blob with its own radius; in the limit σ → 0, this is k-means. Diagonal (Σ_k diagonal): axis-aligned ellipsoids. Tied (all Σ_k equal): linear discriminant analysis for a single Gaussian shape across clusters. Full (arbitrary positive-definite Σ_k): each cluster is a general ellipsoid in any orientation; the most flexible, and the most prone to overfitting with few data points. The default in scikit-learn is "full" and it is usually right; switch to "diagonal" or "tied" if you are fitting a GMM to high-dimensional data with few samples.

GMM vs k-means

k-means is the MAP (hard) version of a GMM with spherical equal-variance covariances and uniform mixing weights. Every improvement GMMs add is a relaxation of one of these assumptions. Unequal variances: the GMM can fit a tight cluster and a diffuse one simultaneously; k-means cannot. Elliptical clusters: a GMM with full covariance fits any ellipsoid; k-means forces spherical. Soft assignments: the responsibility γ_{ik} is a real probability, not a binary indicator, and carries uncertainty about boundary points. Unequal priors: the π_k lets small clusters coexist with large ones without getting absorbed. The cost is more parameters to fit and slower convergence; the benefit is a substantially more flexible model that plugs directly into the EM framework of the next section.

Practical gotcha. GMMs with full covariance can collapse: if a Gaussian gets reduced to one or two points, its covariance shrinks toward zero, the likelihood at those points becomes unbounded, and the model diverges. The fix is a small covariance regulariser added to the diagonal (scikit-learn's reg_covar, default 1e-6). Set it larger for small datasets; 1e-4 is a reasonable bump.

Model selection for GMMs

The GMM is a proper probabilistic model, so standard likelihood-based model-selection criteria apply. The Bayesian Information Criterion BIC = −2 log L + k log n penalises the log-likelihood by the number of parameters k and the sample size n; picking the K (and covariance structure) that minimises BIC is a principled answer to the number-of-clusters question. The Akaike Information Criterion (AIC) uses a weaker penalty; for clustering, BIC is usually preferred because it penalises complexity more heavily and produces fewer spurious clusters. This is the main reason GMMs are often chosen over k-means even when the clusters are approximately round: they let you choose K by optimisation rather than by heuristic.

10

The EM algorithm

Expectation-Maximisation is the iterative scheme that fits latent-variable models — GMMs, hidden Markov models, latent Dirichlet allocation, any model where some of the variables are hidden and their values have to be inferred along with the parameters. It is one of the most broadly useful algorithms in machine learning, and the cleanest way to understand it is through its application to the GMM.

The problem EM solves

In a GMM, we want to find the parameters θ = {π_k, μ_k, Σ_k} that maximise the log-likelihood log p(X | θ) = Σ_i log Σ_k π_k N(x_i; μ_k, Σ_k). Direct maximisation is intractable because the log sum has no closed-form solution and gradient-based methods get stuck in local minima. The EM algorithm (Dempster, Laird, Rubin, 1977) solves this by introducing the missing cluster labels z_i and alternating between inferring them given the parameters (E-step) and re-estimating the parameters given the inferred labels (M-step). Each iteration provably increases (or leaves unchanged) the log-likelihood, and the algorithm converges to a local maximum.

The two steps for a GMM

E-step (Expectation): given the current parameters, compute the responsibilities — the posterior probabilities of each hidden label — γ_{ik} = π_k N(x_i; μ_k, Σ_k) / Σ_j π_j N(x_i; μ_j, Σ_j). Each point gets a probability distribution over cluster memberships. M-step (Maximisation): given the responsibilities, update the parameters as weighted sample statistics. Each cluster's mean is the responsibility-weighted mean of the data; each cluster's covariance is the responsibility-weighted empirical covariance; each cluster's prior π_k is the average responsibility. Iterate until the log-likelihood change falls below a tolerance. This is the cleanest example of the general EM recipe, and it generalises directly to other latent-variable models.

Why EM works

EM can be understood through the evidence lower bound (ELBO). The log-likelihood log p(X | θ) satisfies, for any distribution q(Z) over the hidden variables, log p(X | θ) ≥ E_q[log p(X, Z | θ)] − E_q[log q(Z)]. The right-hand side is the ELBO; the gap between it and the true log-likelihood is the KL divergence KL(q ‖ p(Z | X, θ)). The E-step sets q(Z) = p(Z | X, θ), making the gap zero; the M-step maximises the ELBO over θ. Each step moves the ELBO up, and each E-step tightens the bound against the true likelihood. This framing is what connects EM to variational inference, where the E-step's exact posterior is replaced by a tractable approximation.

EM is coordinate ascent on the ELBO: alternate between tightening the bound (E-step: infer hidden variables) and pushing up the parameters under the bound (M-step: maximise over parameters). Every latent-variable model in classical ML — GMMs, HMMs, LDA, factor analysis, naive Bayes with missing data — uses this template.

EM's limitations

EM has three main weaknesses. Local maxima: EM converges to the nearest local maximum, which depends on initialisation; run multiple restarts and pick the solution with the highest log-likelihood. Slow convergence near the optimum (EM is a first-order method); quasi-Newton or stochastic variants exist but complicate implementation. Requires the E-step to be tractable; for complex graphical models it is not, and variational EM (with an approximate posterior) or sampling (MCMC) takes over. For GMMs, none of these are prohibitive, but they become central in the more complex latent-variable models of Part V. The lesson: understand EM at the GMM level and you have the scaffolding for almost every probabilistic model you will see later.

11

Spectral clustering

Spectral clustering takes a different route to non-convex clusters: represent the data as a graph of nearest neighbours, then cluster in the space of the top eigenvectors of the graph Laplacian. It handles arbitrary shapes that k-means cannot, without the single-density assumption of DBSCAN, and it's one of the cleanest applications of linear algebra to unsupervised learning.

From points to a similarity graph

The first step is to build a similarity graph on the data: each point is a vertex, and edges connect similar points. The standard construction is a k-nearest-neighbour graph (each point connected to its k nearest neighbours) or a Gaussian (RBF) graph (edge weights w_{ij} = exp(−‖x_i − x_j‖² / 2σ²) for all pairs). The graph encodes the local structure of the data, and the key shift is that clustering now happens on this graph, not on the original coordinates. Two points can be close on the graph (many short paths between them) even if they are far in Euclidean distance, which is exactly what lets spectral clustering find elongated clusters and manifold structure.

The graph Laplacian

Given adjacency matrix W and degree matrix D (diagonal, with D_{ii} = Σ_j W_{ij}), the unnormalised Laplacian is L = D − W. The normalised symmetric Laplacian is L_sym = I − D^(−1/2) W D^(−1/2). Both have a critical property: the multiplicity of the eigenvalue 0 equals the number of connected components of the graph, and the corresponding eigenvectors are indicator vectors on the components. If the graph has exactly K perfectly-disconnected clusters, the top K eigenvectors of L trivially reveal them. When the graph is not perfectly disconnected (real data is always a little noisy), the top K eigenvectors still embed the clusters in a low-dimensional space where they become approximately linearly separable.

The algorithm

Spectral clustering (Ng, Jordan, Weiss, 2002; Shi and Malik, 2000) proceeds as follows: (1) build the similarity graph W; (2) compute the Laplacian L; (3) find the eigenvectors corresponding to the K smallest eigenvalues; (4) stack them as columns of an n × K matrix; (5) cluster the rows of that matrix with k-means. The last step is usually plain k-means, which works well because in the spectral-embedded space the clusters have become round and linearly separable even when they were arbitrarily shaped in the original data. The algorithm handles concentric rings, two moons, elongated manifolds, and any other non-convex structure that k-means alone cannot.

When to reach for spectral. Small-to-medium datasets (n < 10^4) with visibly non-convex clusters, or clusters on a manifold. Spectral clustering is O(n³) in the standard implementation (eigendecomposition of an n × n matrix), which limits scale. For larger problems, use Nyström approximation of the Laplacian or fall back to HDBSCAN on a UMAP embedding, which achieves similar ends through a different route.

The tuning knob

The single most consequential parameter is the graph construction — specifically, the k in the k-NN graph or the bandwidth σ in the RBF kernel. Too sparse and the graph disconnects into spurious components (spectral clustering will pick those up as clusters); too dense and the clusters smear together. The k in the k-NN graph is typically between log n and 20; the σ for RBF is often set to the median pairwise distance or tuned by cross-validation against an internal criterion. Unlike the choice of K (which can be automated via eigengap analysis — the gap between λ_K and λ_{K+1} signals the right number), the graph-construction parameter has to be set with care.

12

Choosing the number of clusters

The single hardest question in clustering is how many clusters there are. Several families of heuristics address it — elbow plots, silhouette scores, the gap statistic, information criteria — and none of them gives a clean answer. A practitioner's real answer is usually some mixture of these metrics, stability analysis, and the downstream use case.

The elbow method

Plot the within-cluster sum of squares J(K) (or whatever objective the algorithm uses) as a function of K, and look for a knee or elbow — the point where the curve transitions from steep to shallow. The intuition: adding a cluster beyond the true K only explains noise, so the marginal reduction in J drops substantially. The elbow is the oldest heuristic for choosing K, and it is the one practitioners reach for first. It is also notoriously subjective — on real data, the curve is often smooth with no clear knee — and it sensibly becomes the first step of a broader analysis rather than a final answer.

Silhouette score

The silhouette coefficient (Rousseeuw, 1987) for a point i is s(i) = (b(i) − a(i)) / max(a(i), b(i)), where a(i) is the average distance from i to the other points in its own cluster and b(i) is the average distance from i to the points in the nearest other cluster. s(i) ∈ [−1, 1]: close to 1 means well-clustered, close to 0 means boundary, negative means probably in the wrong cluster. The average silhouette across the dataset is a single number quantifying cluster quality, and the K that maximises it is a reasonable choice. Silhouette works across algorithms (not tied to a particular objective), is interpretable at the point level (silhouette plots are genuinely informative), and is the most widely-used internal metric. Its main weakness: it favours convex round clusters, so it undervalues HDBSCAN or spectral-clustering outputs on non-convex data.

The gap statistic

The gap statistic (Tibshirani, Walther, Hastie, 2001) compares the within-cluster sum of squares on the real data to the expected WCSS under a null reference distribution (usually uniform in the bounding box of the data) at each K. If W_k is the real WCSS and E[W_k] is the null expectation, the gap is log E[W_k] − log W_k. The right K is the smallest value where the gap has peaked. This method is more principled than the elbow (it has a null model baked in) and often works where the elbow is ambiguous. The cost: computing E[W_k] requires Monte Carlo simulation with many null datasets.

BIC and AIC for GMMs

For model-based clustering (GMMs), choosing K is a proper model-selection problem. BIC = −2 log L + k log n (where k is the parameter count) and AIC = −2 log L + 2k both trade off fit against complexity. Pick the K minimising BIC for most work; use AIC only when you expect to have many more clusters than the data can support (AIC penalises complexity less). Both naturally handle the covariance-structure choice (spherical vs diagonal vs full) as part of the same optimisation. This is the main methodological advantage of GMMs over k-means: you don't have to guess K, you optimise it.

The honest answer to "how many clusters" is "try several and compare". Run k-means for K ∈ {2, …, 20}, compute silhouette and gap at each, fit a GMM with BIC model selection, inspect the dendrogram if hierarchical clustering is feasible, and look at the top three or four candidates by eye. The right K is usually one or two of the contenders. If the metrics disagree wildly, your clusters may not exist.

The hardest case: no clusters

Not every dataset has clusters. A well-behaved continuum — a single Gaussian blob in the middle of feature space — will be happily partitioned by k-means into arbitrary equal-area slices, and the elbow plot will be smooth, the silhouette will be weak, the BIC will decrease monotonically with K. The diagnostic: run the clustering on the data and on a null dataset (permute each feature independently to break any cluster structure), and compare. If the metrics look similar, there are no clusters. Consensus clustering (Section 14) formalises this: if small perturbations of the data change the cluster assignments dramatically, the clustering is not stable and is probably not real.

13

Evaluating a clustering, with and without labels

Choosing K was one evaluation problem. A separate one is: given a clustering, is it any good? The metrics split cleanly in two — internal indices use only the data and the clustering, external indices compare to known labels when you happen to have them.

Internal indices

Internal metrics measure properties of the clustering itself, typically in terms of compactness (points within a cluster are close) and separation (points across clusters are far). The three standards are the silhouette coefficient (already defined in Section 12), the Davies–Bouldin index, and the Calinski–Harabasz index. Davies–Bouldin averages, over clusters, the ratio of within-cluster scatter to between-cluster separation — smaller is better. Calinski–Harabasz (also called the variance-ratio criterion) is the ratio of between-cluster dispersion to within-cluster dispersion normalised by cluster count — larger is better. All three have a bias toward convex, equal-variance clusters because they use Euclidean geometry, so they systematically undervalue the solutions of DBSCAN, HDBSCAN, and spectral clustering on non-convex data.

External indices: comparing to ground truth

If you have labels — either because the dataset was labelled for a supervised problem and you want to see whether clustering rediscovers them, or because you've set up a controlled benchmark — you can measure how well the clustering agrees with the labels. The naive metric is purity: for each cluster, take the fraction of points belonging to its majority class, then average over clusters weighted by size. Purity is easy to understand but biased: you can always maximise it by producing many tiny clusters.

The two metrics to actually use are adjusted Rand index (ARI) and normalised mutual information (NMI). The Rand index counts the fraction of point pairs that are either in the same cluster in both partitions or in different clusters in both; ARI corrects for chance by subtracting the expected value under random labellings. ARI is 1 for identical clusterings, 0 for random agreement, and can be slightly negative for worse-than-random. NMI is the mutual information between cluster and label, normalised by the geometric or arithmetic mean of the two entropies; it is 1 for identical partitions and 0 for statistical independence. ARI penalises cluster-number mismatches more strongly than NMI; NMI is more forgiving when a clustering is a refinement of the labels (splits a true class into two).

Stability and reproducibility

A third class of evaluation measures how reproducible the clustering is under perturbation — different random seeds, bootstrap resamples, added noise, feature subsampling. Stable clusterings are, other things equal, more trustworthy than unstable ones. Section 14 makes this concrete with consensus clustering. For now: always report a clustering along with how sensitive it is to seed and sample.

What the metric does not tell you

Every clustering metric optimises some geometric or statistical property. None of them knows what the clusters mean. A clustering can have a high silhouette, high ARI against the labels you have, and still be useless for the downstream task because the labels you have aren't the structure that matters. Always close the loop with domain experts or with a downstream application metric.

Reporting

At minimum report: the algorithm and its hyperparameters, the feature set and how it was scaled, the distance or similarity metric, K if applicable, the random seed, and one or two internal metrics. If ground truth exists, add ARI and NMI. If stability matters, add a stability score across B bootstrap runs. Include a 2-D projection (PCA or UMAP, reproducible seed) with the clusters coloured, so a reader can see what you saw.

14

Stability and consensus: is this clustering reproducible?

Unsupervised results are easy to overinterpret. Two runs with different seeds can produce rather different clusterings, and if you only ever looked at one, you'd never know. Stability analysis asks how much the output changes under small perturbations — resampled data, added noise, different initialisation — and consensus clustering aggregates many runs into a single more trustworthy partition.

Why stability matters

Suppose you run k-means with K = 5 on a dataset and you run it again on a random 80 % subsample of the same dataset. If the two results agree on most points, the five clusters are a real feature of the data. If the two results disagree on most points, the five clusters are an artefact of the particular points you happened to include, and the algorithm is fitting noise at the boundaries. Stability is therefore a proxy for signal that requires no ground truth.

Bootstrap and subsampling

The procedure: draw B subsamples of the data (either bootstrap with replacement, or uniform without replacement at 80 % size), cluster each, and compare the resulting partitions on the points they share in common. Pairwise ARI between the B partitions gives a stability distribution; the median ARI is the summary. A mean pairwise ARI above about 0.75 is usually considered stable; below 0.5 is shaky. This is also one of the best ways to pick K: choose the K that is most stable across resamples (Ben-Hur, Elisseeff, Guyon, 2002).

Consensus clustering

Consensus clustering (Monti et al., 2003) formalises the idea that agreement across many runs is more trustworthy than any single run. For each pair of points, compute the fraction of bootstrap runs in which they landed in the same cluster — this is the consensus matrix, an n × n co-occurrence matrix with entries in [0, 1]. A final clustering is obtained by hierarchical clustering on 1 − consensus as a distance, or by spectral clustering on the consensus matrix treated as similarity. The consensus matrix also visualises well: sorted by the final clusters, it should have sharp high-value blocks on the diagonal and low off-diagonal values. If the blocks are fuzzy, the clustering is not stable.

Choosing K via stability

For each candidate K, run B bootstrap clusterings, compute the mean pairwise ARI across runs at that K, and plot ARI vs K. The most stable K is often a sharper signal than the elbow or the silhouette — especially when the clusters exist but are slightly uneven in size or shape.

When stability disagrees with the metric

The most informative case: silhouette prefers one K, stability prefers another. Usually this means silhouette is rewarding a clustering that fits the data well on average but carves through a real cluster, while stability is rewarding the clustering that cleanly separates the populations but leaves some points in boundary regions. Which is right depends on the downstream task; seeing the disagreement is more valuable than either number alone.

15

Clustering in high dimensions: the curse bites hardest here

In high dimensions, everything is far from everything else, and "far" starts to mean nothing in particular. Euclidean distance concentrates, nearest-neighbour rankings become unstable, and every clustering algorithm that depends on a distance metric becomes dubious. This section is about what breaks, why, and how to work around it.

Distance concentration

For random points in d-dimensional space, as d → ∞ the ratio of the maximum pairwise distance to the minimum pairwise distance tends to 1. In words: every point becomes equidistant from every other point. Distance-based clustering depends on the relative ordering of distances, and in high dimensions that ordering becomes noisy. Beyond about d = 50 for typical problems, k-means and DBSCAN start to behave erratically and their outputs depend heavily on the particular features included. This is the curse of dimensionality for clustering.

The standard fix: reduce first, cluster later

The most common fix is to run dimensionality reduction (covered in depth in Chapter 05) before clustering. PCA is cheap and captures variance — cluster the first 10–50 principal components rather than the 1000 raw features. UMAP and t-SNE produce 2-D or low-D embeddings specifically designed to preserve local structure; clustering an HDBSCAN on a UMAP embedding is a strong default for exploratory work on high-dimensional data. Autoencoders learn a nonlinear low-dimensional representation that can be clustered with the same methods as the raw data. The caveat in all three cases is that the reduction is non-trivial — PCA emphasises variance, UMAP emphasises local neighbourhoods, autoencoders emphasise reconstruction — and different clusterings will appear depending on what was preserved.

Subspace clustering

Sometimes different clusters live in different subspaces of the feature space — cluster A is tight in features 1, 5, 7 but diffuse in the others; cluster B is tight in features 3, 8, 10. No single global distance metric captures both. Subspace clustering methods (CLIQUE, SUBCLU, PROCLUS) search for clusters together with the subspaces in which they exist. The combinatorial search is expensive and the outputs are harder to interpret, but for genuine high-dimensional cluster structure (common in gene-expression and text data) they outperform global methods.

Metric choice matters more

In low dimensions, Euclidean is the default and other metrics make small differences. In high dimensions, switching from Euclidean to cosine distance (for direction, not magnitude) or to Mahalanobis distance (which whitens against covariance) can make the difference between a usable clustering and an unusable one. For sparse, high-dimensional data (documents, one-hot-encoded categoricals, user-item matrices), cosine is nearly always correct. For dense, high-dimensional continuous data, Mahalanobis or a learned metric is often needed. Section 02 flagged this; it becomes mandatory once d is large.

Clustering on embeddings

Modern practice: for text, image, or graph data, don't cluster the raw features. Encode each item with a pretrained neural network — a sentence transformer, a vision model, a graph neural network — and cluster the resulting embeddings. The embedding moves similar items close in Euclidean or cosine space, by construction, so the downstream clustering is doing something meaningful. This is the workflow behind every modern topic-discovery, document-retrieval, and customer-segmentation system.

Density methods handle high-D worse than you expect

DBSCAN and HDBSCAN are sometimes sold as the answer to high-dimensional clustering, but they depend on local density, and in high dimensions every point's local density looks the same. Without reduction, they tend to put everything in one cluster or everything in noise, depending on eps. Always pre-reduce before density-based methods in more than 20 or so dimensions.

16

Clustering in practice: a recipe for a new dataset

You have a dataset and the goal "find structure". This is the recipe — an ordered sequence of steps that will, nine times out of ten, get you from raw data to a defensible clustering. The tenth time is when the data genuinely has no clusters, which is itself a useful finding.

Step 1: understand the data

Before any clustering, answer: what is a row? What is a column? What are the units? Are there missing values, and if so, are they informative or random? Are there obvious outliers that should be handled, and how? What do the domain experts expect the structure to look like? Clustering answers that come out of a vacuum are nearly always wrong; spending an hour here saves a day later.

Step 2: preprocess and scale

Standardise continuous features (z-score or min-max), encode categoricals (one-hot for low cardinality, target-encoded or embedded for high), decide how to handle missing values (imputation, indicators, or drop rows), and pick a distance metric based on feature types. If features are on wildly different scales and you haven't standardised, every distance-based clusterer is secretly clustering on the biggest-variance feature.

Step 3: reduce dimensionality if needed

If d is more than about 50, run PCA to capture ~90 % of variance and cluster the PCA scores. For exploratory work, produce a UMAP embedding for visualisation regardless of whether you cluster on it. Keep both the raw-feature clustering and the embedded clustering and compare them.

Step 4: try three algorithms

Run three clusterers as a baseline: k-means (the prior-free default), hierarchical with Ward linkage (produces the dendrogram, lets you pick K after the fact), and HDBSCAN (finds whatever density-based structure exists without you specifying K). If any of them agrees strongly with the others, that is a sign of robust structure. If they disagree, inspect the disagreements — they tell you about the shape of your data.

Step 5: pick K (or don't)

For k-means and hierarchical, sweep K ∈ {2, …, 20}, compute silhouette and gap, run bootstrap stability at each K, and pick a K supported by at least two of the three signals. For density methods, tune min_cluster_size instead of K and let the algorithm choose K.

Step 6: evaluate and interpret

Compute internal metrics (silhouette, DB, CH); if labels exist, ARI and NMI; report stability across bootstrap runs. Then — and this is the step everyone skips — look at the clusters. Compute per-cluster summary statistics on the original features. Plot examples. Ask a domain expert to name each cluster. If you can't name them, you don't have clusters; you have a partition.

A clustering that nobody can name is a clustering that nobody should ship. The quantitative metrics tell you that the partition is mathematically defensible; only the qualitative interpretation tells you that it is useful.

Step 7: downstream use

If the clustering is going into a product — customer segmentation, anomaly detection, pre-labelling for a supervised problem — set up an evaluation harness with a downstream metric (conversion rate per segment, precision of anomaly flags, label efficiency), and monitor cluster stability over time as new data arrives. Clusters drift as the underlying population changes; re-fit on a schedule and watch for degradation.

When to stop and say "there are no clusters"

If, across three algorithms, metrics don't agree on a good K; if bootstrap stability is low at every K; if the consensus matrix has no blocks; if domain experts look at the clusters and can't name them — accept that the data is a continuum, not a set of groups. This is a real finding. Report it. Move on to regression, density modelling, or a different framing of the problem.

17

Clustering inside modern machine learning

Clustering rarely appears as the final product of a large ML system. It appears as a subroutine — inside vector quantisation, inside tokenisation, inside retrieval, inside self-supervised learning, inside semi-supervised labelling — and the clusters themselves are the outputs that drive other components. This section is a tour of where clustering hides in systems that don't obviously use it.

Vector quantisation

Vector quantisation (VQ) replaces a continuous vector with the index of its nearest cluster centre from a fixed codebook. This is k-means in disguise: run k-means on a training set of vectors, keep the centroids as the codebook, and at inference replace each vector with its nearest centroid index. VQ underlies product quantisation for approximate nearest-neighbour search (FAISS), and it underlies VQ-VAE and the discrete codes used by AudioLM, MusicLM, and many image generation models. The core idea is the same as Lloyd's algorithm from 1957; the context is different.

Tokenisation as clustering

BPE, WordPiece, and SentencePiece — the tokenisers used by every modern language model — are agglomerative clustering algorithms in disguise. They start with characters and greedily merge the most-frequent adjacent pairs into new tokens, producing a hierarchical vocabulary. The training data gets segmented into clusters of characters that occur together often. The token embeddings are then learned on top of this clustering. If you squint, training a language model is a two-stage procedure: cluster the character sequences, then learn dense representations on the cluster indices.

Retrieval and recommendation

Large-scale retrieval systems (Google's image search, a recommender's candidate generator) pre-cluster billions of items into an inverted index based on embedding similarity, then at query time search only the top-matching clusters. This is k-means on vectors, at internet scale, often with hierarchical cluster trees or product quantisation on top. The clusters are invisible to the user; they make the system a thousand times faster.

Self-supervised learning and pseudo-labelling

SwAV, DeepCluster, and related self-supervised vision methods use clustering as a pretext task: cluster the unlabelled images into K groups, treat the cluster assignments as pseudo-labels, train a network to predict them, use the network's new features to re-cluster, iterate. The cluster labels are throwaway — what is learned is the representation. This is a modern application of the old idea that structure in the input, even without labels, contains signal.

Prototype-based few-shot learning

Prototypical networks (Snell, Swersky, Zemel, 2017) represent each few-shot class as the mean of the support-set embeddings — a one-centroid cluster per class — and classify new points by nearest centroid. This is one iteration of k-means with fixed assignments. Despite its simplicity it is a strong baseline for few-shot classification, and it works because the embedding (from a pre-trained backbone) already places the classes in well-separated clusters.

Pre-labelling and active learning

When you have a lot of unlabelled data and a small budget for annotation, cluster the unlabelled set first and label a few points per cluster rather than labelling a uniformly random sample. This gives the supervised model a much more representative training set, often at a fraction of the annotation cost. The same principle drives active learning: use uncertainty and cluster membership to choose the next point to label.

What clustering isn't good at

Clustering doesn't scale to supervised-equivalent performance when you have labels — if you have labels, use them. Clustering doesn't find causal structure; two variables can be perfectly clustered together and have nothing to do with each other. Clustering finds co-occurrence and similarity, not mechanism. For mechanism, you need interventions, counterfactuals, and the techniques of Chapter 06.

The next chapter, Dimensionality Reduction, is the natural companion to this one. Clustering and dimensionality reduction are both unsupervised techniques for finding structure; the difference is that clustering finds groups of points, while dimensionality reduction finds directions or manifolds along which the points vary. The two are routinely combined — reduce first, then cluster, or cluster first, then reduce for visualisation — and the boundary between them blurs for methods like spectral clustering (which is a reduction followed by a clustering) and self-organising maps (which are both at once). We'll cover PCA, factor analysis, ICA, kernel PCA, t-SNE, UMAP, and autoencoders; by the end you'll have the full toolkit for unsupervised representation learning.

Further reading

Where to go next

Clustering is one of the oldest topics in data analysis — Lloyd's k-means paper is from 1957, Ward's hierarchical method from 1963 — and most of its foundational literature is short, readable, and still in print. The modern research frontier has moved to density-based and graph-based methods, to clustering on learned embeddings, and to consensus and stability analysis. The references below cover the textbooks to anchor the subject, the classical papers that introduced each major algorithm, the modern papers that extend them, and the software documentation where most of the day-to-day work actually happens.

The anchor textbooks

Foundational papers

Evaluation and foundations

Modern extensions

Software documentation

This page is Chapter 04 of Part IV: Classical Machine Learning. The next chapter — Dimensionality Reduction — is the close sibling of this one. Clustering asks "what are the groups", dimensionality reduction asks "what are the directions"; the two are routinely combined (reduce then cluster, or cluster then reduce for visualisation), and several methods live in both worlds at once — spectral clustering is a reduction followed by k-means, self-organising maps are a clustering on a low-dimensional grid. Later chapters of Part IV extend to probabilistic graphical models (which add the structure of conditional independence), kernel methods and SVMs (where the Mahalanobis-like trick of changing the similarity metric is made rigorous), feature engineering and selection, and finally model evaluation — the discipline that turns any of these trained models, supervised or unsupervised, into a trustworthy deployment.