Part XIII · Specialized ML Methods · Chapter 02

Anomaly Detection, finding what doesn't belong.

Fraud, hardware failure, network intrusion, manufacturing defects, novel disease patterns, fleet-wide bug emergence — every modern operational system has a job that translates to "tell me when something is unusual." Anomaly detection is the family of techniques that does it, and the modelling decisions are unusually consequential because, by definition, the data you most need to learn from is the data you have least of.

Prerequisites & orientation

This chapter assumes basic familiarity with linear algebra (Part I Ch 01), probability and statistics (Part I Ch 04–05), and the deep-learning fundamentals of Part V — particularly autoencoders (Part X Ch 01), which appear prominently in Section 05. The time-series anomaly section (Section 07) builds on Chapter 01 of this Part; readers should glance there if they have not already. No prior anomaly-detection experience is assumed.

Two threads run through the chapter. The first is the supervision regime: most anomaly-detection problems have very few labelled anomalies, and many have none at all, which means most methods in this chapter operate in unsupervised or semi-supervised settings rather than in the fully-supervised regime that dominates most of ML. The second thread is the operator burden: an anomaly detector that fires too often loses the operator's trust within days, and one that fires too rarely loses operational value. The threshold-and-evaluation question is the most consequential design decision in any deployment, and it is treated explicitly throughout the chapter rather than left to a single section.

In this chapter

Why Anomaly Detection Is Hard imbalance · supervision regime · the operator-burden problem
Statistical Methods z-scores · IQR · Mahalanobis · Gaussian mixtures · KDE
Distance- and Density-Based Methods k-NN distance · LOF · DBSCAN · HDBSCAN
Isolation Forest and Tree-Based Methods isolation trees · path length · extended isolation · features
Reconstruction-Based: Autoencoders and PCA PCA reconstruction · denoising AE · VAE · reconstruction error
Generative and Probabilistic Models normalizing flows · density estimation · GAN-based AD
Time-Series Anomaly Detection change points · ESD · STL residuals · RNN forecast residuals
Contextual vs. Collective Anomalies point · contextual · collective · sequential
Production Considerations drift · alarm rates · thresholds · evaluation · alarm fatigue
Frontier: Foundation Models, Self-Supervised AD, OOD DINO features · self-supervised · OOD detection · pretrained AD

Why Anomaly Detection Is Hard

Anomaly detection sits awkwardly between supervised learning, density estimation, and classification, and it inherits the worst of all three. The data is heavily imbalanced. The labels, when they exist at all, are sparse and unreliable. The cost of being wrong is asymmetric and operationally specific. Most failures of deployed anomaly detectors trace back to under-appreciating one of these three structural difficulties.

The imbalance problem

By definition, anomalies are rare. A credit-card fraud detector might see one fraudulent transaction in 10,000; a manufacturing defect detector might see one bad widget in 100,000; a network-intrusion detector might process a billion benign packets for every attack. This extreme imbalance breaks most of standard supervised learning's assumptions: cross-entropy loss is dominated by the majority class, accuracy is uninformative (a model that predicts "normal" always achieves 99.99% accuracy), and the ROC curve is misleading because both classes are not similarly informative. The solutions — class weighting, focal loss, oversampling, area-under-precision-recall as the metric — exist but each has its own pathologies.

The supervision regime

Most anomaly-detection problems live in one of three supervision regimes. Supervised AD has labelled examples of both normal and anomalous data; this is the comfortable case but rare in practice because anomaly labels are expensive and often only collected after damage has been done. Semi-supervised AD (sometimes called "novelty detection") has labelled examples of normal data only and trains a model to recognise that distribution; new data is flagged when it falls outside. This is the most common practical regime — most operational data is normal by default, and the question is whether the next observation matches that normality. Unsupervised AD has no labels at all; the model must discover anomalies by looking at the structure of an unlabelled dataset and identifying points that look unlike the others. Many of the methods in this chapter (isolation forest, LOF, autoencoders trained on a mixture) work in any of these regimes with minor adaptations, but the choice of regime has substantial consequences for what kind of evaluation is possible.

The operator-burden problem

The metric that ultimately matters is whether human operators trust the alarms. A detector with 99% precision firing 100 alarms per day produces one false alarm a day, which an operator can tolerate. A detector with 99% precision firing 100,000 alarms per day produces a thousand false alarms a day, which any operator will silence by Friday. The rate of alarms is as consequential as the precision per alarm, and it is determined by the threshold the deployment uses to convert continuous anomaly scores into binary alerts. Threshold selection — covered in detail in Section 09 — is therefore the most consequential single design decision in any deployed AD system, and one that pure-research treatments tend to under-emphasise.

The taxonomy preview

One vocabulary distinction used throughout the chapter, developed in detail in Section 08. Point anomalies are individual observations that look unusual on their own — a temperature reading of 200°C in a system that normally runs at 30°C is anomalous regardless of context. Contextual anomalies are observations that look unusual only in context — a temperature reading of 30°C is normal in summer but anomalous in winter. Collective anomalies are sets of observations that are individually unremarkable but collectively unusual — five normal-looking transactions in five different cities in fifteen minutes is a fraud pattern even though no single transaction is suspicious. Different methods detect different anomaly types; matching the method to the type is half the battle.

The Anomaly That Wasn't

Every anomaly-detection deployment eventually has a moment where the detector flags something dramatic — and on investigation, the flagged event turns out to be a normal occurrence the detector had simply not seen before. This is not a bug; it is the definition of unsupervised AD on novel data. Production stacks therefore invest heavily in feedback loops: every alarm gets a disposition (true positive, false positive, unknown), the dispositions feed back into model retraining, and the alarm threshold is recalibrated. The detector that ships in week one is rarely the detector that runs in month six.

Statistical Methods

The simplest anomaly-detection methods are statistical: fit a distribution to the data, flag anything in the tail. They work surprisingly well on small, low-dimensional, well-behaved data, and they remain the right starting point for many practical problems where more sophisticated methods are overkill.

z-scores and the normal-distribution baseline

The simplest possible AD method assumes the data is approximately Gaussian and flags points whose z-score exceeds a threshold. The z-score is the number of standard deviations a point is from the mean: z = (x − μ) / σ. A common threshold is |z| > 3, which under the Gaussian assumption flags about 0.27% of points as anomalies. The method is fast, interpretable, and parameter-free in any practical sense — but it depends on the Gaussian assumption, which is rarely exactly right and can be very wrong (especially in the tails). Modified z-scores using the median and MAD (median absolute deviation) instead of mean and standard deviation are robust to outliers in the calibration data and are typically a better default.

Interquartile range

The IQR method is even simpler. Compute the 25th and 75th percentiles, call the difference IQR, and flag anything below Q1 − 1.5·IQR or above Q3 + 1.5·IQR (the same thresholds used in Tukey box-and-whisker plots). Like the modified z-score, the IQR method is robust to outliers in the calibration data because it depends on quantiles rather than moments. It is the default behind the box-plot whiskers in nearly every statistical package, and it is what you should reach for first on a single low-dimensional variable.

Mahalanobis distance

For multivariate data, the natural generalisation of the z-score is the Mahalanobis distance. Given a multivariate normal distribution with mean vector μ and covariance matrix Σ, the Mahalanobis distance of a point x is:

Mahalanobis distance D M (x) = \sqrt( (x - μ) T Σ -1 (x - μ) ) Mahalanobis distance accounts for the correlations between variables — a point that is far from the mean in the direction of low variance is more anomalous than one that is far in the direction of high variance. Under the multivariate Gaussian assumption, the squared Mahalanobis distance follows a chi-squared distribution, which gives a calibrated p-value for the anomaly. The catch is that Σ has to be estimated from the data, and the estimate is unreliable when the dimensionality is comparable to the sample size.

Mixture models and density estimation

A single Gaussian is rarely the right model for real data. Gaussian mixture models (GMMs) fit a weighted sum of Gaussians and flag points whose density under the fitted mixture is below a threshold. GMMs handle multimodal data — the case where "normal" comprises several distinct sub-populations — that a single Gaussian fundamentally cannot. The component count is a hyperparameter; cross-validation or BIC are the standard ways to choose it.

For data that is not well-modelled by any parametric family, kernel density estimation (KDE) provides a nonparametric alternative. KDE places a small Gaussian (or other kernel) at each training point and scores new points by the resulting density estimate. KDE works well in low dimensions but suffers severely from the curse of dimensionality: in high dimensions, all points become equally far from each other and the density estimate becomes uniformly flat. Above 10–20 dimensions, KDE is rarely the right choice without dimensionality reduction first.

Distance- and Density-Based Methods

A second family of methods skips explicit distribution fitting and instead uses local geometric structure: an anomaly is a point whose neighbours are far away or whose neighbourhood is sparse. These methods scale to higher dimensions than the parametric statistical methods and are the workhorse of practical AD on tabular data above a few features.

k-NN distance

The simplest distance-based method is to score each point by the distance to its k-th nearest neighbour, or the average distance to its k nearest neighbours. Points in dense regions of the data have small k-NN distance; points in sparse regions or far from the bulk of the data have large k-NN distance. Choosing k is a hyperparameter — small k is sensitive to noise, large k washes out local structure. Common defaults are k in the range 10–50 for medium-sized datasets.

The downside is computational: computing k-NN distance for every point in a dataset of n points is naively O(n²), prohibitive on large datasets. Approximate-nearest-neighbour libraries (FAISS, Annoy, ScaNN) reduce this to roughly linear time at the cost of approximation, which is usually acceptable for AD purposes.

Local Outlier Factor (LOF)

The k-NN distance ignores a critical structural feature: if an anomaly lies in a region that is itself sparser than the rest of the data, all of its neighbours will also be far. LOF (Local Outlier Factor, Breunig et al. 2000) corrects for this by comparing each point's local density to the local densities of its neighbours. A point in a sparse region whose neighbours are also in the same sparse region scores low (it is locally typical); a point in a dense region whose neighbours are far scores high (it is locally atypical, regardless of the global density structure).

LOF was the dominant unsupervised AD method through the 2000s and remains in wide use. Its parameter is the same k as in k-NN, with similar tuning considerations, plus a threshold on the LOF score above which points are flagged. LOF degrades on high-dimensional data for the same curse-of-dimensionality reasons as KDE, but on moderate-dimensional tabular data it is consistently strong.

DBSCAN and HDBSCAN

Density-based clustering algorithms can be repurposed for AD. DBSCAN assigns points to clusters based on density connectivity; points that do not belong to any cluster (which DBSCAN labels as "noise") are anomalies. The natural advantage is that DBSCAN does not require pre-specifying the number of clusters or assuming any particular cluster shape; the disadvantage is that the density-radius parameter ε is hard to tune, and DBSCAN can label whole legitimate but sparse sub-populations as noise. HDBSCAN (the hierarchical version) is more robust and is the modern default — it adapts the density radius automatically and produces cluster-membership probabilities that double as anomaly scores.

The advantage of repurposing clustering for AD is that the same algorithm produces both a clustering and an anomaly score, which is useful when the data structure is itself the question of interest (segmenting the customer base while flagging weird customers in one pass).

Isolation Forest and Tree-Based Methods

Isolation Forest (Liu, Ting & Zhou 2008) is one of the most elegant ideas in machine learning: anomalies are easier to isolate than normal points. The method has been the strongest single unsupervised AD baseline since its publication, and remains the right starting point for tabular anomaly-detection problems in 2026.

The core idea

Build a random binary tree by recursively splitting the data on a randomly-chosen feature at a randomly-chosen split value. Continue until every point is alone in its own leaf. The path length from root to leaf for any given point is a measure of how easy it was to isolate that point: anomalies, being far from the bulk of the data, get cut off early and end up at shallow leaves; normal points end up at deep leaves because they have many neighbours that need to be split apart. Build many such trees and average the path lengths to get an isolation score; points with the shortest average path lengths are the most anomalous.

The mathematical elegance is that the method makes no distributional assumptions and uses no distance metric — it relies only on the geometry of axis-aligned splits. The computational complexity is essentially linear in n and very low per tree, so building thousands of trees is cheap. And the empirical performance on standard tabular AD benchmarks is consistently strong, often beating LOF and other density-based methods at a fraction of the computational cost.

Isolation Forest's intuition. A single anomaly far from the data can be isolated by a random axis-aligned split or two; a point inside the dense cluster needs many splits to be separated from its neighbours. The expected path length is a calibrated anomaly score.

Extended and improved variants

Extended Isolation Forest (Hariri et al. 2018) addresses a known weakness of the original method: axis-aligned splits produce biased anomaly scores that depend on the orientation of the data. Extended IF uses random hyperplane splits at arbitrary angles, eliminating the bias at slightly higher computational cost. Several other variants — RRCF (Robust Random Cut Forest, used in AWS), iForestASD (for streaming data), and ICAD-IF for high-dimensional cases — are practical refinements rather than fundamental changes.

When Isolation Forest is the right choice

Isolation Forest works best on tabular data with a moderate number of features (a few up to a few thousand), with no strong reliance on feature scaling, and where the anomalies are global in some sense (far from the bulk of the data along at least some axis). It struggles when anomalies are subtle local deviations within otherwise normal-looking neighbourhoods; that is the regime where reconstruction-based or density-based methods do better. As a default first method on any new tabular AD problem, isolation forest remains hard to beat.

Reconstruction-Based: Autoencoders and PCA

A third approach: train a model to compress and decompress the data, and flag any input whose reconstruction is far from the original. The intuition is that a model trained on normal data learns to reconstruct normal data well; an anomaly, being unlike the training distribution, will reconstruct poorly. This is the dominant approach for high-dimensional AD problems — images, sensor streams, and anything where the raw input has too many dimensions for distance- or density-based methods to work.

PCA as the linear baseline

The simplest reconstruction-based method is principal component analysis. Fit PCA to the training data, retain the top k components, and reconstruct each input by projecting it onto the principal subspace and back. The reconstruction error — the L2 distance between the original input and its reconstruction — is the anomaly score. Inputs that lie inside the linear subspace spanned by the principal components reconstruct well; inputs that lie off it reconstruct badly.

PCA is fast, interpretable, and surprisingly effective on data whose normal variation is approximately linear. It is the right baseline against which more sophisticated reconstruction-based methods should be compared, and is often competitive on simpler problems despite the linearity assumption.

Autoencoders

An autoencoder generalises PCA's compress-then-decompress structure to non-linear functions. The encoder maps the input to a low-dimensional latent representation; the decoder maps back to the original space; the model is trained to minimise reconstruction error on normal data. Once trained, the reconstruction error of a new input is the anomaly score. The non-linearity of the encoder and decoder lets the autoencoder learn curved manifolds that PCA cannot — important for image data, sensor data with non-linear dynamics, and most other rich modalities.

The architecture choice matters. Denoising autoencoders are trained to reconstruct clean inputs from noisy versions; this prevents the network from learning the trivial identity mapping and produces more robust reconstruction errors. Variational autoencoders add a probabilistic latent space and an explicit prior; they produce calibrated likelihood scores rather than raw reconstruction errors, at the cost of harder training. Sparse autoencoders add a sparsity penalty on the latent code; this forces the network to use only a few latent dimensions per input and tends to produce more interpretable reconstructions.

Reconstruction error pitfalls

The honest pathology of reconstruction-based AD is that a sufficiently expressive model can learn to reconstruct anomalies as well as normal data, defeating the entire premise. The mitigation is to constrain the model — limit the latent dimension, regularise heavily, train for fewer epochs, use a denoising objective — to ensure that the learned reconstruction is specific to the training distribution. The empirical balance is delicate: too much constraint and reconstruction error becomes uninformative on subtle anomalies; too little constraint and the model reconstructs everything.

The image case

Image-domain anomaly detection is one of the most active sub-areas, with applications in industrial quality control (manufacturing-defect detection on assembly lines), medical imaging (catching unusual scans), and security (detecting deepfakes and image manipulation). The dominant 2024–2026 architectures combine pretrained vision features (DINOv2, CLIP) with reconstruction-based scoring on top of those features rather than on the raw pixels — Section 10 returns to this.

Generative and Probabilistic Models

A reconstruction-based method gives an anomaly score but not a probability. Generative models — those that learn an explicit density over the data — give a calibrated likelihood, which is in principle the cleanest possible anomaly signal. The trick is making density estimation work in high dimensions.

Normalizing flows

Normalizing flows are a family of generative models that compute exact likelihoods via invertible transformations. A flow learns a bijective mapping from the data distribution to a simple base distribution (typically a multivariate Gaussian); the change-of-variables formula then gives an exact log-likelihood for any input. For anomaly detection, the log-likelihood is the score: low likelihood means anomalous.

Flows have an attractive property: the density estimates are calibrated, in the sense that a 1-in-1000 anomaly under the model genuinely corresponds to a region of the data space with model probability mass 1/1000. The Glow, RealNVP, and FFJORD architectures are common choices. The downside is that flows are computationally heavier than autoencoders to train and to evaluate, and they sometimes assign surprisingly high likelihood to true anomalies — particularly to pathological inputs that a model has never seen but happen to lie in regions of the latent space the flow has not constrained well. The "the in-distribution test set has lower likelihood than the out-of-distribution one" failure mode (Nalisnick et al. 2019) was a major embarrassment in the deep-learning AD literature and remains an active research issue.

GAN-based AD

An older family uses generative adversarial networks: train a GAN on normal data, and score new inputs by how easily the generator can reproduce them. AnoGAN (2017) was the original; later variants (BiGAN/ALI for inverse mappings, EGBAD, GANomaly) improved the formulation. GAN-based AD looked promising in the late 2010s but has been largely supplanted by simpler reconstruction-based methods using diffusion or autoencoder backbones, which are more stable to train and require fewer hyperparameters. GAN-based AD remains in some niche industrial deployments.

Diffusion-based anomaly detection

A more recent family uses diffusion models. The idea: take a noisy version of the input, run the diffusion model's denoising process, and measure how much the model's denoised output differs from the original input. Inputs from the training distribution denoise back to themselves; anomalies denoise toward the nearest in-distribution example, producing a large reconstruction discrepancy. DDPM-AD (2023) and related methods have shown competitive performance on image AD benchmarks and are the current frontier of generative-model-based AD.

Energy-based and probabilistic neural networks

A final family — energy-based models, deep Bayesian methods, ensembles of neural networks — produce anomaly scores via uncertainty rather than reconstruction. A point with high predictive entropy under an ensemble is likely out of distribution. These methods connect to the broader topic of out-of-distribution detection in deep learning, treated separately in Section 10.

Time-Series Anomaly Detection

Time-series data is the most common deployment context for AD in production. Servers monitor latency, factories monitor sensors, financial systems monitor trades — all as streams of timestamped values where the question is "is this point, or this window, unusual given the history?" The temporal structure both complicates the problem and provides leverage that the methods of the previous sections cannot directly exploit.

The forecast-residual approach

The cleanest framing reduces time-series AD to time-series forecasting. Fit a forecasting model (Chapter 01 of this Part covers the options); at each time step compare the actual observation to the model's forecast; the residual r_t = y_t − ŷ_t is the anomaly score. Large residuals indicate observations the model did not expect. This pattern works with any forecaster — ARIMA, ETS, neural-network forecasters, foundation models — and is the most common architecture for production time-series AD.

The advantage is that the forecasting model handles all the seasonal and trend structure that would otherwise contaminate raw values; the residuals against a good forecaster are approximately stationary and can be scored with the simple statistical methods of Section 02. The disadvantage is that a poor forecaster produces noisy residuals that mask real anomalies; the AD quality is bounded by the forecast quality.

STL residuals and ESD

For seasonal time series, the STL decomposition of Chapter 01 is a useful preprocessing step. Decompose the series into trend, seasonal, and residual components; do AD on the residuals. The STL residuals are roughly stationary and roughly Gaussian, which makes simple threshold rules effective. The Twitter SH-ESD algorithm (Seasonal Hybrid ESD, 2014) formalises this: STL-decompose, compute robust z-scores on the residuals, and apply the generalised ESD test for outliers. SH-ESD was the workhorse of Twitter's production anomaly detection for years and remains a strong default for seasonal time-series AD.

Change-point detection

A related but distinct problem is change-point detection: rather than flagging individual points, identify the time at which the underlying distribution of the series changed. This is the right framing when an anomaly is a regime change (a server suddenly running hot, a customer suddenly abandoning the platform) rather than a transient excursion. Methods range from classical (CUSUM, the Bayesian Online Change-Point Detection algorithm by Adams & MacKay 2007) to modern (deep change-point detection with neural representations). The output is a list of change times rather than a per-point anomaly score, which fits some operational pipelines better than others.

Multivariate time-series AD

Most production time-series AD problems involve many parallel series — every CPU on every server, every sensor on every machine. The univariate-per-series approach (run an AD model on each series independently) is simple but misses cross-series patterns. The multivariate approach (treat all series jointly) catches patterns like "CPU and disk both spike together" that the univariate approach would miss but is computationally heavier. Modern deep methods (USAD, GDN graph-based AD, OmniAnomaly) are designed for the multivariate case at scale; classical Mahalanobis-based methods can also work when the series count is moderate.

Contextual vs. Collective Anomalies

The taxonomy introduced briefly in Section 01 deserves its own treatment. Choosing the right method depends on which kind of anomaly you are looking for, and the kinds are conceptually distinct enough that the wrong choice produces a detector that simply cannot find the patterns that matter.

Point anomalies

Point anomalies are individual observations that are unusual on their own, regardless of any context. The classical examples are sensor readings far outside the operating range, transactions of unusually large amounts, network packets of unusually large size. The methods of Sections 02–04 (statistical, distance-based, isolation forest) are designed for this kind of anomaly and are the right starting point when the anomalies are point-like.

Contextual anomalies

Contextual anomalies are observations that look unusual only within some context. The same numeric value might be normal in one context and anomalous in another. A temperature of 30°C is normal in summer, anomalous in winter. A web-traffic level of 1,000 requests per second is normal during a Black Friday sale, anomalous on a Tuesday afternoon in January. A transaction of $2,000 is normal for one customer, anomalous for another.

Detecting contextual anomalies requires the detector to incorporate the context. The forecast-residual approach of Section 07 does this naturally for time-series data — the forecast is, in effect, the model's expected value given the temporal context. For non-time-series contextual anomalies, the typical approach is to condition the AD model on the context (training a separate model per customer segment, or feeding the context as input features). Mahalanobis distance and conditional density estimation are the classical workhorses; conditional autoencoders and conditional flows are the modern ones.

Collective anomalies

Collective anomalies are sets of observations that are individually unremarkable but collectively unusual. Examples: a sequence of small transactions in different cities (each individually normal but collectively suspicious as fraud), a sequence of normal-looking trades that together violate position limits, a sequence of read operations that individually look like benign queries but together exfiltrate a database. By construction, no point-anomaly detector will catch a collective anomaly — every individual observation is normal — and detection requires looking at the joint structure of observation sequences.

The methods that work for collective anomalies are sequence-aware: HMMs, sequential pattern mining, RNN/transformer-based sequence models that score whole windows, and trace-mining methods from process analytics. Time-series AD with a window-level rather than point-level granularity (the forecast residual over the next k steps, scored as a vector) is the most common production approach. Genuine collective-anomaly detection in non-temporal sequence data (sequences of API calls, sequences of database queries) is an active research area.

Choosing among the three

The honest practice is to ask, for any given application, which of the three anomaly types is operationally most consequential. Most failures of production AD systems trace to a mismatch — point detectors deployed against contextual or collective threats. Production AD systems often run multiple detectors of different types in parallel and combine their alerts via a meta-classifier or a manual triage queue.

Production Considerations

An AD model that produces good ROC numbers in offline evaluation is not yet a deployable system. The operational work — threshold selection, alarm-rate management, drift handling, evaluation, and the human factors around alarm fatigue — determines whether the model actually delivers value once it is live.

The threshold problem

Most AD methods produce a continuous score; the deployed system needs to convert that into a binary alarm. The threshold determines the trade-off between precision and recall, and the right point on that trade-off depends on operational economics that are rarely well-quantified upfront. The common patterns are: pick the threshold to produce a target alarm rate (e.g., 50 alarms per day, what the operator team can review), pick the threshold to achieve a target precision on a labelled validation set, or pick a quantile of the historical score distribution (e.g., the 99.9th percentile of yesterday's scores). All three are reasonable; choose the one that matches how the detector will actually be used.

Alarm fatigue and the operator's perspective

The dominant failure mode of production AD is not detector accuracy but alarm fatigue: a detector that produces too many alarms — even if most of them are technically true positives — becomes ignored by the operators it is meant to serve. The threshold selection above is the first line of defence, but two further mitigations are standard. Alarm aggregation groups related alarms into a single incident (multiple servers in the same data centre simultaneously failing become one alarm, not fifty). Severity scoring attaches a numeric urgency to each alarm so operators can triage, with the highest-severity ones getting attention first.

Concept drift

The data the AD model sees in production is different from the data it was trained on, and the difference grows over time. Concept drift can be slow (gradual changes in the normal operating distribution as the system evolves) or sudden (a software update, a new product launch, a switched-on data source). Production AD systems include drift monitors that compare current data distribution to training data distribution and trigger retraining when drift exceeds a threshold. The retraining cadence is itself a tuning decision: too frequent and the model never stabilises, too rare and the model becomes obsolete.

Evaluation is harder than it looks

Evaluating a deployed AD detector requires labelled data — actual confirmed anomalies vs. confirmed normals. Most deployments do not have this; they have a stream of alarms and an operator's after-the-fact disposition of each. The resulting "labels" are biased by the alarms themselves (if the detector never alarms in some region of the data, no labels exist there). Methods that work despite this bias — bootstrap evaluation, semi-supervised evaluation, off-policy evaluation — exist but are not yet widely deployed. The honest answer for most teams is to track simple operational metrics (alarm volume, operator-confirmed precision, time-to-investigation) and supplement with periodic deep-dive evaluations on labelled subsets.

The Detector That Lasts

The AD detectors that survive longest in production are usually not the ones with the highest offline AUC. They are the ones with simple, interpretable scores that operators can trust; with calibrated thresholds that produce a manageable alarm rate; with feedback loops that incorporate operator dispositions into the next training round; and with monitoring that catches drift before the detector becomes useless. Algorithm choice is the easy part. The infrastructure around the algorithm determines whether the detector ships and stays useful.

Frontier: Foundation Models, Self-Supervised AD, OOD

The 2024–2026 frontier in anomaly detection is the same direction as in most other ML subfields: the use of pretrained foundation models as the feature backbone, self-supervised methods that need fewer labels, and a tighter integration between anomaly detection and out-of-distribution detection in deep learning.

Foundation features for AD

A pretrained vision-language or vision-only foundation model (DINOv2, CLIP, MAE) produces features that capture semantic structure of images. Running classical AD methods (k-NN, Mahalanobis, isolation forest) on these features rather than on raw pixels typically produces dramatically better anomaly detectors than purpose-trained networks for the AD task. The 2023–2024 literature is full of papers reporting state-of-the-art AD results from "DINOv2 features + simple AD method." The pattern has become standard practice for image AD; the same pattern is now spreading to other modalities (using LLMs for text-AD, using audio foundation models for audio-AD).

Self-supervised AD

A separate trend trains AD models on self-supervised pretext tasks rather than reconstruction. Examples: predict whether two patches come from the same image (rotation prediction, jigsaw puzzles), predict the value of a masked feature given the rest. The pretext task forces the model to learn rich representations of normal data without explicit reconstruction. The empirical advantage over reconstruction-based methods is that self-supervised methods are less prone to the "expressive enough to reconstruct anomalies too" failure mode — the pretext task is harder to overfit to.

OOD detection

The closely-related field of out-of-distribution detection in deep learning treats AD as a question about a classifier's trustworthiness: can we tell when a classifier is being asked to predict on an input that is far from the training distribution? Methods range from output-based (entropy of softmax outputs, max softmax probability) to feature-based (Mahalanobis distance in feature space) to gradient-based (the magnitude of the input-gradient). OOD detection is what enables a deployed classifier to abstain rather than confidently misclassify; it is increasingly considered part of the responsible-deployment story for any production ML system. The methods overlap heavily with classical AD but are framed in terms of classifier reliability rather than data anomalousness.

Foundation models for tabular AD

The most recent direction is foundation models trained specifically for tabular anomaly detection, in the same spirit as the time-series foundation models of Chapter 01. TabPFN (and its later variants) is the closest thing in 2026 to a foundation model for tabular data; it is being adapted for AD with promising early results. Whether tabular AD will get its "foundation model moment" the way images and time series did is one of the open empirical questions of the field as of 2026.

What this chapter does not cover

The full causal-inference framework — which connects to anomaly detection because true anomalies often correspond to changes in the underlying causal process, not just statistical fluctuations — is the subject of Chapter 03 (Causal Inference) and Chapter 04 (Causal Machine Learning). Survival analysis, which models time-to-event data and is sometimes used to detect early signs of failure events, is the subject of Chapter 06 (Survival Analysis & Event Modeling). The deep Bayesian methods that produce calibrated uncertainty estimates — useful for flagging when a model does not know — are the subject of Chapter 07 (Bayesian Deep Learning). And the federated and privacy-preserving variants of AD, important for use cases like distributed fraud detection, are the subject of Chapter 10 (Federated Learning & Privacy-Preserving ML).

Anomaly detection is the operational substrate of every monitoring system in modern infrastructure. The classical methods of this chapter remain the foundation; the deep-learning era added the ability to handle high-dimensional unstructured data; the foundation-model era is reshaping what features the methods run on. A practitioner who matches the method to the anomaly type, calibrates the threshold to the operator's tolerance, and builds the feedback infrastructure around the model — that practitioner builds detectors that ship and stay useful.