Part VII · Computer Vision · Chapter 01

Image representation and classical vision, the body of mathematics and engineering that turns a grid of colour-sample numbers into something a computer can reason about — pixels, colour spaces, histograms, filters, edges, corners, scale-space, feature descriptors, camera geometry, stereo, and optical flow — the layer beneath the neural network where every modern vision model still silently rests.

A digital image is nothing more than a rectangular array of numbers, and yet within that humble grid lives the entire discipline of computer vision. Before we ask a network to recognise a cat, we have to decide what the pixels are: how they were sampled from the continuous world, how colour is encoded, how brightness relates to photons, how a filter or a gradient can be computed from them, and how two grids taken from slightly different viewpoints can be related back to three-dimensional geometry. Classical vision — the four decades of work before deep learning swept the field — answered these questions with elegant, interpretable, mostly closed-form machinery: histograms and point operations for contrast; convolutions for smoothing and differentiation; Sobel and Canny for edges; Harris for corners; Gaussian pyramids and scale-space for invariance to zoom; SIFT, SURF, ORB, and HOG for local descriptors that survive rotation and lighting changes; thresholding, watershed, and graph cuts for segmentation; morphology for binary shape operations; pinhole geometry, calibration, and epipolar constraints for camera models; Lucas-Kanade and Horn-Schunck for optical flow; and bag-of-features plus RANSAC for the first generation of image recognition. Much of this machinery is still inside every modern vision system — it preprocesses, it post-processes, it calibrates, it teaches researchers what the problem actually is. Understanding it is the foundation on which every later chapter in this Part builds.

How to read this chapter

Sections one and two establish the substrate. Section one is why start with classical vision — the argument that decades of pre-deep-learning work on pixels, filters, and geometry did not become irrelevant when CNNs arrived but instead quietly became the preprocessing and postprocessing of every modern pipeline. Section two is image formation: how the continuous world is sampled onto a grid, what resolution and bit depth actually mean, and the signal-processing assumptions that sit underneath every image file.

Sections three and four cover colour and pixel statistics. Section three is colour spaces — RGB, HSV, CIE Lab, YCbCr, and why a given task wants one representation rather than another. Section four is pixel statistics and point operations — histograms, equalisation, gamma correction, the simplest transformations of an image that still have real industrial use.

Sections five and six are the heart of classical image processing. Section five is linear filtering and convolution — Gaussian blur, mean and median filters, separability, and the critical fact that convolution is the same operation modern CNNs use. Section six is edge detection — Sobel, Prewitt, Laplacian, and the full Canny pipeline with non-maximum suppression and hysteresis thresholding.

Sections seven through ten cover local features — the classical answer to "where in an image is something interesting". Section seven is corner detection (Harris, Shi-Tomasi, FAST); section eight is scale-space and image pyramids (Gaussian/Laplacian pyramids, DoG); section nine is SIFT-style descriptors (SIFT, SURF and their invariance properties); section ten is binary descriptors (ORB, BRIEF, BRISK) designed for real-time matching.

Section eleven is the HOG descriptor — the Histograms of Oriented Gradients feature that drove a decade of pedestrian detection and dominated object recognition before CNNs. Sections twelve and thirteen cover shape and region. Section twelve is classical segmentation — thresholding, region growing, watershed, graph cuts, mean-shift. Section thirteen is mathematical morphology — erosion, dilation, opening, closing, and the binary-image operators that still clean up every segmentation mask produced by modern nets.

Sections fourteen and fifteen introduce geometry. Section fourteen is camera models and multi-view geometry — pinhole projection, intrinsic and extrinsic calibration, homographies, epipolar geometry. Section fifteen is stereo and optical flow — disparity estimation, Lucas-Kanade, Horn-Schunck, and the classical motion/depth reconstruction that modern networks now learn end-to-end but still evaluate against.

Section sixteen is the first classical recognition pipeline: bag-of-features image matching — SIFT+RANSAC for image retrieval, BoW for category recognition, and the reasons this approach plateaued and was replaced. The closing in-ml section places classical vision in the modern machine-learning lifecycle, where it lives today as preprocessing, post-processing, calibration, evaluation tooling, and as a conceptual vocabulary for understanding what vision networks actually learn.

Why start with classical visionThe through-line from pixel grids to learned representations, and why the old machinery still matters
Image formation, sampling, and quantisationFrom continuous radiance to pixel grids: resolution, bit depth, aliasing, the Nyquist picture
Colour spacesRGB, HSV, CIE Lab, YCbCr, gamut and perceptual uniformity
Pixel statistics and point operationsHistograms, equalisation, gamma correction, contrast and brightness transforms
Linear filtering and convolutionGaussian, mean, median, bilateral — the operator shared with every CNN
Edge detectionSobel, Prewitt, Laplacian, and the Canny pipeline with NMS and hysteresis
Corner detectionHarris, Shi-Tomasi, FAST — the first interest-point detectors
Scale-space and image pyramidsGaussian and Laplacian pyramids, Difference of Gaussians, coarse-to-fine processing
SIFT-style feature descriptorsSIFT, SURF, rotation/scale invariance, descriptor matching
Binary feature descriptorsORB, BRIEF, BRISK, AKAZE — real-time descriptors for embedded vision
HOG — Histograms of Oriented GradientsThe Dalal-Triggs descriptor, pedestrian detection, deformable parts models
Classical segmentationThresholding, region growing, watershed, graph cuts, mean-shift
Mathematical morphologyErosion, dilation, opening, closing, structuring elements, grayscale morphology
Camera models and multi-view geometryPinhole model, calibration, homographies, epipolar geometry, the fundamental matrix
Stereo and optical flowDisparity, rectification, Lucas-Kanade, Horn-Schunck, dense flow
Bag-of-features and image matchingSIFT+RANSAC retrieval, BoW recognition, the pre-CNN recognition pipeline
Classical vision in the modern ML lifecyclePreprocessing, evaluation, data augmentation, and the vocabulary of learned vision

§1

Why start with classical vision

A deep learning curriculum can be tempted to skip straight from pixels to convolutional networks, treating classical vision as a historical curiosity. This is a mistake. The forty years of work that preceded AlexNet did not vanish in 2012; it went underground, and it now runs invisibly inside every production vision system — as preprocessing, as postprocessing, as calibration, as evaluation tooling, and as the conceptual vocabulary that lets researchers even describe what a network has learned. Understanding classical vision is not optional archaeology. It is how you debug modern models, clean up the masks they produce, calibrate the cameras that feed them, and recognise when a learned representation has merely rediscovered something that was hand-engineered in 1986.

The most direct through-line is the convolution operator itself. The filters that Sobel, Prewitt, and Gabor designed by hand in the 1970s — for edges, for gradients, for oriented bars — are structurally identical to the filters learned by the first layers of a CNN. Karen Simonyan and Andrew Zisserman's visualisations of VGG, and every subsequent interpretability paper, consistently find that the opening layers of any trained vision network recover Gabor-like edge and blob detectors. The network is not inventing a new theory of vision; it is rediscovering the theory that classical computer vision already had, then composing it deeper. If you understand a Gaussian-derivative kernel and the orientation tuning of a Gabor filter, you have an immediate bridge into what the first layer of any modern vision network is doing.

The second through-line is geometry. Deep networks are excellent at pattern-matching on textures, but the three-dimensional structure of the world — where the camera is, how the scene projects to the image plane, how multiple views relate — is still largely handled with classical tools. A smartphone camera's image stabilisation, a SLAM system for a robot, a structure-from-motion pipeline for aerial mapping, a stereo depth sensor on a self-driving car — each of these uses pinhole models, homographies, epipolar geometry, and Lucas-Kanade-style optical flow as load-bearing components. Learned networks produce the features; classical geometry ties them to the world. The recent rise of neural radiance fields and Gaussian splatting has, if anything, increased the importance of knowing this material, because those systems are taught in the language of camera matrices and projective geometry.

The evaluation ecology. Modern vision papers report numbers — IoU, mAP, PSNR, SSIM, pixel accuracy — whose definitions come straight from the classical literature. Image quality metrics like Structural Similarity (Wang, Bovik, Sheikh and Simoncelli, 2004) are still the default targets for any image-generation or restoration model. Pedestrian-detection benchmarks like INRIA and Caltech were collected by the same community that invented the HOG descriptor, and their evaluation protocols — miss rate at a given false-positive density — were designed for it. Even the standard dataset of deep vision, ImageNet, inherited its object-recognition protocol from the PASCAL VOC challenge that preceded it, which was run by the classical vision community.

The third through-line is postprocessing. A segmentation network outputs a soft probability mask; the hard, usable mask that ends up in a product has almost always been cleaned with morphological operations, connected-component labelling, or a graph-cut refinement — all classical techniques. A document-understanding pipeline that starts with a vision-language model still uses classical deskewing, page segmentation, and table-line detection. A medical-imaging pipeline that uses a CNN for tumour detection still uses thresholding and region-growing to refine the output. The learned part of the system is typically surrounded by a layer of classical-vision plumbing that makes the output actually usable.

Finally, there is data augmentation. Every training pipeline for a vision network applies random crops, flips, rotations, colour jitter, Gaussian blur, JPEG compression, MixUp, CutMix, and a long tail of geometric and photometric transformations — each one is a classical image-processing operation. The space of augmentations is the space of classical transformations the practitioner believes preserve the semantic content of an image. You cannot reason about augmentation strategies without understanding, concretely, what a rotation does to a pixel grid, what Gaussian blur does to high frequencies, and why a random crop implicitly teaches translation invariance. Augmentation is classical vision, running as a training subroutine.

The rest of this chapter is a guided tour through that classical machinery, in the order that makes each step easiest to motivate: first what an image is as a numerical object, then how we transform its intensities, then how we extract structure from it, then how we describe interesting points, then how we reason about shape and geometry. Every later chapter in this Part — modern classification, detection, video, 3D, vision-language — will be told against this backdrop.

§2

Image formation, sampling, and quantisation

An image file is a rectangular array of integers. The question of how that array came to exist — how continuous light in a three-dimensional scene ended up as a discrete grid of bytes — sets the terms for everything that follows. The machinery of sampling, quantisation, colour filtering, and bit depth determines what information is present in the pixels, what has been irretrievably lost, and which signal-processing assumptions any downstream algorithm can reasonably make. Getting these foundations wrong leads to ringing, colour casts, chroma subsampling artefacts, and a category of mysterious failures that look like modelling problems but are actually representation problems.

The starting point is the physical scene. The world around a camera contains a continuous radiance function — for each direction in space, each wavelength of light, each moment in time, there is some amount of energy flowing through a point. A camera samples this function: it integrates over a small interval of time (the shutter), over a small solid angle per pixel (the lens aperture), over a limited range of wavelengths (the colour filter), and over a discrete spatial grid (the sensor). Each of these four sampling operations is a source of information loss and, potentially, of artefacts. The subsequent machinery of computer vision is downstream of all four.

The spatial sampling step is the one with the most mathematical structure. The Nyquist–Shannon theorem tells us that a band-limited signal can be recovered from samples taken at twice its highest frequency. Real scenes are not band-limited — they contain arbitrarily fine detail — so the practical consequence is that any camera must low-pass filter before it samples, either optically (through the lens's point-spread function and the anti-aliasing filter in front of the sensor) or implicitly (through the finite size of each sensor pixel, which integrates over its footprint). If this low-pass filtering is inadequate, frequencies above Nyquist fold down into lower frequencies and appear as aliasing: moiré on textile patterns, jagged lines on high-contrast edges, rainbow fringes on fine text.

Resolution vs. effective resolution. A 48-megapixel smartphone sensor does not produce 48 megapixels of usable information. The lens optically limits resolution to perhaps 12 megapixels; the Bayer filter throws away two-thirds of the colour information at each pixel and reconstructs it by demosaicing; noise reduction further blurs fine detail. The nominal pixel count is a marketing number. A rigorous test — such as photographing a Siemens star target — reveals that effective resolution is typically a fraction of the sensor resolution. This matters for computer vision: features smaller than the effective resolution are reconstruction artefacts, not scene content.

The intensity quantisation step converts the continuous sensor voltage into a discrete digital value. Most consumer images store 8 bits per channel — 256 levels — a quantisation that is adequate for perceptual viewing but coarse for any computation that amplifies contrast. Scientific imaging, medical imaging, and modern photography all routinely use 10, 12, 14, or 16 bits per channel to preserve dynamic range. The difference matters when you apply a gamma transform, push shadows, or do any computation that linearly combines small differences in pixel values; at 8 bits you quickly hit posterisation — visible banding in smooth gradients.

The colour sampling step is the most lossy and the most frequently underappreciated. A consumer sensor is a grid of monochrome wells, each covered by a single colour filter — typically red, green, or blue in the Bayer mosaic pattern (RGGB: two greens, one red, one blue per 2×2 cell). At each physical pixel, only one colour is measured; the other two are interpolated from neighbours by a demosaicing algorithm. The green channel is sampled twice as densely as red and blue because human vision is more sensitive to luminance variation in the green band. The colour reconstruction step introduces artefacts — false colour at high-contrast edges, zippering, and a general softness — that no subsequent algorithm can undo.

After sampling, a raw sensor image is not yet a consumer-visible photograph. A typical imaging pipeline applies black-level subtraction (the sensor has a nonzero dark current), white balance (scaling the R, G, B channels so that neutral grey appears neutral under the ambient light colour), demosaicing, colour-space conversion (from the sensor's native primaries to a standard like sRGB), tone mapping (a nonlinear curve to fit a high-dynamic-range scene into 8-bit display values), sharpening, and noise reduction. Each step is reversible only in principle; practical pipelines discard information at several stages. When you load a JPEG into a deep learning pipeline, you are working with a heavily processed, lossily compressed, nonlinearly mapped representation of the scene — not the raw photon counts.

A final note on bit depth and linearity. Pixel values stored in a standard JPEG are in sRGB space, which applies an approximately gamma 2.2 nonlinearity: the value 128 does not mean "half as much light as 255" but rather "half as much perceptual brightness", which is far less light physically. Any computation that claims to be linear in light — averaging exposures, physically-motivated blending, radiometric augmentation — must first decode the gamma, operate in linear space, and re-encode. Treating sRGB values as linear produces subtly wrong gradients, shifted colours in blends, and poor results in any physics-based vision task. Deep learning practitioners rarely account for this; most augmentation libraries operate directly on sRGB pixel values, which is one reason photometric augmentation sometimes destroys more signal than necessary.

§3

Colour spaces

"Colour" in computer vision is not a single concept — it is a family of coordinate systems, each tuned to a different task. The three-tuple (R, G, B) is the most common, but by no means the most useful. HSV, CIE Lab, YCbCr, XYZ, and Lab's perceptually uniform successors (Luv, CAM02) all encode the same underlying physics differently, and the choice of colour space materially changes what a subsequent algorithm can see. A skin-tone detector that works beautifully in HSV may be useless in RGB; a compression codec that quantises chrominance in YCbCr introduces no visible artefacts, while the same quantisation in RGB produces garish colour casts. Knowing when to use which colour space is a recurring classical-vision skill that survives unchanged into the deep-learning era — typically as a preprocessing step or as the input to a loss function.

RGB is the native representation of almost every digital image, because it corresponds directly to the red, green, and blue sensors in a camera and the red, green, and blue phosphors or LEDs in a display. Its great virtue is additive linearity: if you double the amount of red light, the R value doubles (in a linear RGB space). Its great flaw is that its axes are not perceptually meaningful. The RGB distance between two pixels is weakly correlated with how different a human perceives them to be, and the axes mix hue, saturation, and brightness together. No classical segmentation or colour-based detection algorithm wants RGB as its input; they almost all start by converting to something else.

HSV (Hue, Saturation, Value) separates the what-colour (hue) from the how-colourful (saturation) from the how-bright (value). It is a non-linear transformation of RGB, computable in closed form. Its main utility is the isolation of hue: under changing illumination, a red object's H value stays roughly constant while its RGB values shift dramatically. Colour-based object tracking, segmentation by colour, and colour-key (green-screen) compositing all use HSV or its sibling HSL for exactly this reason. The price is that HSV is not a Euclidean space — hue is circular (red at 0° and 360°), and the meaning of S and V depends on each other in awkward ways. Distance in HSV is informally useful but formally broken; never use L2 distance on raw HSV vectors.

CIE Lab and the perceptual uniform ideal. Lab (CIE 1976 L*a*b*) is an attempt to build a colour space in which Euclidean distance matches perceived colour difference. It has three channels: L* is lightness (0 = black, 100 = diffuse white), a* is the green-to-red axis, b* is the blue-to-yellow axis. Lab is the space in which image-processing operations that should preserve colour while changing brightness — or vice versa — are most defensible. It is also the default space for most image quality metrics, colour-difference computations, and colour-constancy algorithms. Its successors (CAM02-UCS, ICtCp) are more perceptually uniform but less widely implemented.

YCbCr (and its close relative YUV) separates luminance from chrominance: Y is a weighted sum of R, G, B that approximates perceived brightness; Cb and Cr are the blue-minus-luma and red-minus-luma differences. It is the space of every modern video codec (H.264, H.265, AV1) and of JPEG, because it enables chroma subsampling: human vision is much less sensitive to spatial variation in colour than in brightness, so the Cb and Cr channels can be downsampled to half or quarter resolution with little visible loss. Computer vision systems that work with video — automated video analysis, broadcast quality monitoring, low-bitrate surveillance — often need to operate in YCbCr natively because that is how the stream arrives.

XYZ is the foundational colour space: a linear, device-independent reference defined by the CIE in 1931 from human colour-matching experiments. Every other colour space can be defined as a function of XYZ. It is rarely used directly in computer vision, but it is the lingua franca for colour-space conversions, profile management, and colour-accurate rendering. When a deep learning system claims to produce "sRGB" output, the precise meaning of that claim is defined via XYZ and a standard illuminant.

Other colour spaces make appearances in specialised domains. rgb (lowercase — the chromaticity normalisation R/(R+G+B), G/(R+G+B)) strips out brightness entirely and is useful for colour-constancy work. Luv is a perceptually uniform space designed for monitor applications. HunterLab and CIECAM02 are more sophisticated perceptual models. Opponent-colour spaces, inspired by the biology of the human visual system, separate red-green and blue-yellow axes. A good classical-vision engineer chooses the colour space to match the problem: Lab for colour difference, HSV for colour-based segmentation, YCbCr for video, RGB for nothing much except display.

Deep learning usually trains end-to-end in RGB, which feels philosophically lazy — a network could surely learn any colour transformation it needed — and empirically mostly works. But not always: specialised applications that use Lab loss functions for colourisation, YCbCr inputs for video, or HSV space for robust skin-tone detection are reminders that the space in which you present the data still matters. When a CNN fails mysteriously on colour-related tasks, the first diagnostic move is often to try the task in a different colour space.

§4

Pixel statistics and point operations

A point operation is the simplest possible class of image transformation: a function applied independently to each pixel, with no reference to neighbours. Point operations include brightness and contrast adjustment, gamma correction, thresholding, histogram equalisation, and tone mapping. They are computationally trivial and algorithmically powerful — enough to solve a surprising fraction of real industrial-vision problems without any spatial processing at all, and they provide the vocabulary for most of the photometric augmentations used in deep learning training.

The histogram of an image is the single most informative global summary. It is an array indexed by intensity — typically 256 bins for an 8-bit channel — whose value at each bin counts the number of pixels with that intensity. From the histogram alone, you can read off exposure (is the distribution pushed left toward black, right toward white, or centred?); dynamic range (do pixels span the full 0-255 range or a narrow subset?); the presence of saturation (spikes at 0 and 255); and often the number and brightness of distinct objects. A human eye looking at a histogram can diagnose most common photographic flaws in under a second.

Histogram equalisation is the classical technique for improving the contrast of a low-contrast image. It redistributes pixel values so that the histogram becomes approximately uniform. The algorithm is a one-liner: compute the cumulative distribution of the histogram, then map each pixel value to its percentile. An image with pixels clustered tightly around middle gray emerges with its intensities spread across the full range; structure that was invisible becomes visible. The limitation is that equalisation is a global operation — it optimises overall contrast at the expense of local contrast in already-bright or already-dark regions.

CLAHE — Contrast-Limited Adaptive Histogram Equalisation. The fix for the limitations of global equalisation is adaptive equalisation: divide the image into a grid of tiles, equalise each tile independently, and bilinearly interpolate between them to avoid block artefacts. CLAHE (Zuiderveld, 1994) adds a clipping step that prevents noise amplification in uniform regions by capping how much the histogram in each tile can be stretched. CLAHE is standard in medical imaging, where it reveals faint structures in X-rays and mammograms, and in many low-light photographic pipelines.

Gamma correction is a simple power-law point operation: output = input^γ, with γ typically between 0.4 and 2.5. Values of γ < 1 brighten midtones while preserving blacks and whites; values of γ > 1 darken midtones. The historical origin is display calibration — cathode-ray tube phosphors have a natural response that is approximately gamma-2.2 — but gamma correction is still used extensively as a general-purpose tone adjustment, and the sRGB standard is built around an approximate γ=2.2 encoding for exactly this reason.

Thresholding is the simplest segmentation operation: everything above some threshold becomes white, everything below becomes black. Despite its primitive flavour, thresholding remains the right answer for a huge class of industrial vision problems — presence/absence of a part, reading a barcode, segmenting text from paper — where the scene is controlled enough that a single intensity value cleanly separates foreground from background. The classical improvement is Otsu's method (1979), which automatically chooses the threshold that maximises between-class variance. More sophisticated variants (adaptive thresholding, Sauvola, Niblack) handle spatially varying illumination by computing a different threshold for each local window.

Beyond these canonical operations, a family of photometric transformations — contrast stretching, brightness shift, colour jitter, saturation adjustment, hue rotation, blue/yellow balance — are the vocabulary of data augmentation in deep learning. A training pipeline's colour jitter transform is a parametrised point operation, applied with random intensity, to teach a network invariance to lighting conditions. The augmentation library albumentations implements dozens of these point operations, and their correct parametrisation is the subject of an ongoing literature (RandAugment, AugMix, Auto-Augment) that asks: given a fixed compute budget, what combination of photometric transformations most improves generalisation?

The crucial property of point operations is that they are invertible when applied with known parameters — you can undo a gamma correction, un-equalise a histogram, or reverse a brightness shift. This makes them safe for reversible data representation, safe for augmentation (the label does not change), and safe for composition. It also makes them relatively interpretable: when a point operation fails or causes unexpected behaviour, the cause is usually visible in the histogram, and the fix is usually a different point operation.

§5

Linear filtering and convolution

Convolution is the single most important operator in classical computer vision and, by historical accident, the same operator that forms the basis of every convolutional neural network. A filter kernel — a small grid of weights — is slid across the image; at each position, the corresponding patch of the image is multiplied element-wise with the kernel and the results summed to produce one output pixel. From this primitive, nearly every classical low-level operation can be built: blurring, sharpening, edge detection, embossing, denoising, and feature extraction. Understanding convolution concretely, with small kernels and small images, is the fastest route to understanding what the early layers of a CNN are doing.

The Gaussian blur is the prototypical filter. Its kernel is a 2D Gaussian function, truncated and normalised. Applied to an image, it attenuates high spatial frequencies — fine detail, noise, sharp edges — while preserving low frequencies. The Gaussian is the unique filter that is isotropic (no preferred direction), separable (a 2D Gaussian is the product of two 1D Gaussians, making the filter O(n) instead of O(n²)), and scale-parametrised by a single number σ. Most blurs you see in image-editing software, most noise-reduction preprocessing, and most multi-scale analysis start with a Gaussian blur for these reasons.

A mean filter averages each pixel with its neighbours in a square window; it is the simplest conceivable smoother, but its box shape causes artefacts (visible ringing in the Fourier domain) and it is rarely preferred over a Gaussian when resources allow. A median filter replaces each pixel with the median of its neighbours, and its non-linearity gives it one superpower: removal of salt-and-pepper noise without blurring edges. Medical imaging, document scanning, and any pipeline with impulsive noise benefit from a median filter as preprocessing.

The bilateral filter. A Gaussian blur weights neighbours only by spatial distance; it smooths edges and detail together. The bilateral filter (Tomasi and Manduchi, 1998) weights each neighbour by both spatial distance and intensity difference, so pixels across a strong edge contribute little to each other. The result is an edge-preserving smoother: noise is removed in flat regions while sharp edges remain sharp. Bilateral filtering is the classical ancestor of today's edge-aware learned denoisers, and variants (joint bilateral, guided filter) are still standard in computational photography. The operator is non-linear — it cannot be expressed as a convolution — and requires O(k²) work per pixel for a k×k window, motivating a literature on fast approximations.

A sharpening filter is the logical inverse of a blur: it amplifies high frequencies at the expense of low ones. The classical implementation is unsharp masking: compute a blurred version of the image, subtract it from the original to get the high-frequency residual, multiply by a gain, and add back. The name comes from analogue photographic darkroom practice (an unsharp mask was a blurred negative), but the digital implementation is a one-line convolution. Most image-editing software's "sharpen" tool is an unsharp mask with hidden parameters.

Convolution's key mathematical property is that it is linear and shift-invariant: the same kernel applied everywhere, with proportional response to input intensity. This means the Fourier transform diagonalises it: convolution in the spatial domain equals multiplication in the frequency domain. This equivalence is more than theoretical. Large convolutions (say, with kernels larger than 50×50) are faster computed by FFT than by direct convolution, a trick called fast convolution. It also clarifies what each filter is doing: a Gaussian blur is a low-pass filter in the frequency domain; a sharpening filter is a high-pass filter; a bandpass filter (e.g., difference of Gaussians) isolates a specific frequency band.

Separability is a critical computational property. A 2D filter is separable if its kernel can be written as the outer product of two 1D filters. A k×k separable filter requires 2k multiplies per pixel instead of k², a substantial speedup for larger kernels. The Gaussian is separable; a pure box filter is separable; the Sobel gradient filters are separable. Whenever possible, classical vision code implements large filters as two consecutive 1D convolutions. Modern CNNs have rediscovered this idea under the name depthwise separable convolution (MobileNet, Xception), where separating spatial and channel operations produces networks with an order of magnitude less compute for comparable accuracy.

Finally, there are boundary conditions: what to do at pixels within one kernel radius of the image border. The standard options are zero-padding (pretend the outside is black), reflection (mirror the image across its boundary), replication (extend edge pixels outward), wrap (treat the image as toroidal), and crop (produce a smaller output). Each has consequences — zero-padding introduces dark edges in convolutions; reflection preserves statistics better; replication is what most CNN libraries default to. These boundary details become invisibly important when a model underperforms near image edges, which is usually a boundary-condition problem rather than a modelling one.

§6

Edge detection

An edge in an image is a curve along which the intensity changes rapidly. Edges are the skeleton of an image's structural content: they outline objects, mark boundaries between materials, and are the first and most reliable cue for recognition. Edge detection was the original problem of classical computer vision, and the techniques developed for it — Sobel, Prewitt, Laplacian of Gaussian, Canny — remain the canonical reference. They are also, not coincidentally, the operations that the first convolutional layer of every trained vision network learns to implement.

The simplest edge detectors are gradient operators. The image intensity I(x,y) is treated as a function of spatial coordinates, and its gradient vector ∇I = (∂I/∂x, ∂I/∂y) points in the direction of steepest intensity change with magnitude proportional to how steep that change is. To compute this discretely, we need a finite-difference approximation. The simplest is a 1×2 kernel [−1, 1], but this is so noise-sensitive as to be unusable. The Sobel operator uses a 3×3 kernel that combines a differentiating 1D filter with a smoothing 1D filter perpendicular to it, producing a gradient estimate that is both more accurate and less noisy. The Prewitt operator is similar, with a uniform smoother instead of a Gaussian-like one.

Given the x-gradient Gx and the y-gradient Gy, the gradient magnitude is √(Gx² + Gy²) and the gradient direction is arctan(Gy/Gx). Thresholding the magnitude produces an edge map: a binary image in which edge pixels are white and flat regions are black. This naive approach produces thick, noisy edge maps that need considerable cleanup, which is what more sophisticated detectors do next.

The Canny edge detector. John Canny's 1986 paper A Computational Approach to Edge Detection is the most influential single paper in classical low-level vision. Canny formulated edge detection as an optimisation: a good edge detector should maximise detection (finding true edges), maximise localisation (placing the edge at the right pixel), and minimise responses to noise. The optimum of these competing criteria is a Gaussian derivative filter, which he approximated with a multi-step pipeline: (1) Gaussian blur to suppress noise; (2) Sobel gradient; (3) non-maximum suppression along the gradient direction (keep only the ridge-top of each edge response, producing one-pixel-wide edges); (4) double thresholding with hysteresis (accept pixels above a high threshold, and pixels above a low threshold that are connected to accepted pixels). The result — clean, thin, well-localised edge maps — has been the gold standard for nearly forty years.

The Laplacian operator, ∇²I = ∂²I/∂x² + ∂²I/∂y², is the second-derivative analogue of the gradient. Edges correspond to zero crossings of the Laplacian: places where the second derivative changes sign. The practical version is the Laplacian of Gaussian (LoG), introduced by David Marr and Ellen Hildreth in 1980, which first blurs with a Gaussian of scale σ and then applies the Laplacian. Detecting zero crossings of the LoG produces edges at the scale determined by σ. This Marr-Hildreth operator was a foundational moment in vision, connecting edge detection to the biological receptive fields of retinal ganglion cells, and its logic still underlies the Difference of Gaussians operator used in SIFT-style keypoint detection.

Why do neural networks rediscover edge detectors? The answer is that a bank of oriented Gabor filters — edge detectors tuned to different orientations and scales — is the optimal linear feature extractor for natural images under standard statistical models (sparse coding; independent component analysis applied to natural image patches). When a CNN is trained on natural images with any reasonable loss, its first layer will develop filters that look like rotated Sobel kernels at multiple scales, because that is where the information is. This is not a quirk; it is a deep property of natural images that the classical vision community discovered a generation before deep learning.

Modern deep-learning edge detection — Richer Convolutional Features (HED, RCF) and their successors — has pushed benchmark performance past classical detectors on natural images by fusing features across multiple layers of a pretrained backbone. But classical detectors remain the right tool in three situations: when you have no training data, when interpretability is required (industrial inspection, scientific imaging), and when preprocessing a downstream learned component. Almost every document-understanding pipeline that starts with a vision-language model still runs Canny edge detection somewhere in its preprocessing for deskewing, page segmentation, or table extraction. The classical techniques are not replaced; they are the infrastructure.

§7

Corner detection

Corners — points in the image where intensity changes strongly in two directions — are the canonical candidates for local features: small patches that can be uniquely identified, repeatably located, and matched across different views of the same scene. A corner is more distinctive than an edge, because an edge extends in one direction and cannot be localised along it. Corner detection is the foundation of image matching, tracking, structure-from-motion, and every algorithm that needs to answer "is the same physical point visible in both of these images?"

The canonical corner detector is the Harris operator, introduced by Chris Harris and Mike Stephens in 1988. The core idea is to measure how much the intensity pattern in a small window changes when the window is shifted in any direction. At a flat region, shifting in any direction produces little change; at an edge, shifting along the edge produces no change but shifting across it produces large change; at a corner, shifting in any direction produces large change. Harris formalised this by computing, for each pixel, a 2×2 matrix called the structure tensor:

M(x,y) = Σ_w ∇I ∇I^⊤ = Σ_w [Ix² IxIy; IxIy Iy²]

where the sum is over a Gaussian-weighted window w. The eigenvalues of M tell you everything: both small → flat region; one large, one small → edge (change in one direction only); both large → corner (change in both directions). Rather than compute eigenvalues directly, Harris used the cornerness response R = det(M) − k·trace(M)², a polynomial in the eigenvalues that is large and positive at corners, negative at edges, and near zero in flat regions. A threshold on R, followed by non-maximum suppression in a local neighbourhood, yields a sparse set of corner locations.

Harris variants and descendants. The Shi-Tomasi detector (1994), also called Good Features to Track, replaces the Harris response R with the minimum eigenvalue λ_min(M), which is both simpler and more directly related to the geometric intuition. OpenCV's goodFeaturesToTrack uses Shi-Tomasi; many modern SLAM systems use Shi-Tomasi corners as their first step. The FAST detector (Rosten and Drummond, 2006) takes an entirely different approach — it classifies a pixel as a corner if a circular ring of 16 surrounding pixels contains a contiguous arc of at least 9 that are all much brighter or much darker than the centre. FAST is an order of magnitude faster than Harris and is the workhorse of real-time vision on embedded systems.

Beyond the basic detector, practical use of corners raises several engineering questions. How many corners? Too few and you lose tracking through view changes; too many and you waste compute and match incorrectly. The standard heuristic is to sort corners by response strength and keep the top N (often 500–2000), or to apply a minimum spacing constraint so corners are spread across the image rather than clustered. At what scale? A corner detected at one scale may not be visible at another; the scale-space extension — detecting corners at multiple σ — addresses this. At what precision? Integer-pixel corner locations are not good enough for structure-from-motion; sub-pixel refinement, by fitting a paraboloid to the response surface, gives locations accurate to roughly 0.1 pixel.

The critical property of corner detectors is repeatability: the same physical corner should be detected in different images of the same scene despite view changes, lighting changes, and noise. This is the property that separates a useful local feature from an irrelevant one, and the quality of a corner detector is typically evaluated by measuring repeatability on standard datasets (the Oxford affine-covariant dataset being the canonical benchmark). Harris and Shi-Tomasi are reasonably repeatable under translation, small rotation, and moderate illumination change; they fail under scale change and large viewpoint shift, which motivates the scale-space and affine-invariant extensions that lead directly to SIFT.

Corner detection's role in the modern deep-learning era is not what it once was. When a task requires matching across views — SLAM, visual odometry, image stitching — learned local features (SuperPoint, D2-Net, LightGlue) now typically outperform classical corners on matching benchmarks. But classical corner detection is still used for bootstrapping (initialising a learned feature system), for interpretability, and for systems where training data is unavailable. And the concept of a "corner" — a point in an image with informative local structure — has carried over essentially unchanged into the learned detectors; they are, structurally, deep descendants of Harris.

§8

Scale-space and image pyramids

A given physical structure — a face, a letter, a doorknob — can appear at many different sizes in an image, depending on its distance from the camera. A vision system that detects or matches features at only one scale is therefore fundamentally limited. Scale-space theory formalises the idea of representing an image at all scales simultaneously, and image pyramids are its most common practical implementation. Together they provide the multiresolution scaffolding on which scale-invariant feature detectors (SIFT, SURF), object detectors (Feature Pyramid Networks), and a large fraction of image-processing algorithms are built.

The mathematical foundation is Tony Lindeberg's scale-space framework from the 1990s. Given an image I, the scale-space representation L(x, y, σ) is the convolution of I with a Gaussian of scale σ. Varying σ sweeps out a continuous stack of increasingly blurred images, with the property that structure smaller than about σ pixels is suppressed at that scale. Lindeberg proved that the Gaussian is the essentially unique kernel that produces a scale-space with no spurious structure — no new local maxima created by increasing σ — a result that picks the Gaussian out as the canonical multiscale smoother.

A Gaussian pyramid is the discrete, computationally efficient version of scale-space. Starting from the original image at scale 1, you produce a second level by Gaussian-blurring and then downsampling by a factor of two; a third level by the same operation on the second; and so on, until the image is a handful of pixels. The downsampling makes successive levels geometrically smaller, so the pyramid's total storage is only 4/3 the original image size. Each level represents the image smoothed to a particular scale and sampled at a matching resolution. The pyramid was introduced by Burt and Adelson in 1983 and has been a workhorse ever since.

Laplacian pyramid and DoG. The Laplacian pyramid stores the difference between consecutive Gaussian levels rather than the Gaussians themselves. Each level of the Laplacian pyramid is a band-pass-filtered image that captures the structure at one octave of scale, and the original image can be reconstructed exactly by upsampling and summing. This makes the Laplacian pyramid an ideal representation for multi-scale image editing, blending, and compression. The closely related Difference of Gaussians (DoG) — the difference of two Gaussian blurs at different σ — approximates the Laplacian of Gaussian and is the operator at the heart of SIFT keypoint detection.

Why multi-scale? Because nearly every vision problem is inherently multi-scale. Classification: the same object appears at many sizes. Detection: objects in an image vary in size from a few pixels to thousands. Matching: two images of the same scene may have been taken from different distances. Edge detection: the edges that matter at one resolution (e.g., the boundary of a forest) are different from those that matter at another (the veins of a leaf). A multi-scale representation lets an algorithm process each scale appropriately and select the ones most relevant to the task.

Classical scale-space has a modern successor in deep learning: Feature Pyramid Networks (Lin et al., 2017). An FPN takes the natural multi-scale structure of a CNN (where successive pooling layers produce feature maps at successively lower resolutions) and adds top-down lateral connections to combine semantic content from deep layers with spatial resolution from shallow ones. The result is a feature pyramid where each level has both high-level semantic meaning and high spatial precision, at a scale appropriate to that level. FPNs are standard in object detection (RetinaNet, Mask R-CNN) and are a direct intellectual descendant of classical pyramids.

Pyramids also enable coarse-to-fine processing: start at the pyramid's top (small, heavily smoothed), solve a problem approximately there, use that solution to initialise the problem at the next level down, and iterate. Classical optical flow (Lucas-Kanade with pyramids), stereo matching, image registration, and template matching all benefit from coarse-to-fine approaches because they dramatically reduce the search space and the risk of getting stuck in a local minimum. This pattern recurs throughout vision: solve a small problem first, then use its answer to solve a larger, more complex version.

§9

SIFT-style feature descriptors

The Scale-Invariant Feature Transform (SIFT), published by David Lowe in 1999 and refined in 2004, was the most important single algorithm in classical computer vision. It combined a keypoint detector (which found distinctive points in scale-space) with a descriptor (a 128-dimensional vector describing the appearance around each keypoint) that was, famously, invariant to image scale, rotation, and small changes in illumination and viewpoint. For roughly a decade — from 2000 to 2012 — SIFT was the dominant building block of image matching, panorama stitching, structure-from-motion, object recognition, and robot localisation. When CNNs swept the field after 2012, they replaced SIFT in most of these applications, but the conceptual vocabulary of "interest point plus descriptor" — and the evaluation methodology — carried over unchanged.

SIFT's detector is scale-space extrema detection in the Difference of Gaussians. The image is convolved with Gaussians at a dense set of scales, and consecutive blurred images are subtracted to produce a DoG pyramid. Each pixel in the DoG pyramid is compared to its 26 neighbours — 8 in the same scale and 9 at each of the adjacent scales — and kept if it is the extremum (max or min) of that neighbourhood. This identifies blob-like structures at the scale where they are most prominent. Sub-pixel and sub-scale localisation, plus edge-response rejection, refine and filter the detections.

The SIFT descriptor is more subtle. Around each detected keypoint, SIFT computes the dominant orientation (from a histogram of local gradient directions) and uses it to rotate the local patch to a canonical orientation. The patch is then divided into a 4×4 grid of cells, and each cell produces an 8-bin histogram of gradient orientations weighted by gradient magnitude. The concatenation of these 16 histograms produces a 128-dimensional vector. The vector is normalised to unit length, clipped to reduce the influence of large gradients, and re-normalised. The result is a descriptor that is invariant to the canonical orientation, roughly invariant to small affine changes, and robust to illumination changes (because it works on gradients, which are lighting-invariant).

Why SIFT worked. SIFT's remarkable robustness comes from four specific design choices: orientation normalisation (rotation invariance), gradient histograms (illumination invariance and tolerance to small spatial misalignment), scale-space detection (scale invariance), and large descriptor dimensionality (discriminative power). Each choice is defensible in isolation; the combination produced a feature that worked well enough that one algorithm dominated a field for a decade. Lowe's 2004 IJCV paper is one of the most-cited in computer science, and the descriptor is still the default when you need a classical local feature.

SIFT inspired a family of descendants. SURF (Speeded-Up Robust Features, Bay et al. 2006) replaces the DoG pyramid with integral-image-based box filters, reducing the compute cost by an order of magnitude at modest cost in accuracy. SURF descriptors are 64-dimensional, built from Haar wavelet responses rather than gradient histograms. For a period in the late 2000s, SURF was the fast-but-rough alternative to SIFT in real-time systems. GLOH, DAISY, LIOP, and various other variants explored the descriptor-design space; none fully displaced SIFT in quality, and the SURF-vs-SIFT trade-off of speed vs. accuracy persisted for years.

The practical use of SIFT is almost always in the match-and-filter paradigm. Extract SIFT features from two images; for each feature in the first image, find its nearest neighbour in the second by L2 distance in descriptor space; accept the match if the nearest-neighbour distance is significantly smaller than the second-nearest (Lowe's ratio test, typically with ratio 0.7); then filter the matches by geometric consistency — RANSAC on a homography or fundamental matrix, rejecting matches that do not fit the geometric model. This pipeline underlies nearly every panorama stitcher, structure-from-motion system, and image-retrieval engine from 2000 to 2015.

The patent encumbrance of SIFT (expired 2020) drove the community toward patent-free alternatives throughout the 2000s and 2010s, which produced the binary descriptors (ORB, BRIEF, BRISK — covered in the next section) and, eventually, deep descriptors (SuperPoint, D2-Net, R2D2). Modern learned local features are now better than SIFT on the standard matching benchmarks, but SIFT remains the reference: every new feature paper compares against SIFT, and SIFT is still the default in teaching and in many production systems that cannot afford to ship a neural network for this task.

§10

Binary feature descriptors

SIFT and SURF descriptors are real-valued vectors — 128 or 64 floats per feature — and matching them requires L2 distance, which is expensive when you have millions of descriptors to compare. For real-time applications on mobile phones, drones, robots, and embedded systems, this cost is prohibitive. The response in the late 2000s was a family of binary descriptors — features composed of a few hundred bits each, matchable by Hamming distance (a single XOR and popcount per comparison). The flagship is ORB (Oriented FAST and Rotated BRIEF), published by Ethan Rublee and colleagues in 2011, which combines a FAST-based detector with a binary descriptor and has been a workhorse of real-time vision ever since.

The idea behind binary descriptors is radical: forget gradient histograms and inner-product matching. Around each keypoint, pick a fixed pattern of pixel pairs, compare the intensity at each pair, and store a 1 if the first is brighter or a 0 otherwise. The descriptor is simply a bit string indicating the outcome of these pairwise comparisons. Matching becomes XOR-plus-popcount — a single CPU cycle on modern hardware — and storage is 32 or 64 bytes per descriptor, compared to SIFT's 512.

BRIEF (Binary Robust Independent Elementary Features, Calonder et al. 2010) was the first clean formulation: given a patch around a keypoint, compute a 256-bit descriptor by performing 256 pre-selected intensity comparisons. The patch is first Gaussian-smoothed to reduce noise, and the comparison pattern is fixed in advance (BRIEF authors tested several sampling strategies and found that random sampling from a Gaussian distribution around the centre worked well). BRIEF is fast and surprisingly discriminative, but it has no intrinsic rotation invariance.

ORB: Oriented FAST and Rotated BRIEF. ORB adds rotation invariance to BRIEF by first computing the dominant orientation of the patch (using the intensity centroid), rotating the BRIEF sampling pattern by that angle, and then computing the comparisons. ORB also uses FAST corners at multiple scales (a Gaussian pyramid) for the detection stage, giving scale invariance. The final descriptor is 256 bits, rotation- and scale-invariant, and roughly an order of magnitude faster to extract and match than SIFT. ORB is the default detector in OpenCV for real-time applications and is the feature used in ORB-SLAM, one of the most successful monocular SLAM systems of the 2010s.

BRISK (Binary Robust Invariant Scalable Keypoints, Leutenegger et al. 2011) uses a symmetric concentric sampling pattern and an adaptive scale-space detector; its descriptors are 512 bits. FREAK (Fast Retina Keypoint, Alahi et al. 2012) takes inspiration from the human retina's coarser-peripheral, finer-foveal sampling and uses a retinal-inspired comparison pattern. AKAZE (Accelerated KAZE, Alcantarilla et al. 2013) uses non-linear scale-space to preserve edges and has a binary descriptor option. These variations differ in detail but share the structure: detect, compute intensity comparisons, compress to bits.

Binary descriptors match via Hamming distance: the number of bit positions at which two descriptors differ. On modern CPUs, Hamming distance between two 256-bit descriptors is two XOR instructions followed by a popcount, perhaps 2-4 cycles total. Compared to a SIFT L2 distance (which requires 128 subtractions, 128 multiplications, 127 additions, and a square root — hundreds of cycles), this is a transformative speedup. For nearest-neighbour search, binary descriptors can be indexed with LSH (Locality-Sensitive Hashing) or bitwise KD-trees, both of which scale to millions of descriptors on embedded hardware.

The trade-off is accuracy. Binary descriptors are less discriminative than SIFT or SURF: fewer bits, less expressive comparison, more sensitivity to affine distortions. In matching benchmarks with large viewpoint changes, they underperform the classical real-valued descriptors. But in the regime they were designed for — fast matching across modest view changes, e.g., visual odometry where consecutive frames are mostly the same — they are often good enough, and their speed makes many applications possible that would otherwise require a GPU.

The modern counterpart is learned binary descriptors. Networks trained with contrastive losses and binary activations (e.g., DeepBit, LDLR) can outperform classical binary descriptors on matching benchmarks, while learned real-valued descriptors (SuperPoint, R2D2) beat both. Binary descriptors are now the niche choice for applications with tight compute budgets — embedded vision, drone SLAM, augmented reality on phones without neural accelerators — where their efficiency still matters. ORB remains the most common single choice in this niche.

§11

HOG — Histograms of Oriented Gradients

SIFT-style descriptors describe local patches around interest points. For many problems — pedestrian detection, face detection, sliding-window object detection in general — you need a descriptor of a larger image region, computed densely everywhere rather than at sparse keypoints. The dominant answer for roughly a decade before CNNs took over was HOG, the Histogram of Oriented Gradients, introduced by Navneet Dalal and Bill Triggs in a 2005 paper on pedestrian detection. HOG plus a linear SVM was the state-of-the-art pedestrian detector until 2012 and remains a standard baseline in any sliding-window detection benchmark.

The HOG descriptor computes the gradient at every pixel in an image region, bins the gradients by orientation in small spatial cells, normalises the bins in larger overlapping blocks, and concatenates everything into a single large vector. The typical parameters: divide the region into 8×8 pixel cells; for each cell, compute a histogram of 9 orientation bins (each covering 20° of the 0-180° range, because opposite gradients are equivalent); group cells into 2×2 overlapping blocks; normalise each block's 36-dim vector by its L2 norm; concatenate. A 64×128 pedestrian detection window produces a 3780-dimensional HOG descriptor.

The Dalal-Triggs pedestrian detector trained a linear SVM on these 3780-dimensional descriptors: positive examples were labelled pedestrian images, negatives were random image patches. At test time, the image was scanned at multiple scales with a sliding window, HOG was computed for each window, and the SVM score was thresholded. A second stage of hard-negative mining — retraining on the false positives of the first round — substantially improved precision. The final system, trained on the INRIA Person dataset they released alongside the paper, outperformed every previous pedestrian detector by a wide margin.

Why HOG worked. HOG's power comes from three properties. Gradient-based: gradients are invariant to absolute brightness and roughly invariant to small contrast changes. Local normalisation: the block-level L2 normalisation removes variation due to local illumination changes, so a pedestrian in shadow and a pedestrian in sun produce similar descriptors. Spatial partitioning: the 8×8 cells encode where a gradient was observed, not just what, so the descriptor captures overall shape — the vertical gradients at a human's torso edges, the horizontal gradients at head-and-shoulders boundaries. The linear SVM then learns a weighted template: the weights visualised as an image look strikingly like a human silhouette.

HOG's descendants were legion. Deformable Part Models (Felzenszwalb, Girshick, McAllester, Ramanan 2008-2010) extended HOG+SVM to a structured model of object parts: a coarse root filter plus several finer part filters, connected by deformation costs that allow the parts to move slightly relative to the root. DPMs dominated the PASCAL VOC detection benchmark from 2008 until CNNs arrived. Integral Channel Features and the Viola-Jones face detector are close relatives, using rectangular features computed from integral images instead of gradient histograms.

HOG was replaced, starting in 2014, by region-based CNNs (R-CNN, Fast R-CNN, Faster R-CNN) that learned richer features end-to-end from detection labels. The learned features quickly exceeded HOG's accuracy by a wide margin on VOC and MS-COCO. But HOG has not disappeared: it remains the canonical example in every textbook; OpenCV's pedestrian detector is still HOG+SVM; and for applications with extreme compute constraints (low-power sensors, old hardware, real-time embedded systems) HOG's cost-accuracy trade-off can still be competitive.

The intellectual legacy is larger than the direct legacy. Every design choice in HOG — gradient features, local histograms, block normalisation, spatial pyramids — has a direct analogue in modern CNNs. The first layer of a CNN learns oriented gradients; pooling computes local histograms; batch normalisation and layer normalisation are distant cousins of block normalisation. Reading the Dalal-Triggs paper with a CNN practitioner's eye is an exercise in seeing how deep the continuity runs between classical and learned vision.

§12

Classical segmentation

Segmentation is the problem of partitioning an image into coherent regions — separating foreground from background, finding the boundaries of objects, identifying uniform materials. Modern segmentation is dominated by U-Net, DeepLab, and Mask R-CNN, which produce pixel-level masks end-to-end. But beneath and around those learned systems, classical segmentation algorithms — thresholding, region growing, watershed, graph cuts, mean-shift — remain essential tools: they provide initialisations, post-processing, fallback modes when training data is absent, and the conceptual foundations on which modern methods were built.

Thresholding, introduced in §4, is the simplest segmentation: classify each pixel as foreground or background by comparing its intensity to a threshold. When scene conditions are controlled — industrial inspection on a conveyor belt, document scanning, barcode reading — thresholding is still the right answer. Otsu's method chooses the threshold automatically by maximising between-class variance; adaptive thresholding computes a different threshold per local region to handle varying illumination. For multi-class segmentation, multi-level Otsu and clustering on the histogram (§13 below) extend the idea.

Region growing starts from a set of seed pixels and iteratively adds neighbouring pixels that are similar enough to the region's current statistics — in mean intensity, colour, or texture. The algorithm is greedy but intuitive; its output depends on the seeds, which can be placed by the user (semi-automatic segmentation in medical imaging) or generated automatically from local extrema. Split-and-merge is the complementary top-down approach: start with the whole image, recursively split quadrants that contain too much variation, then merge adjacent regions that are similar enough. Both algorithms are fast and produce connected regions, at the cost of sensitivity to seed placement and similarity thresholds.

Watershed transform. The watershed algorithm treats the image as a topographic surface (high intensity = high elevation), floods it from local minima (or user-placed markers), and draws region boundaries where rising water from different catchment basins meets. The result is a segmentation whose boundaries lie along image gradients — which is usually where we want them — and whose regions are exactly connected. Pure watershed over-segments badly (every local minimum becomes a region); marker-controlled watershed, where the user or an algorithm places seed markers and only those basins are allowed to exist, fixes this. Watershed is a standard algorithm in OpenCV, scikit-image, and every medical-imaging toolbox; it remains the default in many industrial segmentation pipelines.

Graph cuts formulate segmentation as a min-cut problem on a graph whose nodes are pixels and whose edges encode similarity. In the classic Boykov-Jolly interactive segmentation (2001), the user scribbles on foreground and background; an algorithm constructs a graph with pixel-to-pixel edges weighted by intensity similarity and pixel-to-label edges (to source and sink) weighted by the label probabilities from the scribbles; a max-flow/min-cut algorithm finds the cut that partitions the graph with minimum total edge weight. The resulting segmentation is globally optimal for a well-defined energy function, smooth (the pairwise term discourages label changes across similar pixels), and interactive (a few seconds for a typical image). GrabCut extends this to require only a bounding box instead of scribbles, alternating between colour-model estimation and graph-cut inference. GrabCut was a standard segmentation tool for years and still ships in many image editors.

Mean-shift segmentation (Comaniciu and Meer, 2002) takes a different approach: treat each pixel as a point in a joint spatial-plus-colour space, compute a kernel density estimate, and iteratively move each point uphill to a mode of the density. Pixels that converge to the same mode become the same segment. Mean-shift produces good edge-preserving segmentations without requiring the number of regions in advance, at the cost of slow inference and sensitivity to bandwidth parameters.

Superpixels — SLIC, SEEDS, LSC — partition an image into several hundred or thousand small, roughly uniform regions that respect edges. They are not segmentations themselves but a useful intermediate representation: a subsequent algorithm can reason about superpixels rather than pixels, a roughly 100× reduction in compute, while preserving the ability to recover precise boundaries. Superpixel-based CRFs, graph neural networks, and boundary-aware CNNs all leverage this abstraction.

Modern learned segmentation networks — U-Net, DeepLab, SegFormer, SAM — have largely taken over the high-level task of producing pixel masks. But they typically produce soft probability masks that still need classical post-processing: thresholding to obtain hard masks, morphology (next section) to clean them up, watershed to separate touching instances, and graph-cut refinement at the boundary. Classical segmentation did not disappear; it became the layer surrounding the learned core.

§13

Mathematical morphology

Mathematical morphology is a small, elegant algebra for manipulating binary and grayscale images with shape-based operations: erosion, dilation, opening, and closing, built from a structuring element that defines the local neighbourhood. It was developed by Georges Matheron and Jean Serra at the École des Mines de Paris starting in the 1960s, primarily for analysing microscope images of minerals, and has remained a fundamental tool ever since. Every modern segmentation pipeline, OCR system, industrial vision inspection, and medical imaging application uses morphology somewhere — typically to clean up the output of a detector, refine a mask, or extract shape features. Morphology is one of the best classical-vision investments per unit of conceptual overhead.

The structuring element is a small binary pattern — a 3×3 cross, a 5×5 disc, a 7×3 rectangle — that defines what "neighbourhood" means for the operation. The choice of structuring element determines what morphology does; a horizontal rectangle picks up horizontal features, a disc is rotationally symmetric, a cross gives 4-connectivity.

Erosion replaces each pixel with the minimum over its neighbourhood (as defined by the structuring element). For a binary image, this means a foreground pixel survives only if every pixel in its neighbourhood is also foreground — the result is the original shape shrunk by the radius of the structuring element. Erosion eliminates small objects, thins thick objects, and breaks narrow connections.

Dilation is the dual: replace each pixel with the maximum over its neighbourhood. For a binary image, a foreground pixel's neighbours all become foreground — the shape expands by the radius of the structuring element. Dilation fills small holes, thickens thin objects, and merges nearby objects.

Opening and closing. The composite operations are what you actually use in practice. Opening is erosion followed by dilation: it removes small foreground objects and thin connections, without substantially changing the large objects. Closing is dilation followed by erosion: it fills small holes and gaps without substantially changing the boundaries. Together, opening and closing are the default morphological cleanup operations: after thresholding an image or producing a segmentation mask, an opening removes stray foreground noise, a closing fills small holes, and the result is a clean, usable mask. A typical image-processing recipe is "threshold, open, close" — three lines of code that solve a huge number of practical problems.

Beyond the four basic operations, morphology has a rich library of derived operators. Top-hat (image minus its opening) isolates small bright features. Black-hat (closing minus image) isolates small dark features. Morphological gradient (dilation minus erosion) is an edge-strength estimate. Hit-or-miss transforms test for specific pixel patterns — corners, endpoints, isolated points — and are the basis of skeletonisation, which thins a shape to its topological skeleton. Geodesic reconstruction — iterating dilations limited to a mask — can grow regions starting from markers, and is the tool used to implement marker-controlled watershed.

Morphology extends to grayscale images naturally: erosion becomes the local minimum, dilation the local maximum, and the composite operations follow. Grayscale opening suppresses bright structures smaller than the structuring element; grayscale closing suppresses dark structures. The grayscale versions are used extensively in industrial inspection (detecting defects that are darker or brighter than their surroundings) and medical imaging (highlighting or suppressing particular tissue contrasts).

In modern deep-learning pipelines, morphology is typically applied to the binary masks produced by a segmentation network: open to remove noise, close to fill holes, possibly skeletonise to extract medial structure, and extract connected components for instance identification. It is also used in data annotation workflows: a scribble-based labelling tool dilates the user's brush strokes, and a skeletonisation step converts hand-drawn paths into single-pixel-wide topological skeletons. The operations are cheap (linear in image size times structuring element size), robust, and interpretable — a useful package that has aged well.

§14

Camera models and multi-view geometry

Pixels do not exist in a vacuum — they are projections of a three-dimensional world through a specific optical system. A camera model formalises this projection: how a 3D point in the scene becomes a 2D pixel coordinate. Understanding camera models is the foundation of multi-view geometry — the mathematics of relating multiple images of the same scene — which underlies structure-from-motion, SLAM, stereo depth, augmented reality, and every system that reconstructs 3D from 2D images. This is a deep and beautiful subject (the canonical reference is Hartley and Zisserman's Multiple View Geometry in Computer Vision, 2003, still the standard text) and it remains largely classical: even neural networks that infer 3D structure are typically taught and evaluated in this language.

The pinhole camera is the simplest useful model. An infinitely small aperture projects light from a 3D point X = (X, Y, Z) onto an image plane, producing a 2D point (x, y) = (fX/Z, fY/Z), where f is the focal length. The formula is intuitive once drawn: similar triangles, with the pinhole at the origin and the image plane at distance f. Adding the principal-point offset (cx, cy) and potentially different x and y focal lengths (for rectangular pixels), the projection becomes x = fx·X/Z + cx, y = fy·Y/Z + cy. These four numbers — fx, fy, cx, cy — are the intrinsic parameters of the camera.

In homogeneous coordinates, projection becomes a matrix multiplication: x ∼ K [R | t] X, where K is the 3×3 intrinsic matrix, [R | t] is the 3×4 matrix of extrinsic parameters (rotation R and translation t that describe the camera's pose in world coordinates), and ∼ denotes equality up to scale. This matrix form is the compact, composable language of projective geometry and it shows up everywhere.

Lens distortion. Real lenses are not pinholes. They introduce radial distortion (barrel or pincushion — straight lines become curved, especially near image edges) and tangential distortion (from imperfect lens alignment). These effects are modelled with polynomial corrections: the distorted image coordinates (x_d, y_d) are related to the undistorted ones (x, y) by polynomials in radius r = √(x² + y²). Zhang's calibration method (2000) estimates both intrinsics and distortion coefficients simultaneously from multiple images of a known planar pattern (a chessboard). Calibrating your camera is the first step of any serious geometric vision pipeline; the OpenCV calibrateCamera function implements Zhang's method and is the standard tool.

A homography is a 3×3 matrix that relates pixel coordinates between two images of the same planar scene. If you photograph a whiteboard from two different angles, every pixel in one image is related to the corresponding pixel in the other by a homography H, such that x' ∼ H x. Homographies have 8 degrees of freedom and can be estimated from as few as four corresponding points. They underpin image rectification, panorama stitching, document rectification, and augmented reality overlays onto planar surfaces.

When the scene is not planar but general 3D, two cameras are related by the epipolar geometry. The key constraint is: given a point in image 1, its corresponding point in image 2 must lie on a specific line (the epipolar line) in image 2, regardless of the 3D depth of the point. The relationship is encoded in the fundamental matrix F — a 3×3 matrix of rank 2 with 7 degrees of freedom — such that x'^⊤ F x = 0 for any pair of corresponding points (x, x'). If the intrinsics are known, F reduces to the essential matrix E, which decomposes into a rotation and translation that describe the relative pose of the two cameras (up to scale).

The full Structure-from-Motion (SfM) pipeline is: detect features in every image; match features across images; estimate fundamental/essential matrices between image pairs; chain these together to get camera poses; triangulate 3D points from the matched features; optimise everything jointly with bundle adjustment (a sparse non-linear least-squares problem that refines all cameras and points simultaneously). COLMAP (Schönberger and Frahm, 2016) is the canonical open-source SfM pipeline and is still the backbone of NeRF and Gaussian Splatting preprocessing today — classical geometry feeding into modern learned models.

Modern learned geometry systems (deep SfM, learned feature matching like SuperGlue/LightGlue, implicit representations like NeRF) have raised the ceiling on accuracy and robustness, but they almost always inherit the classical structure: features are extracted, matches are made, poses are estimated, and a bundle-adjustment-like optimisation refines everything. The neural component provides better features, better matches, and sometimes better initialisation; the geometric machinery that turns these into actual 3D reconstructions is still the classical framework.

§15

Stereo and optical flow

Two specific problems dominate the geometric side of classical vision: stereo matching (given two images from horizontally displaced cameras, recover a depth map) and optical flow (given two frames of a video, recover a 2D motion vector for each pixel). Both are correspondence problems — finding which pixel in image 2 corresponds to a given pixel in image 1 — and their classical solutions (block matching, dynamic programming, variational methods, Lucas-Kanade, Horn-Schunck) remain the reference algorithms against which modern learned methods are evaluated. They are also the rare vision problems where the classical solutions still shine in the right niches: GPU-less embedded systems, structured-light depth sensors, and low-compute scientific imaging.

The stereo setup assumes two cameras separated horizontally (a baseline b), both with known intrinsics and with image planes that have been rectified so that corresponding points lie on the same horizontal scanline. For a 3D point at depth Z seen in both images, the displacement of the point between left and right images — the disparity d — relates to depth by the simple formula Z = fb/d. Recovering disparity per pixel is therefore equivalent to recovering depth per pixel, and the stereo problem reduces to: for each pixel (x, y) in the left image, find the x-shift d that best matches the corresponding pixel in the right image.

The classical stereo algorithm is block matching: for each pixel, take a small window around it in the left image, slide it horizontally in the right image, compute a matching cost at each candidate disparity, and pick the disparity with the lowest cost. The cost can be sum of squared differences (SSD), sum of absolute differences (SAD), or normalised cross-correlation (NCC); the last is more robust to illumination differences. Block matching is fast and embarrassingly parallel; it underpins most cheap depth sensors (passive stereo cameras from consumer manufacturers) and countless industrial measurement systems.

Semi-Global Matching (SGM). Pure block matching produces noisy, locally inconsistent disparity maps. The state-of-the-art classical stereo algorithm, Semi-Global Matching (Hirschmüller, 2008), adds a smoothness prior via dynamic programming along multiple 1D paths across the image (typically 8 directions), aggregating matching costs with penalties for disparity discontinuities. SGM produces much smoother, more accurate disparity maps than block matching and is the algorithm at the heart of most modern passive stereo sensors — still shipping inside industrial vision systems and robotics platforms today, often alongside a learned CNN refinement.

Optical flow is the 2D analogue: given two consecutive frames, assign a 2D displacement vector (u, v) to each pixel indicating where that pixel moved from frame 1 to frame 2. Optical flow is the foundation of motion-based video understanding, visual tracking, video stabilisation, and many action-recognition systems.

The classical point-based method is Lucas-Kanade (1981). Assume that within a small neighbourhood of a pixel, the optical flow is constant, and that brightness is constant along flow vectors (the brightness constancy assumption: I(x, y, t) = I(x+u, y+v, t+1)). Linearising the brightness constancy equation gives Ix·u + Iy·v + It = 0, one equation per pixel. With N pixels in the neighbourhood, this becomes an overdetermined linear system that can be solved by least squares. Lucas-Kanade produces sparse flow — computed only at well-conditioned points (corners, not smooth regions) — and is fast enough for real-time use. The pyramidal Lucas-Kanade extension runs the method coarse-to-fine on a Gaussian pyramid, enabling it to handle large displacements.

The classical dense method is Horn-Schunck (1981). Instead of assuming local flow constancy, Horn-Schunck adds a global smoothness prior — neighbouring pixels should have similar flow — and formulates flow as the minimiser of an energy that combines the brightness constancy residual with a smoothness penalty. The solution is the zero of an Euler-Lagrange equation, which can be computed iteratively. Horn-Schunck produces dense flow fields but is sensitive to brightness changes and fails at motion discontinuities. Variational extensions in the 2000s (Brox et al., TV-L1) addressed these issues and were the state of the art until CNNs arrived.

Modern learned stereo and flow — GC-Net, PSMNet, RAFT, FlowNet, Unimatch — have pushed accuracy substantially beyond the classical methods on standard benchmarks (Middlebury, KITTI, Sintel), and they generalise better to difficult cases (reflective surfaces, thin structures, large displacements). But they are orders of magnitude more compute-intensive and require large training sets with ground-truth flow/disparity — typically synthetic data or carefully constructed real datasets. For applications where compute is constrained (embedded stereo sensors, older robotics platforms, scientific imaging on CPU-only systems), SGM and pyramidal Lucas-Kanade remain standard. And the correspondence problem itself — the framing that dominates both classical and learned approaches — is still the same.

§16

Bag-of-features and image matching

Given local feature detectors (§7, §8) and descriptors (§9, §10), what can you do with them? The classical answer — which dominated image matching, retrieval, and recognition from roughly 2000 to 2012 — is the bag-of-features pipeline, built out of SIFT-like descriptors, vector quantisation, and geometric verification with RANSAC. It was the reigning paradigm for image retrieval (Google Images, Tineye, and their peers), visual place recognition (re-localising in SLAM), and pre-CNN object classification. Understanding it is useful both historically and practically: variants of this pipeline still drive production image-search and visual-localisation systems where CNNs are impractical to deploy.

The core idea of image matching — determining whether two images show the same scene — is straightforward. Extract SIFT features from both images; for each feature in image 1, find its nearest neighbour in descriptor space in image 2; apply Lowe's ratio test (accept only matches where the best neighbour is substantially better than the second-best) to filter unambiguous matches; then check geometric consistency with RANSAC.

RANSAC (Random Sample Consensus, Fischler and Bolles 1981) is a general algorithm for fitting a model to data contaminated with outliers. For image matching, the model is typically a homography (for planar scenes) or a fundamental matrix (for general scenes). RANSAC works by: (1) randomly sample the minimum number of matches needed to estimate the model (4 for a homography, 8 for a fundamental matrix); (2) estimate the model from those matches; (3) count how many other matches are consistent with the estimated model (inliers) within a pixel threshold; (4) repeat many times; (5) keep the model with the most inliers, then re-estimate from all inliers. RANSAC turns SIFT's noisy putative matches into a clean set of geometrically consistent ones, and its output is both a refined correspondence set and an estimate of the geometric transformation between the images.

The SIFT + RANSAC paradigm. For fifteen years, nearly every production system that needed to determine whether two images showed the same scene ran some version of SIFT + RANSAC. Panorama stitchers in consumer photography software (Photoshop's Photomerge, Lightroom's Panorama, and every smartphone's native stitcher) used this pipeline. Structure-from-motion tools (Bundler, VisualSFM, and today COLMAP) used it for pairwise matching. Augmented-reality systems used it to align virtual content to a known planar marker. Visual-localisation systems for robot SLAM used it to find the robot's pose in a prebuilt map. The pipeline is still shipping; COLMAP is still the canonical SfM tool for NeRF and Gaussian Splatting preprocessing in 2025.

Bag-of-features takes this idea from pairs of images to collections of thousands. The insight, due to Sivic and Zisserman (2003), is to treat an image as a histogram of visual words, analogous to how a document is a histogram of actual words. The pipeline: (1) collect a large set of SIFT descriptors from a training corpus; (2) cluster them with k-means into, say, 10,000 clusters — each cluster centre is a visual word; (3) for a new image, extract SIFT features, assign each to its nearest visual word, and build a histogram over the vocabulary; (4) compare images by the cosine similarity of their visual-word histograms (with TF-IDF weighting as in text retrieval). This turns image matching into a problem that could be indexed and searched with inverted files, the same data structure that powers text search engines.

Bag-of-features powered the first generation of large-scale image retrieval. Google Images (in its earliest incarnation), Tineye, and many academic systems were essentially bag-of-features engines. The approach scaled to millions of images with inverted-file retrieval, and its extensions — spatial verification (use RANSAC to re-rank the top candidates by geometric consistency), soft assignment (assign each descriptor to multiple visual words), Vector of Locally Aggregated Descriptors (VLAD), Fisher Vectors — progressively improved retrieval accuracy.

For image classification, bag-of-features was the dominant approach from roughly 2003 to 2012. The pipeline — SIFT features, visual-word assignment, spatial pyramid pooling (Lazebnik, Schmid, Ponce 2006), linear SVM on the resulting high-dimensional vector — was the state of the art on PASCAL VOC, Caltech-101, and the early ImageNet benchmarks. Krizhevsky's 2012 AlexNet paper beat this pipeline by ten points on ImageNet top-5 accuracy, and within two years the entire community had moved to CNNs.

The bag-of-features era is over, but its skeleton — detect features, describe them, quantise, aggregate, match — recurs in every vision system built since. Learned descriptors (SuperPoint, DISK), learned matchers (SuperGlue, LightGlue), learned retrieval (NetVLAD, deep global descriptors) have replaced SIFT, RANSAC, and k-means respectively, but the pipeline shape is unchanged. And in compute-constrained niches — embedded visual localisation, drone re-localisation, loop closure in SLAM — SIFT + RANSAC + BoW is still the deployed solution.

§17

Classical vision in the modern ML lifecycle

The question this final section addresses: given that deep learning has swept object recognition, segmentation, detection, and most other high-level vision tasks, what is the ongoing role of classical vision in 2025? The answer is both prosaic and pervasive. Classical vision lives in every production pipeline as preprocessing, evaluation, augmentation, post-processing, and calibration — rarely glamorous, always load-bearing. It is also the vocabulary in which modern vision is described, analysed, and taught. Understanding where classical techniques still reside, and why, is a useful map of the modern practitioner's toolkit.

Data preprocessing is the most visible role. Every vision training pipeline resizes, crops, colour-corrects, and normalises its inputs — operations that are pure classical image processing, typically implemented by OpenCV, Pillow, or torchvision's transform utilities. Medical-imaging pipelines that feed CNNs routinely apply CLAHE, denoising, bias-field correction, and skull-stripping as preprocessing — classical techniques that make the learning problem substantially easier. Document-understanding pipelines run Canny, Hough, and classical deskewing before any learned model sees the image.

Data augmentation is the training-side equivalent. Every modern augmentation library — albumentations, Kornia, Imgaug, torchvision.transforms — is a collection of classical image-processing transformations: random crop, flip, rotate, affine, perspective, Gaussian blur, motion blur, JPEG compression, colour jitter, hue/saturation shift, gamma, noise injection, CutOut, MixUp, CutMix. The strength of these augmentations, and the specific choices of which to apply, are major drivers of CNN and ViT generalisation; the design space is the space of classical vision operations that preserve semantic content.

Mask post-processing. Every production segmentation system I am aware of (Meta's SAM-based tools, Google's medical-imaging segmentation, autonomous-driving perception stacks, satellite-imagery pipelines) combines a learned segmentation network with classical post-processing: threshold the probability mask, apply morphological opening to remove noise and closing to fill holes, optionally refine boundaries with a graph-cut or CRF, extract connected components, compute per-component properties. Without this post-processing, the raw output of a segmentation network is rarely usable. With it, the learned model and the classical plumbing form a system whose combined accuracy exceeds either alone.

Camera geometry and calibration remain purely classical. Camera calibration with a chessboard pattern, lens-distortion modelling and correction, stereo rectification, projection matrices — these are the same algorithms (mostly Zhang's 2000 method for intrinsics, and classical multi-view-geometry for extrinsics) that they were twenty years ago. The geometric backbone of autonomous driving, photogrammetry, 3D reconstruction, and augmented reality is classical multi-view geometry, with learned components inserted where they help (feature extraction, matching, and depth refinement) but never replacing the underlying geometric framework.

Evaluation metrics are classical. Intersection-over-Union, mean Average Precision, pixel accuracy, boundary F-measure, PSNR, SSIM — every metric used to evaluate learned vision models comes straight from classical vision, where they were designed to evaluate classical algorithms. The image quality metric SSIM is now routinely used as a loss function in learned image-restoration and super-resolution. Panoptic quality, COCO's AP definition, and KITTI's evaluation protocols were developed in the classical era.

Interpretability and analysis lean on classical concepts. Analysing what a CNN has learned begins with visualising its first-layer filters — typically in the language of oriented edges and Gabor-like filters — and continues with activation maximisation, Grad-CAM, and saliency methods that ultimately produce a kind of attention map analogous to classical saliency. The Gestalt principles, the scale-space framework, and feature hierarchies from the classical era are the conceptual backdrop against which learned representations are analysed.

Fallback modes. When you have no training data, when you need to ship on a device without a GPU, when interpretability is legally required, or when a task is so narrow that collecting data is more expensive than tuning an algorithm — classical vision is the alternative. Industrial inspection, OCR on simple documents, barcode reading, marker-based AR, real-time pose estimation from fiducials, medical-imaging preprocessing — all of these applications routinely deploy classical pipelines in 2025. Not because deep learning cannot solve them, but because classical solutions are cheaper, more predictable, and easier to certify.

The through-line to subsequent chapters is direct. Chapter 02 picks up modern image classification and architectures, from AlexNet to Vision Transformers, built on the convolution primitive introduced here. Chapter 03 covers modern detection and segmentation, whose outputs are still cleaned up with the morphology and watershed of §§12-13. Chapter 04 on video builds on the optical flow of §15. Chapter 05 on 3D vision extends the camera geometry of §14 with learned depth and implicit representations. Chapter 06 on vision-language models replaces HOG (§11) and bag-of-features (§16) with CLIP-style contrastive learning, but the need for reliable local features and their calibration persists. Everything that follows rests on what is in this chapter — that is why it is first.