Image classification is the problem that made modern deep learning. Everything in this Part — detection, segmentation, video, 3-D, vision-language — inherits the backbones, training tricks, and scaling intuitions that were worked out on the single task of mapping a 224×224 crop to one of 1000 ImageNet labels. The story moves through four stylistic eras. AlexNet (2012) and VGG (2014) demonstrated that deep convolutional networks trained on a large labelled corpus could outperform a decade of hand-engineered features; Inception introduced 1×1 bottlenecks and parallel branches; ResNet (2015) solved deep-network trainability with skip connections and took networks from twenty-something to a hundred-plus layers; BatchNorm made all of this optimise stably. Efficient architectures — MobileNet, ShuffleNet, EfficientNet — pushed ImageNet-level accuracy onto phones by replacing dense convolutions with depthwise-separable ones and by discovering good scaling rules via neural architecture search. Vision Transformers (2020) threw convolution out altogether, treating an image as a sequence of patches and transplanting the NLP transformer stack; when paired with very large pretraining corpora they matched and then exceeded the best CNNs. The modern mainstream is a hybrid: hierarchical transformers like Swin, modernised convolutions like ConvNeXt, and a layer of self-supervised pretraining (SimCLR, MoCo, DINO, MAE) and web-scale multimodal pretraining (CLIP, SigLIP) on top of which the ImageNet numbers finally stopped moving. This chapter walks through that lineage, ends with the training recipes, robustness story, and distillation/compression toolkit that turn a research backbone into production code.
Sections one and two frame the problem. Section one is why modern classification matters — the ImageNet moment, the collapse of hand-crafted features, and the argument that the classification backbone is the compiler of computer vision, providing the features that every downstream task imports. Section two is the benchmark landscape — ImageNet-1k, ImageNet-21k, JFT-300M, LAION-400M/5B, and the finer-grained, robustness, and out-of-distribution test sets that replaced top-1 accuracy once the leaderboard saturated.
Sections three through seven walk through the classical CNN lineage. Section three is the early deep nets — AlexNet, ZFNet, VGG, and the training tricks (ReLU, dropout, data augmentation, GPU training) that made them work. Section four is the Inception family — 1×1 convolutions, the Inception module, auxiliary classifiers, and the gradual refinement from GoogLeNet through Inception-v4 and Xception. Section five is the residual networks that define the modern backbone: plain ResNet, pre-activation ResNet, ResNeXt's grouped convolutions, and the squeeze-and-excitation attention module. Section six is normalisation — why batch normalisation works, what batch-size dependence it introduces, and the family of LayerNorm / GroupNorm / InstanceNorm alternatives. Section seven surveys activation functions from ReLU through GELU.
Sections eight and nine cover efficient architectures. Section eight is mobile and efficient CNNs — depthwise-separable convolutions, MobileNet-v1/v2/v3, ShuffleNet, SqueezeNet, and the techniques that push classification into the milliwatt regime. Section nine is neural architecture search and EfficientNet — NASNet, DARTS, ProxylessNAS, and compound scaling as a recipe rather than a search.
Sections ten and eleven are the transformer turn. Section ten is the Vision Transformer (ViT) — patch tokenisation, positional encoding, class token, and the insight that with enough data you do not need convolutional inductive biases. Section eleven is the hybrid and hierarchical descendants — Swin, ConvNeXt, MaxViT, MobileViT — which reintroduced locality and multi-scale processing on top of the transformer chassis, producing the current state-of-the-art backbones.
Sections twelve and thirteen are about training. Section twelve is the modern training recipe — optimiser, learning-rate schedule, weight decay, Mixup/CutMix/RandAugment/label smoothing, and the recipe-sized performance gap that separates a 2015 and a 2022 ResNet-50. Section thirteen is self-supervised pretraining for vision — contrastive methods (SimCLR, MoCo, BYOL), clustering (SwAV), self-distillation (DINO), and masked image modelling (BEiT, MAE) — the paradigm that made web-scale vision pretraining practical.
Section fourteen examines scaling laws in vision: how accuracy varies with model, data, and compute; why ViTs scale while CNNs saturate; and where vision-language pretraining fits into this picture. Section fifteen is the deployment toolkit: distillation, pruning, quantisation and the compression pipeline that sits between a research checkpoint and a production model. Section sixteen is robustness and reliable evaluation — ImageNet-C / -R / -A / -Sketch, adversarial benchmarks, calibration, and the question of whether benchmark accuracy still predicts real-world behaviour. The closing in-ml section is the operational picture: how a vision backbone fits into the machine-learning lifecycle today, the timm and torchvision ecosystems, and the relationship between a classification head and the rest of Part VII.
Classification is the task that made deep learning for vision: recognising which of a thousand ImageNet categories a single 224×224 crop belongs to. The reason this synthetic-sounding problem matters is that the backbone trained to solve it provides the feature extractor for essentially every other vision task in this Part. Detection heads, segmentation decoders, video temporal models, and vision-language encoders almost all start with a classification backbone and adapt its penultimate features.
The field was transformed in a single year. In 2011, the best ImageNet-LSVRC top-5 error was 25.8%, achieved by a carefully engineered pipeline of SIFT features, Fisher Vectors, and linear SVMs — essentially the state of the art from the previous chapter, extrapolated. In 2012, AlexNet won with 15.3% top-5 error, using a single end-to-end trained convolutional network, ReLU activations, dropout, and two GPUs. The gap was not incremental. Within three years, every serious entry in the challenge was a deep CNN; within five, hand-crafted visual features had effectively disappeared from the research literature.
The pattern has a name now: the ImageNet moment. What it demonstrated is that when you have a large labelled corpus, a loss that scores end-to-end task performance, and enough compute, a sufficiently deep network will learn features that dominate anything hand-designed. What made the pattern generalise beyond classification is transfer: the convolutional features learned on ImageNet turn out to be remarkably useful for downstream tasks, from medical imaging to satellite remote sensing to object detection. A research paper on pedestrian detection that in 2010 would have described a bespoke HOG-and-DPM pipeline now begins "we take a ResNet-50 pretrained on ImageNet, fine-tune it on…" — the classification backbone has become the compiler of computer vision.
This chapter therefore works two layers simultaneously. On the surface, it is a history: from AlexNet through ResNet, EfficientNet, ViT, Swin, and ConvNeXt, with stops for BatchNorm, SE blocks, NAS, and the modern training recipe. Underneath, it is a sustained argument about representation learning — how deep networks acquire useful feature hierarchies, why certain architectural choices (skip connections, normalisation, multi-head attention) keep outperforming their alternatives, and how the answer to "what is a good image encoder" has shifted over fifteen years from carefully-engineered CNN stacks to large transformers trained with self-supervision or multimodal contrastive losses.
A final framing question is where classification actually sits inside a modern vision pipeline. For most practitioners, it is not an endpoint. The question is rarely "which ImageNet class is this"; it is "what vector should we extract from this image to feed into a detector, a retrieval index, or a language model". Classification is the training signal that produced a good vector, and the classifier's final-layer weights are thrown away at deployment time. Keeping this in mind while reading the rest of the chapter helps explain why a one-percent ImageNet improvement is sometimes worth a major engineering effort and sometimes completely invisible in the thing that actually ships.
Deep vision was built on top of ImageNet. Understanding the benchmarks — how they are structured, what they measure, and how they were gradually supplemented as the original numbers saturated — is a prerequisite for reading any paper in this chapter. The field's story of progress is also a story about benchmarks being designed, broken, and replaced.
The canonical benchmark is ImageNet-1k, properly the ILSVRC-2012 classification subset: 1,281,167 training images and 50,000 validation images, each labelled with exactly one of 1000 synset categories drawn from WordNet. Accuracy is reported as top-1 (the network's single top-ranked label matches) or top-5 (the correct label is in the top five). Top-5 was introduced because many of the categories are near-duplicates (dozens of dog breeds, multiple snake species) and single-label accuracy systematically underestimates a classifier that identifies the correct rough concept.
Two expanded sets sit above it. ImageNet-21k (also called ImageNet-22k) uses approximately 14 million images across roughly 21,000 WordNet synsets — a noisier, much larger corpus used almost exclusively as a pretraining set that is later fine-tuned on ImageNet-1k. JFT-300M and its later JFT-3B cousin are Google-internal datasets, roughly 300 million and 3 billion noisy web-labelled images respectively; they are the pretraining corpora behind many of the strongest published ViT numbers, though the data are not public. For non-Google researchers, LAION-400M and LAION-5B serve as public web-scale substitutes, primarily used for vision-language pretraining (CLIP-style).
The robustness test sets probe different failure modes. ImageNet-V2 is a re-collection of ImageNet's validation images under the original protocol; models score 5–10% lower on it, quantifying the selection bias of the original set. ImageNet-C applies 15 categories of corruption (Gaussian noise, blur, weather, compression) at five severities, measuring robustness under common distribution shifts. ImageNet-R contains artistic renditions (paintings, sketches, toys) of 200 ImageNet classes, probing whether the network has learned the category concept or the photographic style. ImageNet-A contains natural adversarial examples — real photos that standard classifiers get wrong — and ImageNet-Sketch is a pencil-sketch-only variant. Together these are the default robustness suite in any serious 2020s paper.
Several finer-grained classification benchmarks live alongside ImageNet and stress different aspects of the problem. CIFAR-10/100 are small (32×32) but a classical quick-turnaround benchmark. Places365 classifies scenes rather than objects. iNaturalist is species identification with a long-tail, heavily-imbalanced class distribution. Stanford Cars, FGVC-Aircraft, and CUB-200 test fine-grained classification where different classes can be nearly-identical. These smaller benchmarks are particularly important for evaluating transfer learning and for spotting architectures that only shine when saturated with data.
Finally, a growing cluster of transfer benchmarks evaluates a frozen backbone's linear-probe and few-shot accuracy across many datasets. The VTAB (Visual Task Adaptation Benchmark) suite is the most-cited; ELEVATER and the linear-probe protocol from the CLIP paper are close relatives. These reflect the modern reality: the purpose of an image classifier is to be the feature extractor for everything else, and how well it transfers is a better measure than its ImageNet-1k top-1 number.
The three architectures that opened the deep-vision era — AlexNet, ZFNet, and VGG — are, by 2026 standards, small and unsophisticated. Their importance is that together they established the template that every subsequent CNN refined: a tall stack of convolutional layers with ReLU activations, interspersed pooling, data augmentation, dropout on the fully-connected head, and training by SGD with momentum on large labelled data.
AlexNet (Krizhevsky, Sutskever & Hinton, 2012) was eight weight layers deep: five convolutional blocks followed by three fully-connected layers. Its innovations were a combination of things that individually existed but had not been packaged together on ImageNet scale. ReLU activations replaced the saturating tanh or sigmoid, giving roughly an order-of-magnitude speedup in training convergence. Local response normalisation normalised activation magnitudes across feature maps (later dropped in favour of BatchNorm). Dropout at rate 0.5 on the fully-connected layers provided strong regularisation. Data augmentation (random crops, horizontal flips, PCA-based colour jittering) effectively enlarged the training set. And all of it was done on two GPUs with model parallelism — a practical necessity in 2012 that shaped the architecture itself.
ZFNet (Zeiler & Fergus, 2013) is essentially a tuned AlexNet. What made the paper influential was its companion visualisation method: deconvolutional networks that project hidden activations back into pixel space, producing the now-familiar pictures of what a given filter responds to (edges in layer 1, textures in layer 2, object parts in layers 3–4, whole objects in the final conv layer). That visualisation idea — more than the accuracy bump — established the intuition of a feature hierarchy that every subsequent architecture paper assumes as background.
VGG has two more lasting legacies beyond the 3×3 principle. First, it demonstrated that depth rather than width is the more productive axis of scaling — 13 conv layers beat fewer, wider ones — though the trend would eventually reverse at the extreme depths solved by ResNet. Second, the pretrained VGG features became the de facto perceptual loss backbone for generative models and style transfer; a decade later, papers on diffusion-model evaluation still compute VGG-based LPIPS scores. The network is too slow and too parameter-heavy for production inference, but it remains a useful reference for how a plain deep CNN behaves before any of the tricks that followed.
A technical note on training these early networks: none of them used batch normalisation (introduced in 2015), so training required careful weight initialisation (He or Xavier), a learning rate warmup period, and frequent monitoring for exploding or vanishing activations. Getting a 16-layer VGG to converge from scratch in 2014 was serious engineering; today it would run as a homework exercise in an hour. This compute/engineering gap is a recurring theme — every generation of architecture looks easy from the next generation's training infrastructure.
Where VGG scaled by stacking identical 3×3 blocks, Inception (Szegedy et al., 2014 — also known as GoogLeNet) scaled by going wider inside each block: multiple parallel branches with different receptive fields, concatenated at the output. The motivation was to let the network learn which scale of feature to emphasise rather than committing to a single kernel size at each depth.
The core Inception module places four branches in parallel — a 1×1 conv, a 3×3 conv, a 5×5 conv, and a 3×3 max-pool — and concatenates their outputs along the channel dimension. This has the obvious problem that channels blow up quadratically with depth. The key trick, borrowed from Lin, Chen & Yan's Network in Network, is to prefix each of the larger branches with a 1×1 convolution that first reduces channel count, then lets the more expensive operation run on a lean representation. The 1×1 conv is the single most important primitive that Inception contributed to the field: it is a linear mixing of channels that effectively learns a feature-combining layer, and it appears everywhere in later architectures (as the "bottleneck" in ResNet, as the pointwise conv in MobileNet, as the projection layer in SE blocks).
GoogLeNet was 22 layers deep but had only ~7 million parameters — roughly one-twelfth of AlexNet — because the 1×1 bottlenecks aggressively shrank channel counts inside each module. It also used two auxiliary classifiers: shallower branches attached to intermediate layers that were supervised with the same loss. These were intended to help gradient flow into the early layers; the same problem would be solved more directly by BatchNorm and residual connections, and later Inception versions dropped them.
Inception-v2 (2015) added batch normalisation and replaced 5×5 convs with two stacked 3×3s (the same VGG trick). Inception-v3 (2015) further factored 3×3 convs into asymmetric 1×3 then 3×1 pairs, saving more parameters, and introduced label smoothing — a regularisation that mixes the one-hot target with a uniform distribution, which has since become a default ingredient in every modern training recipe. Inception-v4 and Inception-ResNet (2016) grafted residual connections onto Inception blocks, essentially admitting that ResNet's idea was a strictly better chassis.
Xception (Chollet, 2017) is the natural limit of the Inception idea: if 1×1 convs decouple cross-channel mixing from spatial mixing, take that principle all the way and replace each regular conv with a depthwise 3×3 followed by a 1×1 pointwise conv. This is the depthwise-separable convolution that MobileNet and the entire efficient-architecture family would build around. Xception's accuracy was competitive with Inception-v3 at similar parameter count, but the larger contribution was framing the depthwise/pointwise decomposition as a first-class design primitive.
The Inception family is no longer the state-of-the-art backbone, but its 1×1 bottlenecks, factored convolutions, label smoothing, and multi-scale feature aggregation are ubiquitous inside every modern architecture. Reading an Inception paper is the cleanest way to understand these primitives in isolation before encountering them bolted into later designs.
The ResNet paper (He et al., 2015) solved a problem that seemed on the face of it strange: past about 20 layers, making a CNN deeper made it worse, not better. The issue was not overfitting — training error also increased — but an optimisation failure. The residual-connection idea that fixed it is a one-line architectural change that produced what remains, a decade later, the single most influential vision architecture.
A residual block computes y = F(x) + x instead of y = F(x), where F is typically two or three convolutions. The intuition is that the block only needs to learn the residual function from input to output, not the full mapping. If the optimal solution at some depth is approximately the identity, the block can drive F toward zero — a much easier target than learning to reproduce x through several non-linearities. The same argument applies at every depth; gradient flow during backpropagation is also improved because the identity path carries gradient unchanged. With residual connections, networks of 50, 101, 152 layers trained stably; a 1000-layer version (He et al., 2016b) even trained, though without accuracy gains.
The standard ResNet family has four depth variants — ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152 — with ResNet-50 the de facto reference backbone that every subsequent architecture measures itself against. The 50/101/152 variants use a three-layer bottleneck block: a 1×1 conv that reduces channels, a 3×3 conv at the reduced width, and a 1×1 conv that restores channels — the Inception idea applied inside a residual block. Downsampling happens at a few specific stages by striding the 3×3 conv and using a 1×1 projection on the skip path.
Pre-activation ResNet (He et al., 2016b) swapped the order inside each block: instead of conv → BN → ReLU + identity, it does BN → ReLU → conv + identity. The small change made training even more stable at extreme depths and marginally improved accuracy; modern ResNet implementations in torchvision and timm are typically pre-activation.
Two notable ResNet descendants extended the template. ResNeXt (Xie et al., 2016) replaced each bottleneck's 3×3 conv with a grouped convolution: split the channels into G groups, run independent 3×3 convs on each, concatenate. With the right choice of G (typically 32) this matches full-convolution accuracy at significantly lower compute — the "cardinality" lever of ResNeXt is the clearest empirical demonstration that channel count is not the only width knob.
Squeeze-and-Excitation networks (SE-ResNet, Hu et al., 2017) add a lightweight channel-attention module at the end of each block: global-average-pool each channel to a scalar, run two FC layers to produce a per-channel gate, then multiply. This adds ~1% parameters and gives consistent 1–2% top-1 improvements across architectures. SE blocks (or the very similar ECA and CBAM variants) are now routinely added to any production CNN; EfficientNet, for instance, uses them inside every MBConv.
The broader point is that ResNet's residual stream — a persistent identity-plus-delta spine — is the architectural pattern that lets networks go deep. ViTs retain it exactly (each block is x + Attention(x) + x + FFN(x), modulo normalisation). ConvNeXt retains it. Segmentation decoders retain it. Residual connections are now a default ingredient everywhere.
The other architectural change that enabled deep vision training is normalisation: rescaling intermediate activations so their distribution stays well-conditioned regardless of depth. Between residual connections and normalisation, training a 50-layer CNN went from a research feat to a default workflow in about twelve months.
Batch normalisation (Ioffe & Szegedy, 2015) normalises each channel independently across the batch-and-spatial dimensions during training, then applies a learned scale and shift. The original paper framed this as reducing "internal covariate shift"; later work has argued the true reason it helps is a smoother loss landscape, though the practical consequence — much larger usable learning rates, no careful weight initialisation required, faster convergence — was immediately obvious. BN became a near-universal ingredient in CNNs within months.
BN has a subtle but important failure mode: its statistics depend on the batch. At training time it uses the current batch's mean and variance; at inference, it uses an exponential moving average estimated during training. When the inference-time distribution differs from training, or when batch size shrinks (small-GPU training, detection with high-resolution inputs), the two statistics drift apart and accuracy suffers. This prompted a family of batch-free normalisers that compute statistics over different axes.
LayerNorm normalises each sample over all of its channels and spatial dimensions. It is the default in transformers because it is completely independent of batch size and gives stable training at batch-size 1. GroupNorm (Wu & He, 2018) is a compromise: group the channels into G groups (typically 32), normalise within each group. It gives BN-like accuracy but without the batch dependence; it is the default normalisation in detection and segmentation systems where small-batch training is common. InstanceNorm normalises per-sample per-channel and is the traditional choice for style-transfer networks.
Several modern alternatives deserve mention. RMSNorm (Zhang & Sennrich, 2019) drops the centring (mean-subtraction) step of LayerNorm, keeping only the RMS-based scaling; it is slightly faster and has become the default in recent large language and vision models. Weight Standardisation and Filter Response Normalisation (Singh & Krishnan, 2020) operate on the weights rather than the activations and avoid the batch dependence entirely. BatchReNorm and Switchable Normalisation are earlier attempts in the same direction. Outside specific niches, these have not displaced BN/LN/GN as the standard choices.
A final architectural point: where normalisation sits inside a residual block matters. The original ResNet uses post-norm (BN after each conv). Pre-activation ResNet and all modern transformers use pre-norm (normalise before the operation, not after). Pre-norm trains more stably at depth and at very high learning rates; post-norm occasionally gives slightly higher final accuracy when it trains successfully but is more fragile. The 2020s default across both CNNs and transformers is pre-norm with RMSNorm or LayerNorm.
The activation function is the non-linearity applied between linear layers. Despite the vast attention that architectures receive, the choice of activation is essentially a solved problem: ReLU is the default for CNNs, GELU or SiLU/Swish for transformers, and the differences between the smoothly-differentiable variants usually cash out as a fraction of a percent on ImageNet.
ReLU — max(0, x) — became the standard with AlexNet. Its virtues are extreme simplicity, no vanishing-gradient problem for positive inputs, and fast computation. Its vice is the "dying ReLU" problem: a neuron whose inputs are consistently negative gets zero gradient and stops updating. In practice, with modern initialisation and normalisation this is rarely a problem, and ReLU remains the default activation in ResNet-family CNNs.
LeakyReLU — max(αx, x) with small α around 0.01 — addresses the dying-neuron issue by giving a small gradient for negative inputs. PReLU (parametric ReLU) makes α a learned per-channel parameter. ELU (exponential linear unit) is smooth for x<0, approaching -α asymptotically. None of these consistently outperform plain ReLU by more than ~0.3% top-1, but they are useful on networks that are difficult to train from scratch.
Swish (Ramachandran, Zoph & Le, 2017) was discovered via a neural architecture search over activation functions; it is x · σ(x), also written as SiLU (Sigmoid Linear Unit). It is smooth everywhere, non-monotonic (has a small negative dip around -1), and typically gives 0.5–1% accuracy improvements over ReLU on EfficientNet-scale models. Hard-Swish approximates Swish with piecewise-linear pieces for mobile deployment where sigmoid is expensive.
Two more recent variants appear in current large models. GeGLU and SwiGLU (Shazeer, 2020) are "gated" activations: the FFN's first linear layer produces twice as many features, half of which are passed through GELU/Swish and used as a multiplicative gate on the other half. This costs 50% more parameters in the FFN but gives consistent ImageNet and language-modelling wins; modern LLMs and recent ViT variants use SwiGLU almost universally.
A practical recommendation: for standard ImageNet-scale training of ResNets and similar CNNs, ReLU. For efficient architectures in the MobileNet / EfficientNet family, Swish (or Hard-Swish if deploying to mobile). For any transformer, GELU or SwiGLU. Anywhere else, try ReLU first, and spend the search budget on data augmentation or architecture choices rather than activation tuning.
A parallel track to the accuracy-chasing architectures has always been efficiency: producing models that run in real time on a phone, in a browser, or on embedded hardware, at a cost (accuracy-wise) of a few top-1 points. The techniques that drove this track — depthwise-separable convolutions, inverted residual blocks, channel shuffling — also fed back into the mainstream architectures.
A depthwise-separable convolution factors a standard K×K×C_in×C_out convolution into two steps. First, a depthwise convolution applies a K×K filter to each input channel independently (so C_in separate spatial convolutions, with no cross-channel mixing). Then a pointwise 1×1 conv mixes the C_in channels to produce C_out. The parameter count drops from K²·C_in·C_out to K²·C_in + C_in·C_out — roughly a factor of K² for small channel counts, and a factor of 8–10× in practice. Accuracy drops by around 1–2% top-1 if you match FLOPs; if you match parameter count and therefore use a wider network, you often come out ahead.
MobileNet-v1 (Howard et al., 2017) was the first widely-used architecture built around depthwise-separable convs: a simple stack of them, with two hyperparameters (a width multiplier α and a resolution multiplier ρ) to sweep out a curve of accuracy vs. latency. MobileNet-v2 (Sandler et al., 2018) introduced the inverted residual block: expand channels by factor 6 with a pointwise conv, run a depthwise 3×3 on the expanded representation, and contract back with another pointwise conv. Non-linearities are omitted on the contracting layer — the "linear bottleneck" — because ReLU destroys information when applied to a low-dimensional manifold. This became the canonical efficient block.
MobileNet-v3 (Howard et al., 2019) added squeeze-and-excitation blocks, switched to Hard-Swish activations, and used neural architecture search to tune per-layer widths. It remains the go-to architecture for on-device classification on mobile CPUs as of the mid-2020s.
SqueezeNet (Iandola et al., 2016) took a different approach: keep full convolutions but aggressively shrink channel counts through 1×1-conv "fire modules". It achieved AlexNet-level accuracy with 50× fewer parameters, making it an early proof of concept that parameter count and accuracy were not tightly coupled — but it lost ground to depthwise-separable architectures on FLOPs and latency.
An important benchmark to know is the accuracy-vs-latency Pareto frontier on mobile hardware. MobileNet-v3-Small achieves about 67% top-1 at around 6 ms on a Pixel 4 CPU; MobileNet-v3-Large reaches 75% at ~12 ms; EfficientNet-B0 reaches ~77% at similar cost. Below the frontier sits every serious modern mobile architecture; above it sits everything too big for practical on-device inference. Knowing where your target sits on this frontier tells you, immediately, which architecture family to reach for.
Neural architecture search (NAS) automates the design of a network's topology: treat the architecture itself as a set of discrete choices (which block at which layer, which kernel size, how many channels) and search over them by treating accuracy-under-training as a black-box function to optimise. The technique's most famous output is the EfficientNet family, and its broader contribution is the compound scaling rule that turned "how do I make this network bigger" into a one-line formula.
The first generation of NAS — NASNet (Zoph & Le, 2017) and its successors — used a reinforcement-learning controller that proposed architectures, trained them to completion, and used validation accuracy as reward. This worked but cost hundreds of GPU-years. A series of efficiency improvements followed. ENAS (Pham et al., 2018) shared weights across candidate architectures via a supernet. DARTS (Liu et al., 2018) made the architecture parameters continuous and differentiable, letting SGD optimise topology and weights jointly. ProxylessNAS (Cai et al., 2018) searched directly on the target hardware to optimise latency rather than proxy metrics. By 2019 NAS had become cheap enough to use routinely.
The result most worth remembering is EfficientNet (Tan & Le, 2019). Starting from a small NAS-searched backbone (EfficientNet-B0), the paper poses a clean scaling question: given extra compute, how should you use it — more depth, more width, higher resolution, or some combination? The answer is a compound scaling rule that grows all three together according to α, β, γ exponents constrained so that αβ²γ² ≈ 2 per "unit" of extra compute. Following that rule produces a family of models (EfficientNet-B1 through B7) on a clean Pareto frontier: B0 is small and fast, B7 is large and accurate, and you simply pick the point that matches your budget.
EfficientNet-v2 (Tan & Le, 2021) refined this with Fused-MBConv blocks (combining the first pointwise-expand and the depthwise conv into a single regular conv at early stages, which runs faster on modern hardware), progressive resizing during training (start at low resolution and grow), and a slightly different scaling rule. EfficientNet-v2-L approached 88% ImageNet top-1 with strong transfer to downstream tasks.
NAS is not a universal win. Its output is only as good as the search space; reasonable hand-designed architectures (like a well-tuned ConvNeXt or Swin) routinely compete with NAS-produced ones at similar parameter counts. What NAS does reliably win at is hardware-specific optimisation: given a specific mobile CPU, NPU, or DSP, searching directly against measured latency on that device produces architectures that beat any generic design. Production deployment pipelines at major platforms (Google Tensor, Apple Neural Engine) lean heavily on NAS for this reason.
A philosophical note: NAS's main contribution to vision architecture design was not the specific networks it produced but the legitimisation of grid search over design decisions. Choices about activation function, normalisation placement, kernel size, and channel count that had been made by taste in the VGG/ResNet era were suddenly empirical questions, and once the community internalised that, hand-designed architectures also improved substantially. Every modern vision paper's ablation table is an NAS-adjacent artefact.
In 2020, ViT (Dosovitskiy et al.) demonstrated that a pure transformer — the same block that powers GPT and BERT — could match and exceed the best CNNs on ImageNet, given enough pretraining data. The paper was a paradigm shift: for five years the question had been "how should we design a CNN"; after ViT, the question became "how much do we even need convolution".
The architecture is almost embarrassingly simple once you have the transformer chassis in hand. Take an image, split it into fixed-size patches (typically 16×16 pixels, sometimes 14×14 or 32×32), flatten each patch into a vector, and linearly project it to the transformer's hidden dimension. This gives a sequence of patch tokens. Prepend a learned class token, add learned positional embeddings (one per patch position plus one for the class token), and feed the sequence into a standard transformer encoder — multi-head self-attention blocks with FFN sublayers, pre-norm, residual connections. The class token's output after the final block goes through a linear classifier to produce logits. That is the entire network.
The catch, and it is a big one, is data efficiency. On ImageNet-1k alone, ViT underperforms a ResNet of comparable size by several percentage points. The reason is that CNNs encode strong inductive biases — locality, translation equivariance, hierarchical feature extraction — that ViTs do not have; ViTs instead learn these from data. With 1.3M ImageNet-1k images that learning doesn't happen; with 14M ImageNet-21k images it partially does; with 300M JFT images ViT matches CNNs; with 3B images ViT substantially exceeds them. The original paper's headline result was ViT-L/16 pretrained on JFT-300M then fine-tuned on ImageNet-1k, beating the best CNN baselines.
DeiT (Touvron et al., 2020) made ViT trainable from scratch on ImageNet-1k by combining strong data augmentation (RandAugment, Mixup, CutMix), aggressive regularisation (stochastic depth, label smoothing), and a distillation token that absorbed knowledge from a teacher CNN. DeiT-B reached 83.4% top-1 trained only on ImageNet-1k in about 53 epochs — removing the "you need JFT-300M" asterisk and making ViTs broadly accessible.
A recurring practical concern with ViTs is attention complexity. Self-attention costs O(N²) in the number of tokens; at 224×224 with 16×16 patches, N = 196, which is fine. At higher resolutions or smaller patches, N grows and memory/compute blow up. This motivates the hierarchical transformers of the next section, which restrict attention to local windows. It also motivates token-pruning methods (DynamicViT, A-ViT) that discard uninformative tokens at intermediate layers, and linear-attention variants that approximate the full attention with O(N) operations.
Three ViT architectural details deserve note. Relative positional encoding (ViT used absolute learned positions, but RPE in Swin and others generalises better to resolutions not seen at training time). Pre-norm LayerNorm (always before attention and FFN — post-norm transformers train less stably at depth). And MLP expansion ratio of 4× (the FFN's hidden dimension is 4× the model dimension — a value inherited from BERT/GPT and retained in almost every ViT since).
Plain ViTs have two weaknesses that matter in practice: quadratic attention cost with token count, and a single flat spatial resolution that makes them awkward for dense-prediction tasks (detection, segmentation) that consume features at multiple scales. The 2021–2023 wave of architectures fixed these by reintroducing hierarchy and locality on top of the transformer chassis — and, in parallel, by modernising CNNs using everything the transformer era had learned.
Swin Transformer (Liu et al., 2021) is the canonical hierarchical ViT. It operates attention within non-overlapping local windows (typically 7×7 patches) rather than globally. Between consecutive blocks it shifts the window partitioning, so that information flows across window boundaries. Between stages it merges 2×2 patch tokens into one — a transformer analogue of CNN downsampling — so that the network has four resolution stages exactly as a ResNet does. This gives linear complexity in image size and a multi-scale feature pyramid compatible with standard detection and segmentation heads. Swin-B reached 84% ImageNet top-1 and became the default backbone for detection/segmentation for several years.
MaxViT, MViT-v2, Focal Transformer, and CvT are close cousins: all reduce attention complexity by combining local windowed attention with some form of coarse global attention (grid attention, pooling attention, coarse-to-fine attention), achieving linear or near-linear scaling while retaining long-range information flow.
The CNN counter-argument came from ConvNeXt (Liu et al., 2022). The paper's thesis was that ViT's success was not about attention per se but about a package of training and architectural choices that had been co-developed with transformers: AdamW, cosine-schedule LR, heavy augmentation, LayerNorm instead of BN, GELU instead of ReLU, a 7×7 depthwise conv instead of 3×3 (expanding the receptive field to match attention), inverted-bottleneck MBConv blocks, fewer activation and normalisation layers. Applying all of these to a ResNet produced ConvNeXt, which matched Swin accuracy and inference speed. The broader message: "transformers vs CNNs" was a largely-false dichotomy — the modern vision recipe outperforms the 2015 recipe regardless of which primitive is at the core.
MobileViT, EfficientFormer, and MobileNetV4 apply the same lessons to the mobile regime: small transformers with aggressive use of convolutions for the early stages (where locality is cheap) and attention only at lower-resolution later stages (where its global receptive field earns its complexity cost). These are the architectures of choice for on-device inference; they typically reach 75–78% ImageNet top-1 at single-digit-millisecond latency on a phone.
The practical takeaway from this landscape is that in 2026, "which backbone should I use" depends on two questions: how much pretraining data do you have access to, and what downstream task are you serving. For dense prediction (detection, segmentation) on standard budgets, a Swin or ConvNeXt backbone is a strong default. For large-scale retrieval or vision-language work, a plain ViT pretrained with CLIP or DINO is usually the right answer. For on-device inference, MobileNet-v4 or MobileViT. The relative rankings shift year to year; the abstraction — "pick a pretrained backbone, attach a task-specific head" — is stable.
Perhaps the single largest source of "free" accuracy in modern vision is the training recipe — the combination of optimiser, schedule, augmentation, regularisation, and evaluation protocol that sits around the architecture. The same ResNet-50 trained with a 2015 recipe gets ~76% ImageNet top-1; retrained with a 2022 recipe, it gets 80.4%. That gap is larger than most architecture differences.
The optimiser of choice for CNNs has historically been SGD with Nesterov momentum at learning rate ~0.1 and weight decay ~1e-4, trained for 90–120 epochs with a step or cosine schedule. For transformers, AdamW (Loshchilov & Hutter, 2017) — Adam with decoupled weight decay — is the default, at learning rate ~1e-3 with weight decay 0.05 and a cosine schedule. Both work with the right hyperparameters; SGD is slightly more accurate on small CNNs, AdamW much more stable on anything involving attention or long training runs. Most modern code uses AdamW as a default because it is more robust to hyperparameter choice.
A learning-rate warmup over the first few epochs (linear from zero to the peak LR) is standard. A cosine decay schedule from the peak LR back to a small minimum over the remaining epochs outperforms step decay in essentially every setting. Exponential moving average (EMA) of model weights — maintaining a slowly-updated copy of the weights as an ensemble-of-one — typically adds 0.2–0.5 top-1 and is used in essentially every modern recipe.
Regularisation stacks complementarily on top. Stochastic depth (Huang et al., 2016) randomly drops entire residual blocks during training, scaling up depth regularisation. Label smoothing (Szegedy et al., 2016, popularised by Inception-v3) replaces the hard one-hot target with a smoothed distribution — typically 0.1 mass on the wrong classes, 0.9 on the correct — reducing the confidence-calibration problem and boosting top-1 modestly. Dropout is retained in ViTs and MLP-mixer architectures but largely absent from modern CNNs (where BN provides enough regularisation).
A landmark paper on training recipes is Wightman, Touvron & Jégou (2021), "ResNet Strikes Back": it retrained ResNet-50 with a modern recipe and produced 80.4% top-1, up from 76.5% in the original paper. The paper's ablation is the clearest reference for which recipe ingredients matter. The second landmark is the ConvNeXt paper itself, which carefully isolates each training-recipe change and shows that roughly half of the "transformer beats CNN" gap came from recipe differences rather than architecture.
Two evaluation details deserve mention. Training image resolution typically differs from inference resolution; inference at 256×256 after training at 224×224 gives a free ~0.5% top-1. Test-time augmentation (averaging predictions over several augmented versions of a test image) gives another ~0.5%. Both are standard at ImageNet-leaderboard-top scores and deliberately absent when measuring production latency. When reading accuracy numbers, always check which resolution and whether TTA was used.
Self-supervised learning (SSL) for vision — pretraining a backbone on unlabelled images by solving a pretext task derived from the data itself — is the technique that finally made web-scale vision pretraining practical. By 2023, an SSL-pretrained ViT typically outperformed the same architecture pretrained on ImageNet-1k labels for almost every downstream task.
Vision SSL has moved through four families in sequence. The first generation (Context prediction, Jigsaw puzzles, Rotation prediction, Colourisation) defined pretext tasks that required the network to reason about spatial layout without any external labels. These worked but did not scale; the features were useful but not competitive with supervised ImageNet pretraining.
The second generation was contrastive learning. SimCLR (Chen et al., 2020) and MoCo (He et al., 2019) both train by pulling augmented views of the same image together in embedding space while pushing different images apart. SimCLR used very large batches for the negative pool; MoCo used a momentum-updated memory queue to keep many negatives cheap. Both closed the gap with supervised pretraining on ImageNet-1k. BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2020) removed negative examples altogether: two networks predict each other's representations, with a momentum target network that prevents collapse. The surprising observation — that the loss landscape of "predict your own EMA teacher" admits useful solutions — reshaped the field.
The third generation is clustering and self-distillation. SwAV (Caron et al., 2020) clusters image features online into learned prototypes and uses cluster assignments as pseudo-labels for a classification loss — combining contrastive and clustering ideas. iBOT combines DINO-style self-distillation with masked-patch prediction. The features produced by these methods are qualitatively different from contrastive features and often stronger for semantic segmentation.
The fourth generation, and current dominant paradigm, is masked image modelling. BEiT (Bao et al., 2021) tokenised image patches with a learned discrete VAE and trained a ViT to predict masked tokens — a direct translation of BERT to vision. MAE (He et al., 2021) made this work on raw pixels: mask 75% of patches, reconstruct them from the remaining 25% using an asymmetric encoder-decoder (encoder only sees visible patches). MAE's dramatic compute reduction — the encoder processes 4× fewer tokens — made ViT-H/14 and ViT-G/14 pretraining affordable, and MAE-pretrained backbones became the default for segmentation and detection transfer in 2022–2023.
A parallel track, multimodal contrastive pretraining, is covered in detail in Chapter 6 of this Part. The short version: CLIP (Radford et al., 2021) and its successors pretrain on image-text pairs by aligning image and text embeddings contrastively. The resulting image encoder is a competitive backbone for classification, retrieval, and as the vision tower of downstream vision-language models. Whether CLIP, DINOv2, or MAE is "better" depends on the task; on ImageNet linear probe DINOv2 wins, on zero-shot classification CLIP wins, on dense-prediction fine-tuning MAE wins. All three are near the 2026 frontier.
Scaling laws — power-law relationships between model size, dataset size, compute budget, and final loss — are one of the sharpest empirical tools in modern vision. They are less clean than their language-model counterparts but consistent enough to inform architecture and training decisions at the frontier.
The earliest large-scale vision scaling study is Mahajan et al. (2018), which trained ResNeXt-101 on up to 3.5 billion Instagram hashtag-labelled images and found that classification accuracy grew logarithmically with dataset size over six orders of magnitude. A similar pattern showed up for model size: ImageNet top-1 grew approximately linearly with log(parameters) at fixed compute budget.
The more systematic study is Zhai et al. (2022), "Scaling Vision Transformers". They scaled ViT from 5 million to 2 billion parameters, varied JFT-3B subset size from 3M to 3B, and varied compute from 0.01 to 1.6 ExaFLOPs. The key findings: ViTs follow a power law in compute similar to language models, with a "compute-optimal" point that trades off model size and training duration. ViT-G/14 at 2B parameters pretrained on 3B images reached 90.4% ImageNet top-1, the 2022 state of the art. The scaling exponent is different from language (roughly 0.4 vs 0.5 for language), but the functional form is the same.
A second finding is the importance of pretraining distribution. LiT (Zhai et al., 2022) and SigLIP (Zhai et al., 2023) show that pretraining on paired image-text data unlocks strong zero-shot classification capabilities that are absent from supervised or self-supervised image-only pretraining at the same scale. Vision-language pretraining has become, in effect, the dominant way to consume web-scale image data — the labels are cheap (the captions are already on the web) and the resulting features transfer excellently.
Compute-optimal scaling also reshaped the architecture-design conversation. The Chinchilla-style observation from language — that training longer on fewer tokens is sub-optimal compared to training shorter on more tokens at fixed compute — has a vision analogue: training a smaller ViT for longer often beats training a larger ViT for shorter, at fixed compute budget. Current frontier vision pretraining runs (DINOv2, SigLIP, CLIP-XL) all sit near this compute-optimal frontier.
The practical use of scaling laws in 2026 is to answer a simple question: given a compute budget, what model size, data size, and training duration should I aim for? Scaling-law fits from similar architectures give approximate answers before you spend the budget. This is why every major frontier vision model is preceded by a scaling-law ablation over small configurations — you measure the exponents first, then extrapolate to the target budget rather than guessing.
Research checkpoints are typically larger than what ships: a 600M-parameter ViT-L is excellent for a paper but unacceptable for real-time inference. Distillation, pruning, and quantisation are the three techniques that turn frontier models into deployable ones, typically reducing model size 4–16× with accuracy losses under 1%.
Knowledge distillation (Hinton, Vinyals & Dean, 2015) trains a smaller student network to match the probability distribution (softmax outputs) of a larger teacher, usually at a high softmax temperature that sharpens "dark knowledge" — the relative confidence the teacher assigns to incorrect classes. In vision, standard distillation targets include logits (the original formulation), intermediate feature maps (FitNets, Romero et al.), attention maps (attention transfer), and relational information (RKD). Distillation is particularly effective when the teacher was pretrained on data the student cannot access at training time — distilling a CLIP-trained ViT-G into a MobileViT-S gives the student CLIP-like features despite having seen only ImageNet.
DeiT's distillation token is an elegant special case: a second class token fed through the transformer alongside the main class token, trained to match the teacher's prediction. It transfers 1–2 top-1 points of accuracy at negligible inference cost and is the most widely-adopted distillation recipe in the ViT family.
Quantisation reduces the numerical precision of weights and activations from 32-bit float to lower-precision formats. The standard options are INT8 post-training quantisation (typically ~0.5% top-1 loss on ImageNet, 4× memory reduction, 2–4× inference speedup on modern CPUs), INT4 weight-only quantisation with float activations (~1–2% loss, 8× weight memory reduction, common in LLM serving), and mixed-precision training (FP16 or BF16 for most operations, FP32 for sensitive accumulations — standard practice in all modern training). Quantisation-aware training (QAT) simulates quantisation during training and typically recovers the accuracy gap entirely.
Newer formats like FP8 (Nvidia H100 and later) and FP6/FP4 (Nvidia Blackwell) sit between INT8 and FP16 on the precision curve and are becoming standard for large-model training. For inference, binary or ternary networks (1–2 bit weights) have been researched but have not broken through to production use outside extreme-edge scenarios.
The three techniques compose. A standard compression pipeline is: (1) train a large teacher, (2) distil into a smaller student architecture, (3) apply structured pruning to the student, (4) quantise to INT8 with QAT. Each step is measured against a fixed accuracy budget — "keep within 1% of the teacher on ImageNet-1k" — and the final model is often 20–30× smaller and 5–10× faster than the teacher at inference. The compression pipeline, not the architecture search, is what decides whether a research advance becomes a product feature.
A model that achieves 88% ImageNet top-1 is not necessarily a model that works. Modern evaluation of image classifiers includes a constellation of robustness tests, calibration measures, and fairness probes that together give a more honest picture of whether a network will generalise to real-world data.
The core robustness benchmarks were introduced in section 2. Briefly: ImageNet-V2 measures selection bias (5–10% drop relative to ImageNet-1k); ImageNet-C measures common corruption robustness; ImageNet-R measures domain-shift robustness (paintings, sketches, toys); ImageNet-A measures natural adversarial examples; ImageNet-Sketch measures line-art generalisation. Reporting these together became standard in the early 2020s. The clear pattern is that robustness metrics correlate with ImageNet-1k accuracy but with substantial noise — some architectures (ViTs, hybrids) are more robust at a given top-1 than others (plain CNNs).
Adversarial robustness is a separate axis. FGSM (Goodfellow et al., 2014) and PGD (Madry et al., 2017) are the standard attacks: compute an input perturbation within some ε-ball (typically ε=8/255 for ImageNet) that maximally changes the prediction. Standard networks' accuracy under even weak PGD drops to zero. Adversarial training — training on PGD-attacked inputs — recovers robustness at the cost of clean-sample accuracy (typically 5–15% lower clean top-1). The canonical adversarial-robustness leaderboard is RobustBench, which as of 2026 has the best ImageNet L∞ ε=4/255 robust accuracy at around 55% — substantially below clean accuracy.
Out-of-distribution (OOD) detection asks whether a classifier can identify inputs that don't belong to any training class. Standard approaches score inputs by softmax confidence, energy, or distance-to-nearest-class-prototype in feature space; the Mahalanobis and ViM methods are current defaults. The associated benchmark is OpenOOD, which tracks both near-OOD (held-out ImageNet categories) and far-OOD (completely different domains like iNaturalist) performance. No method currently achieves near-human reliability on OOD.
Finally, fairness and bias evaluation has become a standard part of classification assessment. ImageNet-Gender, FairFace, and ablations on geographic distribution of the training data (Wang, Narayanan & Russakovsky, 2019) have documented that ImageNet-trained classifiers systematically underperform on images from the Global South, on darker-skinned subjects, and on women in many "profession" categories where training data reflects occupational demographics. Mitigations — balanced sampling, debiasing losses, data augmentation targeting under-represented subgroups — partially close the gap but do not eliminate it. Any production classifier should include a subgroup-accuracy breakdown in its evaluation.
The pragmatic summary: ImageNet-1k top-1 alone is no longer an adequate model card. A 2026-standard evaluation reports top-1 plus ImageNet-V2 / -C / -R / -A accuracy, ECE after temperature scaling, and (for production models) subgroup accuracy on a representative demographic evaluation set. Papers that report only top-1 are flagged by reviewers as underspecified.
The point of all this machinery is not, for most practitioners, to beat an ImageNet leaderboard. It is to produce a pretrained backbone that can be adapted to a specific task — classification of retail products, medical imagery, satellite imagery, manufacturing defects, user-uploaded photographs. This closing section is the operational picture: how classification backbones get selected, adapted, served, and evaluated in a real ML system.
The first decision is backbone selection. The current defaults: for a general-purpose task, a ConvNeXt-B or Swin-B pretrained on ImageNet-21k, fine-tuned to the downstream task. For transfer from frozen features with a linear probe, DINOv2 ViT-L/14 or CLIP ViT-L/14. For on-device inference, MobileNetV4 or MobileViT. For anything sensitive to compute latency, consider EfficientNet-v2 or a distilled ViT-S/16. These choices are not rigid; a reasonable workflow is to train two or three candidates and pick the best on a validation set.
The ecosystem is important to name. torchvision.models provides the reference implementations and the official pretrained weights for classical architectures (ResNet, EfficientNet, ViT, Swin, ConvNeXt). timm (PyTorch Image Models, Ross Wightman) provides a more comprehensive zoo — hundreds of architectures with multiple pretrained configurations each, consistent training recipes, and the best-tuned weights for many models. For ViT-specific pretrained weights, Hugging Face transformers hosts the checkpoints (CLIP, SigLIP, DINOv2, MAE, ViT-JFT). For cloud-scale model serving, ONNX Runtime, TensorRT, and OpenVINO are the standard optimised backends.
Production classifiers rarely output 1000 ImageNet classes. More common is a 10–1000-class hierarchy tailored to the product, with custom labels, multi-label outputs, or a threshold-based "no detection" regime. The standard adaptation recipe is: (1) take a pretrained backbone, (2) replace the final classifier head with a new linear layer of the correct output dimension, (3) fine-tune with a small learning rate on the pretrained weights and a larger learning rate on the new head, (4) evaluate with the target metric (often not accuracy — see below).
Target metrics in production classification almost never include plain accuracy. For imbalanced classes, use balanced accuracy, macro F1, or per-class recall. For threshold-sensitive applications (defect detection, medical triage), use ROC-AUC or precision at fixed recall. For cost-sensitive applications (spam filtering, moderation), weight the confusion matrix by business cost and optimise the expected cost. The choice of metric often dominates the choice of architecture in impact.
Finally, monitoring. A classifier in production degrades silently: the input distribution shifts, the relevant categories evolve, the training data ages. Standard monitoring includes input-distribution drift detection (compare feature-space distributions of production traffic to training traffic), confidence-distribution monitoring (flag if the confidence histogram shifts), and periodic re-evaluation on a freshly-labelled holdout that reflects current traffic. A classifier trained to 85% balanced accuracy can be down to 75% within a year of deployment without anyone noticing if monitoring is absent. The classification backbone is the compiler; the monitoring pipeline is what keeps the compiled program running.
Modern image classification has a richer, better-structured literature than almost any other area of deep learning — the papers below trace the continuous fifteen-year arc from AlexNet to the latest scaled vision transformers and self-supervised backbones. The selections emphasise the canonical architecture papers, the training-recipe and self-supervised-learning papers that reshaped what backbones are, and the software (timm, torchvision, Hugging Face) that turns published checkpoints into deployable code. A small number of 2022–2025 papers are included where they have already become standard references.