Detection and segmentation sit one level above classification: instead of naming a whole image, they localise and delineate every object of interest. The task has two historical lineages. The two-stage lineage — R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN — first proposes candidate regions and then classifies and refines them. The one-stage lineage — SSD, YOLO, RetinaNet — predicts boxes and classes densely on a grid in a single pass, trading a small amount of accuracy for an order-of-magnitude speed-up. Anchor-free detectors (FCOS, CenterNet) removed the hand-designed anchor boxes; DETR (2020) removed hand-designed non-maximum suppression by casting detection as set prediction with a transformer decoder. On the segmentation side, FCN and U-Net established dense per-pixel prediction; DeepLab added atrous convolutions and pyramid pooling; Mask R-CNN added per-instance masks to two-stage detectors; Mask2Former and similar unified architectures now treat semantic, instance, and panoptic segmentation as one problem. The current frontier is promptable and open-vocabulary perception — SAM segments anything you point at, Grounding DINO and OWL-ViT detect objects named by free-text queries, and the line between detection, segmentation, and vision-language has blurred. This chapter follows that arc: the task landscape and metrics, the classical two-stage and one-stage families, anchor-free and transformer detectors, the segmentation hierarchy, and the promptable / open-vocabulary frontier.
Sections one through four establish the problem. Section one is why detection and segmentation matter — the move from whole-image recognition to structured spatial outputs, and the argument that detection is the parsing layer that turns pixels into entities. Section two is the task taxonomy: classification, localisation, bounding-box detection, semantic segmentation, instance segmentation, panoptic segmentation, and keypoint detection — what each outputs, and where the hard parts are. Section three is evaluation metrics: intersection-over-union, average precision at a fixed IoU threshold, mean AP across thresholds and categories, the COCO protocol, and the panoptic-quality metric. Section four is bounding-box encodings and anchor boxes — centre-size vs. corner formats, anchor priors, aspect-ratio scales, and the IoU-based matching that turns detection into a per-anchor classification problem.
Sections five through nine walk through the detection lineage. Section five is the two-stage family — R-CNN, SPP-net, Fast R-CNN, Faster R-CNN, and the RoIAlign refinement that made Mask R-CNN possible. Section six introduces one-stage detectors — the SSD architecture, the YOLO-v1 grid-prediction idea, and the accuracy/speed trade-off. Section seven is the YOLO family — from v2 and v3 through the Ultralytics YOLOv5 / v8 and successors — a case study in how a single-pass architecture absorbed a decade of ideas. Section eight is anchor-free detection — FCOS, CenterNet, CornerNet, RepPoints — which removed the hand-engineered anchor priors. Section nine is feature pyramids — FPN, PANet, BiFPN, NAS-FPN — the multi-scale feature fusion shared by almost every modern detector.
Sections ten and eleven close out the detection story. Section ten is detection losses and label assignment — the focal loss that made one-stage detectors trainable, the IoU-family regression losses (IoU, GIoU, DIoU, CIoU), and dynamic label assignment (ATSS, OTA, SimOTA). Section eleven is the transformer era — DETR's set-prediction formulation, Deformable DETR's multi-scale attention, DINO's denoising queries, and RT-DETR's real-time variant.
Sections twelve through fourteen cover segmentation. Section twelve is semantic segmentation — FCN, U-Net, SegNet, the DeepLab family with atrous convolutions and ASPP, and the modern transformer segmentors (SegFormer, Segmenter). Section thirteen is instance segmentation — Mask R-CNN's per-RoI masks, YOLACT's prototype masks, SOLO's position-conditioned masks, and CondInst / BlendMask. Section fourteen is panoptic segmentation — stuff vs. things, Panoptic FPN, and the unified Mask2Former / OneFormer architectures that treat every segmentation task as mask classification.
Section fifteen is the promptable segmentation frontier: the Segment Anything Model (SAM) and its video-capable successor SAM 2, the ambient-image-encoder / prompt-decoder split, and how a single model trained on a billion masks can be prompted with a click, a box, or a mask. Section sixteen is open-vocabulary detection and segmentation — CLIP-aligned classifiers, GLIP, Grounding DINO, OWL-ViT, and the unified open-vocabulary segmentors (ODISE, OV-Seg). Section seventeen is the efficient-deployment toolkit: edge detectors, knowledge distillation, INT8/FP16 quantisation, TensorRT and ONNX export, and the engineering gap between a research checkpoint and a 30-fps phone model. The closing in-ml section is the operational picture: annotation tools and dataset formats (COCO, LVIS, Open Images), the timm/torchvision/Detectron2/MMDetection ecosystems, and how detection and segmentation plug into the larger vision pipeline covered in the rest of Part VII.
Image classification answers one question per image: "what is this a picture of?" Most real applications need more. A self-driving car needs to know where each pedestrian and cyclist is, not just whether there are people in the frame. A medical tool needs to outline the tumour's precise boundary, not report "positive". A retail checkout needs to count each item on the belt. These tasks — object detection, which places a bounding box around every object of interest, and segmentation, which assigns a class (and often an instance identity) to every pixel — are the parsing layer of computer vision: they turn a dense grid of pixels into a structured list of entities.
Classification, covered in the previous chapter, gives a single label for a whole image. Detection replaces this with a variable-length output: a set of (class, bounding box, confidence) triples, one per detected object. Segmentation goes further, producing a per-pixel label map: either a single class per pixel (semantic segmentation), a unique instance identity per foreground pixel (instance segmentation), or a unified stuff-and-things labelling (panoptic segmentation). The jump from "one label for an image" to "a variable set of labels localised in space" is what makes the engineering and evaluation of these tasks qualitatively different.
Detection and segmentation inherit the classification backbones from the previous chapter — ResNets, EfficientNets, ViT, Swin, ConvNeXt — and add task-specific heads: region-proposal networks, box regressors, mask decoders. The trend over the last decade has been to unify these heads. In 2015 you would build a different architecture for detection (Faster R-CNN), semantic segmentation (FCN), and instance segmentation (a CRF-plus-detector stack). By 2022 the same transformer backbone with different decoders could handle all of them; by 2024, a single promptable model like SAM could produce segmentation masks for anything a user pointed at, and models like Grounding DINO could detect categories described in free-text language.
The applications that pushed these tasks from curiosity to necessity are, roughly, autonomous driving (where the entire perception stack is detection and segmentation), medical imaging (pixel-accurate tumour / organ boundaries), retail and industrial inspection (counting, sorting, defect detection), robotics (manipulation requires knowing where objects are in 3-D), and content moderation / analysis (detecting weapons, faces, text, logos). Each has produced its own benchmark and its own engineering dialect, but the underlying architectures converged.
This chapter treats detection and segmentation as a single unified topic, because by 2026 they are. The same loss functions, the same backbones, the same transformer decoders, and increasingly the same foundation models serve both. The chapter walks through the historical two-stage and one-stage detector families, the segmentation hierarchy, and the modern transformer-based and promptable architectures that have replaced most of the older task-specific designs.
Before diving into architectures, it is worth fixing the vocabulary. "Detection" and "segmentation" each split into several subtasks with different outputs, different metrics, and different difficulty levels. Getting the taxonomy straight prevents confusion when reading papers that benchmark on apparently similar but subtly different tasks.
Image classification outputs one label per image (or a score for each of K classes). This is the Part VII Chapter 02 task. Image-level multi-label classification is the same but with multiple positive labels — a photo that contains both a dog and a cat. No spatial information is produced.
Localisation outputs a single bounding box and class for the dominant object in the image. It is an intermediate task between classification and detection — easier than detection because there is exactly one output per image — and is rarely a research target on its own. The ImageNet localisation challenge used it as a transitional benchmark before detection took over.
Object detection outputs a variable-length list of (class, box, score) triples: one for every object instance in the image. A box is typically four numbers — (xmin, ymin, xmax, ymax) or (cx, cy, w, h) in pixels. The standard benchmarks are PASCAL VOC (20 classes, legacy), COCO (80 classes, the modern reference), and LVIS (1200+ classes with a long-tailed distribution). Open Images has several million labelled boxes across 600 classes and is commonly used for pretraining.
Keypoint detection outputs a set of labelled point locations — classically facial landmarks, human body joints, or hand joints. The standard benchmark is the COCO keypoint task (17 human body keypoints). Keypoints share much of the detection machinery: they can be framed as heatmap regression (one heatmap per keypoint class) or as points-on-detected-objects (detect the person, then regress keypoints inside the box).
Two further variants are worth naming. Oriented bounding box detection uses rotated rectangles rather than axis-aligned ones — needed for aerial imagery (DOTA benchmark), text detection, and industrial inspection. 3-D detection outputs cuboids in 3-D space rather than 2-D boxes; it is the standard formulation for autonomous-driving perception and is covered in Chapter 05 (3-D vision). The current chapter restricts attention to 2-D tasks on still images; video-specific detection (tracking, spatio-temporal action localisation) appears in Chapter 04.
The output format of each task determines almost everything about the corresponding architecture. Dense per-pixel tasks (segmentation) need a decoder that preserves or restores spatial resolution. Sparse per-object tasks (detection, keypoint) need a mechanism for proposing candidate locations and suppressing duplicates. The distinction between a dense head and a sparse head is the deepest architectural split in the chapter; once you recognise which side of it a given method is on, the rest of the design choices follow from it.
Detection and segmentation metrics are less obvious than classification's accuracy. A detector can be right about class but slightly off on the box; a segmentor can be right on most pixels but blur every boundary. The metrics were designed to reward the behaviours people actually want.
The foundational primitive is Intersection-over-Union (IoU), also called the Jaccard index. For two regions A and B, IoU = |A ∩ B| / |A ∪ B|, a number between 0 (disjoint) and 1 (identical). For bounding boxes the two regions are the axis-aligned rectangles; for segmentation they are pixel masks. IoU is the standard matching criterion between predictions and ground truth. A predicted box with IoU ≥ 0.5 against a ground-truth box of the same class is usually called a true positive at that threshold.
Average Precision (AP) is the area under the precision-recall curve for a single class at a fixed IoU threshold. Predictions are ranked by confidence, matched to ground-truth boxes in rank order, and P/R are computed at each rank; integrating P over R gives AP. mAP is AP averaged over classes. PASCAL VOC's legacy metric was AP@0.5: AP at IoU threshold 0.5.
For instance segmentation, the same AP framework applies but with mask IoU instead of box IoU. COCO tables separately report APbb (box AP) and APmask (mask AP) from the same predictions. A good instance segmentor gets mask AP typically 3–5 points below its box AP, reflecting the extra difficulty of pixel-accurate boundaries.
Semantic segmentation uses mean Intersection-over-Union (mIoU): for each class, compute the pixel-wise IoU of the predicted mask against the ground-truth mask across all test images, then average over classes. Pixel accuracy (fraction of correctly classified pixels) is also reported but is dominated by the common classes; mIoU is the more informative metric, especially for long-tailed class distributions. The reference benchmarks are Cityscapes (street scenes, 19 classes), ADE20K (150 classes over diverse scenes), and Pascal Context.
Panoptic segmentation introduced its own metric: Panoptic Quality (PQ). For each class, PQ = (sum of IoUs over matched segments) / (|TP| + 0.5·|FP| + 0.5·|FN|). It factors into a Segmentation Quality (SQ) term — average IoU over matched segments — and a Recognition Quality (RQ) term — an F1 score over segment matches. A model can achieve high PQ only by both correctly identifying each segment and covering it accurately.
Two further details affect reproducibility. Non-maximum suppression (NMS) — removing overlapping predictions of the same class — has knobs (IoU threshold, score threshold) that change reported numbers by a full point. And soft NMS, which reduces rather than removes overlapping boxes' scores, can yield small gains. COCO evaluation has standardised on specific settings, and papers that tune NMS aggressively are sometimes flagged as benchmark-chasing rather than architectural wins.
A box is four numbers — but which four, and parameterised how, matters surprisingly much. The anchor-box formulation that dominated detection from 2015 to about 2020 is built entirely on a choice of these parameters, and understanding it is the key to reading any pre-transformer detection paper.
There are two standard bounding-box formats. Corner format — (xmin, ymin, xmax, ymax) — gives the top-left and bottom-right corners in pixel coordinates. Centre-size format — (cx, cy, w, h) — gives the centre point and the width and height. Corner is convenient for IoU computation and for clipping to the image; centre-size is convenient for regression (the centre coordinates scale linearly, while corners are coupled to each other). Most modern detectors regress in centre-size format and convert to corner for evaluation.
The anchor box idea, introduced by Faster R-CNN, is to place a dense grid of pre-defined box shapes ("priors") across the image at a set of spatial locations and scales, and have the detector predict deltas from each anchor rather than absolute box coordinates. At each location on, say, a 50×50 feature map, the network places nine anchors — three scales × three aspect ratios — for a total of 22 500 candidate boxes. The detector then predicts, for each anchor, a classification score and a four-dimensional delta (dx, dy, dw, dh) that shifts and scales the anchor to match a ground-truth box.
The standard delta parameterisation (Faster R-CNN) is: tx = (x − xa) / wa, ty = (y − ya) / ha, tw = log(w / wa), th = log(h / ha), where subscript a denotes the anchor. Log-encoding the width and height keeps the regression loss well-scaled across a wide range of box sizes; the centre offsets are normalised by anchor size for the same reason. The regression targets are trained with a smooth-L1 loss (Huber loss) that transitions from L2 near zero to L1 in the tails, preventing outliers from dominating the gradient.
A subtlety is positive / negative assignment: deciding which anchors should be trained to predict which ground-truth boxes. The classical rule is threshold-based. An anchor is positive if either it has IoU ≥ 0.7 with any ground-truth box, or it is the highest-IoU anchor for some ground-truth box (the second clause ensures every ground-truth gets matched). It is negative if it has IoU < 0.3 with every ground-truth box. Anchors in between are ignored. This matching is done independently per image and is the engine that turns detection into a per-anchor classification-and-regression problem.
The downsides of anchors are why the field eventually moved past them. Anchor shape, count, and scale are hyperparameters that must be tuned per dataset — COCO and VOC don't use the same anchors. Most anchors are background, which creates heavy class imbalance during training. Large objects and small objects get different amounts of effective capacity depending on which anchor scales you include. Anchor-free detectors (§8) remove the anchor prior by predicting boxes directly from feature-map locations, and transformer detectors (§11) remove anchors entirely by learning a small set of object queries.
The first successful CNN-based detectors followed a two-stage pipeline: propose candidate regions, then classify and refine each region. The family culminated in Faster R-CNN (2015) and Mask R-CNN (2017), and even after the one-stage and transformer detectors took the speed lead, the two-stage designs held the accuracy crown on COCO for several years.
R-CNN (Girshick et al., 2014) — "Regions with CNN features" — is the ancestor. It uses selective search, a classical bottom-up region-proposal algorithm, to produce about 2000 candidate regions per image. Each region is warped to a fixed size, passed through a CNN (AlexNet originally), and classified by per-class linear SVMs. The network is fine-tuned from ImageNet pretraining on the detection classes. R-CNN was the first method to clearly beat hand-crafted features on PASCAL VOC, but it had two problems: it ran the CNN 2000 times per image (slow), and selective search was independent of the learned features (locked-in).
SPP-net (He et al., 2014) and Fast R-CNN (Girshick, 2015) solved the first problem. Both run the CNN once on the whole image to produce a shared feature map, then extract per-region features by cropping the appropriate sub-region of the feature map. Fast R-CNN introduced the RoI Pooling operator — given a feature map and a region-of-interest box, produce a fixed-size feature grid (say 7×7) by pooling the feature map within the box. This lets a classification-and-regression head run on each region in a single forward pass. Fast R-CNN is roughly 200× faster at inference than R-CNN and slightly more accurate.
Faster R-CNN (Ren et al., 2015) replaced selective search with a learned proposal mechanism: the Region Proposal Network (RPN). The RPN is a small conv network that slides across the shared feature map and, at each location, predicts objectness scores and box refinements for each of k anchors (typically 9 — see §4). Its output is a ranked list of candidate regions, which is then fed to the Fast R-CNN head as before. The key point is that the RPN shares features with the detection head; the whole system trains end-to-end. Faster R-CNN at inference is about 5–10× faster than Fast R-CNN and cleaner architecturally, because there is no hand-designed proposal step.
Training a two-stage detector involves three losses applied at the two stages. At the RPN: a binary objectness classification loss (is this anchor a foreground object?) and an anchor-delta regression loss, applied to a sampled subset of positive and negative anchors. At the head: a K+1-way classification loss (one per class plus background) and a per-class box-delta regression loss, applied to the RPN's top-N proposals that survived NMS. Mask R-CNN adds a per-class binary cross-entropy mask loss on the mask branch. All four (or five) losses are weighted and summed; getting the weights right is part of the recipe.
Two-stage detectors are still used — Cascade R-CNN, HTC, DINO — when accuracy matters more than latency, particularly in domain-specific industrial and medical applications. The engineering trade-off is that the two stages give more capacity per candidate region (at the cost of throughput), while one-stage detectors (§6-§7) give up some accuracy for an order-of-magnitude speed-up.
The one-stage detector family predicts class labels and bounding-box deltas directly from a grid of feature-map locations in a single forward pass, skipping the region-proposal step. This trades a few points of AP for a 5–10× speed-up, and for most real-time applications (autonomous driving, video analysis, mobile inference) that trade is worth making.
SSD — "Single Shot MultiBox Detector" (Liu et al., 2016) — was the first widely-adopted one-stage detector. The architecture is a VGG-16 backbone plus a pyramid of extra convolutional feature maps at decreasing resolutions. At each spatial location of each feature map, a dense set of anchor boxes is predicted (both a softmax over classes and a box delta). Large feature maps detect small objects; small feature maps detect large objects — the multi-scale pyramid gives SSD natural scale coverage without an explicit proposal step. Post-processing is a per-class non-maximum suppression over all predictions.
YOLO-v1 (Redmon et al., 2016) — "You Only Look Once" — took an even more radical approach. It divides the image into a 7×7 grid; each grid cell predicts B=2 bounding boxes (x, y, w, h, confidence) plus a distribution over classes. All predictions come from a single pass through a single network with a single loss. YOLO-v1 was dramatically faster than any previous detector (45 fps on a GPU, versus Faster R-CNN's 5 fps) but meaningfully less accurate, especially on small objects where the 7×7 grid was too coarse. It set the aesthetic for the YOLO family: a clean, fast, single-pass detector that would absorb improvements from elsewhere in the field over the next decade.
FL(pt) = −(1 − pt)γ log pt. With γ=2, an example with pt=0.9 contributes 100× less than one with pt=0.5. Focal loss closed the accuracy gap between one-stage and two-stage detectors at matched speed; it is one of the single most important training-time innovations in the detection literature.
RetinaNet (Lin et al., 2017) combined focal loss with a Feature Pyramid Network (FPN, §9) on a ResNet backbone, and was the first one-stage detector to match Faster R-CNN on COCO. It established the canonical one-stage template — backbone + FPN + per-level detection head with anchor-based classification and regression — that subsequent YOLO, FCOS, and RTMDet variants all inherited.
The basic one-stage training recipe became: forward the image, for each anchor compute a classification loss (focal or cross-entropy) against the matched ground-truth class, compute a regression loss (smooth-L1 or IoU-based) against the matched ground-truth box, and sum. Evaluation runs the forward pass, applies per-class NMS, and keeps the top-K predictions by score. The simplicity is the point; there is no proposal-then-refine stage and no hand-designed anchor search.
Practically, one-stage and two-stage detectors have converged. DETR-family transformer detectors (§11) are architecturally one-stage; modern anchor-free detectors (§8) look one-stage but also eliminate the anchor priors; the latest YOLO variants perform comparably to the best two-stage detectors on COCO AP while running 5–10× faster. The division matters more as a historical organising principle than as a current design choice.
The YOLO name has been carried by roughly a dozen different architectures over a decade, some by the original authors (Redmon, Farhadi) and others by different research groups and companies. The family is less a unified design and more a living lineage: a commitment to a single-pass, real-time detector that repeatedly absorbed whatever new technique (FPN, focal loss, anchor-free heads, transformer modules) the rest of the field produced.
YOLO-v2 (Redmon & Farhadi, 2017) replaced the fixed 7×7 grid of v1 with anchor boxes clustered over training data via k-means — making the anchor priors data-dependent rather than hand-set. It also introduced DarkNet-19, a custom backbone; switched to BatchNorm throughout; and added multi-scale training. YOLO-v3 (2018) upgraded to DarkNet-53 with residual connections, added a three-level feature pyramid for better small-object detection, and moved to binary-cross-entropy per-class losses (allowing overlapping labels). v3 was the last version by Redmon; it became the de-facto real-time detector for several years.
The post-v3 YOLO world fragmented. YOLO-v4 (Bochkovskiy et al., 2020) is a careful ablation study assembling the best modules of the preceding years: CSPDarknet backbone, PANet neck, SPP module, Mish activation, CIoU loss, mosaic data augmentation. YOLO-v5 (Ultralytics, 2020) was not a paper but an open-source release that emphasised PyTorch implementation quality, ease of training, and a clean Python API. Its actual architectural differences from v4 are modest, but it rapidly became the most-used detector in industry.
Later variants converged on an anchor-free, decoupled-head design. YOLOX (Megvii, 2021) switched to an anchor-free head (predicting box corners directly from each feature-map point), decoupled the classification and regression heads (separate conv stacks), and introduced SimOTA, a dynamic label-assignment strategy that picks the best candidate anchors per ground-truth at training time. YOLO-v6 (Meituan, 2022) and YOLO-v7 (Wang et al., 2022) continued refinement: re-parameterised blocks for inference, improved label assignment, and careful quantisation support.
YOLO-v8 (Ultralytics, 2023) unified detection, instance segmentation, classification, and pose estimation in a single codebase, all built on a similar CSPDarknet-plus-anchor-free-head architecture. It is anchor-free, uses distribution-focal-loss for box regression, and has been the practical "default" detector for real-time applications through 2024–2025. Subsequent YOLOv9, v10, and v11 variants by various groups continue to push the accuracy/speed Pareto frontier with techniques borrowed from transformer detectors and modern training recipes.
The consolidated picture: over YOLO v1 through v8+, the architecture migrated from a naive grid-prediction to a mature anchor-free, multi-scale, decoupled-head detector that benefits from almost every advance in the rest of the chapter. The family is a case study in how a clean baseline, accessible code, and a real-time design constraint can produce a continuously-improving line of models without a single unifying paper.
Anchor boxes are a useful inductive bias, but they are also a set of hand-designed priors that must be tuned per dataset and that introduce a substantial training-time imbalance. A family of anchor-free detectors, developed mostly between 2018 and 2020, predicts boxes directly from feature-map locations without any anchor priors — typically with comparable or better accuracy than their anchor-based predecessors.
FCOS — "Fully Convolutional One-Stage detection" (Tian et al., 2019) — is the clearest anchor-free design. At every location (x, y) on every level of a feature pyramid, FCOS predicts: a per-class classification score, a 4-vector (l, t, r, b) giving distances from (x, y) to the four sides of the target box, and a centreness score that down-weights predictions whose location is far from the centre of their target. Training assigns each location to the smallest ground-truth box that contains it (within a level-specific size range). At inference, the (l, t, r, b) vector plus the location determine the box; NMS handles duplicates.
FCOS outperformed anchor-based RetinaNet at matched backbones on COCO, with a simpler and more elegant formulation. The centreness head is a small trick that turned out to generalise: it is effectively a learned prior that central locations should produce higher-quality predictions, implemented as an additional scalar per location. Almost every anchor-free detector since has either used centreness directly or an IoU-based equivalent.
RepPoints (Yang et al., 2019) replaces the four-number box representation with a set of nine learned point locations; the box is computed as the min-max bounding rectangle of these points. The benefit is that the points can learn to attach to semantically meaningful object parts, giving a richer representation than a box. ATSS — "Adaptive Training Sample Selection" (Zhang et al., 2020) — showed that the real performance gap between anchor-based and anchor-free detectors came from the label assignment strategy, not the presence or absence of anchors; with matched assignment rules, the two are indistinguishable. ATSS became the default assigner in modern detectors: it computes candidate positive anchors/locations per ground-truth using a statistical mean-plus-standard-deviation IoU rule, rather than a fixed threshold.
The modern lineage (FCOS, ATSS, CenterNet, SimOTA) is that a one-stage detector does not need anchor priors as long as it has a carefully designed label-assignment and centreness scheme. Almost every post-2021 detector — YOLOX, YOLO-v6+, RTMDet, RTMPose — is anchor-free. The anchor-based detectors still in wide use (Faster R-CNN, RetinaNet) are valuable reference points more than competitive systems; a new detection paper in 2026 is almost certainly anchor-free.
Objects in natural images appear at drastically different scales — a bicycle close to the camera might cover a 400-pixel region; another in the distance might cover ten. A detector with a single feature-map resolution will struggle with one end of this range. The feature pyramid is the standard architectural answer: extract and fuse features at multiple spatial resolutions so that small objects are handled by fine-grained feature maps and large objects by coarse-grained ones.
FPN — "Feature Pyramid Networks" (Lin et al., 2017) — is the canonical design. Starting from a backbone's feature hierarchy (say C2, C3, C4, C5 at strides 4, 8, 16, 32), FPN adds a top-down pathway: upsample the coarsest feature map (C5), add it element-wise to a lateral 1×1 projection of C4, apply a 3×3 conv to smooth the result (P4), and iterate. The output is a pyramid P2…P5 where every level has both coarse-grained semantic information (from C5) and fine-grained spatial information (from the same-stride backbone level). Detection heads are attached at each level; small objects go to P2, large to P5.
FPN's impact was immediate: attaching an FPN to RetinaNet, Faster R-CNN, or Mask R-CNN gave 2–4 AP points of improvement across the board, with small objects (APS) improving even more. It rapidly became a default, to the point that "with FPN" is the assumed baseline in any post-2017 detection paper.
An orthogonal refinement is the neck vs. backbone vs. head separation that the field settled on. A modern detector is built from three loosely-coupled components: a backbone (ResNet, ConvNeXt, Swin) that produces feature maps; a neck (FPN, BiFPN, PANet) that fuses features across scales; and a head (anchor-based, anchor-free, transformer query) that makes the actual predictions. This decomposition lets researchers swap components independently — a ResNet-backbone FCOS head and a Swin-backbone ATSS head share the same neck API.
Transformer detectors (DETR and descendants) typically replace the multi-scale neck with a single feature map plus learned object queries, or — in the case of Deformable DETR — use a multi-scale feature set but without the explicit FPN top-down pathway. For non-transformer detectors, an FPN-family neck is essentially mandatory; the lineage from FPN (2017) through BiFPN (2020) and beyond is one of the quieter-but-important stories in detection architecture.
A detector's loss function has to balance a classification signal (is this anchor a foreground object, and which class?), a regression signal (how much do I shift the box to hit the target?), and a label-assignment rule (which anchor is responsible for which ground-truth box?). The evolution of each of these three pieces drove roughly as much of the last decade's accuracy gains as any architectural change.
Classification losses. Early detectors used cross-entropy or softmax cross-entropy directly, which collapsed under the 1000:1 foreground-background imbalance of one-stage detectors. Focal loss (§6) solved this by exponentially down-weighting easy examples. Quality Focal Loss (Li et al., 2020, GFL) replaces the binary (positive/negative) target with a continuous IoU-based quality score — a positive anchor's target is its predicted box's IoU with the ground truth, not 1. This aligns the classification score with the actual localisation quality, so that NMS (which sorts by classification score) selects boxes with both high class confidence and tight localisation.
Regression losses. Classical detectors used smooth L1 (Huber loss) on the four anchor deltas, independently. The problem: optimising four scalar losses does not directly optimise IoU, which is what the evaluation metric actually cares about. IoU loss (Yu et al., 2016) directly uses 1 − IoU, but it has zero gradient for non-overlapping boxes. GIoU (Rezatofighi et al., 2019) adds a term that accounts for the smallest enclosing rectangle, giving a gradient even when boxes do not overlap. DIoU (Zheng et al., 2020) adds a centre-distance term; CIoU adds aspect-ratio alignment. These are drop-in replacements for smooth-L1 that give 1–2 AP of improvement on COCO.
Distribution Focal Loss (DFL) is a specific 2020-era refinement. Rather than regress a single scalar (e.g. distance to left side of box), the network predicts a discrete distribution over possible values; the predicted value is the expectation, and the loss encourages the distribution to concentrate on the correct value while staying smooth. This is useful for boundary-ambiguous objects — e.g. a dog's tail — where the exact box edge is genuinely uncertain. DFL is used in GFL, YOLO-v6, v7, v8.
Mask-loss-family choices mostly use per-pixel binary cross-entropy or a dice-loss variant. Dice loss = 1 − 2·|A ∩ B| / (|A| + |B|) is essentially F1 at the pixel level; it is more robust to class imbalance than pixel BCE for very-small-object segmentation (cell microscopy, crack detection). Modern segmentors often use a weighted sum of BCE and Dice.
The overall message is that detection accuracy is bottlenecked roughly equally by architecture and by loss/assignment design. A ResNet-50 Mask R-CNN with 2017-era losses scores around 37 AP on COCO; with 2022-era losses (DFL, GIoU, ATSS, appropriate focal) on the same backbone it reaches ~43 AP. Before reaching for a new architecture, a new loss function is the cheapest test.
The DEtection TRansformer (Carion et al., 2020) is the architectural equivalent of taking the Transformer chassis from NLP (Part VI Chapter 04) and transplanting it directly into detection. DETR eliminated two of the classical pipeline's hand-engineered pieces — anchor boxes and non-maximum suppression — in favour of a transformer decoder whose outputs are a fixed-size set of object predictions, matched to ground-truth via bipartite matching during training.
DETR's architecture is: a CNN backbone produces a single feature map; a transformer encoder refines it; a transformer decoder takes N=100 learned object queries as its input and produces N output vectors via cross-attention to the encoder features; a small MLP head on each output produces (class, box). At training time, Hungarian matching pairs each prediction with at most one ground-truth box, minimising a cost that combines classification and box losses. Unmatched predictions are trained against a "no object" class, and unmatched ground truths are ignored.
The elegance of DETR is that the whole detector is a single end-to-end differentiable pipeline with no hand-designed post-processing, no anchor boxes, and no NMS. Its downside — on release — was convergence: vanilla DETR took 500 training epochs to reach parity with Faster R-CNN, where a classical detector needs 36 epochs. The bottleneck was the learned queries' slow convergence on where to attend.
Bipartite matching deserves a moment of explanation because it is the key training-time construct of DETR and its descendants. Let N predictions and M ground-truth boxes be given (with M ≤ N). Hungarian matching finds a one-to-one assignment π: {1..M} → {1..N} that minimises ∑i cost(π(i), i), where cost combines classification and box-regression terms. This is computed per image at every training step; the predictions not matched to any ground-truth get the "no object" target. The result is that at training time, there is no positive/negative ambiguity and no duplicate predictions per ground-truth — the set-prediction formulation has this built in.
DETR generalises cleanly to segmentation. The same decoder can produce per-query mask embeddings that, dotted with a per-pixel feature map, give a mask for each detected instance — this is the core idea of Mask2Former (§14). It also generalises to panoptic segmentation and video instance segmentation with the same query mechanism, which is why the transformer-detector family is the common ancestor of essentially every post-2022 mask-prediction architecture.
Practically, a 2026 detector is probably either a DETR variant (DINO, RT-DETR, MaskDINO for segmentation) or a YOLO-v8+/RTMDet-style efficient dense prediction head. The DETR family is preferred when accuracy matters most (top of COCO is uniformly DETR-descended); the dense-head family is preferred for real-time deployment. The architectural convergence is clear: both are anchor-free, both use multi-scale features, both use dynamic label assignment, and both share backbones with classification.
Semantic segmentation assigns a class label to every pixel of an image. The task is simpler to specify than instance or panoptic segmentation — no instance identities, just a per-pixel class map — but it still requires producing outputs at the input image's spatial resolution, which is a very different architectural problem from classification.
FCN — "Fully Convolutional Networks" (Long, Shelhamer & Darrell, 2015) — is the foundational architecture. It takes a classification network (VGG-16), replaces the final fully-connected layers with 1×1 convolutions, and upsamples the coarse output feature map back to input resolution via transposed convolutions. Skip connections from earlier layers provide fine-grained spatial detail that the deep coarse feature map alone lacks. FCN-8s (with three levels of skip connections) was the first widely-adopted architecture to show that end-to-end CNN training could match or exceed classical CRF-based segmentation methods.
The encoder-decoder pattern generalised this. The encoder is a classification-style backbone that progressively downsamples; the decoder is a mirror that progressively upsamples. SegNet (Badrinarayanan et al., 2015) kept track of max-pool indices during encoding and reused them during decoder upsampling for sharper boundaries. U-Net (Ronneberger, Fischer & Brox, 2015) added symmetric skip connections that concatenate encoder features with decoder features at matching resolutions, giving the decoder direct access to fine-grained spatial information. U-Net was developed for biomedical segmentation and remains the default architecture for that domain; variants appear everywhere from satellite imagery to diffusion-model backbones.
The transformer era brought new architectures. SETR (Zheng et al., 2021) was a straightforward application: a ViT encoder on image patches, then a decoder that upsampled the token representations back to pixel-level predictions. SegFormer (Xie et al., 2021) introduced a hierarchical transformer backbone (multiple stages with decreasing resolution, similar to Swin) and a lightweight MLP decoder that aggregates across scales. Segmenter (Strudel et al., 2021) used a transformer decoder with class tokens to produce per-class segmentation masks in a DETR-like formulation.
Modern unified architectures like Mask2Former (§14) treat semantic segmentation as a special case of mask classification: every semantic class corresponds to a single "mask" prediction covering all pixels of that class. This is architecturally identical to how the same model handles instance and panoptic segmentation — just with the matching rule and the post-processing adjusted. As of 2026 Mask2Former-style unified architectures have replaced task-specific semantic-segmentation models for most benchmarks.
For deployment, the choice is still often a simple encoder-decoder like DeepLab-v3+ or SegFormer: these are lighter than Mask2Former, run at real time on modest hardware, and provide solid accuracy for the common case where every pixel simply needs a class label. The research frontier has moved on, but the production-reality floor is firmly in the convolutional-encoder-decoder family.
Instance segmentation combines detection and segmentation: for each object in the image, predict both a bounding box and a pixel-accurate mask. Two objects of the same class must receive distinct instance IDs. The task is strictly harder than either detection or semantic segmentation because it requires both object-level understanding and pixel-level precision.
Mask R-CNN (He et al., 2017) is the canonical two-stage instance segmentor. Architecturally it extends Faster R-CNN with a parallel mask head: for each RoI, in addition to the (class, box) predictions, a small FCN produces a K×K binary mask for each of the K classes (only the predicted class's mask is kept at inference). The combination of RoIAlign (§5) and the parallel mask branch gives mask AP that sits 3–5 points below the corresponding box AP, which is roughly the quality that downstream applications need for most use cases.
One-stage and anchor-free instance segmentors developed in parallel. YOLACT (Bolya et al., 2019) — "You Only Look At CoefficienTs" — generates a set of prototype masks (global mask candidates) for the whole image, and each detected instance's mask is produced as a linear combination of prototypes with predicted per-instance coefficients. The decoupling of "where masks live" (prototypes, computed once) from "which prototypes does this instance use" (coefficients, per-detection) makes YOLACT real-time on a GPU.
CondInst (Tian et al., 2020) and BlendMask (Chen et al., 2020) are further refinements in the "dynamic head" family: the detection network predicts not just a box and class per instance, but also a set of dynamic convolution weights that are used to produce the instance's mask from a shared feature map. This gives mask quality comparable to Mask R-CNN at one-stage speed, without the hand-designed RoI operator.
The transformer family unified instance segmentation with detection through MaskFormer (Cheng et al., 2021) and Mask2Former (Cheng et al., 2022). MaskFormer reframes instance segmentation as mask classification: the decoder produces a fixed set of binary mask predictions plus per-mask class labels, and the task's inductive biases are encoded in the loss and matching rather than the architecture. Mask2Former added masked cross-attention — each query attends only within its predicted mask region — which dramatically improves convergence and accuracy. These architectures are the current state-of-the-art on COCO, LVIS, and Cityscapes instance-segmentation benchmarks.
For practitioners, the choice is typically: Mask R-CNN (or its Detectron2 variant) for easy integration and wide tooling; YOLOv8-seg / YOLACT-descendants for real-time inference on commodity hardware; Mask2Former or its descendants for accuracy-critical applications at research cost. LVIS (1200 classes, long-tailed) has become the more demanding benchmark than COCO, and ranking on LVIS is a better proxy for real-world long-tail performance.
Panoptic segmentation (Kirillov et al., 2019) unifies semantic and instance segmentation into a single task: every pixel gets a (class, instance-id) pair. For thing classes (countable objects like cars, people), each instance gets a unique ID; for stuff classes (uncountable regions like sky or road), all pixels share a single ID. The result is a complete scene parse — every pixel explained, with instances for things and coherent regions for stuff.
The task was introduced specifically to push the field toward unified architectures. Before 2019 a production vision system wanting scene understanding would run a semantic segmentor and an instance segmentor in parallel and merge their outputs post-hoc — an expensive and inelegant solution. Panoptic segmentation gave the field a single benchmark that required producing coherent unified output.
Panoptic FPN (Kirillov et al., 2019) was the first widely-cited architecture: a Mask R-CNN head for things and a semantic-segmentation head for stuff, both sharing an FPN backbone, with a small merging step. UPSNet and Panoptic-DeepLab added learned logic for resolving conflicts between the two heads (what happens when a thing mask overlaps with a stuff prediction of a different class).
A follow-up, OneFormer (Jain et al., 2023), trains a single model on all three tasks simultaneously with a task token that conditions the decoder. At inference time, switching the task token changes the output format without retraining. This is the foundation-model direction for segmentation: a single weight set that can be prompted for any segmentation task.
The PQ metric (§3) enforces the coherence requirement: a panoptic segmentor must both correctly label each pixel and produce non-overlapping, complete coverage. Classical post-hoc merging of a detector and semantic segmentor tends to score poorly on PQ because the two outputs can disagree on overlapping pixels; unified architectures avoid this by construction.
Panoptic segmentation is used in autonomous driving (where the whole scene matters), robotic scene understanding, and augmented reality. It is less commonly deployed than semantic or instance segmentation alone, largely because the extra capability has engineering cost without always matching an application need. The task's largest impact has been conceptual: forcing the field to build architectures that handle the full structure of a scene rather than one task at a time.
The Segment Anything Model (SAM, Kirillov et al., 2023) marked a phase change in segmentation: instead of training one model per dataset and task, SAM is a single foundation model that produces a mask for any object a user points at. The model is trained on ~1 billion masks across 11 million images (the SA-1B dataset), and it segments objects in image domains it has never seen — microscopy, satellite imagery, medical scans — often zero-shot.
SAM's architecture is a clean three-part design. An image encoder (a large ViT) runs once per image and produces a dense feature embedding. A prompt encoder turns user inputs — points, boxes, or rough masks — into embedding vectors. A lightweight mask decoder combines the two and produces one or more candidate masks in ~50 ms. The image encoder is expensive (hundreds of milliseconds on a GPU) but only runs once; after that, every prompt is near-instant. This makes SAM practical as an interactive tool — click on a part of an image, get a mask back immediately.
The dataset — SA-1B — was assembled via a data engine: an initial SAM was trained, used to propose masks on unlabelled images, human annotators corrected the proposals, and the corrected data was used to train the next SAM. Three rounds of this bootstrap produced the final billion-mask training set. The scale is ~400× larger than any prior segmentation dataset and is the primary reason SAM generalises as broadly as it does.
SAM 2 (Ravi et al., 2024) extended the model to video. The key addition is a memory mechanism: a running summary of which objects have been seen, prompted at which frames, with what masks, letting the model maintain instance identity across frames. SAM 2 segments a video object from a single-frame prompt (click on the dog in frame 1, track it through 10 000 frames). This brings promptable segmentation into the video-understanding domain covered in Chapter 04.
Beyond SAM specifically, the promptable-segmentation paradigm is influential. SEEM (Segment Everything Everywhere Model) added text prompts and audio prompts. SAM-HQ (High Quality SAM) added a boundary-refinement module. Semantic SAM handles hierarchical segmentations (one click can return parts-of-parts). The general design — a large image encoder, a prompt-conditioned mask decoder, trained on a very large mask dataset — has become the template for vision foundation models.
Practically, SAM has replaced a great deal of task-specific segmentation code. Many vision pipelines in 2025–2026 use SAM for the segmentation step and restrict task-specific training to the classifier on top of the segmented region. For interactive annotation, SAM has also collapsed the time to label a new dataset by 5–10×, which has downstream effects across every vision benchmark.
A classical detector is trained on a fixed set of categories — COCO's 80, LVIS's 1200 — and cannot produce a prediction for any class outside that set. Open-vocabulary detection and segmentation loosen this restriction by accepting a category described in free-text at inference time: "red bicycle", "stop sign", "Van Gogh painting", anything the user types. The bridge from vision to text is typically a CLIP-style joint embedding.
CLIP (Radford et al., 2021, covered further in Chapter 06) trained separate image and text encoders with a contrastive loss over 400 million image-text pairs. The resulting embedding space aligns images with their descriptive text. Open-vocabulary detectors use CLIP-style text embeddings as their classifier weights: instead of a fixed K-way classification head, the detector embeds text prompts for the categories of interest and uses the dot product between region embeddings and text embeddings as the classification score.
ViLD (Gu et al., 2021) was the first widely-cited open-vocabulary detector. It distils knowledge from a pretrained CLIP image encoder into a detector's RoI features via an auxiliary loss, aligning the detector's region embeddings with CLIP image embeddings. OWL-ViT (Minderer et al., 2022) is simpler: use a ViT backbone trained jointly with a text encoder (LiT-style pretraining), and predict boxes and their CLIP-aligned embeddings end-to-end. At inference, embed any category text and retrieve the matching boxes.
Open-vocabulary segmentation follows the same pattern. ODISE (Xu et al., 2023) uses diffusion-model features as the dense encoder, then open-vocabulary mask classification on top. OV-Seg trains a CLIP-aligned mask classifier; X-Decoder and SEEM unify many tasks (grounding, refexp segmentation, panoptic segmentation, captioning) into a single decoder. The common architectural idea is: produce mask proposals in a class-agnostic way, then classify each proposal against an open-vocabulary CLIP text embedding.
An important sub-problem is the long tail. On LVIS (1200 classes with power-law frequency), open-vocabulary methods often outperform closed-vocabulary counterparts on rare classes because they benefit from CLIP's web-scale pretraining even for classes with a handful of detection annotations. This is the clearest empirical win of the open-vocabulary approach: for the head of the distribution, closed-vocabulary methods still dominate, but the gap closes (and can reverse) for rare classes.
The practical shape of open-vocabulary vision in 2026 is: Grounding DINO or OWLv2 for detection, SAM for segmentation, often chained so a text query produces a detection that is then refined into a mask. The composed system — call it "detect-then-segment" — is the closest current implementation of a general-purpose vision interface, where the user describes what they want in text and the system localises it pixel-accurately.
A research checkpoint of Mask R-CNN or Mask2Former is rarely production-ready: it runs at 5–15 fps on a server GPU, consumes several gigabytes of memory, and expects FP32 inputs. Real applications need detectors and segmentors running at 30+ fps on mobile chips, embedded accelerators, or in-browser WASM. The efficiency toolkit — distillation, quantisation, pruning, and format conversion — bridges the gap.
Knowledge distillation trains a small student detector to match a larger teacher's predictions. For detection specifically, feature-level distillation (student feature maps mimic teacher feature maps, often with attention masks that focus on object regions) works better than pure output-level distillation, because detection outputs are sparse and very sensitive to NMS. LD (Localisation Distillation, Zheng et al., 2022) distils the distribution-focal-loss representations. A modern distilled mobile detector — YOLO-NAS, RTMDet-Tiny — retains ~90% of the teacher's AP at 20% of the compute.
Quantisation converts weights and activations from 32-bit floats to lower-precision formats. INT8 (8-bit signed integers) is the de-facto standard for CPU/mobile inference via TensorRT, ONNX Runtime, and CoreML; it typically yields 3–4× speedup with ≤1 AP loss. FP16 (half precision) is near-lossless and standard for GPU inference. INT4 quantisation is an active research frontier — useful for transformers but often fragile for detection, where the spread of activation magnitudes across the multi-scale feature pyramid is hostile to aggressive quantisation. Quantisation-aware training (QAT) simulates the quantisation during training and usually recovers most of the post-training-quantisation accuracy gap.
The deployment format question matters more than people think. Most detectors are trained in PyTorch or JAX; most deployments run on TensorRT (NVIDIA GPUs), ONNX Runtime (CPU, cross-platform), OpenVINO (Intel), CoreML (Apple), or TFLite (mobile). Converting a PyTorch checkpoint to any of these formats involves tracing the graph, handling operators that the target runtime supports differently (NMS, RoIAlign, multi-scale feature gathering), and validating that the converted model's outputs match the original. The toolchain has improved: torch.compile, ONNX Runtime's dynamic shapes, and TensorRT's plugin system cover most cases in 2026, but "exports cleanly to TensorRT" is still a non-trivial architectural constraint.
Recent efficient-detection research has focused on architectures specifically designed to quantise and export cleanly. RTMDet (Lyu et al., 2022) is a representative: a CSPNeXt backbone, large-kernel convolutions, and an anchor-free head with dynamic label assignment, all designed so every operator maps to a supported TensorRT/ONNX primitive. On COCO it achieves ~50 AP at real-time speeds, matching YOLOv8 while being cleaner architecturally.
For promptable and open-vocabulary models, the efficiency picture is different. SAM's image encoder is the bottleneck; distilled versions (MobileSAM, FastSAM) replace it with a small ViT or CNN while keeping the same prompt decoder, giving 50–100× speedups with modest quality loss. This pattern — distil the heavy encoder but keep the lightweight task-specific head — generalises across the vision-foundation-model landscape.
The chapter's architectures and techniques only produce useful systems when embedded in a full ML lifecycle: choosing and curating a dataset, labelling it, training, evaluating, iterating, and shipping. The operational picture is almost as important as the architectural one.
The canonical datasets are COCO (80 classes, 330k images, ~1.5M instances — the standard research benchmark), LVIS (1200+ classes, long-tailed, the more rigorous modern benchmark), Open Images (600 classes, 9M images, larger but noisier), Objects365 (365 classes, 2M images), and task-specific ones: Cityscapes (autonomous driving), ADE20K (scene parsing), BDD100K (driving video), Mapillary Vistas (street view). Each has its own annotation format and quirks; the COCO JSON format is the lingua franca that most tooling supports.
Annotation tooling is a bottleneck for most real-world applications. Bounding boxes take ~5–10 seconds per box; polygon masks take 30–60 seconds. A dataset of 10 000 images with 10 masks each is weeks of annotation work. Modern tools — CVAT, LabelStudio, Roboflow, V7 — accelerate this with model-in-the-loop pre-labelling (often using SAM or a small pretrained detector), active learning, and quality-control workflows. Annotation cost has dropped by ~5–10× since SAM's release, enabling datasets that were previously impractical.
Evaluation beyond mAP is increasingly important. Real deployments care about per-class accuracy (some classes are safety-critical), per-size accuracy (can we reliably detect the cyclist 80 m away?), calibration (are 0.8-confidence predictions correct 80% of the time?), failure modes (what does the model do on out-of-distribution inputs?), and throughput / latency (not just FPS, but p95 latency). A COCO mAP score is a useful summary statistic but not a sufficient basis for deployment.
Detection and segmentation occupy a specific role in the larger Part VII pipeline. Chapter 02 provides the backbones; this chapter turns backbones into parsing primitives. Chapter 04 (video) extends detection/segmentation to temporal sequences. Chapter 05 (3-D vision) builds on 2-D detection with depth and multi-view reasoning. Chapter 06 (vision-language) closes the loop by letting detectors and segmentors be driven by language queries — which, as §15–§16 showed, is where the architectural convergence is most advanced. A reader finishing this chapter has roughly the ingredients needed to build a full scene-understanding system; the remaining chapters add time, depth, and language.
Closing thought: the task of detection in 2026 is unrecognisable from the task in 2014. R-CNN's 2000 region proposals per image and SIFT-era feature descriptors have been replaced by promptable foundation models that can detect and segment arbitrary text-described objects at real-time speeds. The research frontier now asks whether detection-as-a-task is even a useful abstraction, or whether it should be absorbed entirely into multimodal foundation models. The tooling, datasets, and deployment infrastructure are mature; the architectural convergence is nearly complete. The frontier is less about new detectors and more about what detection looks like inside a larger model that also reads text, generates images, and reasons about scenes.
Detection and segmentation have produced some of the most cited papers in deep learning — R-CNN, Faster R-CNN, Mask R-CNN, YOLO, DETR, SAM — and a substantial engineering literature that surrounds them. The selections below trace the two-stage, one-stage, transformer-detector, and segmentation families, plus the open-vocabulary and promptable-foundation-model frontier that has emerged since 2022. Software references point to the research frameworks (Detectron2, MMDetection, Ultralytics) that most practitioners start from and the foundation-model APIs (SAM, Grounding DINO) that many systems now depend on.