Part VII · Computer Vision · Chapter 03

Object detection and instance segmentation, the vision tasks that move beyond "what is in this image?" to "what is where, and exactly which pixels does it occupy?" — the story of bounding-box regressors (R-CNN, YOLO, SSD), transformer-based detectors (DETR), pixel-accurate mask predictors (Mask R-CNN, SOLO), unified panoptic models (Mask2Former), and the promptable foundation models (SAM, Grounding DINO) that turned detection into a generic perception primitive.

Detection and segmentation sit one level above classification: instead of naming a whole image, they localise and delineate every object of interest. The task has two historical lineages. The two-stage lineage — R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN — first proposes candidate regions and then classifies and refines them. The one-stage lineage — SSD, YOLO, RetinaNet — predicts boxes and classes densely on a grid in a single pass, trading a small amount of accuracy for an order-of-magnitude speed-up. Anchor-free detectors (FCOS, CenterNet) removed the hand-designed anchor boxes; DETR (2020) removed hand-designed non-maximum suppression by casting detection as set prediction with a transformer decoder. On the segmentation side, FCN and U-Net established dense per-pixel prediction; DeepLab added atrous convolutions and pyramid pooling; Mask R-CNN added per-instance masks to two-stage detectors; Mask2Former and similar unified architectures now treat semantic, instance, and panoptic segmentation as one problem. The current frontier is promptable and open-vocabulary perception — SAM segments anything you point at, Grounding DINO and OWL-ViT detect objects named by free-text queries, and the line between detection, segmentation, and vision-language has blurred. This chapter follows that arc: the task landscape and metrics, the classical two-stage and one-stage families, anchor-free and transformer detectors, the segmentation hierarchy, and the promptable / open-vocabulary frontier.

How to read this chapter

Sections one through four establish the problem. Section one is why detection and segmentation matter — the move from whole-image recognition to structured spatial outputs, and the argument that detection is the parsing layer that turns pixels into entities. Section two is the task taxonomy: classification, localisation, bounding-box detection, semantic segmentation, instance segmentation, panoptic segmentation, and keypoint detection — what each outputs, and where the hard parts are. Section three is evaluation metrics: intersection-over-union, average precision at a fixed IoU threshold, mean AP across thresholds and categories, the COCO protocol, and the panoptic-quality metric. Section four is bounding-box encodings and anchor boxes — centre-size vs. corner formats, anchor priors, aspect-ratio scales, and the IoU-based matching that turns detection into a per-anchor classification problem.

Sections five through nine walk through the detection lineage. Section five is the two-stage family — R-CNN, SPP-net, Fast R-CNN, Faster R-CNN, and the RoIAlign refinement that made Mask R-CNN possible. Section six introduces one-stage detectors — the SSD architecture, the YOLO-v1 grid-prediction idea, and the accuracy/speed trade-off. Section seven is the YOLO family — from v2 and v3 through the Ultralytics YOLOv5 / v8 and successors — a case study in how a single-pass architecture absorbed a decade of ideas. Section eight is anchor-free detection — FCOS, CenterNet, CornerNet, RepPoints — which removed the hand-engineered anchor priors. Section nine is feature pyramids — FPN, PANet, BiFPN, NAS-FPN — the multi-scale feature fusion shared by almost every modern detector.

Sections ten and eleven close out the detection story. Section ten is detection losses and label assignment — the focal loss that made one-stage detectors trainable, the IoU-family regression losses (IoU, GIoU, DIoU, CIoU), and dynamic label assignment (ATSS, OTA, SimOTA). Section eleven is the transformer era — DETR's set-prediction formulation, Deformable DETR's multi-scale attention, DINO's denoising queries, and RT-DETR's real-time variant.

Sections twelve through fourteen cover segmentation. Section twelve is semantic segmentation — FCN, U-Net, SegNet, the DeepLab family with atrous convolutions and ASPP, and the modern transformer segmentors (SegFormer, Segmenter). Section thirteen is instance segmentation — Mask R-CNN's per-RoI masks, YOLACT's prototype masks, SOLO's position-conditioned masks, and CondInst / BlendMask. Section fourteen is panoptic segmentation — stuff vs. things, Panoptic FPN, and the unified Mask2Former / OneFormer architectures that treat every segmentation task as mask classification.

Section fifteen is the promptable segmentation frontier: the Segment Anything Model (SAM) and its video-capable successor SAM 2, the ambient-image-encoder / prompt-decoder split, and how a single model trained on a billion masks can be prompted with a click, a box, or a mask. Section sixteen is open-vocabulary detection and segmentation — CLIP-aligned classifiers, GLIP, Grounding DINO, OWL-ViT, and the unified open-vocabulary segmentors (ODISE, OV-Seg). Section seventeen is the efficient-deployment toolkit: edge detectors, knowledge distillation, INT8/FP16 quantisation, TensorRT and ONNX export, and the engineering gap between a research checkpoint and a 30-fps phone model. The closing in-ml section is the operational picture: annotation tools and dataset formats (COCO, LVIS, Open Images), the timm/torchvision/Detectron2/MMDetection ecosystems, and how detection and segmentation plug into the larger vision pipeline covered in the rest of Part VII.

Contents

  1. Why detection and segmentation matterFrom whole-image labels to localised, structured spatial predictions
  2. The task taxonomyClassification, detection, semantic/instance/panoptic segmentation, keypoints
  3. Evaluation metricsIoU, AP, mAP, the COCO protocol, panoptic quality, boundary metrics
  4. Bounding boxes and anchorsBox encodings, anchor priors, IoU matching, positive/negative assignment
  5. Two-stage detectorsR-CNN, SPP-net, Fast R-CNN, Faster R-CNN, RoIPool and RoIAlign
  6. One-stage detectorsSSD, YOLO-v1, the dense-prediction grid, and the speed/accuracy trade-off
  7. The YOLO familyYOLO-v2 through v8 and beyond — a decade of single-pass detectors
  8. Anchor-free detectionFCOS, CenterNet, CornerNet, RepPoints — removing hand-designed anchors
  9. Feature pyramidsFPN, PANet, BiFPN, NAS-FPN — multi-scale feature fusion
  10. Detection losses and label assignmentFocal loss, GIoU / DIoU / CIoU, ATSS, SimOTA, DETR bipartite matching
  11. DETR and transformer detectorsSet prediction, cross-attention, Deformable DETR, DINO queries, RT-DETR
  12. Semantic segmentationFCN, U-Net, DeepLab, atrous / ASPP, SegFormer, Segmenter
  13. Instance segmentationMask R-CNN, YOLACT, SOLO, CondInst, BlendMask — per-object masks
  14. Panoptic segmentationStuff vs. things, Panoptic FPN, Mask2Former, OneFormer — unified mask prediction
  15. Promptable segmentationSAM, SAM 2, the prompt encoder / mask decoder split, video segmentation
  16. Open-vocabulary detection and segmentationGLIP, Grounding DINO, OWL-ViT, ODISE, OV-Seg — detecting the long tail by text
  17. Efficient deploymentDistillation, INT8 / FP16 quantisation, TensorRT / ONNX, edge detectors
  18. Detection and segmentation in the ML lifecycleDatasets (COCO, LVIS, Open Images), annotation, toolkits, downstream integration
§1

Why detection and segmentation matter

Image classification answers one question per image: "what is this a picture of?" Most real applications need more. A self-driving car needs to know where each pedestrian and cyclist is, not just whether there are people in the frame. A medical tool needs to outline the tumour's precise boundary, not report "positive". A retail checkout needs to count each item on the belt. These tasks — object detection, which places a bounding box around every object of interest, and segmentation, which assigns a class (and often an instance identity) to every pixel — are the parsing layer of computer vision: they turn a dense grid of pixels into a structured list of entities.

Classification, covered in the previous chapter, gives a single label for a whole image. Detection replaces this with a variable-length output: a set of (class, bounding box, confidence) triples, one per detected object. Segmentation goes further, producing a per-pixel label map: either a single class per pixel (semantic segmentation), a unique instance identity per foreground pixel (instance segmentation), or a unified stuff-and-things labelling (panoptic segmentation). The jump from "one label for an image" to "a variable set of labels localised in space" is what makes the engineering and evaluation of these tasks qualitatively different.

Detection and segmentation inherit the classification backbones from the previous chapter — ResNets, EfficientNets, ViT, Swin, ConvNeXt — and add task-specific heads: region-proposal networks, box regressors, mask decoders. The trend over the last decade has been to unify these heads. In 2015 you would build a different architecture for detection (Faster R-CNN), semantic segmentation (FCN), and instance segmentation (a CRF-plus-detector stack). By 2022 the same transformer backbone with different decoders could handle all of them; by 2024, a single promptable model like SAM could produce segmentation masks for anything a user pointed at, and models like Grounding DINO could detect categories described in free-text language.

Why this is harder than classification. Detection has to produce a set of outputs whose size depends on the image, which requires solving three problems at once: where objects are (localisation), what they are (classification), and how many there are (duplicate suppression). Segmentation compounds this by demanding pixel-accurate boundaries at the resolution of the input image, which requires careful handling of spatial resolution through the network. The error surface is noisier than classification's, the evaluation metrics are harder to optimise directly, and small architectural choices (anchor sizes, RoI alignment, loss weighting) can move benchmark numbers by several points.

The applications that pushed these tasks from curiosity to necessity are, roughly, autonomous driving (where the entire perception stack is detection and segmentation), medical imaging (pixel-accurate tumour / organ boundaries), retail and industrial inspection (counting, sorting, defect detection), robotics (manipulation requires knowing where objects are in 3-D), and content moderation / analysis (detecting weapons, faces, text, logos). Each has produced its own benchmark and its own engineering dialect, but the underlying architectures converged.

This chapter treats detection and segmentation as a single unified topic, because by 2026 they are. The same loss functions, the same backbones, the same transformer decoders, and increasingly the same foundation models serve both. The chapter walks through the historical two-stage and one-stage detector families, the segmentation hierarchy, and the modern transformer-based and promptable architectures that have replaced most of the older task-specific designs.

§2

The task taxonomy

Before diving into architectures, it is worth fixing the vocabulary. "Detection" and "segmentation" each split into several subtasks with different outputs, different metrics, and different difficulty levels. Getting the taxonomy straight prevents confusion when reading papers that benchmark on apparently similar but subtly different tasks.

Image classification outputs one label per image (or a score for each of K classes). This is the Part VII Chapter 02 task. Image-level multi-label classification is the same but with multiple positive labels — a photo that contains both a dog and a cat. No spatial information is produced.

Localisation outputs a single bounding box and class for the dominant object in the image. It is an intermediate task between classification and detection — easier than detection because there is exactly one output per image — and is rarely a research target on its own. The ImageNet localisation challenge used it as a transitional benchmark before detection took over.

Object detection outputs a variable-length list of (class, box, score) triples: one for every object instance in the image. A box is typically four numbers — (xmin, ymin, xmax, ymax) or (cx, cy, w, h) in pixels. The standard benchmarks are PASCAL VOC (20 classes, legacy), COCO (80 classes, the modern reference), and LVIS (1200+ classes with a long-tailed distribution). Open Images has several million labelled boxes across 600 classes and is commonly used for pretraining.

Segmentation: stuff, things, and the three flavours. Things are countable objects (people, cars, dogs) that have well-defined instance boundaries. Stuff is amorphous background (sky, grass, road) that is typically described in aggregate. Semantic segmentation assigns each pixel a single class label and does not distinguish instances — two adjacent cars share one "car" region. Instance segmentation labels things with instance IDs (so the two cars are separate), but typically ignores stuff. Panoptic segmentation (Kirillov et al., 2019) unifies both: every pixel is assigned a (class, instance_id) pair, where instance_id is a distinct integer for things and a shared stuff-label for stuff.

Keypoint detection outputs a set of labelled point locations — classically facial landmarks, human body joints, or hand joints. The standard benchmark is the COCO keypoint task (17 human body keypoints). Keypoints share much of the detection machinery: they can be framed as heatmap regression (one heatmap per keypoint class) or as points-on-detected-objects (detect the person, then regress keypoints inside the box).

Two further variants are worth naming. Oriented bounding box detection uses rotated rectangles rather than axis-aligned ones — needed for aerial imagery (DOTA benchmark), text detection, and industrial inspection. 3-D detection outputs cuboids in 3-D space rather than 2-D boxes; it is the standard formulation for autonomous-driving perception and is covered in Chapter 05 (3-D vision). The current chapter restricts attention to 2-D tasks on still images; video-specific detection (tracking, spatio-temporal action localisation) appears in Chapter 04.

The output format of each task determines almost everything about the corresponding architecture. Dense per-pixel tasks (segmentation) need a decoder that preserves or restores spatial resolution. Sparse per-object tasks (detection, keypoint) need a mechanism for proposing candidate locations and suppressing duplicates. The distinction between a dense head and a sparse head is the deepest architectural split in the chapter; once you recognise which side of it a given method is on, the rest of the design choices follow from it.

§3

Evaluation metrics

Detection and segmentation metrics are less obvious than classification's accuracy. A detector can be right about class but slightly off on the box; a segmentor can be right on most pixels but blur every boundary. The metrics were designed to reward the behaviours people actually want.

The foundational primitive is Intersection-over-Union (IoU), also called the Jaccard index. For two regions A and B, IoU = |A ∩ B| / |A ∪ B|, a number between 0 (disjoint) and 1 (identical). For bounding boxes the two regions are the axis-aligned rectangles; for segmentation they are pixel masks. IoU is the standard matching criterion between predictions and ground truth. A predicted box with IoU ≥ 0.5 against a ground-truth box of the same class is usually called a true positive at that threshold.

Average Precision (AP) is the area under the precision-recall curve for a single class at a fixed IoU threshold. Predictions are ranked by confidence, matched to ground-truth boxes in rank order, and P/R are computed at each rank; integrating P over R gives AP. mAP is AP averaged over classes. PASCAL VOC's legacy metric was AP@0.5: AP at IoU threshold 0.5.

COCO metrics are the modern default. The COCO protocol averages AP across ten IoU thresholds (0.50, 0.55, …, 0.95) and 80 classes, calling the result mAP (or just AP in COCO tables). It also breaks out AP50 (loose localisation, VOC-style), AP75 (tight localisation), and per-size buckets APS, APM, APL for small / medium / large objects. A detector can have a strong AP50 and a weak AP75, meaning it finds objects but localises them coarsely. Reading "the model achieves 55 AP on COCO" means "55% averaged across the IoU range and class set"; a strong modern detector is around 55–65 AP.

For instance segmentation, the same AP framework applies but with mask IoU instead of box IoU. COCO tables separately report APbb (box AP) and APmask (mask AP) from the same predictions. A good instance segmentor gets mask AP typically 3–5 points below its box AP, reflecting the extra difficulty of pixel-accurate boundaries.

Semantic segmentation uses mean Intersection-over-Union (mIoU): for each class, compute the pixel-wise IoU of the predicted mask against the ground-truth mask across all test images, then average over classes. Pixel accuracy (fraction of correctly classified pixels) is also reported but is dominated by the common classes; mIoU is the more informative metric, especially for long-tailed class distributions. The reference benchmarks are Cityscapes (street scenes, 19 classes), ADE20K (150 classes over diverse scenes), and Pascal Context.

Panoptic segmentation introduced its own metric: Panoptic Quality (PQ). For each class, PQ = (sum of IoUs over matched segments) / (|TP| + 0.5·|FP| + 0.5·|FN|). It factors into a Segmentation Quality (SQ) term — average IoU over matched segments — and a Recognition Quality (RQ) term — an F1 score over segment matches. A model can achieve high PQ only by both correctly identifying each segment and covering it accurately.

Two further details affect reproducibility. Non-maximum suppression (NMS) — removing overlapping predictions of the same class — has knobs (IoU threshold, score threshold) that change reported numbers by a full point. And soft NMS, which reduces rather than removes overlapping boxes' scores, can yield small gains. COCO evaluation has standardised on specific settings, and papers that tune NMS aggressively are sometimes flagged as benchmark-chasing rather than architectural wins.

§4

Bounding boxes and anchors

A box is four numbers — but which four, and parameterised how, matters surprisingly much. The anchor-box formulation that dominated detection from 2015 to about 2020 is built entirely on a choice of these parameters, and understanding it is the key to reading any pre-transformer detection paper.

There are two standard bounding-box formats. Corner format(xmin, ymin, xmax, ymax) — gives the top-left and bottom-right corners in pixel coordinates. Centre-size format(cx, cy, w, h) — gives the centre point and the width and height. Corner is convenient for IoU computation and for clipping to the image; centre-size is convenient for regression (the centre coordinates scale linearly, while corners are coupled to each other). Most modern detectors regress in centre-size format and convert to corner for evaluation.

The anchor box idea, introduced by Faster R-CNN, is to place a dense grid of pre-defined box shapes ("priors") across the image at a set of spatial locations and scales, and have the detector predict deltas from each anchor rather than absolute box coordinates. At each location on, say, a 50×50 feature map, the network places nine anchors — three scales × three aspect ratios — for a total of 22 500 candidate boxes. The detector then predicts, for each anchor, a classification score and a four-dimensional delta (dx, dy, dw, dh) that shifts and scales the anchor to match a ground-truth box.

Why anchors? Direct regression of absolute box coordinates from image features is difficult: the target distribution is wide and multimodal (there may be many objects at different scales). Anchors decompose the problem. Each anchor takes responsibility for a narrow range of sizes and aspect ratios; the regression target for that anchor becomes small and well-behaved. The cost is a hand-designed prior over what anchor shapes and sizes are worth including — typically tuned for each dataset — and a proliferation of negative anchors for every positive, which requires hard-example mining or focal loss to train.

The standard delta parameterisation (Faster R-CNN) is: tx = (x − xa) / wa, ty = (y − ya) / ha, tw = log(w / wa), th = log(h / ha), where subscript a denotes the anchor. Log-encoding the width and height keeps the regression loss well-scaled across a wide range of box sizes; the centre offsets are normalised by anchor size for the same reason. The regression targets are trained with a smooth-L1 loss (Huber loss) that transitions from L2 near zero to L1 in the tails, preventing outliers from dominating the gradient.

A subtlety is positive / negative assignment: deciding which anchors should be trained to predict which ground-truth boxes. The classical rule is threshold-based. An anchor is positive if either it has IoU ≥ 0.7 with any ground-truth box, or it is the highest-IoU anchor for some ground-truth box (the second clause ensures every ground-truth gets matched). It is negative if it has IoU < 0.3 with every ground-truth box. Anchors in between are ignored. This matching is done independently per image and is the engine that turns detection into a per-anchor classification-and-regression problem.

The downsides of anchors are why the field eventually moved past them. Anchor shape, count, and scale are hyperparameters that must be tuned per dataset — COCO and VOC don't use the same anchors. Most anchors are background, which creates heavy class imbalance during training. Large objects and small objects get different amounts of effective capacity depending on which anchor scales you include. Anchor-free detectors (§8) remove the anchor prior by predicting boxes directly from feature-map locations, and transformer detectors (§11) remove anchors entirely by learning a small set of object queries.

§5

Two-stage detectors: R-CNN to Faster R-CNN

The first successful CNN-based detectors followed a two-stage pipeline: propose candidate regions, then classify and refine each region. The family culminated in Faster R-CNN (2015) and Mask R-CNN (2017), and even after the one-stage and transformer detectors took the speed lead, the two-stage designs held the accuracy crown on COCO for several years.

R-CNN (Girshick et al., 2014) — "Regions with CNN features" — is the ancestor. It uses selective search, a classical bottom-up region-proposal algorithm, to produce about 2000 candidate regions per image. Each region is warped to a fixed size, passed through a CNN (AlexNet originally), and classified by per-class linear SVMs. The network is fine-tuned from ImageNet pretraining on the detection classes. R-CNN was the first method to clearly beat hand-crafted features on PASCAL VOC, but it had two problems: it ran the CNN 2000 times per image (slow), and selective search was independent of the learned features (locked-in).

SPP-net (He et al., 2014) and Fast R-CNN (Girshick, 2015) solved the first problem. Both run the CNN once on the whole image to produce a shared feature map, then extract per-region features by cropping the appropriate sub-region of the feature map. Fast R-CNN introduced the RoI Pooling operator — given a feature map and a region-of-interest box, produce a fixed-size feature grid (say 7×7) by pooling the feature map within the box. This lets a classification-and-regression head run on each region in a single forward pass. Fast R-CNN is roughly 200× faster at inference than R-CNN and slightly more accurate.

Faster R-CNN (Ren et al., 2015) replaced selective search with a learned proposal mechanism: the Region Proposal Network (RPN). The RPN is a small conv network that slides across the shared feature map and, at each location, predicts objectness scores and box refinements for each of k anchors (typically 9 — see §4). Its output is a ranked list of candidate regions, which is then fed to the Fast R-CNN head as before. The key point is that the RPN shares features with the detection head; the whole system trains end-to-end. Faster R-CNN at inference is about 5–10× faster than Fast R-CNN and cleaner architecturally, because there is no hand-designed proposal step.

RoIAlign and the Mask R-CNN refinement. RoI Pooling has a quantisation problem: continuous box coordinates are rounded to integer feature-map cells, causing small misalignments that hurt pixel-accurate mask prediction. Mask R-CNN (He et al., 2017) introduced RoIAlign, which uses bilinear interpolation to sample the feature map at exact box coordinates without quantisation. Mask R-CNN also added a parallel mask-prediction branch to Faster R-CNN: for each RoI, the head produces both (class, box) and a per-class K×K binary mask. RoIAlign alone gives ~3 mask AP improvement over RoI Pooling; the combined system established the two-stage detection+segmentation paradigm that dominated until 2020.

Training a two-stage detector involves three losses applied at the two stages. At the RPN: a binary objectness classification loss (is this anchor a foreground object?) and an anchor-delta regression loss, applied to a sampled subset of positive and negative anchors. At the head: a K+1-way classification loss (one per class plus background) and a per-class box-delta regression loss, applied to the RPN's top-N proposals that survived NMS. Mask R-CNN adds a per-class binary cross-entropy mask loss on the mask branch. All four (or five) losses are weighted and summed; getting the weights right is part of the recipe.

Two-stage detectors are still used — Cascade R-CNN, HTC, DINO — when accuracy matters more than latency, particularly in domain-specific industrial and medical applications. The engineering trade-off is that the two stages give more capacity per candidate region (at the cost of throughput), while one-stage detectors (§6-§7) give up some accuracy for an order-of-magnitude speed-up.

§6

One-stage detectors: SSD and YOLO-v1

The one-stage detector family predicts class labels and bounding-box deltas directly from a grid of feature-map locations in a single forward pass, skipping the region-proposal step. This trades a few points of AP for a 5–10× speed-up, and for most real-time applications (autonomous driving, video analysis, mobile inference) that trade is worth making.

SSD — "Single Shot MultiBox Detector" (Liu et al., 2016) — was the first widely-adopted one-stage detector. The architecture is a VGG-16 backbone plus a pyramid of extra convolutional feature maps at decreasing resolutions. At each spatial location of each feature map, a dense set of anchor boxes is predicted (both a softmax over classes and a box delta). Large feature maps detect small objects; small feature maps detect large objects — the multi-scale pyramid gives SSD natural scale coverage without an explicit proposal step. Post-processing is a per-class non-maximum suppression over all predictions.

YOLO-v1 (Redmon et al., 2016) — "You Only Look Once" — took an even more radical approach. It divides the image into a 7×7 grid; each grid cell predicts B=2 bounding boxes (x, y, w, h, confidence) plus a distribution over classes. All predictions come from a single pass through a single network with a single loss. YOLO-v1 was dramatically faster than any previous detector (45 fps on a GPU, versus Faster R-CNN's 5 fps) but meaningfully less accurate, especially on small objects where the 7×7 grid was too coarse. It set the aesthetic for the YOLO family: a clean, fast, single-pass detector that would absorb improvements from elsewhere in the field over the next decade.

The class-imbalance problem. One-stage detectors evaluate tens of thousands of anchor locations per image, of which only a few dozen overlap an actual object. The resulting 1000:1 class imbalance overwhelmed standard cross-entropy loss: background anchors, even easy ones, collectively dominated the gradient. Focal loss (Lin et al., 2017, RetinaNet) solved this by down-weighting easy examples: FL(pt) = −(1 − pt)γ log pt. With γ=2, an example with pt=0.9 contributes 100× less than one with pt=0.5. Focal loss closed the accuracy gap between one-stage and two-stage detectors at matched speed; it is one of the single most important training-time innovations in the detection literature.

RetinaNet (Lin et al., 2017) combined focal loss with a Feature Pyramid Network (FPN, §9) on a ResNet backbone, and was the first one-stage detector to match Faster R-CNN on COCO. It established the canonical one-stage template — backbone + FPN + per-level detection head with anchor-based classification and regression — that subsequent YOLO, FCOS, and RTMDet variants all inherited.

The basic one-stage training recipe became: forward the image, for each anchor compute a classification loss (focal or cross-entropy) against the matched ground-truth class, compute a regression loss (smooth-L1 or IoU-based) against the matched ground-truth box, and sum. Evaluation runs the forward pass, applies per-class NMS, and keeps the top-K predictions by score. The simplicity is the point; there is no proposal-then-refine stage and no hand-designed anchor search.

Practically, one-stage and two-stage detectors have converged. DETR-family transformer detectors (§11) are architecturally one-stage; modern anchor-free detectors (§8) look one-stage but also eliminate the anchor priors; the latest YOLO variants perform comparably to the best two-stage detectors on COCO AP while running 5–10× faster. The division matters more as a historical organising principle than as a current design choice.

§7

The YOLO family

The YOLO name has been carried by roughly a dozen different architectures over a decade, some by the original authors (Redmon, Farhadi) and others by different research groups and companies. The family is less a unified design and more a living lineage: a commitment to a single-pass, real-time detector that repeatedly absorbed whatever new technique (FPN, focal loss, anchor-free heads, transformer modules) the rest of the field produced.

YOLO-v2 (Redmon & Farhadi, 2017) replaced the fixed 7×7 grid of v1 with anchor boxes clustered over training data via k-means — making the anchor priors data-dependent rather than hand-set. It also introduced DarkNet-19, a custom backbone; switched to BatchNorm throughout; and added multi-scale training. YOLO-v3 (2018) upgraded to DarkNet-53 with residual connections, added a three-level feature pyramid for better small-object detection, and moved to binary-cross-entropy per-class losses (allowing overlapping labels). v3 was the last version by Redmon; it became the de-facto real-time detector for several years.

The post-v3 YOLO world fragmented. YOLO-v4 (Bochkovskiy et al., 2020) is a careful ablation study assembling the best modules of the preceding years: CSPDarknet backbone, PANet neck, SPP module, Mish activation, CIoU loss, mosaic data augmentation. YOLO-v5 (Ultralytics, 2020) was not a paper but an open-source release that emphasised PyTorch implementation quality, ease of training, and a clean Python API. Its actual architectural differences from v4 are modest, but it rapidly became the most-used detector in industry.

Why YOLO-v5 took over. Detection research had accumulated a large gap between research papers and deployable code — Faster R-CNN, RetinaNet, FCOS all had official repos that were difficult to run on custom data. YOLO-v5's Ultralytics repo prioritised the practitioner experience: pip-installable, YAML-configured, reproducible training on a custom dataset in an afternoon. This is a case where engineering polish created a much larger real-world footprint than the underlying research contribution would suggest.

Later variants converged on an anchor-free, decoupled-head design. YOLOX (Megvii, 2021) switched to an anchor-free head (predicting box corners directly from each feature-map point), decoupled the classification and regression heads (separate conv stacks), and introduced SimOTA, a dynamic label-assignment strategy that picks the best candidate anchors per ground-truth at training time. YOLO-v6 (Meituan, 2022) and YOLO-v7 (Wang et al., 2022) continued refinement: re-parameterised blocks for inference, improved label assignment, and careful quantisation support.

YOLO-v8 (Ultralytics, 2023) unified detection, instance segmentation, classification, and pose estimation in a single codebase, all built on a similar CSPDarknet-plus-anchor-free-head architecture. It is anchor-free, uses distribution-focal-loss for box regression, and has been the practical "default" detector for real-time applications through 2024–2025. Subsequent YOLOv9, v10, and v11 variants by various groups continue to push the accuracy/speed Pareto frontier with techniques borrowed from transformer detectors and modern training recipes.

The consolidated picture: over YOLO v1 through v8+, the architecture migrated from a naive grid-prediction to a mature anchor-free, multi-scale, decoupled-head detector that benefits from almost every advance in the rest of the chapter. The family is a case study in how a clean baseline, accessible code, and a real-time design constraint can produce a continuously-improving line of models without a single unifying paper.

§8

Anchor-free detection

Anchor boxes are a useful inductive bias, but they are also a set of hand-designed priors that must be tuned per dataset and that introduce a substantial training-time imbalance. A family of anchor-free detectors, developed mostly between 2018 and 2020, predicts boxes directly from feature-map locations without any anchor priors — typically with comparable or better accuracy than their anchor-based predecessors.

FCOS — "Fully Convolutional One-Stage detection" (Tian et al., 2019) — is the clearest anchor-free design. At every location (x, y) on every level of a feature pyramid, FCOS predicts: a per-class classification score, a 4-vector (l, t, r, b) giving distances from (x, y) to the four sides of the target box, and a centreness score that down-weights predictions whose location is far from the centre of their target. Training assigns each location to the smallest ground-truth box that contains it (within a level-specific size range). At inference, the (l, t, r, b) vector plus the location determine the box; NMS handles duplicates.

FCOS outperformed anchor-based RetinaNet at matched backbones on COCO, with a simpler and more elegant formulation. The centreness head is a small trick that turned out to generalise: it is effectively a learned prior that central locations should produce higher-quality predictions, implemented as an additional scalar per location. Almost every anchor-free detector since has either used centreness directly or an IoU-based equivalent.

CenterNet and CornerNet: objects as keypoints. A parallel thread reframes detection as keypoint estimation. CornerNet (Law & Deng, 2018) predicts heatmaps for top-left and bottom-right corners plus a per-corner embedding used to group corners into the same object. CenterNet (Zhou et al., 2019) predicts one heatmap per class with a peak at each object's centre, plus a regression head for box size. These methods frame detection as a dense per-pixel prediction task that looks more like segmentation. They produce clean real-time detectors and generalise naturally to 3-D detection, pose, and tracking — modern autonomous-driving detection stacks are often CenterNet-descendants.

RepPoints (Yang et al., 2019) replaces the four-number box representation with a set of nine learned point locations; the box is computed as the min-max bounding rectangle of these points. The benefit is that the points can learn to attach to semantically meaningful object parts, giving a richer representation than a box. ATSS — "Adaptive Training Sample Selection" (Zhang et al., 2020) — showed that the real performance gap between anchor-based and anchor-free detectors came from the label assignment strategy, not the presence or absence of anchors; with matched assignment rules, the two are indistinguishable. ATSS became the default assigner in modern detectors: it computes candidate positive anchors/locations per ground-truth using a statistical mean-plus-standard-deviation IoU rule, rather than a fixed threshold.

The modern lineage (FCOS, ATSS, CenterNet, SimOTA) is that a one-stage detector does not need anchor priors as long as it has a carefully designed label-assignment and centreness scheme. Almost every post-2021 detector — YOLOX, YOLO-v6+, RTMDet, RTMPose — is anchor-free. The anchor-based detectors still in wide use (Faster R-CNN, RetinaNet) are valuable reference points more than competitive systems; a new detection paper in 2026 is almost certainly anchor-free.

§9

Feature pyramids

Objects in natural images appear at drastically different scales — a bicycle close to the camera might cover a 400-pixel region; another in the distance might cover ten. A detector with a single feature-map resolution will struggle with one end of this range. The feature pyramid is the standard architectural answer: extract and fuse features at multiple spatial resolutions so that small objects are handled by fine-grained feature maps and large objects by coarse-grained ones.

FPN — "Feature Pyramid Networks" (Lin et al., 2017) — is the canonical design. Starting from a backbone's feature hierarchy (say C2, C3, C4, C5 at strides 4, 8, 16, 32), FPN adds a top-down pathway: upsample the coarsest feature map (C5), add it element-wise to a lateral 1×1 projection of C4, apply a 3×3 conv to smooth the result (P4), and iterate. The output is a pyramid P2…P5 where every level has both coarse-grained semantic information (from C5) and fine-grained spatial information (from the same-stride backbone level). Detection heads are attached at each level; small objects go to P2, large to P5.

FPN's impact was immediate: attaching an FPN to RetinaNet, Faster R-CNN, or Mask R-CNN gave 2–4 AP points of improvement across the board, with small objects (APS) improving even more. It rapidly became a default, to the point that "with FPN" is the assumed baseline in any post-2017 detection paper.

PANet, BiFPN, NAS-FPN: refining the neck. PANet (Liu et al., 2018) adds a second bottom-up pathway on top of FPN's top-down one, so the finest-grained features also get combined back into the coarser levels. BiFPN (Tan et al., 2020, EfficientDet) goes further: bidirectional cross-level connections with learned per-edge weights and repeated blocks. NAS-FPN (Ghiasi et al., 2019) uses neural architecture search to discover a non-trivial connection pattern that outperforms hand-designed alternatives. Each of these adds 1–2 AP on COCO over vanilla FPN at small FLOP cost; BiFPN is the best-known modern standard.

An orthogonal refinement is the neck vs. backbone vs. head separation that the field settled on. A modern detector is built from three loosely-coupled components: a backbone (ResNet, ConvNeXt, Swin) that produces feature maps; a neck (FPN, BiFPN, PANet) that fuses features across scales; and a head (anchor-based, anchor-free, transformer query) that makes the actual predictions. This decomposition lets researchers swap components independently — a ResNet-backbone FCOS head and a Swin-backbone ATSS head share the same neck API.

Transformer detectors (DETR and descendants) typically replace the multi-scale neck with a single feature map plus learned object queries, or — in the case of Deformable DETR — use a multi-scale feature set but without the explicit FPN top-down pathway. For non-transformer detectors, an FPN-family neck is essentially mandatory; the lineage from FPN (2017) through BiFPN (2020) and beyond is one of the quieter-but-important stories in detection architecture.

§10

Detection losses and label assignment

A detector's loss function has to balance a classification signal (is this anchor a foreground object, and which class?), a regression signal (how much do I shift the box to hit the target?), and a label-assignment rule (which anchor is responsible for which ground-truth box?). The evolution of each of these three pieces drove roughly as much of the last decade's accuracy gains as any architectural change.

Classification losses. Early detectors used cross-entropy or softmax cross-entropy directly, which collapsed under the 1000:1 foreground-background imbalance of one-stage detectors. Focal loss (§6) solved this by exponentially down-weighting easy examples. Quality Focal Loss (Li et al., 2020, GFL) replaces the binary (positive/negative) target with a continuous IoU-based quality score — a positive anchor's target is its predicted box's IoU with the ground truth, not 1. This aligns the classification score with the actual localisation quality, so that NMS (which sorts by classification score) selects boxes with both high class confidence and tight localisation.

Regression losses. Classical detectors used smooth L1 (Huber loss) on the four anchor deltas, independently. The problem: optimising four scalar losses does not directly optimise IoU, which is what the evaluation metric actually cares about. IoU loss (Yu et al., 2016) directly uses 1 − IoU, but it has zero gradient for non-overlapping boxes. GIoU (Rezatofighi et al., 2019) adds a term that accounts for the smallest enclosing rectangle, giving a gradient even when boxes do not overlap. DIoU (Zheng et al., 2020) adds a centre-distance term; CIoU adds aspect-ratio alignment. These are drop-in replacements for smooth-L1 that give 1–2 AP of improvement on COCO.

Dynamic label assignment is the second-biggest story. Classical assignment uses a fixed IoU threshold (e.g. IoU ≥ 0.5 → positive). ATSS computes a per-ground-truth threshold from the mean-plus-standard-deviation of candidate anchor IoUs. OTA (Optimal Transport Assignment, Ge et al., 2021) casts assignment as a transport problem, matching anchors to ground truths by cost minimisation. SimOTA is a simplified, faster version used by YOLOX and v6+. DETR's bipartite matching (§11) is the transformer-detector cousin: every query gets uniquely matched to one ground-truth via Hungarian matching. These all implement the same principle — adapt the assignment to each ground-truth and each training step — and each one advanced the COCO leaderboard by roughly a point.

Distribution Focal Loss (DFL) is a specific 2020-era refinement. Rather than regress a single scalar (e.g. distance to left side of box), the network predicts a discrete distribution over possible values; the predicted value is the expectation, and the loss encourages the distribution to concentrate on the correct value while staying smooth. This is useful for boundary-ambiguous objects — e.g. a dog's tail — where the exact box edge is genuinely uncertain. DFL is used in GFL, YOLO-v6, v7, v8.

Mask-loss-family choices mostly use per-pixel binary cross-entropy or a dice-loss variant. Dice loss = 1 − 2·|A ∩ B| / (|A| + |B|) is essentially F1 at the pixel level; it is more robust to class imbalance than pixel BCE for very-small-object segmentation (cell microscopy, crack detection). Modern segmentors often use a weighted sum of BCE and Dice.

The overall message is that detection accuracy is bottlenecked roughly equally by architecture and by loss/assignment design. A ResNet-50 Mask R-CNN with 2017-era losses scores around 37 AP on COCO; with 2022-era losses (DFL, GIoU, ATSS, appropriate focal) on the same backbone it reaches ~43 AP. Before reaching for a new architecture, a new loss function is the cheapest test.

§11

DETR and transformer detectors

The DEtection TRansformer (Carion et al., 2020) is the architectural equivalent of taking the Transformer chassis from NLP (Part VI Chapter 04) and transplanting it directly into detection. DETR eliminated two of the classical pipeline's hand-engineered pieces — anchor boxes and non-maximum suppression — in favour of a transformer decoder whose outputs are a fixed-size set of object predictions, matched to ground-truth via bipartite matching during training.

DETR's architecture is: a CNN backbone produces a single feature map; a transformer encoder refines it; a transformer decoder takes N=100 learned object queries as its input and produces N output vectors via cross-attention to the encoder features; a small MLP head on each output produces (class, box). At training time, Hungarian matching pairs each prediction with at most one ground-truth box, minimising a cost that combines classification and box losses. Unmatched predictions are trained against a "no object" class, and unmatched ground truths are ignored.

The elegance of DETR is that the whole detector is a single end-to-end differentiable pipeline with no hand-designed post-processing, no anchor boxes, and no NMS. Its downside — on release — was convergence: vanilla DETR took 500 training epochs to reach parity with Faster R-CNN, where a classical detector needs 36 epochs. The bottleneck was the learned queries' slow convergence on where to attend.

The DETR lineage that fixed convergence. Deformable DETR (Zhu et al., 2020) replaces dense cross-attention with a sparse deformable attention that attends to a handful of learned offsets from each query's reference point — reducing compute and, more importantly, giving each query a localised inductive bias that speeds convergence 10×. DN-DETR and DINO (Zhang et al., 2022, 2023) add denoising training: inject noisy ground-truth boxes as extra queries and train the decoder to denoise them. RT-DETR (Lv et al., 2023) is a real-time variant that trims the decoder and switches to an efficient hybrid encoder. As of 2026 DETR-family detectors sit at or near the top of COCO.

Bipartite matching deserves a moment of explanation because it is the key training-time construct of DETR and its descendants. Let N predictions and M ground-truth boxes be given (with M ≤ N). Hungarian matching finds a one-to-one assignment π: {1..M} → {1..N} that minimises ∑i cost(π(i), i), where cost combines classification and box-regression terms. This is computed per image at every training step; the predictions not matched to any ground-truth get the "no object" target. The result is that at training time, there is no positive/negative ambiguity and no duplicate predictions per ground-truth — the set-prediction formulation has this built in.

DETR generalises cleanly to segmentation. The same decoder can produce per-query mask embeddings that, dotted with a per-pixel feature map, give a mask for each detected instance — this is the core idea of Mask2Former (§14). It also generalises to panoptic segmentation and video instance segmentation with the same query mechanism, which is why the transformer-detector family is the common ancestor of essentially every post-2022 mask-prediction architecture.

Practically, a 2026 detector is probably either a DETR variant (DINO, RT-DETR, MaskDINO for segmentation) or a YOLO-v8+/RTMDet-style efficient dense prediction head. The DETR family is preferred when accuracy matters most (top of COCO is uniformly DETR-descended); the dense-head family is preferred for real-time deployment. The architectural convergence is clear: both are anchor-free, both use multi-scale features, both use dynamic label assignment, and both share backbones with classification.

§12

Semantic segmentation

Semantic segmentation assigns a class label to every pixel of an image. The task is simpler to specify than instance or panoptic segmentation — no instance identities, just a per-pixel class map — but it still requires producing outputs at the input image's spatial resolution, which is a very different architectural problem from classification.

FCN — "Fully Convolutional Networks" (Long, Shelhamer & Darrell, 2015) — is the foundational architecture. It takes a classification network (VGG-16), replaces the final fully-connected layers with 1×1 convolutions, and upsamples the coarse output feature map back to input resolution via transposed convolutions. Skip connections from earlier layers provide fine-grained spatial detail that the deep coarse feature map alone lacks. FCN-8s (with three levels of skip connections) was the first widely-adopted architecture to show that end-to-end CNN training could match or exceed classical CRF-based segmentation methods.

The encoder-decoder pattern generalised this. The encoder is a classification-style backbone that progressively downsamples; the decoder is a mirror that progressively upsamples. SegNet (Badrinarayanan et al., 2015) kept track of max-pool indices during encoding and reused them during decoder upsampling for sharper boundaries. U-Net (Ronneberger, Fischer & Brox, 2015) added symmetric skip connections that concatenate encoder features with decoder features at matching resolutions, giving the decoder direct access to fine-grained spatial information. U-Net was developed for biomedical segmentation and remains the default architecture for that domain; variants appear everywhere from satellite imagery to diffusion-model backbones.

Atrous convolutions and the DeepLab family. A classification backbone downsamples aggressively — the final feature map is typically 1/32 of the input resolution. Upsampling this back to full resolution loses detail. Atrous convolution (also called dilated convolution) solves the same problem differently: by inserting zeros between kernel weights, the same 3×3 kernel covers a larger receptive field without downsampling. Replacing the last two downsamplings of a ResNet with atrous convs lets you maintain a 1/8-resolution feature map while still having the receptive field of a 1/32 one. The DeepLab family (v1/v2/v3/v3+) built around this idea, adding ASPP — Atrous Spatial Pyramid Pooling — to capture multiple scales in parallel. DeepLab-v3+ was the strongest convolutional semantic segmentor until the transformer era.

The transformer era brought new architectures. SETR (Zheng et al., 2021) was a straightforward application: a ViT encoder on image patches, then a decoder that upsampled the token representations back to pixel-level predictions. SegFormer (Xie et al., 2021) introduced a hierarchical transformer backbone (multiple stages with decreasing resolution, similar to Swin) and a lightweight MLP decoder that aggregates across scales. Segmenter (Strudel et al., 2021) used a transformer decoder with class tokens to produce per-class segmentation masks in a DETR-like formulation.

Modern unified architectures like Mask2Former (§14) treat semantic segmentation as a special case of mask classification: every semantic class corresponds to a single "mask" prediction covering all pixels of that class. This is architecturally identical to how the same model handles instance and panoptic segmentation — just with the matching rule and the post-processing adjusted. As of 2026 Mask2Former-style unified architectures have replaced task-specific semantic-segmentation models for most benchmarks.

For deployment, the choice is still often a simple encoder-decoder like DeepLab-v3+ or SegFormer: these are lighter than Mask2Former, run at real time on modest hardware, and provide solid accuracy for the common case where every pixel simply needs a class label. The research frontier has moved on, but the production-reality floor is firmly in the convolutional-encoder-decoder family.

§13

Instance segmentation

Instance segmentation combines detection and segmentation: for each object in the image, predict both a bounding box and a pixel-accurate mask. Two objects of the same class must receive distinct instance IDs. The task is strictly harder than either detection or semantic segmentation because it requires both object-level understanding and pixel-level precision.

Mask R-CNN (He et al., 2017) is the canonical two-stage instance segmentor. Architecturally it extends Faster R-CNN with a parallel mask head: for each RoI, in addition to the (class, box) predictions, a small FCN produces a K×K binary mask for each of the K classes (only the predicted class's mask is kept at inference). The combination of RoIAlign (§5) and the parallel mask branch gives mask AP that sits 3–5 points below the corresponding box AP, which is roughly the quality that downstream applications need for most use cases.

One-stage and anchor-free instance segmentors developed in parallel. YOLACT (Bolya et al., 2019) — "You Only Look At CoefficienTs" — generates a set of prototype masks (global mask candidates) for the whole image, and each detected instance's mask is produced as a linear combination of prototypes with predicted per-instance coefficients. The decoupling of "where masks live" (prototypes, computed once) from "which prototypes does this instance use" (coefficients, per-detection) makes YOLACT real-time on a GPU.

SOLO and position-conditioned masks. SOLO — "Segmenting Objects by Locations" (Wang et al., 2020) — took a different anchor-free route. Divide the image into an S×S grid. Each grid cell predicts a mask for the instance whose centre lies in that cell, and a per-class score. Instances are distinguished by position — the grid cell that contains their centre — rather than by detected boxes. SOLO-v2 refined this with dynamic convolutions that conditioned on the cell's features. The result is an instance segmentor with no box detection, no RoI operator, and no second stage — a clean conceptual simplification of the task.

CondInst (Tian et al., 2020) and BlendMask (Chen et al., 2020) are further refinements in the "dynamic head" family: the detection network predicts not just a box and class per instance, but also a set of dynamic convolution weights that are used to produce the instance's mask from a shared feature map. This gives mask quality comparable to Mask R-CNN at one-stage speed, without the hand-designed RoI operator.

The transformer family unified instance segmentation with detection through MaskFormer (Cheng et al., 2021) and Mask2Former (Cheng et al., 2022). MaskFormer reframes instance segmentation as mask classification: the decoder produces a fixed set of binary mask predictions plus per-mask class labels, and the task's inductive biases are encoded in the loss and matching rather than the architecture. Mask2Former added masked cross-attention — each query attends only within its predicted mask region — which dramatically improves convergence and accuracy. These architectures are the current state-of-the-art on COCO, LVIS, and Cityscapes instance-segmentation benchmarks.

For practitioners, the choice is typically: Mask R-CNN (or its Detectron2 variant) for easy integration and wide tooling; YOLOv8-seg / YOLACT-descendants for real-time inference on commodity hardware; Mask2Former or its descendants for accuracy-critical applications at research cost. LVIS (1200 classes, long-tailed) has become the more demanding benchmark than COCO, and ranking on LVIS is a better proxy for real-world long-tail performance.

§14

Panoptic segmentation

Panoptic segmentation (Kirillov et al., 2019) unifies semantic and instance segmentation into a single task: every pixel gets a (class, instance-id) pair. For thing classes (countable objects like cars, people), each instance gets a unique ID; for stuff classes (uncountable regions like sky or road), all pixels share a single ID. The result is a complete scene parse — every pixel explained, with instances for things and coherent regions for stuff.

The task was introduced specifically to push the field toward unified architectures. Before 2019 a production vision system wanting scene understanding would run a semantic segmentor and an instance segmentor in parallel and merge their outputs post-hoc — an expensive and inelegant solution. Panoptic segmentation gave the field a single benchmark that required producing coherent unified output.

Panoptic FPN (Kirillov et al., 2019) was the first widely-cited architecture: a Mask R-CNN head for things and a semantic-segmentation head for stuff, both sharing an FPN backbone, with a small merging step. UPSNet and Panoptic-DeepLab added learned logic for resolving conflicts between the two heads (what happens when a thing mask overlaps with a stuff prediction of a different class).

Mask2Former is the unifier. Mask2Former (Cheng et al., 2022) proved that all three segmentation tasks — semantic, instance, panoptic — can be solved by a single architecture with a single training recipe. Each task is framed as predicting a fixed-size set of (mask, class) pairs, where the loss and matching scheme determine the task-specific inductive bias. With a Swin backbone it simultaneously achieves SOTA on COCO panoptic, Cityscapes semantic, ADE20K semantic, and several other benchmarks. The architectural convergence — one model, one weight set, every segmentation task — is perhaps the clearest sign of how mature the mask-prediction paradigm has become.

A follow-up, OneFormer (Jain et al., 2023), trains a single model on all three tasks simultaneously with a task token that conditions the decoder. At inference time, switching the task token changes the output format without retraining. This is the foundation-model direction for segmentation: a single weight set that can be prompted for any segmentation task.

The PQ metric (§3) enforces the coherence requirement: a panoptic segmentor must both correctly label each pixel and produce non-overlapping, complete coverage. Classical post-hoc merging of a detector and semantic segmentor tends to score poorly on PQ because the two outputs can disagree on overlapping pixels; unified architectures avoid this by construction.

Panoptic segmentation is used in autonomous driving (where the whole scene matters), robotic scene understanding, and augmented reality. It is less commonly deployed than semantic or instance segmentation alone, largely because the extra capability has engineering cost without always matching an application need. The task's largest impact has been conceptual: forcing the field to build architectures that handle the full structure of a scene rather than one task at a time.

§15

Promptable segmentation and SAM

The Segment Anything Model (SAM, Kirillov et al., 2023) marked a phase change in segmentation: instead of training one model per dataset and task, SAM is a single foundation model that produces a mask for any object a user points at. The model is trained on ~1 billion masks across 11 million images (the SA-1B dataset), and it segments objects in image domains it has never seen — microscopy, satellite imagery, medical scans — often zero-shot.

SAM's architecture is a clean three-part design. An image encoder (a large ViT) runs once per image and produces a dense feature embedding. A prompt encoder turns user inputs — points, boxes, or rough masks — into embedding vectors. A lightweight mask decoder combines the two and produces one or more candidate masks in ~50 ms. The image encoder is expensive (hundreds of milliseconds on a GPU) but only runs once; after that, every prompt is near-instant. This makes SAM practical as an interactive tool — click on a part of an image, get a mask back immediately.

The dataset — SA-1B — was assembled via a data engine: an initial SAM was trained, used to propose masks on unlabelled images, human annotators corrected the proposals, and the corrected data was used to train the next SAM. Three rounds of this bootstrap produced the final billion-mask training set. The scale is ~400× larger than any prior segmentation dataset and is the primary reason SAM generalises as broadly as it does.

The ambiguity problem. A click at a point can mean "segment the shirt", "segment the person", or "segment the whole group". SAM handles this by producing three candidate masks per prompt, corresponding to different plausible object scopes, plus a confidence score. Downstream tools can display all three and let the user pick, or take the highest-confidence one. This triple-output design is one of the small but important details that makes SAM usable in interactive settings.

SAM 2 (Ravi et al., 2024) extended the model to video. The key addition is a memory mechanism: a running summary of which objects have been seen, prompted at which frames, with what masks, letting the model maintain instance identity across frames. SAM 2 segments a video object from a single-frame prompt (click on the dog in frame 1, track it through 10 000 frames). This brings promptable segmentation into the video-understanding domain covered in Chapter 04.

Beyond SAM specifically, the promptable-segmentation paradigm is influential. SEEM (Segment Everything Everywhere Model) added text prompts and audio prompts. SAM-HQ (High Quality SAM) added a boundary-refinement module. Semantic SAM handles hierarchical segmentations (one click can return parts-of-parts). The general design — a large image encoder, a prompt-conditioned mask decoder, trained on a very large mask dataset — has become the template for vision foundation models.

Practically, SAM has replaced a great deal of task-specific segmentation code. Many vision pipelines in 2025–2026 use SAM for the segmentation step and restrict task-specific training to the classifier on top of the segmented region. For interactive annotation, SAM has also collapsed the time to label a new dataset by 5–10×, which has downstream effects across every vision benchmark.

§16

Open-vocabulary detection and segmentation

A classical detector is trained on a fixed set of categories — COCO's 80, LVIS's 1200 — and cannot produce a prediction for any class outside that set. Open-vocabulary detection and segmentation loosen this restriction by accepting a category described in free-text at inference time: "red bicycle", "stop sign", "Van Gogh painting", anything the user types. The bridge from vision to text is typically a CLIP-style joint embedding.

CLIP (Radford et al., 2021, covered further in Chapter 06) trained separate image and text encoders with a contrastive loss over 400 million image-text pairs. The resulting embedding space aligns images with their descriptive text. Open-vocabulary detectors use CLIP-style text embeddings as their classifier weights: instead of a fixed K-way classification head, the detector embeds text prompts for the categories of interest and uses the dot product between region embeddings and text embeddings as the classification score.

ViLD (Gu et al., 2021) was the first widely-cited open-vocabulary detector. It distils knowledge from a pretrained CLIP image encoder into a detector's RoI features via an auxiliary loss, aligning the detector's region embeddings with CLIP image embeddings. OWL-ViT (Minderer et al., 2022) is simpler: use a ViT backbone trained jointly with a text encoder (LiT-style pretraining), and predict boxes and their CLIP-aligned embeddings end-to-end. At inference, embed any category text and retrieve the matching boxes.

GLIP and grounding. GLIP — "Grounded Language-Image Pretraining" (Li et al., 2022) — unifies phrase grounding (locate this phrase in this image) with detection. Training data combines detection datasets (COCO, Objects365) with grounding datasets (Flickr30K, Visual Genome), all reformulated as phrase-to-region matching. The resulting model handles both fixed-category detection and arbitrary-text grounding with the same weights. Grounding DINO (Liu et al., 2023) is the DETR-family successor: a DINO transformer detector augmented with text cross-attention in the encoder and decoder. It is the practical open-vocabulary detector of choice in 2025, used as a first-stage component in many systems (often chained with SAM for promptable detect-then-segment pipelines).

Open-vocabulary segmentation follows the same pattern. ODISE (Xu et al., 2023) uses diffusion-model features as the dense encoder, then open-vocabulary mask classification on top. OV-Seg trains a CLIP-aligned mask classifier; X-Decoder and SEEM unify many tasks (grounding, refexp segmentation, panoptic segmentation, captioning) into a single decoder. The common architectural idea is: produce mask proposals in a class-agnostic way, then classify each proposal against an open-vocabulary CLIP text embedding.

An important sub-problem is the long tail. On LVIS (1200 classes with power-law frequency), open-vocabulary methods often outperform closed-vocabulary counterparts on rare classes because they benefit from CLIP's web-scale pretraining even for classes with a handful of detection annotations. This is the clearest empirical win of the open-vocabulary approach: for the head of the distribution, closed-vocabulary methods still dominate, but the gap closes (and can reverse) for rare classes.

The practical shape of open-vocabulary vision in 2026 is: Grounding DINO or OWLv2 for detection, SAM for segmentation, often chained so a text query produces a detection that is then refined into a mask. The composed system — call it "detect-then-segment" — is the closest current implementation of a general-purpose vision interface, where the user describes what they want in text and the system localises it pixel-accurately.

§17

Efficient deployment

A research checkpoint of Mask R-CNN or Mask2Former is rarely production-ready: it runs at 5–15 fps on a server GPU, consumes several gigabytes of memory, and expects FP32 inputs. Real applications need detectors and segmentors running at 30+ fps on mobile chips, embedded accelerators, or in-browser WASM. The efficiency toolkit — distillation, quantisation, pruning, and format conversion — bridges the gap.

Knowledge distillation trains a small student detector to match a larger teacher's predictions. For detection specifically, feature-level distillation (student feature maps mimic teacher feature maps, often with attention masks that focus on object regions) works better than pure output-level distillation, because detection outputs are sparse and very sensitive to NMS. LD (Localisation Distillation, Zheng et al., 2022) distils the distribution-focal-loss representations. A modern distilled mobile detector — YOLO-NAS, RTMDet-Tiny — retains ~90% of the teacher's AP at 20% of the compute.

Quantisation converts weights and activations from 32-bit floats to lower-precision formats. INT8 (8-bit signed integers) is the de-facto standard for CPU/mobile inference via TensorRT, ONNX Runtime, and CoreML; it typically yields 3–4× speedup with ≤1 AP loss. FP16 (half precision) is near-lossless and standard for GPU inference. INT4 quantisation is an active research frontier — useful for transformers but often fragile for detection, where the spread of activation magnitudes across the multi-scale feature pyramid is hostile to aggressive quantisation. Quantisation-aware training (QAT) simulates the quantisation during training and usually recovers most of the post-training-quantisation accuracy gap.

Pruning is the weakest lever. Unstructured pruning (zeroing individual weights) offers high compression ratios but requires sparse-kernel hardware support to translate into real speedup — which most inference stacks do not provide. Structured pruning (removing whole channels or filters) translates directly to faster dense inference but gives smaller compression ratios before accuracy degrades. For detection specifically, distillation and quantisation usually outperform pruning on the AP/latency frontier. Pruning's main value is as a complement: a distilled-and-quantised model can be pruned 10–20% further without major regression.

The deployment format question matters more than people think. Most detectors are trained in PyTorch or JAX; most deployments run on TensorRT (NVIDIA GPUs), ONNX Runtime (CPU, cross-platform), OpenVINO (Intel), CoreML (Apple), or TFLite (mobile). Converting a PyTorch checkpoint to any of these formats involves tracing the graph, handling operators that the target runtime supports differently (NMS, RoIAlign, multi-scale feature gathering), and validating that the converted model's outputs match the original. The toolchain has improved: torch.compile, ONNX Runtime's dynamic shapes, and TensorRT's plugin system cover most cases in 2026, but "exports cleanly to TensorRT" is still a non-trivial architectural constraint.

Recent efficient-detection research has focused on architectures specifically designed to quantise and export cleanly. RTMDet (Lyu et al., 2022) is a representative: a CSPNeXt backbone, large-kernel convolutions, and an anchor-free head with dynamic label assignment, all designed so every operator maps to a supported TensorRT/ONNX primitive. On COCO it achieves ~50 AP at real-time speeds, matching YOLOv8 while being cleaner architecturally.

For promptable and open-vocabulary models, the efficiency picture is different. SAM's image encoder is the bottleneck; distilled versions (MobileSAM, FastSAM) replace it with a small ViT or CNN while keeping the same prompt decoder, giving 50–100× speedups with modest quality loss. This pattern — distil the heavy encoder but keep the lightweight task-specific head — generalises across the vision-foundation-model landscape.

§18

Detection and segmentation in the ML lifecycle

The chapter's architectures and techniques only produce useful systems when embedded in a full ML lifecycle: choosing and curating a dataset, labelling it, training, evaluating, iterating, and shipping. The operational picture is almost as important as the architectural one.

The canonical datasets are COCO (80 classes, 330k images, ~1.5M instances — the standard research benchmark), LVIS (1200+ classes, long-tailed, the more rigorous modern benchmark), Open Images (600 classes, 9M images, larger but noisier), Objects365 (365 classes, 2M images), and task-specific ones: Cityscapes (autonomous driving), ADE20K (scene parsing), BDD100K (driving video), Mapillary Vistas (street view). Each has its own annotation format and quirks; the COCO JSON format is the lingua franca that most tooling supports.

Annotation tooling is a bottleneck for most real-world applications. Bounding boxes take ~5–10 seconds per box; polygon masks take 30–60 seconds. A dataset of 10 000 images with 10 masks each is weeks of annotation work. Modern tools — CVAT, LabelStudio, Roboflow, V7 — accelerate this with model-in-the-loop pre-labelling (often using SAM or a small pretrained detector), active learning, and quality-control workflows. Annotation cost has dropped by ~5–10× since SAM's release, enabling datasets that were previously impractical.

The toolkit landscape. For research, Detectron2 (Meta FAIR) and MMDetection (OpenMMLab) are the two leading frameworks; both offer implementations of essentially every architecture in this chapter with consistent training recipes. For production, Ultralytics (YOLO family), torchvision's detection models, and HuggingFace Transformers' DETR models are the mainstream. For segmentation specifically, segmentation-models-pytorch and mmsegmentation are the go-to libraries. The choice is increasingly ecosystem-driven rather than accuracy-driven: which framework best fits the rest of your stack.

Evaluation beyond mAP is increasingly important. Real deployments care about per-class accuracy (some classes are safety-critical), per-size accuracy (can we reliably detect the cyclist 80 m away?), calibration (are 0.8-confidence predictions correct 80% of the time?), failure modes (what does the model do on out-of-distribution inputs?), and throughput / latency (not just FPS, but p95 latency). A COCO mAP score is a useful summary statistic but not a sufficient basis for deployment.

Detection and segmentation occupy a specific role in the larger Part VII pipeline. Chapter 02 provides the backbones; this chapter turns backbones into parsing primitives. Chapter 04 (video) extends detection/segmentation to temporal sequences. Chapter 05 (3-D vision) builds on 2-D detection with depth and multi-view reasoning. Chapter 06 (vision-language) closes the loop by letting detectors and segmentors be driven by language queries — which, as §15–§16 showed, is where the architectural convergence is most advanced. A reader finishing this chapter has roughly the ingredients needed to build a full scene-understanding system; the remaining chapters add time, depth, and language.

Closing thought: the task of detection in 2026 is unrecognisable from the task in 2014. R-CNN's 2000 region proposals per image and SIFT-era feature descriptors have been replaced by promptable foundation models that can detect and segment arbitrary text-described objects at real-time speeds. The research frontier now asks whether detection-as-a-task is even a useful abstraction, or whether it should be absorbed entirely into multimodal foundation models. The tooling, datasets, and deployment infrastructure are mature; the architectural convergence is nearly complete. The frontier is less about new detectors and more about what detection looks like inside a larger model that also reads text, generates images, and reasons about scenes.

Further reading

Detection and segmentation have produced some of the most cited papers in deep learning — R-CNN, Faster R-CNN, Mask R-CNN, YOLO, DETR, SAM — and a substantial engineering literature that surrounds them. The selections below trace the two-stage, one-stage, transformer-detector, and segmentation families, plus the open-vocabulary and promptable-foundation-model frontier that has emerged since 2022. Software references point to the research frameworks (Detectron2, MMDetection, Ultralytics) that most practitioners start from and the foundation-model APIs (SAM, Grounding DINO) that many systems now depend on.

Textbooks & tutorials

Textbook
Computer Vision: Algorithms and Applications (2nd ed.), chapters 6 & 7
Richard Szeliski, Springer, 2022 (free PDF online)
The canonical reference. Chapter 6 covers segmentation (classical through CNN-based), chapter 7 covers recognition including detection. Updated 2022 edition integrates transformer detectors and Mask R-CNN.
Textbook
Multiple View Geometry in Computer Vision (2nd ed.)
Richard Hartley & Andrew Zisserman, Cambridge, 2004
Necessary background for the geometric side of detection — projective transformations, camera models, and the box representations that detection and segmentation inherit from earlier vision.
Course
Stanford CS231n — Lecture 11: Detection and Segmentation
Fei-Fei Li, Justin Johnson & successors
The standard graduate introduction. The detection / segmentation lecture is freely available online and covers R-CNN through DETR at a level appropriate for a strong undergraduate or early graduate student.
Tutorial
Detectron2 documentation and tutorials
Meta AI (FAIR), ongoing
The reference implementation for two-stage detectors. Tutorials cover training Mask R-CNN on custom data, the RoIAlign operator, and the configuration system. Most Meta-originated detection papers ship a Detectron2 config.
Tutorial
MMDetection documentation
OpenMMLab, ongoing
The broadest open-source detection framework. Covers essentially every architecture in this chapter with a unified training recipe, including DINO, Mask2Former, Grounding DINO, and the YOLO family.
Tutorial
Ultralytics YOLOv8 documentation
Ultralytics, ongoing
The most-used detector library in industry. Tutorials cover detection, segmentation, classification, and pose estimation from a single codebase with pip-installable training on custom datasets.
Tutorial
COCO evaluation protocol and pycocotools
Microsoft COCO team
The detection and instance-segmentation evaluation code that every paper uses. Reading the pycocotools source is the cleanest way to understand the AP, AP50, AP75, APS/M/L metrics in detail.
Tutorial
Segment Anything Model demos and documentation
Meta AI, 2023–2024
Interactive demos at segment-anything.com; model code and checkpoints on GitHub. The cleanest way to build intuition for what promptable segmentation produces.

Foundational papers

Paper
Rich feature hierarchies for accurate object detection and semantic segmentation (R-CNN)
Ross Girshick, Jeff Donahue, Trevor Darrell & Jitendra Malik, CVPR 2014
The paper that brought CNN features to detection. Set the two-stage template (region proposals → per-region CNN classification) that dominated the next five years.
Paper
Fast R-CNN
Ross Girshick, ICCV 2015
Introduced RoI Pooling and the shared-feature-map architecture. The 200× speedup over R-CNN was the key engineering step that made learned detection practical.
Paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick & Jian Sun, NeurIPS 2015
Introduced the Region Proposal Network and the anchor-box formulation. Still the single most-cited detection paper; the architectural vocabulary it established (RPN, anchor deltas, IoU matching) is the basis of the whole anchor-based detector family.
Paper
Mask R-CNN
Kaiming He, Georgia Gkioxari, Piotr Dollár & Ross Girshick, ICCV 2017
Introduced RoIAlign and the parallel mask branch, turning Faster R-CNN into an instance segmentor. Still a strong baseline in 2026 and the paper that defined the two-stage instance-segmentation template.
Paper
You Only Look Once: Unified, Real-Time Object Detection (YOLO-v1)
Joseph Redmon, Santosh Divvala, Ross Girshick & Ali Farhadi, CVPR 2016
The first single-pass detector. Less accurate than contemporaneous Faster R-CNN but far faster; established the one-stage detector aesthetic and opened the YOLO family.
Paper
SSD: Single Shot MultiBox Detector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu & Alexander Berg, ECCV 2016
The multi-scale anchor-based one-stage detector. Set the pyramid-of-feature-maps template that FPN later refined. Strong accuracy-speed trade-off that Faster R-CNN did not match at comparable latency.
Paper
Focal Loss for Dense Object Detection (RetinaNet)
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He & Piotr Dollár, ICCV 2017
Introduced focal loss and closed the one-stage vs. two-stage accuracy gap. Focal loss is one of the most widely-reused ideas in the chapter — it appears in almost every modern dense-head detector.
Paper
Feature Pyramid Networks for Object Detection
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan & Serge Belongie, CVPR 2017
The top-down feature pyramid that every post-2017 detector assumes. Two pages of method; several hundred pages of downstream papers.
Paper
Fully Convolutional Networks for Semantic Segmentation
Jonathan Long, Evan Shelhamer & Trevor Darrell, CVPR 2015
The paper that defined end-to-end CNN-based semantic segmentation. Established the encoder-upsample-decoder architecture that every later segmentation model is a refinement of.
Paper
U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger, Philipp Fischer & Thomas Brox, MICCAI 2015
The encoder-decoder with symmetric skip connections. Still the default segmentation architecture for biomedical imaging and the backbone shape behind many diffusion-model denoisers.
Paper
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy & Alan Yuille, TPAMI 2018
The atrous-convolution approach to dense prediction. DeepLab-v3+ remains the reference strong convolutional semantic segmentor.
Paper
End-to-End Object Detection with Transformers (DETR)
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov & Sergey Zagoruyko, ECCV 2020
The paper that eliminated anchors and NMS with a transformer decoder and set prediction. Slow to converge on release but rapidly fixed by follow-ups; the architectural template for essentially every post-2020 research detector.
Paper
Panoptic Segmentation
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother & Piotr Dollár, CVPR 2019
The paper that introduced the task, the PQ metric, and the stuff-vs-things distinction. The task-defining paper that reshaped the segmentation landscape.
Paper
Segment Anything (SAM)
Alexander Kirillov et al., ICCV 2023
Foundation-model segmentation. Introduced the image-encoder / prompt-encoder / mask-decoder architecture and the SA-1B dataset. Collapsed annotation cost across the field and became a standard pre-processing step in many vision pipelines.
Paper
Grounded Language-Image Pre-training (GLIP)
Liunian Harold Li et al., CVPR 2022
Unified detection and phrase grounding in a single pre-training task. The foundational paper for open-vocabulary detection; Grounding DINO and OWL-ViT are direct descendants.

Modern extensions

Paper
YOLOv3: An Incremental Improvement
Joseph Redmon & Ali Farhadi, arXiv 2018
The last YOLO version by the original authors. A refreshingly honest "tech report" style paper that codifies the YOLO-v3 design used as a baseline for years afterward.
Paper
YOLOv4: Optimal Speed and Accuracy of Object Detection
Alexey Bochkovskiy, Chien-Yao Wang & Hong-Yuan Mark Liao, arXiv 2020
The careful ablation study that assembled the "bag of freebies" and "bag of specials" — mosaic augmentation, CIoU loss, CSPDarknet, Mish activation — into a coherent detector.
Paper
YOLOX: Exceeding YOLO Series in 2021
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li & Jian Sun, arXiv 2021
Anchor-free YOLO with decoupled head and SimOTA label assignment. Demonstrated that the YOLO chassis could absorb the ideas from FCOS/ATSS at no speed cost.
Paper
FCOS: Fully Convolutional One-Stage Object Detection
Zhi Tian, Chunhua Shen, Hao Chen & Tong He, ICCV 2019
The anchor-free detector that matched or beat anchor-based RetinaNet. Introduced centreness and the per-location (l, t, r, b) box representation that became standard in modern dense-head detectors.
Paper
Objects as Points (CenterNet)
Xingyi Zhou, Dequan Wang & Philipp Krähenbühl, arXiv 2019
Reframed detection as keypoint estimation. The lightweight, extensible design extends cleanly to 3-D detection and pose estimation; many autonomous-driving perception systems descend from it.
Paper
Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection (ATSS)
Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei & Stan Z. Li, CVPR 2020
The clarifying paper: the gap between anchor-based and anchor-free detectors was almost entirely about label assignment. ATSS became the default assigner in the subsequent generation of detectors.
Paper
Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection (GFL)
Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang & Jian Yang, NeurIPS 2020
Quality Focal Loss and Distribution Focal Loss. The loss-engineering paper that made the biggest COCO AP move of 2020–2022 and is used in YOLO-v6/v7/v8.
Paper
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression (GIoU)
Hamid Rezatofighi et al., CVPR 2019
Extends IoU to give a non-zero gradient when boxes do not overlap, enabling direct optimisation of the evaluation metric. DIoU and CIoU are straightforward extensions.
Paper
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang & Jifeng Dai, ICLR 2021
Fixed DETR's convergence by replacing dense cross-attention with sparse deformable attention. The architectural improvement that made DETR-family detectors practical.
Paper
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni & Heung-Yeung Shum, ICLR 2023
The denoising-training DETR variant. Combined with deformable attention, DINO sits at the top of COCO with straightforward ResNet-50 and Swin backbones.
Paper
DETRs Beat YOLOs on Real-time Object Detection (RT-DETR)
Yian Zhao et al., CVPR 2024
Demonstrated that a carefully-tuned DETR variant can match YOLO-v8 at real-time speeds with higher accuracy. The practical implication is that transformer detectors are no longer only the accuracy-leader; they are also competitive at the speed frontier.
Paper
Per-Pixel Classification is Not All You Need for Semantic Segmentation (MaskFormer)
Bowen Cheng, Alexander G. Schwing & Alexander Kirillov, NeurIPS 2021
Reframed semantic segmentation as mask classification using DETR-style queries. The conceptual precursor to Mask2Former.
Paper
Masked-attention Mask Transformer for Universal Image Segmentation (Mask2Former)
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov & Rohit Girdhar, CVPR 2022
The universal segmentation architecture. One model, one training recipe, state-of-the-art on COCO panoptic, Cityscapes, and ADE20K simultaneously. The strongest demonstration of the mask-classification paradigm.
Paper
SOLOv2: Dynamic and Fast Instance Segmentation
Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li & Chunhua Shen, NeurIPS 2020
Instance segmentation without boxes or RoI operators. The anchor-free, box-free instance segmentor that inspired several follow-ups and the dynamic-head family of architectures.
Paper
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez & Ping Luo, NeurIPS 2021
A hierarchical transformer backbone with a simple MLP decoder — a clean, efficient alternative to the heavier DeepLab-v3+ and Mask2Former designs. Common choice for deployment-oriented semantic segmentation.
Paper
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu & Lei Zhang, arXiv 2023
The current standard for open-vocabulary detection. Text-prompt-to-bounding-box with a DINO-based transformer detector. Often used as a first stage in promptable perception pipelines; pairs naturally with SAM for text-to-mask workflows.
Paper
Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT)
Matthias Minderer et al., ECCV 2022
An earlier and more minimalist open-vocabulary detector. Trained jointly on image-text contrastive learning and detection, produces CLIP-aligned region embeddings that can be queried with arbitrary text at inference.
Paper
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi et al., arXiv 2024
Extends SAM to video with a memory mechanism that maintains instance identity across frames. Enables single-click tracking of arbitrary objects through long video sequences; the bridge from still-image promptable segmentation to video understanding.
Paper
RTMDet: An Empirical Study of Designing Real-Time Object Detectors
Chengqi Lyu et al., arXiv 2022
A careful empirical study of every design choice in the real-time detector family — backbone, neck, head, training recipe. RTMDet is the strongest export-friendly detector in the MMDetection ecosystem.
Paper
EfficientDet: Scalable and Efficient Object Detection
Mingxing Tan, Ruoming Pang & Quoc V. Le, CVPR 2020
BiFPN plus compound scaling applied to detection. The most principled efficiency-architecture-search paper for detection, with a clean Pareto frontier from EfficientDet-D0 through D7.
Paper
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models (ODISE)
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang & Shalini De Mello, CVPR 2023
Open-vocabulary panoptic segmentation using internal representations of a pretrained text-to-image diffusion model as dense features. Demonstrated that generative models produce strong discriminative features for open-vocabulary tasks.
Paper
Cascade R-CNN: Delving Into High Quality Object Detection
Zhaowei Cai & Nuno Vasconcelos, CVPR 2018
A sequence of RoI heads at increasing IoU thresholds, each refining the previous stage's boxes. A simple but effective idea that advanced the AP75 frontier and is still used in high-accuracy two-stage detectors.
Paper
Focal and Global Knowledge Distillation for Detectors (FGD)
Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao & Chun Yuan, CVPR 2022
A refined feature-level distillation loss specifically for detectors. Shows the empirical benefits of distillation with a careful ablation of what features matter.
Paper
Fast Segment Anything (FastSAM) and MobileSAM
Multiple authors, 2023
Distilled / simplified SAM variants. FastSAM replaces the ViT encoder with a YOLOv8-Seg backbone; MobileSAM uses a small ViT distilled from SAM's encoder. Both demonstrate the "distil the encoder, keep the decoder" template for promptable models.
Paper
OneFormer: One Transformer to Rule Universal Image Segmentation
Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov & Humphrey Shi, CVPR 2023
A task-token-conditioned Mask2Former that handles semantic, instance, and panoptic segmentation with a single weight set. Representative of the unification trend where one model serves all segmentation tasks.
Paper
HTC / HTC++: Hybrid Task Cascade for Instance Segmentation
Kai Chen et al., CVPR 2019 (and follow-ups)
A strong cascade-style instance segmentor that long held the top of COCO; the detailed architecture is instructive as an example of how far two-stage detectors can be pushed.
Paper
Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
Bowen Cheng et al., CVPR 2020
A clean bottom-up panoptic segmentor based on DeepLab. The opposite design to Panoptic FPN and a useful baseline in the family of pre-Mask2Former panoptic architectures.
Paper
DETR for Pedestrian Detection & SparseR-CNN
Sparse R-CNN: Peize Sun et al., CVPR 2021
Sparse R-CNN applies DETR's set-prediction idea to a classical R-CNN chassis with a fixed set of learnable proposals. An instructive intermediate between classical two-stage detection and DETR.
Paper
OTA: Optimal Transport Assignment for Object Detection
Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie & Jian Sun, CVPR 2021
Framed label assignment as an optimal transport problem. SimOTA (a simpler variant) is what YOLOX and later YOLOs use. A clear example of how a clean mathematical formulation advanced a practical sub-problem.

Software & tools

Software
Detectron2 (github.com/facebookresearch/detectron2)
Meta AI Research (FAIR)
The reference implementation for two-stage detectors and Mask R-CNN family. Well-engineered, widely-studied codebase; most Meta-authored detection papers release Detectron2 configs.
Software
MMDetection (github.com/open-mmlab/mmdetection)
OpenMMLab
The broadest open-source detection framework, with hundreds of reference implementations. Pairs with MMSegmentation and MMYOLO in the OpenMMLab ecosystem.
Software
Ultralytics YOLO (github.com/ultralytics/ultralytics)
Ultralytics
The most-deployed detection library in industry. Detection, segmentation, classification, pose — all from a single pip-install. The practitioner's default; research-grade reimplementations often target this codebase.
Software
torchvision models.detection
PyTorch team
Stable, minimal reference implementations of Faster R-CNN, Mask R-CNN, FCOS, RetinaNet, and DETR. Smaller in scope than Detectron2 but better integrated into a standard PyTorch stack.
Software
Segment Anything and SAM 2 (github.com/facebookresearch/segment-anything)
Meta AI Research (FAIR)
The SAM and SAM 2 reference implementations. Include prompt encoders, mask decoders, and interactive demos. Often used as drop-in segmentation backends in larger pipelines.
Software
Grounding DINO (github.com/IDEA-Research/GroundingDINO)
IDEA-Research
The reference open-vocabulary text-prompted detector. Often chained with SAM for text-to-mask pipelines; one of the most-downloaded vision foundation models on Hugging Face.
Software
segmentation-models-pytorch (github.com/qubvel/segmentation_models.pytorch)
Pavel Iakubovskii and contributors
A large set of encoder-decoder segmentation architectures (U-Net, Linknet, FPN, PSPNet, DeepLab, UperNet) with pretrained backbones. The default library for practical semantic-segmentation work.
Software
CVAT (github.com/opencv/cvat)
OpenCV / CVAT Team
Open-source annotation tool for detection, segmentation, and keypoint labelling. Supports model-in-the-loop pre-labelling and multiple annotators. The standard self-hosted option.
Software
Label Studio (github.com/HumanSignal/label-studio)
HumanSignal
A modern multi-modal annotation platform. Supports SAM-assisted labelling out of the box; common choice for teams that need to label images, text, and audio in one tool.
Software
TensorRT, ONNX Runtime, and OpenVINO inference stacks
NVIDIA, Microsoft, Intel
The three major production-inference stacks. Detection/segmentation deployment goes through one of these; each has specific operator-support quirks that shape which architectures quantise and export cleanly.
Software
Roboflow (roboflow.com)
Roboflow team
A managed platform for dataset curation, annotation, and deployment of detection/segmentation models. Useful as a reference for how the full annotation-to-deployment loop is packaged commercially.
Software
Hugging Face Transformers — detection and segmentation models
Hugging Face
DETR, Mask2Former, OWL-ViT, Grounding DINO, SAM — all accessible via a unified API. The default way to try a new detection or segmentation foundation model from Python with minimal setup.