Part VII · Computer Vision · Chapter 04

Video understanding, the branch of computer vision that adds a time axis to the pixel grid — the story of optical flow (Lucas-Kanade, Horn-Schunck, FlowNet, RAFT), two-stream and 3D-convolutional networks (I3D, SlowFast), video transformers (TimeSformer, ViViT, MViT, Video Swin), self-supervised video pretraining (VideoMAE), object tracking (SORT, ByteTrack), video segmentation (SAM 2), and the video-language foundation models (InternVideo, Video-LLaVA) that now parse full clips end-to-end.

Video is not just a stack of images. A clip of a cat jumping onto a table and a clip of a cat falling off a table contain almost the same pixels frame-by-frame; the difference lives in the temporal structure. Video understanding is the family of tasks that extract that structure: action recognition (which of 400 verbs is happening here?), temporal action localisation (where in the video does the serve start and the rally end?), object tracking (same person, new frame), video object segmentation (pixel-accurate mask for each instance across the clip), and video-language grounding (which moment in the video answers the question?). The technical arc runs from hand-crafted motion features — optical flow, improved dense trajectories — through two-stream networks that processed RGB and flow in parallel, 3D convolutions that learned spatio-temporal filters jointly (C3D, I3D, R(2+1)D, SlowFast, X3D), to the modern video transformers — TimeSformer's factorised attention, ViViT's tubelet embedding, MViT's pooling attention, Video Swin's shifted windows — that now dominate the Kinetics and Something-Something benchmarks. On the self-supervised side, VideoMAE and MAE-ST showed that masked reconstruction scales to video; on the multimodal side, VideoCLIP, Frozen-in-Time, and InternVideo align video with text for retrieval, captioning, and open-vocabulary classification, and the latest Video-LLMs (Video-LLaVA, VideoChat, Video-ChatGPT) feed tokenised clips into language models that can answer arbitrary questions about a scene. This chapter follows the full arc: the data-and-metrics landscape, optical flow, the CNN era (two-stream and 3D), video transformers and masked video modelling, video-language pretraining, tracking, video segmentation, temporal action segmentation, and the efficiency and deployment engineering that turns research clips into 30-fps production systems.

How to read this chapter

Sections one through three set up the problem. Section one is why video matters — what temporal structure adds over still images, the canonical task list (classification, detection-in-video, tracking, segmentation, retrieval, captioning, question answering), and why a naïve per-frame pipeline is both wasteful and wrong. Section two is video as a data type — containers and codecs (H.264, H.265, AV1), frame rates, temporal sampling strategies (dense vs. sparse, segment-based, clip-based), memory budgets, and the gap between decoding a clip and feeding it to a GPU. Section three is evaluation metrics: top-1 and top-5 accuracy on trimmed clips, mean Average Precision for action detection, temporal IoU and the tIoU-thresholded detection metric, and the tracking metrics (MOTA, MOTP, IDF1, HOTA) that replaced earlier ad-hoc measures.

Sections four through seven cover the classical motion and CNN era. Section four is optical flow — the Horn-Schunck variational formulation, Lucas-Kanade sparse flow, the Middlebury benchmark, and the deep-flow family (FlowNet, PWC-Net, RAFT, GMA). Section five is per-frame baselines — running a 2D CNN on each frame and pooling the scores, why it is surprisingly competitive on classification but blind to fine-grained motion. Section six is the two-stream architecture that started the modern era: Simonyan and Zisserman's spatial-plus-flow networks, TSN's sparse temporal sampling, and I3D's trick of "inflating" 2D ImageNet weights into 3D kernels. Section seven is the 3D convolutional family — C3D, I3D, R(2+1)D, SlowFast's dual-pathway architecture, and X3D's network-scaling family — and the engineering trade-offs between clip length, input resolution, and compute.

Sections eight and nine are the detection-style tasks. Section eight is action recognition — the task formulation for trimmed clips, the Kinetics / Something-Something / AVA / Moments-in-Time benchmarks, and the different styles of "action" each tests (appearance-dominant, motion-dominant, context-dependent, compositional). Section nine is temporal action localisation: action detection in untrimmed videos, the BSN / BMN proposal families, one-stage ActionFormer / TriDet regressors, and the evaluation gotchas on ActivityNet and THUMOS.

Sections ten and eleven cover the transformer era. Section ten is video transformers — TimeSformer's divided space-time attention, ViViT's tubelet embedding and factorised variants, MViT's multi-scale pooling, Video Swin's shifted 3D windows, and Uniformer's convolution-attention hybrid. Section eleven is masked video modelling — the self-supervised pretraining story from VideoMAE through MAE-ST, Siamese masked modelling, and V-JEPA's joint-embedding predictive architecture.

Sections twelve through fifteen cover the multimodal and dense-prediction frontiers. Section twelve is video-language pretraining — contrastive video-text learning (MIL-NCE, VideoCLIP), the Frozen-in-Time curriculum, InternVideo's generalist pretraining, and the generation of Video-LLMs (Video-LLaVA, VideoChat, Video-ChatGPT) that feed clip tokens into an LLM. Section thirteen is object tracking — Kalman-filter + Hungarian assignment baselines (SORT, DeepSORT, ByteTrack), detection-driven trackers (Tracktor, CenterTrack), and end-to-end transformer trackers (MOTR, TrackFormer). Section fourteen is video segmentation: video object segmentation (OSVOS, STM, XMem), video instance segmentation (MaskTrack R-CNN, VisTR, IDOL), and the SAM 2 promptable video segmentor. Section fifteen is temporal action segmentation — the dense per-frame action-label task, MS-TCN, ASFormer, and the evaluation metrics that reward both correct labels and correct boundaries.

Section sixteen is efficient video: the token-reduction, temporal-coarsening, and dynamic-inference techniques (AdaViT, Ada3D, STTS, STA) that make video transformers practical, plus the deployment story — quantisation, ONNX/TensorRT export, edge inference, and streaming-window architectures for real-time video. The closing in-ml section is the operational picture: dataset catalogues (Kinetics-700, Something-Something v2, AVA, ActivityNet, HowTo100M, WebVid, InternVid), the tooling ecosystem (MMAction2, PyTorchVideo, Decord, Kornia), labelling platforms, and how the video pipeline plugs back into the rest of Part VII and into Part XIV's video-generation models.

Why video understanding mattersFrom frames to temporal structure, and the task landscape that time enables
Video as a data typeCodecs, frame rates, sampling strategies, decoding pipelines, memory budgets
Evaluation metricsTop-1 / top-5, action-detection mAP, temporal IoU, MOTA / IDF1 / HOTA
Optical flowHorn-Schunck, Lucas-Kanade, FlowNet, PWC-Net, RAFT, GMA
Per-frame baselines2D CNNs on individual frames, score averaging, and the motion blind spot
Two-stream networksSimonyan-Zisserman, TSN, I3D — parallel RGB and optical-flow processing
3D convolutional networksC3D, I3D, R(2+1)D, SlowFast, X3D — spatio-temporal convolutions
Action recognitionKinetics, Something-Something, AVA, Moments-in-Time — what counts as an action
Temporal action localisationAction detection in untrimmed video — BSN, BMN, ActionFormer, TriDet
Video transformersTimeSformer, ViViT, MViT, Video Swin, Uniformer — factorised spatio-temporal attention
Masked video modellingVideoMAE, MAE-ST, Siamese masked modelling, V-JEPA — self-supervised video
Video-language modelsMIL-NCE, VideoCLIP, Frozen, InternVideo, Video-LLaVA, VideoChat
Object trackingSORT, DeepSORT, ByteTrack, Tracktor, CenterTrack, MOTR, TrackFormer
Video segmentationVOS (OSVOS, STM, XMem), VIS (MaskTrack, VisTR), SAM 2
Temporal action segmentationDense per-frame action labels — MS-TCN, ASFormer, boundary-aware losses
Efficient videoToken reduction, temporal coarsening, dynamic inference, streaming deployment
Video understanding in the ML lifecycleDatasets, toolkits (MMAction2, PyTorchVideo), labelling, downstream integration

§1

Why video understanding matters

A still image tells you what a scene looks like. A video tells you what is happening in it. The difference is large enough that almost every task in computer vision has a video variant, and for many real applications — self-driving cars, surveillance, sports analytics, surgical assistance, content moderation — the video variant is the one that matters. Video understanding is the body of techniques that extracts structure from a sequence of frames: classifying an action, localising it in time, following an object through occlusion, segmenting a mask across a clip, or answering a natural-language question about what happened.

The naïve approach — run a still-image model on each frame independently and average the results — is a useful baseline but misses the point. A person raising a hand and a person lowering a hand are almost identical at any single instant; they differ only in the direction of motion. A glass being filled and a glass being poured out swap the causal arrow of the water. "Pretending to put something into a pile" and "putting something into a pile" (a classic Something-Something category) differ in an intention that is only visible across many frames. Image models cannot see any of this; video models have to.

The task landscape is broader than it is for images. Video classification (Kinetics-400/600/700) asks which of a fixed set of action labels a trimmed clip belongs to. Temporal action localisation (ActivityNet, THUMOS) asks where in a long untrimmed video each action starts and ends. Object tracking (MOT, DanceTrack) asks for consistent identities of every object across a clip. Video object segmentation (DAVIS, YouTube-VOS) asks for pixel masks of a specified instance on every frame. Video instance segmentation (YouTube-VIS) asks the same for all instances jointly. Temporal action segmentation (Breakfast, 50Salads) asks for a per-frame action label on every frame of a long clip. Video retrieval (MSR-VTT, HowTo100M) asks for the clip that matches a text query; video captioning and video question answering go the other way.

Why this is harder than images. A single clip in a research benchmark is typically 2–10 seconds long and contains 30–300 frames — an order of magnitude more pixels than a single image. Training on it naïvely costs an order of magnitude more compute. Memory is the dominant constraint: a 16-frame 224 × 224 clip has 2.4× the activations of a single image; a 32-frame 384 × 384 clip has 18×. Every architectural choice in video — input length, spatial resolution, temporal sampling rate, token reduction — is ultimately a compute budget decision.

The field has moved through three broad eras. The hand-crafted-features era (roughly through 2013) computed motion features — HOG, HOF, MBH, improved dense trajectories (iDT) — and fed them into SVM or Fisher-vector classifiers; iDT was the state of the art on Hollywood2 and HMDB for years. The CNN era (2014–2020) built on the ImageNet backbone tradition: two-stream networks, 3D CNNs, and spatio-temporal architectures like SlowFast. The transformer era (2020–) imported the ViT inductive bias, giving TimeSformer, ViViT, MViT, Video Swin, and their masked-modelling self-supervised cousins. Each era roughly doubled the accuracy ceiling on Kinetics-400, and each was enabled by a new dataset — UCF-101 for iDT, Kinetics-400 for 3D CNNs, Kinetics-700 plus HowTo100M and WebVid for transformers — that put pressure on model capacity.

This chapter treats video understanding as a unified topic rather than as a collection of sub-tasks. The backbones, pretraining objectives, and attention mechanisms that work for action recognition also serve tracking, segmentation, and video-language models; the differences are in the heads and the labels. The chapter walks through motion features (optical flow), the CNN era (two-stream, 3D), the transformer era, self-supervised pretraining, multimodal video-language models, and the dense-prediction tasks (tracking, segmentation, temporal action segmentation), finishing with the efficiency and deployment engineering that brings research clips to 30-fps production pipelines.

§2

Video as a data type

Before any model can learn from a video, the video has to be decoded into tensors. The engineering around that step — containers and codecs, frame rates, sampling strategies, and the CPU/GPU boundary — is where half of a video pipeline's wall-clock time is typically spent, and where most reproducibility problems live. A model can look state-of-the-art or badly broken depending entirely on how its input frames were sampled.

A video file is a container (MP4, MKV, WebM, MOV) wrapping one or more streams — one video stream, usually one or more audio streams, sometimes subtitle streams — each encoded with a specific codec. The dominant video codecs are H.264/AVC (still the default for interoperability), H.265/HEVC (better compression, royalty-encumbered), VP9 and AV1 (royalty-free, slowly replacing H.264 on the web), and ProRes or DNxHD for professional post-production. Codecs use temporal prediction — I-frames are standalone; P-frames predict from previous frames; B-frames predict bidirectionally — which means "decode frame 1000" typically requires decoding the preceding GOP (group of pictures). Keyframe-only seeking is fast but coarse; exact seeking is accurate but slow.

Frame rate varies enormously. Hollywood film is 24 fps, broadcast television is 25 or 29.97 fps, YouTube encourages 30 or 60 fps, gaming capture is often 60 or 120 fps, smartphone slow-motion is 240 fps, and scientific high-speed cameras can reach 10 000+ fps. Most video-understanding benchmarks operate at 25–30 fps; most models consume 8–64 frames per clip. A 10-second clip at 30 fps has 300 frames; sampling 16 means keeping one out of every 18. The sampling strategy — dense (consecutive frames with stride 1), sparse (uniform stride over the clip's duration), segment-based (TSN-style: divide the clip into N segments and sample one frame per segment) — is a model-design choice with large accuracy consequences.

Decoding is the bottleneck. A naïve pipeline decodes video on CPU with ffmpeg or OpenCV, transfers RGB tensors to GPU, and runs the model; the decode often takes as long as the forward pass. Decord and PyAV give frame-accurate random access with acceptable throughput; NVIDIA DALI and NVIDIA Video Codec SDK decode H.264/H.265 directly on the GPU via dedicated video engines, bypassing the CPU entirely; FFmpeg-cuvid offers a middle ground. Cloud training pipelines typically pre-decode videos into TAR archives of JPEG frames, or into WebDataset shards, to amortise decode cost across training runs; this inflates storage by 5–20× but turns training into an image-loading problem. Some recent work (MViT, Hiera, InternVideo) still uses this pattern.

The temporal sampling decision. Different tasks need different sampling. For trimmed action classification on Kinetics, 8 or 16 frames sampled uniformly across a 2–10 s clip captures enough structure; SlowFast famously uses two pathways at different rates (4 frames slow + 32 frames fast). For Something-Something, which tests fine motion, denser sampling helps. For temporal action localisation on long untrimmed video, you either window the video into overlapping clips or use a memory-efficient backbone that can ingest a full minute. For tracking and VOS, you want every frame (frame-to-frame identity matters). For video-language retrieval, coarse 4–8 frame sampling is often enough because text supervision is loose. There is no universally correct sampling rate; it is a dataset-task combination.

Resolution is the other axis. Kinetics research typically uses 224 × 224 crops; detection and segmentation in video push to 448 × 448 or higher; clinical and surveillance video can be 1080p or 4K. Every spatial doubling roughly quadruples compute, which interacts with the temporal dimension multiplicatively. The compute-accuracy trade-off for video is therefore a 3-D trade-off (frames × height × width), and a model that wins on Kinetics at 16 × 224 × 224 may lose at 8 × 320 × 320 or at 64 × 160 × 160.

Other data considerations: audio is often available alongside the video stream and can be a strong supervisory signal (for retrieval, action recognition, and AV-language pretraining); metadata (capture device, timestamp, geolocation) can leak through into annotations and cause shortcut learning; data provenance matters legally (scraped YouTube datasets have been retracted), and most benchmarks now ship only video IDs, requiring users to download their own copies, which means datasets decay as creators delete content. For production systems, live video means streaming inference — a windowed model sees frames as they arrive and must produce outputs at matched latency — which changes both the architecture and the evaluation protocol.

§3

Evaluation metrics

Video tasks inherit their metrics from adjacent image tasks but add a temporal dimension that complicates each one. Classification adds top-N accuracy across clip samples; detection and segmentation add temporal intersection-over-union; tracking has its own vocabulary of identity and association metrics; dense prediction adds boundary and edit scores. The specific protocol a paper uses can change "state of the art" by several points.

For trimmed action classification (Kinetics, Something-Something, UCF-101, HMDB-51), the headline numbers are top-1 and top-5 accuracy. The subtlety is the inference protocol: most papers sample multiple clips from each test video (often 10 temporal clips × 3 spatial crops = 30 views) and average softmax scores. A model evaluated with 1-crop 1-clip will typically score 1–4 points lower than the same model with 10-crop 3-view, so comparing across papers without matching the protocol is misleading. The inference FLOPs × views product is a more honest efficiency metric than model FLOPs alone.

For temporal action localisation (ActivityNet, THUMOS-14), the headline metric is mAP at temporal IoU thresholds, typically averaged over tIoU = 0.3, 0.4, 0.5, 0.6, 0.7 on THUMOS and tIoU = 0.5, 0.75, 0.95 on ActivityNet, with the ActivityNet average often reported as average mAP. A prediction is a (class, start-time, end-time, score) tuple; a ground-truth match requires class agreement and temporal IoU above the threshold; the AP follows the standard precision-recall computation per class, then is averaged across classes. The metric rewards accurate boundaries — a one-second error in a two-second action can push tIoU below 0.5 — which is why anchor-based and anchor-free localisation methods compete primarily on boundary quality.

For multi-object tracking (MOTChallenge, DanceTrack, BDD100K MOT), the metrics encode two separate failure modes: detection quality and identity association. MOTA (multi-object tracking accuracy) penalises false positives, false negatives, and identity switches with equal weight; MOTP scores localisation precision; IDF1 focuses on identity preservation across the clip; HOTA (Higher Order Tracking Accuracy), introduced in 2020, factorises into detection (DetA) and association (AssA) components and is now the preferred summary on most benchmarks because it balances the two sub-problems more fairly than MOTA.

Tracking metrics in one page. Given ground-truth trajectories and predicted trajectories, a matching is first found on each frame (Hungarian assignment on IoU). MOTA = 1 − (FN + FP + IDS) / GT, where FN/FP are frame-level false negatives/positives and IDS counts identity switches. IDF1 = 2·IDTP / (2·IDTP + IDFP + IDFN), where IDTP/IDFP/IDFN are identity-based true/false positives/negatives. HOTA(α) = √(DetA(α) · AssA(α)), averaged over localisation thresholds α. MOTA can be improved by a better detector alone; IDF1 requires a better associator; HOTA requires both. In practice, ByteTrack and its descendants dominate MOTA by improving the detector side; association-focused trackers (MOTR, TrackFormer) dominate IDF1 and HOTA.

For video object segmentation (DAVIS, YouTube-VOS), the metrics are the region Jaccard index 𝒥 (mean IoU of mask predictions) and the contour F-measure ℱ (boundary accuracy); the summary score 𝒥&ℱ is their mean. Semi-supervised VOS assumes the first frame is given; unsupervised VOS does not. For video instance segmentation (YouTube-VIS), the metric is video mAP — AP computed over 3-D (space + time) IoU between predicted and ground-truth tubes, averaged over IoU thresholds 0.5 to 0.95.

For temporal action segmentation (Breakfast, 50Salads, GTEA), the common metrics are frame-accuracy (percentage of frames with the right label), edit score (normalised Levenshtein distance between predicted and ground-truth segment sequences, which rewards getting the order right even when boundaries are off), and segmental F1 at overlap thresholds (typically 0.10, 0.25, 0.50). Frame accuracy rewards long stable predictions; edit and F1 reward structure.

For video retrieval and question answering, the standard metrics are retrieval Recall@K (R@1, R@5, R@10) plus median rank, and for VQA top-1 accuracy on a fixed answer vocabulary or open-ended accuracy (exact-match or normalised BLEU / CIDEr for captioning). The retrieval numbers are extremely dataset-sensitive — MSR-VTT has 20 captions per video, which makes R@1 easier than on a benchmark with one caption per video — so cross-dataset comparisons need careful normalisation.

§4

Optical flow

Optical flow is the per-pixel 2-D displacement field that aligns one frame to the next. It is the most basic form of motion representation, and for roughly twenty years it was the dominant motion feature in video understanding — two-stream action recognition, tracking, video compression, and frame interpolation all depend on it. Modern deep action recognisers do not always need explicit flow any more, but the underlying mathematics and the best-in-class estimators remain central: flow is still the first thing you compute for structure-from-motion, video stabilisation, and optical-flow-guided generation.

The classical formulation starts from the brightness-constancy assumption: a pixel's intensity does not change as the object moves from frame t to frame t+1. Writing that constraint as a Taylor expansion gives the optical-flow equation I_xu + I_yv + I_t = 0, where (u, v) is the unknown 2-D velocity and I_x, I_y, I_t are spatial and temporal image gradients. This is one equation in two unknowns — the aperture problem — so flow estimation needs a regulariser.

Horn-Schunck (1981) adds a global smoothness penalty: minimise the sum of the brightness-constancy residual and an L2 penalty on flow gradients. The solution is a large sparse linear system solved iteratively; the result is a dense flow field but one that smooths across motion boundaries. Lucas-Kanade (1981) makes a local rather than global assumption: flow is constant within a small neighbourhood around each pixel. The resulting 2 × 2 linear system is solvable where the image structure tensor is well-conditioned — corners, textured regions — and ambiguous elsewhere. Multi-scale Lucas-Kanade (coarse-to-fine warping) handles large displacements. Sparse Lucas-Kanade tracking of feature points is the basis of KLT, one of the oldest and most robust trackers.

Classical flow peaked with variational methods — TV-L1, Brox, EpicFlow — which blended the brightness-constancy residual with a total-variation regulariser and edge-preserving smoothness, often guided by matched features. The Middlebury and Sintel benchmarks charted the progress: average endpoint error on Sintel dropped from around 10 pixels in 2010 to around 3 pixels with the best variational methods by 2015.

The deep-learning era began with FlowNet (Dosovitskiy et al., 2015), which trained a U-Net-like encoder-decoder to regress flow directly from a pair of frames. FlowNetS used plain concatenation; FlowNetC introduced a correlation layer that computed cosine similarity between patches in the two frames. FlowNet2 (2017) stacked three FlowNet networks and matched classical accuracy while running at 10 fps on a GPU. PWC-Net (Sun et al., 2018) combined three classical ideas in a compact learned network — pyramid, warping, cost volume — and dominated for several years. RAFT (Teed & Deng, 2020) replaced the encoder-decoder with a single-scale recurrent refinement of a shared 4-D cost volume and won essentially every flow benchmark by a wide margin; its average endpoint error on Sintel-clean dropped below 1 pixel. GMA (2021) added a global motion aggregation step for occlusions; FlowFormer (2022) replaced RAFT's GRU refinement with a transformer. SEA-RAFT and UniMatch (2023–2024) extended RAFT-style cost volumes to unified depth/flow/stereo estimation.

Flow in action recognition — then and now. In the two-stream era (2014–2018), flow was computed offline with TV-L1 and cached as JPEG-compressed displacement images; training read these alongside RGB frames. The cost was substantial — computing flow for Kinetics takes longer than training on it — but flow features gave a 3–6% absolute top-1 boost on UCF-101 and Kinetics-400. 3D CNNs like I3D partially absorbed the motion information into the RGB stream, and modern video transformers with masked pretraining (VideoMAE) now match two-stream accuracy without explicit flow. Flow is still used when you need explicit, interpretable motion: video stabilisation, frame interpolation (Super-SloMo, RIFE), video compression, and increasingly as a guidance signal for video generation models.

The evaluation metrics are endpoint error (EPE) — the L2 norm of the predicted-minus-true flow vector, averaged over all pixels — and Fl-all, the fraction of pixels with EPE > 3 px and > 5% relative error (the KITTI-style threshold). Benchmarks include Middlebury (small motions), Sintel (animation-rendered, ground-truth available), KITTI (automotive, LiDAR-derived partial ground truth), and increasingly Spring (2023), which has 120 Hz ground-truth flow for high-resolution scenes.

§5

Per-frame baselines

The simplest way to do video classification is to ignore the temporal axis: pick N frames from the clip, run each through an image classifier, and average the softmax scores. This per-frame baseline is a surprisingly strong starting point on any benchmark where the action is visually characteristic — running looks like running in a single frame, juggling looks like juggling — and it is an indispensable sanity check. Any more sophisticated model that does not beat the frame baseline by a clear margin is suspect.

The baseline traces back to Karpathy et al. (2014) "Large-scale Video Classification with Convolutional Neural Networks", which ran a 2D CNN on single frames and on fixed-length clips and reported — surprisingly at the time — that the single-frame model on Sports-1M was within 2% of the full-clip model. This was a major result: it suggested that early 2D architectures were not actually using the temporal axis effectively, and that much of action recognition was static appearance.

The engineering choices in a frame baseline are the frame-sampling strategy (random vs. uniform vs. segment-based), the pooling strategy (max, mean, weighted, attention), and whether to pool logits, features, or both. Mean-logit pooling (Karpathy's default) is the simplest; late fusion runs a small MLP over pooled features. Temporal Segment Networks (TSN) (Wang et al., 2016) formalised segment-based sparse sampling: divide the clip into T segments and sample one frame per segment, which gives temporally distributed coverage without explicit temporal modelling. TSN with a ResNet-50 backbone was within a point of early 3D CNNs on Kinetics at a fraction of the compute, and TSN-style sampling is still the default in most open-source video toolkits.

On tasks where motion is the signal — Something-Something, Jester, Diving-48 — the frame baseline collapses. On Something-Something v2, a ResNet-50 frame baseline scores around 30% top-1 while I3D scores around 58% and VideoMAE scores around 75%. The gap is the direct cost of ignoring temporal order. A useful diagnostic is the shuffled-frame test: randomly permute the frames before inference. If a model's accuracy barely changes, it is functionally a frame baseline; if it drops 10+ points, the model is genuinely using temporal order.

Temporal Shift Module (TSM) (Lin, Gan & Han, 2019) is a clever mid-point: it leaves the 2D CNN weights untouched but inserts a zero-parameter module that shifts a subset of channel activations forward and backward in time between layers. The effect is to let 2D convolutions "see" a small temporal window without any 3D weights or extra compute, and TSM matched I3D on Kinetics while keeping 2D-CNN inference cost. Variants include TIN (Temporal Interlacing Network) and GSM (Gate-Shift Module).

The frame baseline as a sanity check. Before training any video model, run the 2D backbone on single frames. On Kinetics-400 you should reach ~70% with a ResNet-50 and ~75% with a ViT-B. On Something-Something v2 you should reach ~25–30%. If your proposed temporal model does not beat these numbers, the problem is elsewhere — sampling, augmentation, or the label distribution — not temporal modelling. Many mid-2010s "novel video architectures" turned out to be frame baselines in disguise once the sampling was matched.

Frame baselines remain the go-to approach for video tasks where the temporal signal is weak: coarse content tagging, ad-brand detection, thumbnail generation, and most retrieval. They are also the reference for evaluating video-language alignment — CLIP-style models can be applied to single frames and score competitively on video retrieval because a strong still-image text embedding is already informative enough. The field moved past the frame baseline for fine-grained action recognition, but it is still the most honest first baseline for any new video task.

§6

Two-stream networks

The breakthrough that moved video models past the frame baseline was the two-stream architecture of Simonyan and Zisserman (2014): run one CNN on RGB frames for appearance, another CNN on pre-computed optical flow for motion, and fuse the two at the classifier. The split encoded a useful prior — appearance and motion are different kinds of information and benefit from different features — and two-stream networks dominated video classification for roughly four years.

The original Two-Stream ConvNet (Simonyan & Zisserman, 2014) used a VGG-16 on single RGB frames and a second VGG-16 on stacks of 10 optical-flow frames (representing 10 consecutive motion fields as a 20-channel input). The two networks were trained separately; at inference, their softmax scores were averaged. On UCF-101, the combination reached 88% top-1 versus 73% for the RGB-only stream and 83% for flow-only, establishing that motion carries complementary information that RGB alone does not capture. Pretrained ImageNet weights transferred to the RGB stream; the flow stream had to be trained from scratch on video (there was no ImageNet for flow).

Temporal Segment Networks (TSN, Wang et al., 2016) took the two-stream idea and added the segment-sampling strategy of the previous section: the clip was divided into T segments (typically 3–8), one frame per segment was sampled for the RGB stream, and one flow stack per segment for the flow stream. TSN was the first architecture to demonstrate that temporally distributed sampling — even very sparse sampling — outperformed the dense short-clip sampling the original two-stream paper used. TSN with BN-Inception backbones hit 94% on UCF-101 and became the open-source workhorse for action recognition through 2018.

I3D (Carreira & Zisserman, 2017), "Quo Vadis, Action Recognition?", made two seminal contributions. First, it introduced the Kinetics-400 dataset — 300 000 ten-second YouTube clips across 400 action classes — which was 10× the size of the then-largest video benchmark (UCF-101) and finally large enough to train modern architectures from scratch. Second, it showed that you could inflate a pretrained 2D ImageNet architecture into 3D: take the ImageNet-trained Inception-v1 weights, replicate each 2D kernel across a new temporal dimension (dividing by the temporal size to preserve magnitude), and fine-tune on Kinetics. The resulting Inflated 3D ConvNet (I3D) trained stably, made efficient use of the ImageNet prior, and set the new state of the art. Two-stream I3D — RGB plus flow, both with inflated backbones — reached 82% top-1 on Kinetics-400, which stood for two years.

The fusion question in two-stream networks is non-trivial. The original paper fused at the softmax layer; later work explored feature-level fusion (concatenate or sum intermediate feature maps, then pass through a shared head), spatio-temporal fusion with 3D convolutions at the fusion point (Feichtenhofer, Pinz & Zisserman, 2016), and attention-based fusion. Most systems settled on feature-level fusion around a mid-network layer as a good compromise.

Why flow eventually became optional. Optical flow is expensive to compute — roughly 0.5 s per frame for TV-L1 on CPU — and cached flow inflates storage 3–5×. By the late 2010s the flow stream was contributing a 3–4% absolute gain but adding 50% of the total training-pipeline cost. As backbones got deeper and datasets got bigger (Kinetics-600, Kinetics-700), 3D CNNs and later video transformers learned the motion-related features that flow used to provide, and the marginal gain from a flow stream fell below 1%. By 2022, essentially no leading video model on Kinetics used explicit optical flow: VideoMAE, MViT, Video Swin, and their successors are RGB-only. Flow is now specialised — still essential for frame interpolation and video compression, but not for action recognition.

The two-stream idea lives on conceptually. SlowFast (Feichtenhofer et al., 2019) replaced "RGB stream + flow stream" with "slow pathway (low frame rate, high capacity) + fast pathway (high frame rate, low capacity)", using two RGB streams at different temporal resolutions to separate appearance and motion within a single architecture. Audio-visual networks use video stream + audio stream. Multimodal video-language models use video stream + text stream. The pattern of "two specialised streams fused late" remains one of the most robust architectural templates in video.

§7

3D convolutional networks

A 2D convolution slides a 2D kernel over the spatial grid of a single frame. A 3D convolution slides a 3D kernel over a spatio-temporal volume: the kernel has an extra temporal dimension, and it learns spatio-temporal features jointly rather than treating each frame separately and aggregating afterwards. The 3D CNN family — C3D, I3D, R(2+1)D, SlowFast, X3D — dominated video classification from roughly 2015 to 2020 and established the modern template for video backbones.

C3D (Tran et al., 2015) was the first purely-3D CNN for video and introduced the basic architectural choice: 3 × 3 × 3 convolutions stacked in a VGG-like pyramid. C3D trained on Sports-1M produced generic video features that transferred reasonably well but were computationally expensive and did not benefit from ImageNet pretraining because the architecture was 3D from the start. The big realisation of I3D was that you could sidestep that problem by inflating a 2D pretrained model.

R(2+1)D (Tran et al., 2018) "A Closer Look at Spatio-temporal Convolutions" decomposed each 3 × 3 × 3 kernel into a 1 × 3 × 3 spatial convolution followed by a 3 × 1 × 1 temporal convolution. This (2+1)D factorisation kept most of the parameter budget but added non-linearity between the spatial and temporal steps and allowed better optimisation; it also made 3D networks cheaper to train than their full-3D counterparts. The paper also introduced the I3D-R3D baseline — 3D-convolutional ResNet inflated from ImageNet — which became the standard reference backbone.

SlowFast (Feichtenhofer et al., 2019) is the most successful mid-era 3D CNN. It uses two pathways operating on the same clip at different temporal resolutions: a slow pathway at low frame rate (e.g., 4 frames sampled from the clip) with high channel capacity (more filters), and a fast pathway at high frame rate (e.g., 32 frames from the same clip) with low channel capacity (1/8 the filters). The fast pathway captures motion; the slow pathway captures appearance; lateral connections fuse information between them at every stage. SlowFast-R101 reached 79% top-1 on Kinetics-400, and the pattern of "two pathways at different temporal rates" has been influential in video transformers as well.

X3D (Feichtenhofer, 2020) applied neural-architecture-search-style scaling rules to 3D networks: start from a small, efficient base and progressively expand along six axes — temporal length, temporal stride, spatial resolution, width, depth, bottleneck ratio — picking the expansion that most improves accuracy-per-FLOP at each step. The resulting X3D-M/L/XL family hit Pareto-optimal points on Kinetics and remained the leading efficient 3D CNNs until transformers took over. The lesson was that 3D networks are over-parameterised if scaled with ad-hoc rules, and that principled scaling matters as much as architectural choices.

The cost of 3D convolutions. A 3 × 3 × 3 kernel has 27 parameters vs. 9 for a 2D kernel, and it multiplies the compute of each convolution by T (the temporal kernel size). A direct 3D inflation of ResNet-50 has roughly 3× the parameters and 3–4× the FLOPs of the 2D original, and training time scales with the number of input frames. The (2+1)D factorisation recovers a factor of 2–3 in compute; SlowFast's two-pathway design trades capacity for temporal resolution; X3D's narrow-and-deep architectures trade parameters for depth. The general lesson is that every extra temporal frame costs roughly as much as a spatial factor increase, so "just add more frames" is not a free lunch.

Specialised variants are worth knowing about. P3D (Pseudo-3D) decomposed 3D into 1×3×3 and 3×1×1 like R(2+1)D, but earlier. Non-local neural networks (Wang et al., 2018) added self-attention-style blocks into 3D CNNs and were an important stepping stone toward pure-transformer video models. CSN (Channel-Separated Networks) showed that most 3D FLOPs in a bottleneck could be replaced by depth-wise 3D convolutions with minimal accuracy loss. MoViNet (Kondratyuk et al., 2021) extended NAS to mobile-efficient 3D networks, producing a family that runs at 10–30 fps on a single CPU core.

The 3D CNN era plateaued around 80–82% on Kinetics-400 top-1, and its successor — the video transformer — took over from roughly 2021. Modern benchmarks often still report SlowFast and X3D numbers for context because they remain strong, efficient baselines; many deployed systems at production scale still use an X3D-style backbone because 3D CNNs are cheaper to quantise and deploy than video transformers.

§8

Action recognition

Action recognition is the canonical video task: given a trimmed clip, output a single class label from a fixed vocabulary of actions. The task sounds simple — "classify the action in this 3-second clip" — but its difficulty depends entirely on the dataset. A benchmark built around appearance (Kinetics) is easy for image models; one built around motion (Something-Something) punishes them; one built around compositional semantics (ActivityNet, AVA) requires both.

The canonical benchmarks each test a different aspect. Kinetics-400/600/700 (DeepMind, 2017–2020) are the reference leaderboards: 300 000 / 500 000 / 650 000 YouTube clips of 10 seconds each, each labelled with one of 400/600/700 human action classes. Kinetics tests broad, appearance-rich action recognition — "playing tennis", "cutting vegetables", "shovelling snow" — with enough scene context that a strong image classifier already scores 70+ top-1. Something-Something v1/v2 (20BN, 2017 and 2020) were built specifically to fight this: 220 000 crowdsourced clips of people performing 174 "templated" actions like "putting something onto something", "pretending to put something into something", or "moving something up". The scene content is nearly constant across classes, so any model that is not using temporal order collapses to near-chance; frame baselines score ~30% while the best video models score ~75%.

AVA (Atomic Visual Actions, Gu et al., 2018) is the action-detection benchmark: for each one-second interval in 430 movies, annotate every person with a bounding box and one of 80 atomic actions ("stand", "walk", "hold", "watch (a person)", "carry/hold (an object)"). The evaluation is frame-level mAP, and the task combines person detection, action classification, and spatio-temporal context reasoning. Modern AVA models (SlowFast, MViT) typically integrate a person detector plus a 3D CNN head. Moments-in-Time (Monfort et al., 2018) is a 1-million-clip, 339-class benchmark of three-second moments, optimised for diversity and shortness. HACS, YouCook2, EPIC-Kitchens, and Ego4D cover more specific domains (cooking, egocentric activity, long-form). Legacy benchmarks (UCF-101, HMDB-51) are now saturated and mostly used for efficiency or few-shot ablations.

The evaluation protocol matters a great deal. Clip-level inference samples many clips per test video (commonly 10 temporal × 3 spatial crops) and averages predictions; video-level inference makes one prediction per test video after longer aggregation. Test-time augmentation typically buys 1–3 points on Kinetics top-1. The choice of augmentation pipeline — RandAugment, Mixup, CutMix, RandErase, RandomResizedCrop with specific scale ranges — is sometimes larger than the backbone choice. Reproducing a paper's numbers requires matching the augmentation recipe, not just the architecture.

What different datasets test. Kinetics tests what is in the frame — scene, objects, context — plus a little bit of motion. Something-Something tests what happens across frames — direction of motion, temporal order, compositional relationships. AVA tests spatio-temporal action detection — who is doing what, where, when. EPIC-Kitchens tests egocentric, long-tail, fine-grained actions. A "state of the art" on Kinetics does not necessarily transfer to Something-Something; a model that wins Something-Something often fails on EPIC-Kitchens because of its long-tail class distribution. Citing only Kinetics-400 top-1 therefore understates how narrow many published improvements actually are.

Current frontier numbers (as of 2025) are roughly: Kinetics-400 top-1 around 90% with a ViT-Huge VideoMAE v2, Something-Something v2 around 77% with an MViT-v2 or VideoMAE v2, AVA v2.2 mAP around 45%, EPIC-Kitchens verb top-1 around 75%. The models that achieve these numbers are all transformer-based, all use masked-video-modelling pretraining (see Section 11), and all consume 16–32 frame clips at 224 × 224 or 336 × 336 resolution.

Beyond the benchmark leaderboard, two practical issues dominate real action recognition: label noise and ambiguity — many videos in Kinetics and Something-Something contain multiple valid labels and crowd annotators disagree on fine-grained cases — and domain shift — a model trained on YouTube clips shot in well-lit environments often fails on surveillance footage, medical video, or egocentric clips. Few-shot and zero-shot action recognition (using video-language models, see Section 12) has become a preferred evaluation for generalisation.

§9

Temporal action localisation

Action recognition assumes a clip has already been trimmed to contain one action. Real video is not trimmed: an hour of a tennis match contains dozens of serves, rallies, and pauses; a surgical recording contains many procedure steps of varying length; a cooking video contains a sequence of ingredient, cutting, mixing, and cooking actions. Temporal action localisation (also called temporal action detection) is the task of finding, within an untrimmed video, the start time, end time, and class of every action instance.

The task is analogous to 2D object detection but along the time axis. Like detection, it splits into two-stage and one-stage families. A two-stage temporal detector first proposes candidate temporal windows (pairs of start/end times) regardless of class, then classifies and refines each proposal; a one-stage detector predicts class and boundaries jointly on a dense temporal grid. And, like 2D detection, the family migrated toward anchor-free transformer-based designs in recent years.

BSN (Boundary-Sensitive Network, Lin et al., 2018) was the canonical two-stage approach. A shallow 1-D network predicted, at every temporal location, three probabilities: probability of being a start boundary, probability of being an end boundary, and probability of being inside an action. Proposals were generated by combining start-probability peaks with end-probability peaks (picking those where the "inside" integral was high) and scored with a second classifier. BMN (Boundary Matching Network, Lin et al., 2019) densified the proposal evaluation by constructing a boundary-matching map: a 2-D grid indexed by (start time, duration) that scored every possible proposal in parallel. BMN was the dominant localiser for several years and the backbone of most ActivityNet challenge winners.

The transition to one-stage came with ActionFormer (Zhang, Wu & Li, 2022), which adapted the object-detection transformer backbone to temporal action localisation. ActionFormer runs a multi-scale 1-D transformer over the video's feature sequence and predicts, at every temporal location and scale, a class vector and a pair of boundary offsets. It simplified the BSN/BMN pipeline dramatically, trained end-to-end, and set a new state of the art on THUMOS-14 and ActivityNet. TriDet (Shi et al., 2023) refined ActionFormer with a more explicit boundary-prediction head and has held the leading numbers on several benchmarks since.

Modern localisers are almost always feature-based rather than end-to-end: the untrimmed video is first turned into a feature sequence by a frozen pretrained backbone (I3D, SlowFast, VideoMAE) running on overlapping short clips, and the localisation head operates on this 1-D feature sequence. This factorisation keeps GPU memory tractable — you cannot fit 30 minutes of raw frames in memory, but you can fit 30 minutes of backbone features — and lets the community share standardised features (Kinetics-pretrained I3D features on ActivityNet, for example) for easy comparison.

Where the difficulty lies. Temporal action localisation is harder than classification for three reasons. First, class imbalance is extreme: most of a long video is background (no action), and actions can be a few seconds inside a 20-minute clip. Second, boundary ambiguity: when does "diving" start — when the diver walks to the platform, when they step onto it, when their feet leave it? Annotators disagree by several seconds. Third, actions can nest or overlap: "serving" is part of "playing tennis"; two people can "talk" and "walk" simultaneously. The evaluation metrics (mAP at various tIoU thresholds) partly mask these sources of noise, but they do not eliminate them.

Related tasks include temporal action proposal generation (produce class-agnostic temporal windows — a building block for localisation and a useful pre-processing step for retrieval), dense video captioning (jointly localise actions and produce a natural-language description for each, e.g., ActivityNet Captions), and temporal grounding / moment retrieval (given a free-text query, find the clip boundaries that match it — Charades-STA, QVHighlights). These all share the same feature-extraction and temporal-modelling machinery as localisation but with a text-conditioned head.

Current benchmark numbers: THUMOS-14 mAP at tIoU 0.5 reached ~75% with TriDet+VideoMAE features; ActivityNet-1.3 average mAP is around 40%; Ego4D Moment Queries is substantially harder (the actions are fine-grained and egocentric). For many production applications — podcast chapter marking, sports highlight detection, surgical workflow monitoring — localisation with a custom class vocabulary and a few hours of annotation is now achievable with off-the-shelf ActionFormer / TriDet on top of a strong pretrained backbone.

§10

Video transformers

The transformer architecture, which had already replaced RNNs in NLP and CNNs on ImageNet, arrived in video in 2021. The challenge was compute: a naïve ViT applied to a 16-frame clip at 224 × 224 has 16 × (224/16)² = 3136 tokens, and self-attention is quadratic in token count, so full attention is 30× more expensive than on a single image. The video transformer family — TimeSformer, ViViT, MViT, Video Swin, Uniformer — is a set of architectural choices for keeping this cost tractable while preserving the benefits of global self-attention.

TimeSformer (Bertasius, Wang & Torresani, 2021) "Is Space-Time Attention All You Need?" introduced divided space-time attention: each transformer block applies temporal attention (each token attends only to tokens at the same spatial position across frames) followed by spatial attention (each token attends only to tokens in the same frame). This factorisation reduces complexity from O(T²S²) to O(T² + S²) while preserving enough expressive power to beat I3D and SlowFast on Kinetics-400. The paper's systematic comparison of attention strategies (joint, divided, axial, sparse, local) became the reference for later designs.

ViViT (Arnab et al., 2021) "Video Vision Transformer" introduced tubelet embedding: instead of embedding each frame's patches separately, extract small 3-D patches (e.g., 2 × 16 × 16) spanning both space and time, then treat them as tokens. ViViT also proposed factorised encoder variants — "factorised encoder" (separate spatial then temporal transformers), "factorised self-attention" (similar to TimeSformer's divided attention), "factorised dot product" (split attention heads across axes) — and ran a clean ablation showing tubelet embedding + factorised encoder was the most accurate-for-compute combination.

MViT (Multiscale Vision Transformers, Fan et al., 2021) and MViT-v2 (Li et al., 2022) introduced pooling attention: as the network goes deeper, progressively downsample the token grid in both space and time by pooling the key/value tensors inside each attention block. The design mirrors the pyramid structure of CNNs — early layers have many tokens at high resolution, later layers have few tokens at low resolution — and it dramatically reduces the cost of later attention blocks. MViT-v2 was the first video transformer to convincingly beat SlowFast on both Kinetics and AVA and remains a standard baseline.

Video Swin Transformer (Liu et al., 2021) adapted the Swin image architecture to video by extending the shifted-window self-attention to a 3-D window. Self-attention is computed only within small non-overlapping 3-D windows (e.g., 8 × 7 × 7 tubes of tokens); windows shift by half their size between layers to enable cross-window information flow. Video Swin is easy to implement, quantises well, and deploys cleanly to production hardware; it remains one of the most widely used video backbones in industry.

Uniformer (Li et al., 2022) is a convolution-attention hybrid: early layers are depth-wise 3D convolutions (cheap, strong inductive bias for local motion); late layers are full self-attention (expensive, global context). The split roughly matches what SlowFast's two pathways did separately, but within a single unified architecture. Hiera (Ryali et al., 2023) went further: a hierarchical vision transformer with no explicit convolutions but a strong positional hierarchy, trained with masked pretraining; it matches or beats MViT-v2 with fewer parameters.

Factorising attention — the trade-off. Full joint space-time attention is the most expressive but blows up compute. Divided (TimeSformer) and factorised (ViViT) attention keep most of the expressivity with O(T² + S²) cost. Pooling (MViT) reduces token count progressively so later layers are cheap. Windowed (Video Swin) restricts attention to local 3-D neighbourhoods and reintroduces global context through shifting. Each choice has a "sweet spot" of clip length and spatial resolution; there is no single best architecture. On Kinetics at 16 × 224², MViT-v2 and Video Swin are close; at 64 × 224² or 16 × 336², MViT's pooling gives it more headroom; at very long clips, memory-efficient windowed designs dominate.

Current state-of-the-art video transformers (as of 2025) routinely reach 88–90% top-1 on Kinetics-400, 76–77% on Something-Something v2, and 45–48% mAP on AVA v2.2. The gap over 3D CNNs is 3–8 absolute points and comes primarily from the self-supervised pretraining covered in the next section. Without masked pretraining, pure supervised video transformers are only a point or two above SlowFast — which is the kind of result the research community had to work through before settling on the current recipe of "strong transformer backbone + masked self-supervised pretraining + supervised fine-tuning".

§11

Masked video modelling

The defining development of 2022–2024 in video understanding was not an architectural change but a pretraining change. Masked video modelling — the direct adaptation of MAE-style masked reconstruction to video — produced pretrained video backbones that, when fine-tuned on Kinetics or Something-Something, outscored the best supervised-trained models by several absolute points. Almost every current leaderboard entry uses masked video pretraining; understanding the recipe is essential.

VideoMAE (Tong et al., 2022) and MAE-ST (Feichtenhofer et al., 2022), published within weeks of each other, are the canonical papers. The recipe has four key elements. First, very high masking ratio: video has much more redundancy than images (adjacent frames are highly correlated), so masking 90–95% of tokens is stable and necessary — lower ratios let the model learn trivial frame-to-frame interpolation. Second, tube masking: mask the same spatial positions across all frames in a clip, forcing the model to reason about motion rather than just interpolating across a single frame. Third, an asymmetric encoder-decoder: a large ViT encoder processes only the visible tokens (saving 90% of compute), and a small decoder reconstructs the masked tokens from encoder outputs plus learned mask tokens. Fourth, the reconstruction target is raw pixels (normalised) for VideoMAE, or feature targets for later variants.

VideoMAE v2 (Wang et al., 2023) scaled the recipe: billion-parameter ViT-Huge encoders, pretraining on 1.35 million unlabelled videos from UnlabeledHybrid (a combination of Kinetics, Something-Something, HowTo100M, and YouTube-8M clips), and a modified progressive-training curriculum. The resulting ViT-g model pushed Kinetics-400 top-1 to 90.0% — the first video model to cross that bar — and Something-Something v2 to 77.0%. VideoMAE v2 effectively replaced supervised Kinetics pretraining as the default starting point for downstream video tasks.

Siamese masked modelling and contrastive-masked hybrids added further refinement. MaskFeat (Wei et al., 2022) reconstructed HOG features rather than pixels and showed that the choice of reconstruction target matters. BEVT used discretised VQ tokens like BEiT. MVD (Motion-Aware Video Discriminator) added an auxiliary motion-prediction loss. V-JEPA (Bardes et al., 2024) replaced pixel reconstruction with a joint-embedding predictive objective: predict target features in a separate representation space rather than raw pixels, avoiding the wasteful low-level-detail reconstruction. V-JEPA matches VideoMAE accuracy with more compute-efficient pretraining.

Why masked video modelling works so well. The intuition is twofold. First, video has enormous temporal redundancy — adjacent frames share 95%+ of their pixels — so masking at video-scale ratios (90%+) is stable in a way that image-scale masking (75%) is not. Second, video pretraining data is essentially unlimited: YouTube has orders of magnitude more unlabelled video than ImageNet has labelled images, and masked modelling does not require labels. The combination means video transformers can absorb more pretraining data than their supervised counterparts and emerge with broader, more transferable representations. The same argument applies to images (and was the MAE thesis), but the effect is larger for video because the data-label asymmetry is larger.

The practical recipe for a modern downstream task is now stable: (1) start from a VideoMAE-v2-pretrained ViT-L or ViT-H checkpoint (available publicly), (2) apply task-specific head — classification head, localisation head, detection/segmentation head, or language-model adapter — and (3) fine-tune for 30–100 epochs with a cosine schedule. This recipe dominates Kinetics, Something-Something, AVA, ActivityNet, and most downstream video benchmarks. Bespoke supervised-from-scratch training is now essentially only used when pretrained weights are unavailable or when the target domain (surgical, satellite, underwater) is too distant from the pretraining distribution.

Multimodal masked pretraining is the current frontier. MAViL (Huang et al., 2023) added audio reconstruction alongside video masking. OmniMAE unified images and videos in a single masked-modelling run. VideoMAE + video-language alignment (as in InternVideo-2) combines masked reconstruction with contrastive video-text matching in a shared objective. The hypothesis — not yet proven but heavily suggestive — is that a single "general video encoder" pretrained jointly on masked reconstruction, audio, and language can replace the zoo of task-specific video backbones entirely.

§12

Video-language models

Text is a rich supervisory signal for video. A clip with a caption "a golden retriever catching a frisbee at the beach" implicitly teaches the model what the action is, who is doing it, where it is happening, and what the camera framing typically looks like — richer than any fixed class label. Video-language models (VLMs) are the architectures that learn from large web collections of (video, text) pairs and produce representations that serve retrieval, captioning, question answering, and zero-shot classification.

The earliest practical results came from MIL-NCE (Miech et al., 2020), which trained an S3D-G video encoder and a text encoder on HowTo100M (136 million instructional-video clips with ASR-generated captions). The trick — Multiple-Instance-Learning NCE — handled the misalignment between spoken narration and the action on screen (narration often leads or lags the demonstration by several seconds) by letting each positive video clip match any of several nearby caption windows. MIL-NCE produced strong transfer to zero-shot action recognition and text-video retrieval and demonstrated that noisy, web-scraped instructional video was a sufficient pretraining signal.

VideoCLIP (Xu et al., 2021) followed the CLIP recipe more directly: contrastive pretraining of a video encoder and a text encoder with in-batch negatives on 1.2 million (video, caption) pairs. Frozen-in-Time (Bain et al., 2021) used a ViT image-backbone for video (treating frames as independent tokens with temporal positional encoding) and trained on both image-text pairs (CC3M, CC12M) and video-text pairs (WebVid-2.5M), producing a unified image-and-video-language encoder. Frozen's joint curriculum — images first, then video — became the template for several later models.

InternVideo (Wang et al., 2022) and InternVideo-2 (Wang et al., 2024) are the most ambitious "generalist video encoder" projects. InternVideo-2 combines masked video pretraining (VideoMAE-v2 style), cross-modal contrastive learning (video-text, video-audio, video-speech), and next-token prediction on video-text pairs into a single training objective, on 400 million clips. The resulting ViT-6B encoder transfers to action recognition, detection, localisation, retrieval, captioning, and question answering with state-of-the-art numbers in each category.

The Video-LLM family is the 2023–2025 frontier: feed video tokens into a large language model and ask it to answer arbitrary natural-language queries about the clip. Video-LLaVA (Lin et al., 2023) projects a video encoder's tokens (typically 8–16 frames, each encoded by a CLIP ViT) into an LLM's token space via a learned linear adapter, then fine-tunes on a mixture of video-instruction-following data. VideoChat and Video-ChatGPT use similar architectures with different instruction-tuning recipes. Gemini-family multimodal models and Anthropic's vision models support video natively by either sampling frames and treating them as images or by using a dedicated video encoder. These models can answer zero-shot questions ("how many people are in this clip?", "is anyone wearing a red shirt?", "does the recipe call for garlic?"), generate summaries, and locate moments by natural-language description.

Data is the bottleneck. Video-language pretraining's biggest challenge is dataset quality. HowTo100M (136M clips with ASR captions) is noisy — ASR mis-transcriptions, mis-aligned narration — but cheap to produce. WebVid-10M was scraped from stock-footage sites with alt-text captions (and has since been retracted for copyright reasons). InternVid (2024) is a 7M-clip dataset with machine-generated captions (a caption model, not ASR). Panda-70M, Koala-36M, and OpenVid-1M followed similar strategies. The trend is toward higher-quality machine-generated captions over noisier ASR, because the gain from cleaner text often outweighs the gain from more raw video. The dataset-construction pipeline — scraping, deduplication, caption generation, quality filtering — is now as important a research artefact as the model architecture.

Current benchmark numbers (as of 2025): MSR-VTT retrieval R@1 around 55% text-to-video (InternVideo-2 / UMT), MSVD QA top-1 around 80%, MSRVTT-QA around 50%, ActivityNet-QA around 50%, NExT-QA multiple-choice around 75%. The frontier of long-video understanding — summarising a 30-minute show, answering questions about a feature-length movie, navigating instructional video archives — is still open; dedicated benchmarks (MoVQA, Movie101, CinePile) are actively being built and most current models struggle beyond the 2–5 minute range.

Zero-shot action recognition is a useful practical endpoint. An InternVideo-2 or UMT encoder with the Kinetics-400 class names projected as text queries reaches ~80% top-1 on Kinetics-400 without ever fine-tuning on it — close to a supervised SlowFast from 2019. For most production applications with an open-ended and evolving action vocabulary, video-language models are now the pragmatic choice.

§13

Object tracking

Multi-object tracking (MOT) is the task of maintaining consistent identities for every object of interest across a video. A tracker has to detect each object on each frame, decide which detections belong to the same physical object over time, and continue that identity through partial occlusions, pose changes, and brief disappearances. The field splits cleanly by philosophy: detection-plus-association (strong detector, simple matcher) vs. end-to-end transformer trackers (unified detection and association).

The dominant paradigm for a decade has been tracking-by-detection. Run a strong per-frame object detector; at each frame, associate new detections with existing tracks using motion prediction (Kalman filter) and appearance similarity (ReID features); spawn new tracks for unmatched detections; terminate tracks that go unmatched for too long. SORT (Bewley et al., 2016) was the minimal version: a linear-Kalman-filter motion model plus Hungarian-algorithm IoU matching. Fast, simple, and surprisingly strong; identity switches are the main failure mode. DeepSORT (Wojke, Bewley & Paulus, 2017) added a learned appearance embedding (a ReID network trained on person-reidentification data) to the matching cost, which cut identity switches by half.

ByteTrack (Zhang et al., 2022) made a surprising observation: low-confidence detections — which SORT-family trackers discard — are often real objects that the detector is just uncertain about, typically under occlusion. ByteTrack matches high-confidence detections to tracks first, then does a second matching pass with low-confidence detections, recovering objects that would otherwise drop out. The recipe is almost embarrassingly simple and was briefly state of the art on MOT17 and MOT20; it is now the reference baseline. BoT-SORT, StrongSORT, OC-SORT, and SparseTrack are 2022–2023 refinements with better ReID, camera-motion compensation, and observation-centric re-identification.

Detection-driven trackers leverage a single model for both detection and tracking. Tracktor (Bergmann, Meinhardt & Leal-Taixé, 2019) reused a Faster R-CNN's regression head to propagate each track's bounding box to the next frame, effectively converting a detector into a tracker with no added parameters. CenterTrack (Zhou, Koltun & Krähenbühl, 2020) used a CenterNet detector that took two consecutive frames as input and predicted both the detections and the displacement vector from the previous frame — a joint detection-and-association head. FairMOT and JDE added an appearance-embedding head to one-stage detectors so that a single forward pass produces detections plus ReID features.

The end-to-end transformer tracker family recasts MOT as sequence-to-sequence: the decoder produces a set of "track queries" that persist across frames and predict each track's bounding box at every frame. TrackFormer (Meinhardt et al., 2022), MOTR (Zeng et al., 2022), and MOTRv2 / MOTRv3 (2023) follow this template. MOTRv2 bootstraps from a pretrained YOLOX detector to stabilise training. The appeal is conceptual cleanliness — no separate Kalman filter, no separate ReID network, no bespoke association logic — but end-to-end trackers are still harder to train than tracking-by-detection and do not consistently outperform ByteTrack variants on standard benchmarks.

When each family wins. Tracking-by-detection (ByteTrack, BoT-SORT, OC-SORT) dominates benchmarks like MOT17 and MOT20 because strong detectors already exist and appearance-plus-motion matching is hard to beat. Transformer trackers (MOTR, TrackFormer) shine on DanceTrack, where appearance is uninformative (many dancers look the same) and end-to-end temporal reasoning matters. BDD100K MOT (autonomous driving) rewards robustness to camera motion and illumination, and detection-driven trackers have been competitive. For most production applications, ByteTrack on top of a YOLO-v8 detector is the pragmatic default; transformer trackers are chosen when the scene has heavy appearance ambiguity.

Beyond multi-object tracking, single-object tracking (SOT) follows one target specified by a first-frame box. The SOT lineage went from correlation filters (KCF, DSST) through Siamese networks (SiamFC, SiamRPN, SiamRPN++) to transformers (TransT, OSTrack, MixFormer). SOT benchmarks include GOT-10k, LaSOT, and TrackingNet, and modern SOT systems approach 75+ AUC on LaSOT. A related task — referring expression tracking — finds the object described by a natural-language phrase in the first frame and tracks it, and bridges to video-language models.

Tracking tooling has converged: MMTracking, PyTracking, Norfair, and Supervision provide reference implementations of SORT-family trackers that plug on top of any detector. Real production systems often need additional engineering — ROI rejection for static cameras, tracking across camera hand-offs in multi-camera setups, track fragmentation handling over minutes or hours — that goes well beyond research benchmarks. The core association machinery (Kalman + Hungarian + appearance) has remained remarkably stable since 2016.

§14

Video segmentation

Image segmentation, covered in the previous chapter, produces per-pixel masks on a single frame. Video segmentation extends that to a clip: produce per-pixel masks on every frame, with consistent instance identities across frames. The task splits into video object segmentation (VOS), video instance segmentation (VIS), and video panoptic segmentation (VPS); each has its own benchmark and architectural tradition.

Video object segmentation (VOS) assumes that the first-frame mask of each object of interest is given, and asks for the mask on every subsequent frame. This is the "semi-supervised" setting and is the one most real applications care about — you mark the object you want to track, and the system follows it. Benchmarks: DAVIS-2017 (150 clips, carefully annotated), YouTube-VOS (4 000+ clips, larger scale), MOSE (2023, designed for complex scenes with heavy occlusion). The evaluation metric is 𝒥&ℱ.

The VOS family went through three generations. OSVOS (Caelles et al., 2017) took a simple but influential approach: pretrain a segmentation network on a large dataset, then fine-tune per test clip on the first-frame mask for a few hundred iterations. Accurate but slow (minutes per clip). STM (Oh et al., 2019) introduced Space-Time Memory: encode every previous frame's features into a memory bank, and query it with the current frame to find similar regions; the current mask is computed from the retrieved memory. STM eliminated per-clip fine-tuning and became the dominant approach. XMem (Cheng & Schwing, 2022) refined the memory into three tiers (short-term, long-term, sensory) based on the Atkinson-Shiffrin memory model, and runs comfortably on long videos (minutes) with bounded memory. DeAOT and Cutie are further refinements.

Video instance segmentation (VIS) is the fully-automatic version: no first-frame prompt; the system must detect, segment, and track every object of a set of classes. The reference benchmark is YouTube-VIS-2019 (40 classes) and its successor YouTube-VIS-2021, evaluated with 3-D video mAP (spatial plus temporal IoU on object "tubes"). MaskTrack R-CNN (Yang, Fan & Xu, 2019) added a tracking head to Mask R-CNN, producing per-frame detections with cross-frame association. VisTR (Wang et al., 2021) treated VIS as end-to-end set prediction in 3-D: each query predicts a full spatio-temporal tube. IDOL, SeqFormer, and Mask2Former-VIS refined this pattern. The current frontier (VMT, VITA, GenVIS) combines strong image Mask2Former backbones with temporal association modules and reaches ~60 mAP on YouTube-VIS-2021.

Video panoptic segmentation (VPS) unifies VOS and VIS with stuff classes (road, sky, building) that have no instance identity. VPSNet (Kim et al., 2020) introduced the task and a benchmark based on Cityscapes-VPS; Video K-Net, Tube-Link, and TarViS followed. VPS matters most in autonomous-driving perception, where stuff and things both need to be tracked — sky and road don't move, but the bus behind you does.

SAM 2 and promptable video segmentation. SAM 2 (Ravi et al., 2024) extends the Segment Anything Model to video. The architecture adds a memory encoder and a memory attention module that lets the mask decoder condition on the history of the clip; a single click or box on any frame propagates a mask across the entire video. SAM 2 trained on the SA-V dataset (600 000 video clips with 35 million masks) and dominates DAVIS, YouTube-VOS, and MOSE with zero-shot transfer, beating specialised VOS systems in most cases. SAM 2 has effectively collapsed VOS into a solved problem: for any application that can provide a first-frame prompt, SAM 2 is the default starting point, and bespoke VOS architectures are mostly obsolete as of 2025.

The practical toolkit converged. Most teams now use SAM 2 for any promptable segmentation (VOS-style), Mask2Former-VIS or its descendants for fully-automatic VIS, and specialised trackers (ByteTrack + SAM) when they want boxes plus masks. Training a VIS system from scratch on custom classes is rare; starting from a large pretrained image segmentor (Mask2Former on COCO) and adding a temporal association head is the standard recipe.

§15

Temporal action segmentation

Temporal action segmentation is the task of labelling every frame of a long untrimmed video with an action class. Unlike temporal localisation (where you predict a small number of action intervals with precise boundaries), temporal action segmentation treats the problem as a dense 1-D labelling task: for a 10-minute cooking video, output a label sequence like [idle, wash-vegetables, wash-vegetables, ..., chop, chop, ..., mix, mix, ..., cook, cook, ..., plate, plate, idle]. The task is canonical for cooking-instruction, surgical-workflow, and industrial-process video.

The task differs from action recognition in three ways. First, the label vocabulary is typically small (20–50 classes specific to the domain, like "peel carrot", "stir pot"), not large-scale. Second, long temporal context matters: the same gesture ("holding a knife") could be part of "chopping" or "peeling" depending on what came before. Third, the evaluation metrics — frame accuracy, edit score, segmental F1 — reward both correct labels and reasonable boundary placement; a method that produces many tiny segments is penalised even if individual frame predictions are accurate.

Classical approaches used hidden Markov models on top of hand-crafted features. Modern approaches use a 1-D temporal convolutional network (TCN) on top of frame-level features. MS-TCN (Multi-Stage TCN, Farha & Gall, 2019) stacked multiple refinement stages: each stage takes the previous stage's logits as input and produces a refined label sequence. The dilated-convolution architecture gives each frame a large temporal receptive field (thousands of frames) at modest compute. MS-TCN was the dominant approach for several years and set the baseline on Breakfast, 50Salads, and GTEA.

ASFormer (Yi, Wen & Jiang, 2021) replaced the dilated convolutions with a transformer, adding local-window self-attention to the refinement stages. LTContext, UVAST, and DiffAct (2023) — the last using a denoising-diffusion formulation for action boundaries — pushed accuracy further. The current leaderboards on Breakfast and 50Salads are saturating; the open challenge is generalisation to new domains (surgical video, industrial assembly) where labelled data is scarcer.

The frame-level feature extractor matters as much as the temporal model. Early work used fixed I3D features (Kinetics-pretrained); more recent work uses VideoMAE or InternVideo features. The jump from I3D to VideoMAE features often buys several points of frame accuracy at no change in the temporal head, confirming the general lesson that downstream video tasks are bottlenecked by the upstream representation.

Why edit score matters. On Breakfast, MS-TCN reports frame accuracy around 66%. A baseline that simply predicts the majority class at every frame can reach 40% frame accuracy — not much worse. The edit score (normalised Levenshtein distance between predicted and ground-truth segment sequences) rejects such trivial solutions: it rewards getting the order and count of actions right, not just the per-frame labels. A method with 50% frame accuracy but the right segment structure is often more useful in practice than one with 65% frame accuracy and many false-positive micro-segments. Modern architectures optimise both simultaneously with hybrid frame-and-boundary losses.

Related tasks are action anticipation (predict the next action given the current context — EPIC-Kitchens anticipation challenges), procedure learning (discover the key steps of a procedure from unlabelled instructional video), and surgical phase recognition (a specialised variant on operating-room video with its own benchmarks, Cholec80 and M2Cai16). These share the same temporal-modelling architecture as action segmentation but have different label conventions and evaluation protocols.

For production applications, temporal action segmentation is the go-to tool for procedure monitoring — has every step in a surgical checklist been performed; is the worker on an assembly line executing the correct sequence; at what point in a recipe is the cook. The practical pipeline is: pretrain or download a strong video encoder, extract frame features on your labelled clips, train MS-TCN or ASFormer for a few dozen epochs, evaluate on frame accuracy and edit score, iterate on class-balance and augmentation. Training from scratch with only a few hundred labelled hours of video is routine.

§16

Efficient video

A research-scale video transformer runs at 1–5 fps on a server GPU — acceptable for offline processing, catastrophic for real-time applications. Surveillance, autonomous driving, live sports analysis, and on-device video effects demand 30+ fps at battery-friendly power budgets. The efficiency toolkit for video shares ideas with efficient image models (distillation, quantisation, pruning) but adds video-specific techniques: token reduction, temporal coarsening, dynamic inference, and streaming-window architectures.

Token reduction exploits the redundancy across video frames. A typical video transformer has thousands of tokens per clip (16 frames × 196 spatial tokens = 3136), many of which are uninformative. STTS (Wang et al., 2022) "Efficient Video Transformers with Spatial-Temporal Token Selection" scores each token and keeps only the top-K informative ones per layer; redundant background patches are discarded. AdaViT and Ada3D learn per-input policies: simple clips use fewer tokens; complex clips use more. ATS (Adaptive Token Sampling, Fayyaz et al., 2022) applies a similar idea with a discrete sampling step that is end-to-end differentiable. Token reduction routinely buys 2–3× speedups with <1% accuracy drop.

Temporal coarsening operates at the architectural level. MoViNet (Kondratyuk et al., 2021) uses neural-architecture-search to find 3D networks that are efficient at very short clip lengths (single-stream); X3D's family of models (S, M, L, XL) lets you pick a compute budget from 2 to 50 GFLOPs. TSM (Temporal Shift Module), covered in Section 5, is still a very strong efficient choice because it inherits 2D CNN inference speed while learning temporal features. Uniformer and Hiera combine convolutions in early layers (cheap, high-resolution) with attention in later layers (expensive, low-resolution) to get a favourable compute profile.

Quantisation and pruning port directly from the image world. Video transformers are more sensitive to INT8 quantisation than image ones because the accumulated error across 16–32 frames can drift; PTQ4DM and Q-ViT include video-specific calibration strategies. 3D CNNs (X3D, SlowFast) are more quantisation-friendly because their compute pattern maps more cleanly onto INT8 hardware kernels; this is part of why X3D-family models dominate deployed video inference despite being slightly less accurate than transformers.

Streaming video transformers are a separate class of efficient models designed specifically for real-time inference. A standard clip-based video model re-processes the full clip each time a new output is needed; a streaming model processes each frame once and maintains a compact temporal state (like an RNN). Stream-MViT, STT, and TeSTra are research examples. In production, frame-wise 2D backbones with a temporal aggregation module (TSM or a small LSTM) are still the most common streaming architecture because they are easy to deploy and verify.

A concrete efficiency recipe. For a 30-fps on-device action recogniser: start from MobileViT-v3 or X3D-M, train with TSM-style temporal shift, knowledge-distil from a large VideoMAE teacher, quantise to INT8 with post-training calibration, export to ONNX, and run with TensorRT or CoreML. Expected performance: 30–60 fps on a phone SoC at 80–85% of the teacher's Kinetics accuracy. This recipe has been reproduced at several companies (Meta AR, Apple VideoAnalytics, Google Photos) and is now the pragmatic default for edge video.

Deployment formats matter. ONNX is the portable interchange format; TensorRT is NVIDIA's optimised runtime; OpenVINO targets Intel CPUs and VPUs; CoreML targets Apple silicon; TensorFlow Lite and PyTorch Mobile target Android. Video-specific export quirks include 3D convolutions (sometimes not natively supported; need to be rewritten as sequences of 2D convolutions), tensor-shape dynamism (variable clip lengths), and preprocessing (frame sampling has to happen on-device if the runtime can't decode video natively). Mature production pipelines handle these via wrapper code that sits between the model and the runtime.

§17

Video understanding in the ML lifecycle

The architectures, pretraining recipes, and benchmarks of the previous sections only become useful inside a full ML lifecycle: picking a dataset, annotating it, training a model, evaluating it on the right metrics, iterating, and shipping. Video makes every stage of that lifecycle roughly 10× harder than images: the data is larger, the annotation is denser, the compute is higher, and the failure modes are more varied.

Dataset selection is the first choice. Kinetics-700 and Something-Something v2 are the default for action-recognition research; Kinetics-710 unions both. Moments-in-Time for three-second clips; AVA and AVA-Kinetics for spatio-temporal action detection. ActivityNet-1.3 and THUMOS-14 for temporal localisation. EPIC-Kitchens-100 and Ego4D for egocentric video; HowTo100M, WebVid-10M (retracted), InternVid, and Panda-70M for large-scale video-language pretraining. SA-V (SAM 2's training set, 600 K videos with 35 M masks) for segmentation. DAVIS-2017, YouTube-VOS, and MOSE for VOS; YouTube-VIS-2019/2021 for VIS. Most benchmarks ship only video IDs and require users to download from YouTube or other hosts, which creates a gradual decay — videos get deleted, and the effective dataset shrinks — and a reproducibility challenge.

Annotation is dominated by specialised platforms: CVAT (Computer Vision Annotation Tool, open source), Label Studio, V7, Scale, Encord, and VideoLAT. Video annotation workflows bundle several primitive operations: bounding-box tracking (propagate a box across frames), mask interpolation (draw a mask on keyframes and interpolate between), action-interval labelling (mark start/end of each action), and text-clip-pair labelling (for captioning and QA). Modern annotation platforms now include SAM 2 for mask propagation and tracking-by-detection tools that cut annotation time by 5–10×. Still, labelling one hour of video for dense per-frame action segmentation typically costs 10–20 hours of annotator time.

The training-infrastructure story is dominated by a few toolkits. MMAction2 (OpenMMLab) is the reference implementation of most action-recognition and localisation methods, with a config-driven design and pretrained checkpoints. PyTorchVideo (Meta) offers similar coverage with more integration into the broader PyTorch ecosystem. MMTracking covers tracking; VOSNet, XMem, and Cutie codebases cover VOS; MMSegmentation's video fork covers VIS and VPS. For video-language, LAVIS and VideoChat provide end-to-end pipelines for retrieval and QA. For data loading, Decord and PyAV dominate Python-side decoding; NVIDIA DALI is used for GPU-side decoding at scale; Kornia's video module handles augmentation.

Common failure modes. Video models fail in a few characteristic ways. Scene bias: a model trained on Kinetics may classify any forest scene as "hiking" even when no one is in it. Static-frame leakage: a model can memorise which YouTube channel a clip came from and predict the label from the channel's visual style. Clip-length sensitivity: a model trained on 2-second clips often fails on 10-second clips with the same action. Camera-motion confusion: aggressive camera motion can be misread as object motion by models that don't normalise for it. Evaluation diagnostics — confusion matrices, per-class accuracy, shuffled-frame tests, video-level cross-validation — catch most of these before deployment.

Downstream integration: video understanding rarely stands alone. An autonomous-driving stack combines video detection (Part VII Ch 03), video tracking and segmentation (this chapter), and 3-D perception (Part VII Ch 05 — depth estimation, point clouds, NeRFs); a sports-analytics product combines action recognition with keypoint tracking and tactical event detection; a content-moderation pipeline combines object detection, action recognition, and video-language classification; a surgical-assistance tool combines phase recognition with object (instrument) detection. The tendency is toward unified multimodal foundation models — InternVideo-2, the video-LLM family — that serve many of these tasks with a single shared encoder, much as CLIP unified image-text tasks.

The interface to the rest of Part VII is straightforward. Chapter 01 covers the classical image primitives (colour, gradients, features) that still matter as preprocessing and baselines. Chapter 02 covers the image backbones (ResNet, EfficientNet, ViT, Swin, ConvNeXt) that every video model inflates from or adapts. Chapter 03 covers detection and segmentation, which generalise to video as discussed in Sections 13–14 of this chapter. Chapter 05 covers 3-D vision (depth, point clouds, NeRF), which increasingly fuses with video for dynamic-scene reconstruction. Chapter 06 covers vision-language models, which Section 12 here ties directly into. And Part XIV (Generative Models) covers video generation — Sora, Gen-3, Stable Video — which uses the same backbones and evaluation infrastructure as this chapter but in the decoder role. Video understanding is therefore both a leaf task and a connective tissue for much of the rest of the compendium.