Vision-language models are the bridge between the two largest domains of modern AI: pixels and text. Their task is deceptively broad — given an image (or video, or document, or 3D scene) and a piece of language (a caption, a question, a referring expression, a long instruction), produce a response that respects both modalities. That single framing covers image captioning, visual question answering, text-to-image retrieval, image-to-text retrieval, zero-shot classification, visual grounding, optical character recognition, document understanding, visual dialogue, embodied planning, and the general-purpose "describe / explain / analyse this image" that anchors today's multimodal assistants. The field's history is a sequence of sharp phase transitions. Phase one (2014 – 2018) was the encoder-decoder era: CNN encoders produced image features, LSTM or transformer decoders emitted captions, and bottom-up-and-top-down attention (Anderson 2017) elevated region proposals into a default visual vocabulary. Phase two (2019 – 2020) was the BERT-for-vision era: ViLBERT, LXMERT, VisualBERT, VL-BERT, UNITER, and OSCAR fused image regions and text tokens inside a single transformer and pre-trained on masked-language, image-text-matching, and masked-region objectives. Phase three (2021) was the dual-encoder earthquake: OpenAI's CLIP and Google's ALIGN showed that a simple contrastive loss over four hundred million web image-text pairs produced a zero-shot image classifier competitive with fully supervised ImageNet ResNets, and — more importantly — a joint embedding space that seeded Stable Diffusion, segmentation (CLIPSeg, SAM), detection (GLIP, GroundingDINO), retrieval, and open-vocabulary pipelines across the field. Phase four (2022 – 2023) was the generative-and-frozen era: BLIP and BLIP-2 coupled vision encoders to generative decoders via Q-Formers; Flamingo showed that a frozen LLM augmented with gated cross-attention and a Perceiver Resampler could few-shot new visual tasks in context; LLaVA and MiniGPT-4 made the recipe open — take a frozen vision encoder, a frozen LLM, a thin projector, and a few hundred thousand GPT-4-generated instruction pairs, and you have a conversational multimodal assistant. Phase five (2024 – 2025) brought frontier closed systems (GPT-4V, Gemini 1.5 / 2.0, Claude 3, Claude 3.5, Claude 4) that handle images, long documents, video, and multi-image prompts natively, alongside a vibrant open ecosystem (Qwen-VL, InternVL, DeepSeek-VL, Molmo, Llama 3.2-Vision, Pixtral) and specialised branches — grounding (Kosmos-2, Set-of-Mark, Florence-2), documents (Donut, Nougat, ColPali), video (Video-LLaVA, VideoChat, InternVideo2), 3D (3D-LLM), and embodied reasoning (PaLM-E, RT-2, OpenVLA). This chapter follows the full arc: task landscape, data, architecture families, evaluation, deployment, and the integration back into the rest of the ML stack.
Sections one through three are the framing. Section one is why vision-language models matter — the convergence of vision and language, the difficulty of joint reasoning, the reason CLIP turned out to be the cornerstone of modern computer vision, and the specific capabilities (zero-shot classification, open-vocabulary anything, visual chat, document AI, embodied reasoning) that VLMs unlocked. Section two is the task landscape: image captioning, VQA, visual reasoning, retrieval, zero-shot classification, visual grounding, referring expressions, dialogue, OCR, document QA, video QA, and long-context multi-image reasoning — with the canonical benchmarks for each. Section three is image-text data: the pipeline from early Flickr8K / 30K / MSCOCO Captions, through Conceptual Captions / CC3M / CC12M / SBU, to the web-scale era (LAION-400M, LAION-5B, COYO-700M, DataComp-1B, WIT), the de-duplication and filtering practices (LAION aesthetics, DataComp-Small, LLaVAR / LLaVA-Next instruction data), and the licensing / safety story that has shaped what is and is not public.
Sections four and five cover the pre-CLIP era. Section four is early vision-language models — the Show-and-Tell / Show-Attend-and-Tell CNN-LSTM era, the bottom-up-and-top-down (BUTD) attention that fed Faster R-CNN regions into LSTMs, and the first transformer-based fusion attempts. Section five is the ViLBERT / LXMERT family: single-stream (VisualBERT, VL-BERT, UNITER, OSCAR, VinVL) versus two-stream (ViLBERT, LXMERT) fusion architectures; the pre-training objectives (masked language modelling, masked region modelling, image-text matching, word-region alignment); the Faster R-CNN region bottleneck that dominated until ViT made patch tokens the default visual input.
Sections six through eight cover the contrastive and generative turns. Section six is CLIP: the dual-encoder formulation, the symmetric InfoNCE objective, the four-hundred-million WebImageText corpus, the zero-shot classification recipe (prompt engineering, prompt ensembles), and the reasons CLIP became the universal vision backbone. Section seven is the CLIP family and contrastive scaling: ALIGN (1.8B noisy pairs), BASIC, FLIP (masked image modelling for CLIP), OpenCLIP (the open-source reproduction), SigLIP (sigmoid pairwise loss), EVA-CLIP, MetaCLIP, DFN, and the DataComp / LAION-2B compute / data scaling studies. Section eight is the BLIP family and generative captioning: BLIP's bootstrapped caption filtering, BLIP-2's Q-Former bridging a frozen ViT to a frozen LLM, InstructBLIP, and the generative-plus-contrastive hybrid losses (CoCa, BEiT-3, ONE-PEACE) that unify encoders and decoders.
Sections nine through eleven cover the LLM-coupled multimodal models that define today's frontier. Section nine is Flamingo and frozen-LM few-shot: the Perceiver Resampler, gated cross-attention layers inserted into a frozen Chinchilla, and the few-shot in-context multimodal prompting that Flamingo pioneered. Section ten is LLaVA and open visual instruction tuning: the minimalist recipe (vision encoder, MLP projector, LLM, GPT-4-synthesised instruction data), LLaVA-1.5 / 1.6 / NeXT, MiniGPT-4, Qwen-VL / Qwen2-VL, InternVL, DeepSeek-VL, Molmo, Idefics, CogVLM, Yi-VL, Llama 3.2-Vision, and the open VLM ecosystem. Section eleven is proprietary frontier VLMs: GPT-4V / GPT-4o / GPT-5, Gemini 1.5 / 2.0 / 2.5 with its native long-context multi-modality, Claude 3 / 3.5 / 4 with computer-use vision, and the architecture and capability story as far as it has been publicly disclosed.
Sections twelve through fifteen cover specialised VLM branches. Section twelve is visual grounding and referring expressions: GLIP / GLIPv2, GroundingDINO, Kosmos-2 (grounded captioning with output bounding boxes), Set-of-Mark prompting, Florence-2, and the open-vocabulary detection / referring-segmentation story that wires VLMs back into dense-prediction tasks. Section thirteen is document and OCR VLMs: Donut's OCR-free document understanding, LayoutLMv3, Nougat's scientific-document parser, ColPali's retrieval-first document pages, Qwen-VL-Chat's OCR, and the document-QA benchmarks (DocVQA, InfographicVQA, ChartQA, AI2D) that now serve as capability probes. Section fourteen is video-language models: Video-LLaVA, VideoChat / VideoChat2, VideoLLaMA / VideoLLaMA2, InternVideo / InternVideo2, LLaVA-Video, Gemini's long-context native video, and the temporal QA / long-video-understanding benchmarks. Section fifteen is 3D and embodied VLMs: 3D-LLM, PointLLM, SpatialVLM, PaLM-E, RT-2, OpenVLA, and the spatial-reasoning / robotic-action VLMs that plug vision into the embodiment stack.
Section sixteen is evaluation and benchmarks: the VQAv2 / OK-VQA / GQA / ScienceQA / AI2D origin; the modern MMMU / MMBench / MME / MM-Vet / SEED-Bench / MathVista / ChartQA / MMMU-Pro capability exams; the hallucination benchmarks (POPE, HallusionBench); the robustness suites (VLMEvalKit, LMMs-Eval, LLaVA-Bench-in-the-Wild); and the long-standing question of how to measure open-ended multimodal generation. Section seventeen is efficient VLMs and deployment: quantisation (AWQ, GPTQ for VLMs), distillation, sparse / MoE VLMs (MoVA, LLaVA-MoD), small-footprint variants (MobileVLM, TinyLLaVA, nanoVLM), edge deployment (ONNX, TensorRT-LLM, llama.cpp multimodal, Core ML), serving (vLLM-multimodal, SGLang), and the engineering of latency, memory, and batching under image prompts. The closing in-ml section is the operational picture: how VLMs are trained end-to-end (pretraining → SFT → RLHF / DPO), how they slot into retrieval-augmented and agentic systems, how they power vision pipelines (CLIP embeddings for search, open-vocabulary detection, captioning for dataset labelling), and how the next chapters of Part VII — audio, generative models, RL, robotics, agents — reuse the VLM backbone as their visual front end.
For the first ten years of deep learning, computer vision and natural-language processing were parallel worlds. CNNs learned from ImageNet; transformers learned from text; they shared optimisation theory and little else. Vision-language models (VLMs) are the neural architectures that join these two worlds: they take images (or video, or documents, or 3D scans) together with language (captions, questions, instructions, dialogue) and produce responses that depend on both modalities. In the four years since CLIP's release (February 2021), VLMs have become the cornerstone of modern computer vision — the default encoder for search, for open-vocabulary detection and segmentation, for text-to-image generation, for dataset labelling — and the visual front end of today's multimodal assistants.
The list of tasks that VLMs have reshaped is long. Zero-shot image classification: given any set of textual class names, CLIP can classify an image into them without additional training. Open-vocabulary detection and segmentation: GLIP, GroundingDINO, and OWL-ViT extend this to localisation with textual object descriptions. Image captioning: BLIP and its successors turn an image into a natural-language description, which powers accessibility, search, and dataset curation. Visual question answering: given an image and a question, a VLM returns an answer in natural language — from simple colour / counting questions (VQAv2) to university-level multimodal reasoning (MMMU). Visual dialogue: multi-turn conversations about an image, now the default interaction pattern of ChatGPT, Gemini, and Claude. Document understanding: parsing invoices, contracts, scientific papers, and spreadsheets that contain both text and visual structure. Visual grounding: pointing to the region of an image that corresponds to a phrase. Text-to-image retrieval and image-to-text retrieval: the search primitive behind Pinterest, Google Lens, and every product-search engine that cares about visual similarity. Embodied reasoning: PaLM-E and RT-2 use a VLM as the policy head of a robot, turning "pick up the red block" into a sequence of end-effector commands.
A distinctive property of VLMs is open-endedness. An ImageNet classifier knows a thousand categories; a VLM, conditioned on natural language, can classify anything expressible in words. A Faster R-CNN detector localises eighty COCO classes; an open-vocabulary detector is asked "where is the defective weld?" and produces a bounding box. This open-endedness is why CLIP embeddings appear in almost every modern vision pipeline — as a backbone for segmentation (CLIPSeg, SAM), as the text-image alignment used in Stable Diffusion, as the scorer for retrieval, as the zero-shot classifier for data filtering. The VLM is not just a better captioner; it is a reusable visual reasoning engine.
The second wave — generative, LLM-coupled VLMs — generalises this further. A vision encoder produces patch embeddings; a lightweight projector (MLP, Q-Former, Perceiver Resampler) maps them into an LLM's token space; the LLM produces text that is jointly conditioned on images and the user's instruction. The result is a single system that captions, answers, dialogues, reasons about charts, explains screenshots, reads signs in dozens of languages, and — with the appropriate grounding modules — points to specific regions. GPT-4V (September 2023), Gemini 1.5 (February 2024), and Claude 3 (March 2024) made this the default frontier-model capability; LLaVA, Qwen-VL, and InternVL brought the same recipe to the open-source world.
VLMs also matter because they are the bottleneck the rest of the AI stack is increasingly built on. Text-to-image generation uses CLIP or T5 as its text conditioning. Retrieval-augmented generation over PDFs uses ColPali's document-image embeddings instead of OCR. Robotics foundation models (RT-2, OpenVLA, π0) are VLMs with action tokens. The next chapters of Part VII — audio, generative models, RL, robotics, agents — all use a VLM as their perception and grounding layer. Understanding how VLMs are trained, which data they see, where they fail, and how to deploy them is central to almost every applied AI problem in 2026.
This chapter traces the arc from the pre-transformer caption-generation era through the BERT-for-vision fusion encoders, the CLIP / ALIGN dual-encoder earthquake, the BLIP-2 / Flamingo generative turn, the LLaVA open-instruction-tuning explosion, the proprietary frontier, and the specialised branches (grounding, documents, video, 3D, embodied). We end with evaluation, deployment, and how VLMs integrate with the broader ML stack.
Before surveying architectures, it is worth mapping the tasks VLMs are expected to solve. Each task has its own canonical dataset, its own evaluation quirks, its own failure modes. Reading papers without this map is disorienting; with it, the landscape resolves into a small number of natural clusters.
Image captioning — given an image, produce a natural-language description — is the oldest vision-language task. The datasets are Flickr8K / 30K (small, human-written), MSCOCO Captions (120 000 images, five captions each; the benchmark for a decade), NoCaps (testing novel-object captioning), Conceptual Captions / CC3M / CC12M (web-scraped alt text, large but noisy), and the synthetic-GPT-4V re-captioning that became the de-facto training signal from 2023 onward. The standard metrics — BLEU, METEOR, ROUGE, CIDEr, SPICE — all measure n-gram or graph overlap with reference captions; they are known to undervalue fluent paraphrases and reward memorised phrases, which is why recent work (e.g. CLIPScore, BLIP's ITM-based eval) prefers embedding-space or learned metrics.
Visual question answering (VQA) — given an image and a question, produce an answer — is the workhorse benchmark of the VLM era. The foundational dataset is VQAv2 (Goyal 2017): hundreds of thousands of image-question-answer triples with deliberately balanced yes/no questions to prevent language-only cheating. Extensions focus on specific reasoning types: OK-VQA and A-OKVQA (outside knowledge), GQA (compositional reasoning over scene graphs), ScienceQA (textbook-style multimodal science questions), AI2D (diagram understanding), MathVista (mathematical visual reasoning), ChartQA and PlotQA (chart understanding). Modern comprehensive benchmarks — MMMU, MMBench, MME, MM-Vet, SEED-Bench — batch together dozens of reasoning types and are reported as the single-number capability score for every new frontier VLM.
Visual grounding and referring expressions map a phrase to an image region. Grounding datasets include RefCOCO / RefCOCO+ / RefCOCOg (three tiers of referring-expression difficulty), Flickr30K Entities (phrase-to-box alignments), and the modern GRIT / GroundingDINO annotations. The output is a bounding box, a segmentation mask, or a point — the task type determines whether you are doing referring-expression comprehension (REC) or referring-expression segmentation (RES).
Document and OCR understanding is the fastest-growing specialisation. Datasets include DocVQA (document pages + questions), InfographicVQA (complex infographics), ChartQA / ChartBench, TextVQA and TextCaps (images with prominent text), OCR-Bench / OCRBench-v2 (targeted OCR evaluation), and the academic-paper benchmarks (Nougat's arXiv test set, MMLongBench-Doc for long-document QA). Modern VLMs (Qwen2-VL, InternVL2.5, GPT-4o, Gemini 2.0) treat documents as images and perform OCR-free understanding; they rival or beat specialised OCR-plus-LLM pipelines on most academic benchmarks.
Video-language tasks extend all of the above to video. Video captioning (MSVD, MSR-VTT, ActivityNet-Captions); video QA (TGIF-QA, MSRVTT-QA, ActivityNet-QA, NExT-QA, EgoSchema for egocentric long-form, Video-MME for comprehensive video VLM evaluation); video-text retrieval; moment retrieval (localising the timespan described by a sentence). Long-context video — a thirty-minute video with dozens of events — has been a signature capability of Gemini 1.5 since February 2024.
Embodied and interactive tasks sit at the boundary with robotics and agents. ALFRED (household instruction following), Habitat (embodied navigation), EmbodiedQA, and the RT / OpenVLA robotic-manipulation benchmarks expect a VLM to output actions in a physical simulator or on a real robot. The Part VIII chapters on RL and robotics cover these in depth; this chapter focuses on the VLM backbone that feeds them.
Finally, modern frontier VLMs are evaluated as much on real-world capability — OCR on receipts, reading a weather map, debugging a web screenshot, counting items in a messy drawer — as on any single benchmark. That blend of lab benchmarks and in-the-wild capabilities is what the frontier closed systems increasingly compete on, and it pushes benchmark design toward "practical multimodal intelligence" suites like MMMU-Pro, MM-Vet, and LLaVA-Bench-in-the-Wild.
The story of VLM progress is, at least as much, the story of VLM data. The move from 300 000 paired examples (Flickr30K, MSCOCO) to 3 million (Conceptual Captions) to 400 million (CLIP's WIT) to 5 billion (LAION-5B) closely tracks what the best model could do. Every serious VLM paper has a data section as important as its architecture section; data scaling, filtering, and re-captioning are active research areas in their own right.
The early datasets — Flickr8K / 30K, MSCOCO Captions, Visual Genome, SBU Captions — were curated. Human annotators wrote five captions per image, or crowd workers paraphrased existing captions, producing hundreds of thousands of clean examples. These datasets trained the first CNN-LSTM captioners and remain standard evaluation suites. Their problem is scale: 120 000 images is not enough to train a modern VLM. Their advantage is grounded quality: every caption describes the image, not its surrounding webpage.
Conceptual Captions (CC3M, Sharma 2018) was the first at-scale web alt-text dataset, filtered from several billion candidate image-alt-text pairs down to 3.3 million. CC12M scaled this to 12 million with looser filters. The quality is visibly lower than MSCOCO — alt-text is written for accessibility, not as a description, and often names the photographer or product instead of the content — but the scale was a step change. SBU Captions (Ordonez 2011) was a parallel line: ~1 million Flickr images with user-written captions. RedCaps (Desai 2021) curated 12 million captions from Reddit.
CLIP's WIT (WebImageText) was the breakthrough: 400 million image-text pairs scraped from the public web, released as a training set but never published, and it was the data — more than the model — that made CLIP work. ALIGN (Jia 2021) scaled further to 1.8 billion pairs with almost no filtering, showing that scale could compensate for noise. LAION-400M and especially LAION-5B (Schuhmann 2022) reproduced these datasets in the open, using CommonCrawl plus CLIP-based filtering to select image-text pairs with high cosine similarity. LAION became the training set for Stable Diffusion, OpenCLIP, and countless downstream VLMs. Related open datasets include COYO-700M (Kakao), WuKong (Chinese, 100M), and WIT (Wikipedia-based Image-Text).
DataComp (Gadre 2023) is the most important recent contribution: instead of releasing a dataset, it releases a pool of 13 billion image-text candidates plus a benchmark that evaluates different filtering strategies on a fixed training compute budget. This re-frames dataset construction as a competitive research task — filter recipes are the units of progress — and has produced the DataComp-Small, DataComp-Medium, and DFN (Data Filtering Networks) filter pipelines that underpin the strongest open CLIPs.
Instruction-tuning data is a second, equally important, data regime. LLaVA's original instruction set (Liu 2023) was 158 000 image-conversation pairs synthesised by prompting GPT-4 (text only, with COCO captions and bounding boxes as context) to ask and answer questions about images. LLaVA-Instruct-665K, LLaVA-NeXT, ShareGPT4V, LLaVA-OneVision's 3.2M mix, and the academic datasets (Idefics2-Pretraining, M3IT, MIMIC-IT) scaled this up and broadened it to include OCR, charts, grounding, and multi-image. A recurring pattern: generate candidate dialogues with a strong closed model (GPT-4V, GPT-4o, Claude 3.5), filter for quality, train your open model on the result. This synthetic distillation is how most open VLMs close the gap with frontier closed models.
A third regime is document and OCR data: the IIT-CDIP collection of scanned documents (OCR-free pretraining for Donut), the arXiv PDF corpora (Nougat, Surya), the PubLayNet layouts, the DocLayNet / DocVQA / ChartQA triples, and the synthetic receipt/form generators (LayoutLMv3 used IIT-CDIP plus in-house synthetic data). The strongest document VLMs (Qwen2.5-VL-72B, InternVL2.5, GPT-4o, Claude 3.5 Sonnet) consume tens of millions of document pages during pretraining, which is why they rival specialised OCR pipelines.
The licensing story is unresolved. LAION, COYO, and CC are image-text pairs of publicly accessible content, distributed as URL lists that users re-crawl themselves; the actual images remain owned by their creators. Training on web-scraped data is an active legal and policy topic (the Andersen, Getty, New York Times, and Sarah Silverman cases all touch it), with 2024–25 decisions edging toward a "training is transformative but output must not reproduce copyrighted work" doctrine. This is directly relevant to VLM deployment: a captioner that paraphrases a news photograph is different, legally, from one that reproduces it.
Before the transformer took over, vision-language research was an encoder-decoder story. The encoder was a convolutional network pre-trained on ImageNet; the decoder was a recurrent network that emitted a caption one word at a time; the objective was maximum-likelihood over the next token. This simple recipe — cross-entropy, teacher forcing, CNN-LSTM — dominated captioning from 2014 to 2018 and established the baseline benchmarks that every later paper compares to.
Show and Tell (Vinyals 2014) is the starting point. A GoogLeNet encoder produced a single image feature vector; this vector was passed as the initial hidden state of an LSTM decoder that generated the caption one word at a time. Trained on MSCOCO with standard next-token cross-entropy, Show and Tell substantially beat every hand-engineered template-based captioner and won the 2015 MSCOCO captioning challenge. It demonstrated the core pattern — pretrained visual encoder + language decoder — that every vision-language system since has reused.
Show, Attend and Tell (Xu 2015) added the most important architectural idea: soft attention over spatial feature locations. Instead of condensing the image into a single vector, the encoder's 14 × 14 × 512 convolutional feature map was preserved, and at each decoding step the decoder computed an attention distribution over the 196 spatial locations and pooled them weighted by that distribution. This gave captions that could look at different regions for different words ("girl" → face region, "horse" → the horse region) and was a direct forerunner of the transformer-era cross-attention that all modern VLMs use.
Bottom-Up-and-Top-Down attention (BUTD, Anderson 2017) replaced the regular convolutional feature grid with Faster R-CNN region proposals: run a detector on the image, extract features for the top-36 (or 100) proposed regions, and attend over this set rather than over spatial grid cells. The intuition was that objects, not pixels, are the natural units of visual attention. BUTD won the 2017 VQA Challenge and became the default visual feature extractor for every vision-language paper until the ViT era, which is why so many 2018–2021 VL papers cite "36 or 100 BUTD region features" as their image input.
In parallel, SCAN (Lee 2018), MCB / MLB / MUTAN (bilinear pooling methods, 2016–2018), and a zoo of fusion architectures — SAN (stacked attention), HieCoAtt (hierarchical co-attention), DAN (dual attention), MFB, BAN — combined image region features and question word embeddings via increasingly elaborate attention, bilinear, and tensor-product operations. These methods moved the VQAv2 state of the art from 54% (LSTM+CNN baseline, 2015) to 67% (BUTD+Counter, 2017) but never cracked 70%, and the engineering was baroque.
Dense captioning (Johnson 2016) extended captioning from one caption per image to one caption per region, producing a grid of localised descriptions — a precursor to modern grounded captioning (Kosmos-2, GLaMM). Visual Genome (Krishna 2016) provided the dense annotations (object labels, attributes, relationships, region descriptions) that trained many of these systems and that still appear as supervisory signal in modern grounding data.
Three lessons from the pre-transformer era carried forward. First, pretrained visual encoders are necessary: end-to-end learning from raw pixels to words has never worked; the encoder is always pre-trained on ImageNet (or, later, LAION). Second, attention over spatial or region features is necessary: condensing the image to a single vector discards too much. Third, autoregressive language decoding works: cross-entropy on the next word, teacher forcing, and beam search at inference is the default recipe, unchanged today. What the transformer revolution added was (i) a unified attention primitive that could handle both vision and language tokens, and (ii) the scale of pretraining data that made the encoder-decoder recipe finally breakthrough.
Between BERT's release in late 2018 and CLIP's in early 2021, vision-language research ran a near-exhaustive search over how to fuse a transformer with vision features. Every permutation was tried: single-stream versus two-stream; Faster R-CNN regions versus grid features; masked language versus image-text-matching versus masked-region-modelling objectives. This era produced ViLBERT, LXMERT, VisualBERT, VL-BERT, UNITER, OSCAR, VinVL, ImageBERT, Unicoder-VL, and more — a dense, fast-moving body of work whose lessons still inform current VLM architectures.
ViLBERT (Lu 2019) was the first: a two-stream transformer with separate BERT-style encoders for image region features and text tokens, coupled via co-attention layers (image queries attending to text keys/values and vice versa). It was pre-trained on Conceptual Captions with three objectives: masked multi-modal language modelling (predict masked text tokens given images and remaining text), masked image region modelling (predict masked region class given remaining regions and text), and alignment prediction (is this image-text pair a match?). The visual input was 36 Faster R-CNN region features per image.
LXMERT (Tan 2019) used a similar two-stream design with object-relationship and cross-modality encoders, and added VQA-style supervised pre-training tasks (object prediction, attribute prediction, VQA fine-tuning on GQA and VQAv2 during pretraining itself). LXMERT was for several years the strongest VQA-specific pre-trained model.
VisualBERT (Li 2019), VL-BERT (Su 2019), and Unicoder-VL (Li 2019) took the opposite approach: single-stream transformers that concatenated image regions and text tokens into one sequence and let self-attention handle the fusion. Single-stream designs are conceptually simpler and have fewer parameters per visual-text interaction, but they require carefully designed positional encodings to distinguish the two modalities.
UNITER (Chen 2020) consolidated the single-stream family with a larger BERT-large-scale backbone, four pretraining datasets (COCO, VG, CC3M, SBU), and a thorough ablation over objectives — masked language modelling, masked region modelling with feature regression, image-text matching, and a clever word-region alignment objective via optimal transport. UNITER's ablations set the reference points for which objectives actually contributed to downstream performance.
OSCAR (Li 2020) added a simple but influential idea: include object tags (words like "dog", "umbrella", detected from the image) as an explicit part of the input sequence, acting as an anchor between visual regions and text tokens. This object-tag trick consistently improved VL-BERT-style models by several points on VQA and image-text retrieval, and inspired many follow-ups. VinVL (Zhang 2021) showed that upgrading the visual feature extractor — from ResNet-101 Faster R-CNN to a much larger ResNeXt-152 — was a bigger lever than architectural changes, foreshadowing the "scale the vision encoder" theme that would dominate the CLIP era.
The fusion-transformer family was eventually subsumed by ViT-based approaches. ViLT (Kim 2021) dropped the region extractor entirely — image patches fed directly into a single transformer — and demonstrated that you could match UNITER-level performance without Faster R-CNN. ALBEF (Li 2021) added a contrastive image-text objective alongside the masked-language and ITM objectives, anticipating BLIP. VLMo (Wang 2021) introduced mixture-of-modality-experts — shared parameters plus vision-specific and language-specific experts — a design that persists in BEiT-3 and some modern VLMs.
The fusion-transformer era taught the field three lessons. First, pre-training objectives matter enormously: the specific mix of MLM / MRM / ITM / contrastive / alignment objectives has large downstream effects, and no single objective is dominant. Second, the visual feature source is the biggest lever: Faster R-CNN regions, then ViT patches, then CLIP-initialised ViTs each unlocked a step change. Third, fine-tuning is annoying: every downstream task (VQA, retrieval, captioning, REC) needs its own head and its own fine-tuning recipe, which makes the resulting models practically hard to use. CLIP's zero-shot promise was a direct response to this.
In January 2021, OpenAI released CLIP (Contrastive Language-Image Pre-training; Radford 2021) alongside DALL·E 1. CLIP did not win any of the fine-tuned VL-BERT benchmarks its contemporaries had been chasing. What it did was show that a contrastive dual-encoder trained on 400 million web image-text pairs produced a zero-shot image classifier competitive with a fully supervised ResNet-50 on ImageNet. This single result reshaped computer vision.
The architecture is extreme in its simplicity. Two encoders — a ViT or modified ResNet for images, a transformer for text — each project their input to a shared 512- or 768-dimensional embedding space via a linear layer. Given a batch of N image-text pairs, compute the N × N similarity matrix of image embeddings against text embeddings. The symmetric InfoNCE loss asks that, along each row, the correct text (the one that actually accompanied this image) be closest in cosine similarity to this image, and that along each column the correct image be closest to this text. Optimising this objective on 400 million pairs for 32 epochs on ~512 V100s over 12 days produced the released CLIP models (ViT-B/32, ViT-B/16, ViT-L/14, plus ResNet variants).
What made CLIP a cornerstone of computer vision is the way you use it at inference. To classify an image into K categories with CLIP, you do not fine-tune anything. You (i) embed the image with the image encoder, (ii) form a textual prompt for each category ("a photo of a {class}"), embed each prompt with the text encoder, and (iii) pick the class whose text embedding has the highest cosine similarity to the image embedding. This zero-shot prompt-based classification — no labels required, no task-specific training — gave CLIP 76% on ImageNet (matching a supervised ResNet-50), strong performance on 30 downstream benchmarks, and — crucially — distribution-robust behaviour on ImageNet-R, ImageNet-A, ImageNet-Sketch, and ObjectNet, where supervised ImageNet classifiers collapse.
CLIP's transfer properties are the reason it is everywhere in vision. The CLIP image encoder — especially ViT-L/14 and its 336-pixel input variant — is one of the strongest off-the-shelf visual representations available. It is the default backbone in: Stable Diffusion's and DALL·E 2's text-to-image conditioning (via CLIP text embeddings); open-vocabulary detection (ViLD, DetPro, OWL-ViT); open-vocabulary segmentation (CLIPSeg, LSeg, OpenSeg, MaskCLIP, CLIPSeg); captioning (ClipCap, CoCa); LLaVA and nearly every open VLM since (which use CLIP-ViT-L/14 or SigLIP as the vision tower); and evaluation (CLIPScore, CLIP-IQA, CMMD). If you need a general-purpose "what's in this image" vector in 2026, CLIP or a CLIP descendant is still the answer.
CLIP is not without its weaknesses. Its text encoder is small (a 12-layer transformer with a 77-token context), which is why complex prompts — relative spatial terms, counting, compositional relationships — tend to fail. It struggles with fine-grained distinctions (bird species, dog breeds, near-duplicate objects) where the text supervision was not discriminative enough. Its zero-shot performance on specialised domains (medical imaging, satellite, art-historical) is only modest, which is why domain-specific CLIPs (BioClinicalCLIP, RSClip, StyleCLIP) emerged. And it is a retrieval / classification model: it cannot generate text, describe an image, or answer a question — those require coupling it to a generative decoder (BLIP, Flamingo, LLaVA).
What CLIP really was, in retrospect, is a recipe: a dual-encoder contrastive objective, run at web scale, with a text encoder strong enough to capture natural-language description, and image features that are clean enough to align. Every subsequent contrastive VLM is a variation on this recipe. The next section surveys the CLIP family — ALIGN, SigLIP, OpenCLIP, EVA-CLIP, MetaCLIP, DFN — and what we have learned by scaling, filtering, and re-engineering each part of the original formula.
CLIP was a recipe. In the four years since its release, that recipe has been scaled, cleaned, reformulated, and reproduced in the open. The result is a family of dual-encoder VLMs — ALIGN, OpenCLIP, SigLIP, EVA-CLIP, MetaCLIP, DFN, JINA-CLIP — each of which swapped one or two components of the original formula and pushed the Pareto frontier of zero-shot performance versus training cost.
ALIGN (Jia 2021) was released four months after CLIP. Its innovation was brute scale: 1.8 billion image-text pairs with almost no filtering, versus CLIP's 400 million with some filtering. ALIGN used an EfficientNet-L2 image encoder and a BERT-Large text encoder, and matched or exceeded CLIP on most benchmarks despite lower-quality data. The headline lesson: with enough scale, noisy data still works — an insight that would later justify LAION-5B and COYO-700M.
OpenCLIP (Ilharco 2021 onward) is the open-source reproduction that matters most in practice. The original OpenAI CLIP weights were released, but the training data (WIT) and the code to reproduce it were not. OpenCLIP — built by LAION and the broader open community on LAION-400M, LAION-2B, and later LAION-5B and DataComp — reproduced and extended CLIP, eventually surpassing OpenAI's CLIP on zero-shot ImageNet. The OpenCLIP checkpoints (e.g. ViT-B/16-LAION-2B, ViT-L/14-LAION-2B, ViT-H/14-LAION-2B, ViT-G/14-LAION-2B, ViT-bigG-14-LAION-2B) are the default CLIP weights in most 2023–25 papers, because the training data is public and the licensing is less restrictive.
SigLIP (Zhai 2023) replaced the softmax-based InfoNCE loss with a pairwise sigmoid loss: each image-text pair is treated as an independent binary classification (match or not), and the normalising constant over the batch is eliminated. The practical consequences are (i) smaller optimal batch sizes — SigLIP reaches CLIP-batch-32 768 performance at batch 4 096 — which makes training accessible to smaller labs, and (ii) faster convergence. SigLIP ViT-SO-400M and SigLIP-2 (2024) are now the most common vision towers for open VLMs (InternVL, Qwen-VL, LLaVA-NeXT), having largely displaced OpenAI's ViT-L/14.
EVA-CLIP (Sun 2023) initialised the image tower from EVA, a masked-image-modeling-pre-trained ViT-G/14, and added LAMB optimiser tricks and careful hyperparameters. EVA-CLIP ViT-E/14 with 4B parameters reached 82.0% on ImageNet zero-shot, which remained state of the art for many months and established the "initialise from a strong vision SSL model" pattern that several follow-ups reused.
Longer-context and multilingual CLIPs extend CLIP's 77-token text encoder. Long-CLIP (Zhang 2024) replaced the text encoder with a longer-context variant, enabling paragraph-length prompts. Multilingual CLIP, Jina-CLIP-v2, M-CLIP, and Chinese CLIP train text encoders on non-English data, which is critical for non-English retrieval, open-vocabulary detection, and document understanding in languages outside the original WIT distribution.
Masked-CLIP variants speed up training. FLIP (Li 2023) randomly masks 50% of image patches during CLIP training; training is roughly 2× faster with a 1-point zero-shot gain because the model sees more pairs per unit compute. CLIPA (Li 2023) uses a staged training schedule — low-resolution pretraining then high-resolution fine-tuning — to reach ViT-H/14 quality at 1/10 the cost.
Finally, hybrid contrastive-generative models — CoCa (Yu 2022), ALBEF, BEiT-3, ONE-PEACE — add a captioning decoder to the CLIP dual-encoder and train both objectives jointly. CoCa's 90% accuracy on ImageNet (fine-tuned) and 86% zero-shot showed that a single model can be both a retrieval engine and a captioner, anticipating BLIP-2 and the modern VLM formula.
Three lessons stand out from the CLIP-family era. First, data quality matters as much as scale: DFN's filtered 5B pairs beat LAION-5B's unfiltered ones. Second, the loss matters: SigLIP's sigmoid beat CLIP's softmax per batch-size and per compute dollar. Third, vision-tower pretraining matters: initialising from EVA / DINOv2 / MAE gave CLIP-style contrastive training a consistent head start. These three levers — data, loss, vision pretraining — continue to drive the frontier of contrastive VLMs in 2026.
CLIP was contrastive-only — it could match images and text but could not generate. A parallel research line, led by Salesforce's BLIP series, coupled a vision encoder to a text decoder and produced one of the two archetypes of the modern VLM: the bridge-module VLM that transforms vision features into a frozen language model's input space.
BLIP (Li 2022) introduced a multi-task architecture — MED, Multimodal mixture of Encoder-Decoder — with three functional modes: (i) unimodal image/text encoders for contrastive learning, (ii) image-grounded text encoder for image-text-matching classification, and (iii) image-grounded text decoder for caption generation. All three share most parameters; differences are a few attention-layer toggles. BLIP trained on MSCOCO + CC3M + CC12M + SBU + LAION-115M.
BLIP's CapFilt data-cleaning trick became at least as influential as the architecture. Web captions are noisy; BLIP used its own captioner to generate synthetic captions for web images, its own ITM head to filter low-scoring web-caption and synthetic-caption pairs, and retrained on the cleaned data. This bootstrapped captioning loop is an early example of what later became a universal VLM trick: use a smaller model to clean data for a larger one.
BLIP-2 (Li 2023) is the cleaner architectural statement and the template most modern VLMs follow. It has three components: (i) a frozen vision encoder (ViT-g/14 from EVA), (ii) a lightweight Q-Former — a transformer with a small set of learnable query tokens that attend to the vision features and distil them into a fixed number of output vectors — and (iii) a frozen language decoder (OPT or FlanT5). Only the Q-Former (~180M parameters) is trained; both the vision encoder and the language model are kept frozen. Training is two-stage: first, Q-Former pre-training with ITC / ITM / ITG objectives on vision-text pairs; second, Q-Former-to-LM pre-training where the Q-Former outputs are prepended to the LM's context.
BLIP-2's practical recipe — freeze, bridge, condition — dominates today. Freezing the LM preserves its language capabilities (BLIP-2 could multi-turn-dialogue with FlanT5-XXL without any dialogue training data), and keeping the vision encoder frozen makes training cheap enough to do on a few A100s rather than a cluster. InstructBLIP (Dai 2023) extended this by instruction-tuning the Q-Former on 26 vision-language tasks, producing the first open VLM with strong zero-shot instruction-following.
CoCa (Yu 2022) and BEiT-3 (Wang 2022) are two contemporaneous alternatives that combined contrastive and generative objectives into a single trainable backbone. CoCa had a contrastive ViT-style encoder plus a unimodal-then-multimodal decoder trained with joint contrastive + captioning loss, reaching 86% ImageNet zero-shot and 91% VQAv2 fine-tuned. BEiT-3 used a multiway transformer — shared layers plus modality-specific experts — and a single "masked data modelling" objective that unified image, text, and image-text data. These architectures did not become templates the way BLIP-2 did, but their results contributed to the understanding that generative and contrastive objectives compose, an idea central to the LLaVA-era VLMs.
ONE-PEACE (Wang 2023) extended BEiT-3 to vision, text, and audio in a single tri-modal encoder. Pali / PaLI-X (Chen 2023) from Google scaled a similar encoder-decoder architecture to 55B parameters and 100+ languages with ViT encoders of up to 22B parameters, producing a strong multilingual multimodal foundation that later evolved into Gemini's vision stack.
The lesson of the BLIP family is the bridge paradigm: train a small projection module that connects a pre-trained vision encoder to a pre-trained language model, keeping both towers frozen. This is the minimum-viable recipe for building a VLM from pre-existing components, and it is precisely the recipe that the LLaVA series would make open, cheap, and ubiquitous.
In April 2022, DeepMind's Flamingo (Alayrac 2022) showed that a frozen large language model could be taught to few-shot new visual tasks in context — the way GPT-3 learns tasks from in-context examples in text — by bolting on a carefully designed vision-to-language bridge. Flamingo's capability demo (an 80B-parameter model that matched or beat fine-tuned specialists on 16 visual benchmarks using only a handful of examples per task) was a turning point: it proved that VLMs could be truly general-purpose and that LLM in-context learning extended to pictures.
Flamingo has three architectural innovations. First, a Perceiver Resampler takes a variable-size sequence of vision features (flattened patches from a frozen NFNet-F6 vision encoder) and cross-attends a small set of learnable query tokens to them, producing a fixed 64-token visual representation per image. This is the same idea as BLIP-2's Q-Former, developed independently and at around the same time.
Second, gated cross-attention layers are inserted between the frozen LM's existing transformer blocks. Each gated layer applies cross-attention from text tokens to the current image's visual tokens, gated by a learnable tanh coefficient initialised at zero — so at the start of training, the cross-attention does nothing and the LM behaves exactly as before. The gates gradually open during training, letting the LM condition on vision without disrupting its pretrained language capability. Only the cross-attention layers, the resampler, and the embedding alignment are trained; the rest of the LM and all of the vision encoder stay frozen.
Third, the training format explicitly supports multi-image, interleaved-context prompts: the input is a mixed stream of text and images, and each image is associated (via special tokens) with the positions where it should be attended. This is what enables few-shot in-context learning: at inference you prepend a few image-text exemplars and then an unlabelled image, and the model performs the demonstrated task on the new input without weight updates.
Flamingo was never open-sourced. Its successors are. OpenFlamingo (Awadalla 2023) reproduced the architecture on top of LLaMA / MPT backbones with LAION-derived data. IDEFICS (Laurençon 2023) from HuggingFace trained an 80B Flamingo-style model on the OBELICS interleaved web corpus (141M documents with interleaved images and text, scraped from CommonCrawl with careful filtering), reaching strong open-source parity with Flamingo. IDEFICS-2 (2024) and IDEFICS-3 moved to ViT-based vision encoders and denser fine-tuning data, producing the strongest open VLMs of their respective release windows.
A key lesson of Flamingo that informs almost every later VLM is the interleaved-web-data training regime. Curated image-caption pairs (COCO, CC3M) teach captioning but not in-context multimodal reasoning; interleaved web pages (OBELICS, mmc4, OmniCorpus) have the structural property that each image is embedded in a long paragraph of context, which is what teaches the model to reason jointly about pictures and surrounding text. This is also the data format used by proprietary VLMs — GPT-4V and Gemini were pre-trained on scale-equivalent interleaved corpora — and it is the single most important difference between a "captions-only" VLM and a "chat about any screenshot" VLM.
Flamingo's architectural choices are less uniformly adopted. Gated cross-attention is elegant but adds trainable parameters inside the LM, which means you cannot simply swap in a new LM without retraining; LLaVA's simpler prefix-token design is less elegant but strictly more modular. Perceiver Resampler is still used in Qwen-VL and a handful of others, but the MLP-prefix design (LLaVA) is the dominant simpler alternative. What survived from Flamingo is the paradigm: take a frozen capable LLM, give it a vision adapter, train on interleaved web data, and reap in-context multimodal generalisation.
In April 2023, Haotian Liu and colleagues released LLaVA — Large Language and Vision Assistant (Liu 2023) — a paper and model that compressed the BLIP-2 / Flamingo recipe into something so minimal it could be trained on eight A100s in one day. LLaVA's recipe is now the default template for open VLMs. Every major open model released between 2023 and 2026 — MiniGPT-4, Qwen-VL, InternVL, DeepSeek-VL, CogVLM, Yi-VL, Idefics-2/3, Molmo, Pixtral, Llama-3.2-Vision, Aria — is a LLaVA descendant.
The LLaVA recipe is three parts. First, a frozen CLIP ViT-L/14-336 vision encoder produces per-patch embeddings. Second, a single two-layer MLP projects those patch embeddings into the LLM's token space (earlier LLaVA-1.0 used a single linear layer; LLaVA-1.5 upgraded to an MLP for a meaningful bump). Third, a pretrained LLM (originally Vicuna-13B, then LLaMA-2, LLaMA-3, Qwen, etc.) consumes the projected patch tokens as a prefix followed by the user's textual question. Training is two-stage: (i) feature alignment on 558K image-text pairs from CC3M with a frozen LLM, training only the projector; (ii) instruction tuning on 158K (LLaVA-1.0) or 665K (LLaVA-1.5) GPT-4-synthesised multimodal conversations, training projector plus LLM weights (sometimes LoRA, sometimes full fine-tuning).
The second critical LLaVA contribution is the GPT-4-generated instruction data. Because there was no natural corpus of "detailed instruction-following conversations about images", Liu et al. used the COCO image annotations (captions + detected object bounding boxes) as input to text-only GPT-4, and prompted it to generate three types of question-answer pairs per image: (a) conversation (multi-turn dialogue about the image), (b) detailed description, and (c) complex reasoning ("why is X happening?"). This recipe is now universal: every open VLM generates or borrows similar synthetic conversation data.
LLaVA-1.5 (Liu 2023) scaled the vision input to 336 × 336, the LLM to Vicuna-13B, and the data to 665K, and reached a respectable middle of the pack on MME and MMBench. LLaVA-1.6 / LLaVA-NeXT (January 2024) introduced any-resolution tiling — split a high-resolution image into 336 × 336 sub-crops, process each, and concatenate — which gave a several-point boost on OCR- and chart-heavy tasks. LLaVA-OneVision (August 2024) added video and multi-image support and scaled training to a 3.2M-sample mix from 64+ source datasets.
A structural lesson of the LLaVA era is that the vision tower and the language tower can be improved almost independently. Swap in a stronger vision encoder (SigLIP-SO400M, InternViT-6B, AIMv2) and downstream numbers improve uniformly. Swap in a stronger LLM (Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3) and likewise. The projector is trivial compared to the towers. This is exactly what makes LLaVA-style recipes so robust: each tower lives its own development life, and the VLM inherits everyone else's progress.
Practical LLaVA-style training numbers in 2026 look like this. Feature-alignment stage: 0.5–2M image-caption pairs, projector-only, a few hours on 8 × A100. Instruction-tuning stage: 1–5M multimodal conversations (mix of LLaVA-NeXT, ShareGPT4V, Chart/Doc/Video subsets), full fine-tune of projector + LLM, 1–3 days on 32–128 GPUs. DPO/RLHF stage: additional preference-data tuning with RLAIF / DPO / Iterative-DPO, small compute, large quality gain. This is within reach of most academic labs — one of the reasons open VLMs have caught up with frontier closed ones in many benchmarks, though still lagging in the messier real-world capabilities.
Where LLaVA descendants still trail is long-context multimodality, grounding, and robustness on truly novel visual inputs. GPT-4o, Gemini 2.5, and Claude 4 benefit from closed-data training corpora that open models cannot match at scale, and they have had more iterations of proprietary RLHF / safety tuning. The next section covers the frontier closed systems that define the upper end of today's VLM capability.
The strongest VLMs as of 2026 are proprietary: GPT-4V / GPT-4o / GPT-5 (OpenAI), Gemini 1.5 / 2.0 / 2.5 (Google DeepMind), and the Claude 3 / 3.5 / 4 family (Anthropic). They share a common architectural template — a dense MoE LLM coupled to a high-capacity vision encoder via a bridge trained on large interleaved-web + document + video corpora, followed by heavy RLHF — but differ substantially in their public disclosures, their multimodal coverage, and their real-world capabilities.
GPT-4V (OpenAI, September 2023) was the first closed frontier VLM publicly accessible via API. It handled image inputs natively in the ChatGPT product, with strong OCR, chart reading, screenshot understanding, and visual reasoning. GPT-4o (May 2024) was the full native-multimodal successor: text, image, audio, and video inputs, and audio + text outputs, all in a single unified autoregressive model with significantly lower latency. GPT-4o-mini (July 2024) packaged the recipe into a cheap, fast model that became the default VLM for most production integrations. OpenAI has disclosed little about the architecture; the public capabilities suggest native interleaved-token training on a massive multimodal corpus, plus substantial RLHF.
Gemini 1.0 (December 2023) was Google DeepMind's first frontier multimodal model, trained — as disclosed in the technical report — from scratch as a native multimodal transformer rather than as a vision adapter on an LM. Gemini 1.5 Pro (February 2024) introduced the capability that has most defined Gemini: extreme long-context multimodality. A single prompt can include an hour of video, entire codebases, or hundreds of images, across a 1M-token (later 2M-token) context window — a capability that no other frontier model matched for most of 2024. Gemini 1.5 Flash was the cheap, fast variant. Gemini 2.0 (December 2024) and Gemini 2.5 added native audio output, image generation within the model, and even longer-context video understanding.
Claude 3 (Anthropic, March 2024) introduced vision to the Claude family across three sizes — Haiku, Sonnet, Opus. Claude 3.5 Sonnet (June / October 2024) became particularly influential in the agentic space with its computer-use beta, in which the model is given screenshots of a desktop and produces mouse/keyboard actions — the first production VLM agent that could drive arbitrary GUIs. Claude 4 (2025) continued this line with stronger long-form coding and multimodal reasoning. Anthropic's disclosure pattern is spare on architecture but dense on capability evaluation and safety behaviour.
Benchmark performance for frontier VLMs clusters in narrow ranges — typically within 2–4 points of each other on MMMU, MMBench, VQAv2, DocVQA, MathVista — and is dominated by differences in RLHF data, safety tuning, and real-world capability rather than by headline numbers. The practical question in 2026 is rarely "which is best at MMMU" (any of the three will do) but "which behaves best on my pipeline": Gemini wins for long video and large document bundles, GPT-4o wins for latency-sensitive conversational products, Claude wins for coding-with-screenshots and computer-use agents.
There are secondary closed-but-available players. Grok (xAI) has a multimodal variant. Mistral's Pixtral is technically open but competitive with closed models at its scale. Qwen2.5-VL-72B and InternVL-3-78B are open weights that approach closed-frontier performance on most academic benchmarks and often surpass them on Chinese and document tasks. The open–closed gap has narrowed to the point that, for many applications, an open 70B-parameter VLM is the sensible default.
The frontier-VLM story is far from finished. Native multimodality (including output), long-context video, embodied vision-language-action, and real-time agentic computer-use are all active frontiers that the 2024–26 release pattern has shown continue to move quickly. What is durable is the architectural template — a capable LLM with a native multimodal input path and a long-context memory — and the data flywheel that fills it with interleaved documents, conversations, videos, and screenshots.
A VLM that describes "a cat on a red blanket" without being able to point to the cat and the blanket is missing half the job. Visual grounding — mapping words to image regions — is the bridge between VLM-level semantic understanding and pixel-level computer vision, and it is the feature that turns a chatty multimodal assistant into something you can build agents and pipelines on top of.
The classical formulation has two variants. Referring expression comprehension (REC) takes an image and a phrase ("the woman in the red dress") and outputs a single bounding box. Referring expression segmentation (RES) outputs a pixel mask instead. The canonical benchmarks are RefCOCO (short, object-category-heavy expressions), RefCOCO+ (no absolute location words), and RefCOCOg (longer, descriptive expressions). The REC state of the art has moved from 70% accuracy (MAttNet, 2018) to 95%+ (GroundingDINO 2023, Grounding-GPT 2024), a level at which the benchmarks themselves are the bottleneck.
GLIP (Li 2022) and GLIPv2 (Zhang 2022) unified object detection and phrase grounding into a single contrastive framework: replace a detector's 80-class classification head with a region-to-phrase similarity computation, and train on mixed detection and grounding data (Objects365, COCO, Visual Genome, Cap4M). GLIP's result was an open-vocabulary detector that takes a natural-language phrase at inference time and returns bounding boxes. This was the first VLM-grade approach to detection.
GroundingDINO (Liu 2023) extended GLIP with a DETR-style transformer detector, producing the strongest open-vocabulary detector of its time (52.5% on COCO zero-shot, 63% with fine-tuning). GroundingDINO is now a standard building block — you ask "where are the defective welds?" and get bounding boxes out — and it pairs with SAM for Grounded-SAM: grounded detection followed by promptable segmentation, yielding open-vocabulary instance masks for any phrase.
Kosmos-2 (Peng 2023) was the first generative VLM with native grounding: the model outputs text interleaved with special location tokens that encode bounding boxes, trained on GRIT (Grounded Image-Text), a 91M-pair corpus with phrase-to-box annotations derived automatically from noun-phrase alignment plus open-vocabulary detection. Kosmos-2 can produce grounded captions ("a dog<box x1 y1 x2 y2> chasing a ball<box ...>"), answer "point-at" questions, and do REC natively.
Florence-2 (Xiao 2023) from Microsoft is a notable alternative: a compact (0.2B / 0.7B parameters) unified prompting VLM that handles captioning, detection, grounding, segmentation, and OCR with a single text-prompted interface. Florence-2 is not frontier-quality but is the strongest "small model that does everything" in the 1B-class, and it illustrates the point that architecture can be surprisingly compact when the training data covers the right task mix.
LISA (Lai 2023) and PixelLM tackled reasoning segmentation: given a complex query ("segment the food that the child should avoid"), produce a mask. LISA's trick was to add a <SEG> token to a LLaVA-style VLM whose hidden representation is used to prompt SAM; training combines reasoning supervision from GPT-4 and segmentation masks from COCO / RefCOCOg. GLaMM (Rasheed 2024) extended this to multi-region and free-form grounded captioning.
Modern frontier VLMs have absorbed grounding to varying degrees. Gemini 1.5 and Claude 3.5 Sonnet produce bounding boxes natively for phrases. GPT-4V needs SoM or similar prompting tricks. Qwen2-VL and InternVL-2.5 have native bounding-box outputs and REC benchmarks in their evaluation suites. The trend is clear: grounding is moving from a specialised capability to a default one, because so many downstream applications (agentic computer use, document understanding, robotic manipulation, AR overlay) require pixel-level pointing rather than just text descriptions.
Documents — invoices, receipts, scientific papers, forms, contracts, screenshots, UI mockups, tables, infographics — are the most common visual input in real-world VLM deployments. Their distinguishing feature is dense text with visual layout: text meaning depends on position, tables have implicit grid structure, forms have key-value pairs, charts have data encoded in axes and marks. The document VLM lineage is specifically about getting this right, and it has gone from a separate OCR pipeline (2018) to an integrated OCR-free VLM capability (2024) in six years.
The pre-VLM approach is OCR + structure + LLM: run a text detector and recogniser (PaddleOCR, Tesseract, Google Document AI, Amazon Textract), extract a bag of words with their bounding boxes, run a structure parser (LayoutLMv3, Donut's post-processor) to recover reading order and region types (title, paragraph, table, figure), then feed the structured text to an LLM. This pipeline still dominates production — it is robust, cheap, and auditable — but it loses information (OCR errors cascade; spatial layout gets linearised; figures and charts get ignored) and is awkward to extend to new document types.
LayoutLM (Xu 2019) was the first transformer that took text + bounding-box coordinates as input, encoding each token with its 2D positional encoding. LayoutLMv2 added visual embeddings from a separate encoder; LayoutLMv3 (Huang 2022) unified text, layout, and image in a single multimodal transformer with masked-image, masked-language, and word-patch-alignment objectives. LayoutLM models set the SOTA on FUNSD, CORD, SROIE, and DocVQA for several years, and are still the go-to for production form understanding.
Donut (Kim 2022) pioneered the OCR-free paradigm: a Swin-Transformer encoder takes the raw document image and a BART-style decoder generates structured output (JSON for receipts, SQuAD-style answers for DocVQA). No OCR stage at all. Donut matched or beat OCR-plus-pipeline approaches on multiple benchmarks and became the template for every VLM that now reads receipts and forms end-to-end.
Modern frontier VLMs — Qwen2.5-VL-72B, InternVL-3, GPT-4o, Gemini 2.0, Claude 3.5 — perform OCR-free document understanding end-to-end and, on most academic benchmarks, rival or exceed the classical OCR-plus-LLM pipeline. They read receipts, parse charts, answer questions about slide decks, and summarise PDFs directly from pixels. The benchmarks that track this capability are DocVQA, InfographicVQA, ChartQA, AI2D, TextVQA, OCRBench, MathVista (for equation-heavy diagrams), and the long-document suite MMLongBench-Doc and SlideVQA. By early 2026, Qwen2.5-VL-72B and GPT-4o both score > 90 on DocVQA test, a level where benchmark saturation is a real concern.
ColPali (Faysse 2024) is a notable recent idea: retrieve document pages by late-interaction multi-vector embeddings of page images, produced by a PaliGemma backbone. Instead of extracting text and embedding it (which loses layout) or embedding whole pages (which loses granularity), ColPali embeds every patch and scores against query-token embeddings via MaxSim — the ColBERT trick applied to document images. The result is a visual-retrieval system for PDFs that surpasses OCR-plus-text-retrieval baselines on most DocQA benchmarks and has become a staple of document RAG pipelines in 2025–26.
Practical lessons for document-VLM deployment. First, resolution matters enormously: many documents need 1536 px or higher on the long edge to resolve small fonts, which is why LLaVA-NeXT / Qwen2-VL use any-resolution tiling and which is why Gemini and GPT-4o accept high-DPI inputs. Second, OCR-free is not free: VLMs still hallucinate on tiny or low-quality text, and production pipelines often run OCR in parallel and reconcile. Third, tables and charts are harder than plain text: a VLM that reads prose at 95% accuracy often drops to 70% on cells-within-tables or numbers-on-a-chart, which is why specialised benchmarks (ChartQA, TabFact, WikiTableQuestions) remain useful stress tests. Fourth, non-English documents are covered by frontier models and by Qwen / InternVL on the open side, but specialised OCR for minority scripts often still beats VLMs. The document-VLM story is one of rapid convergence: what required three components in 2020 now runs as one model, and the benchmarks are saturating faster than new ones can be built.
Video is the largest unsolved modality in practical VLM deployment. The data is vast (YouTube alone ingests over 500 hours per minute); the temporal dimension adds a factor of dozens to hundreds more tokens per clip; the relevant tasks (long-form summarisation, moment localisation, action understanding, sports analysis, surveillance) rarely reduce to image VQA. This section covers how VLMs have been extended to handle video and where the state of the art stands in 2026.
The simplest recipe is sparse frame sampling: select k frames from the video (uniformly or via a learned selector), embed each with a CLIP or SigLIP encoder, concatenate the frame tokens with a <FRAME_n> position encoding, feed to an LLM. Video-ChatGPT, Video-LLaVA (Lin 2023), and VideoChat / VideoChat2 are built this way, typically sampling 8–32 frames. These systems handle short clips (a few seconds to a minute) competently but degrade for long-form video, where 32 frames are simply too few.
VideoLLaMA and VideoLLaMA 2 (Zhang 2023) added a temporal Q-Former that aggregates frame features over time, and an audio branch that processes the video's audio track in parallel, giving the model access to speech and music. PLLaVA (Xu 2024) and LLaVA-Video (Zhang 2024) pushed frame counts higher (64–128) by pooling spatially and compressing the resulting token sequences. InternVideo and InternVideo 2 (Wang 2024) scaled up the vision tower specifically for video and are among the strongest open video VLMs.
The video-VLM benchmark landscape is still crystallising. Video-MME (Fu 2024) is the most comprehensive: 900 videos spanning six domains, with three duration categories (short, medium, long) and 2700 manually annotated QA pairs. MVBench, MMBench-Video, TempCompass, and NExT-QA cover narrower slices. EgoSchema and EgoPlan focus on egocentric long-form video. TVQA and MovieQA emphasise narrative understanding. Classical video-retrieval (MSR-VTT, ActivityNet-Captions, DiDeMo) and caption (MSVD, YouCook2) datasets are still referenced but increasingly saturated.
Practical deployment of a video VLM has three recurring engineering issues. Frame sampling strategy: uniform, content-adaptive, or query-aware? Learned frame selection (Video-LLaMA's query-frame retrieval, LongVA's needle-in-haystack sampling) consistently beats uniform on long videos. Audio track: most benchmarks ignore audio, but real videos — lectures, interviews, instructional content — lose 30%+ of their information without it. Frontier VLMs (GPT-4o, Gemini 2.0) ingest audio natively; open models need a separate Whisper transcript fed as text context. Token budget: even after aggressive compression, an hour of video at 1 fps with 576 tokens/frame costs 2M tokens, which is at the edge of current context windows and dominates inference cost.
The convergence direction is unified video-plus-audio-plus-text frontier models. GPT-4o's end-to-end audio, Gemini 2.0's long-context multimodality, and the continued expansion of context windows all point toward "the video is just a long interleaved token stream" as the right abstraction. Specialised video architectures (3D convolutions, video-specific Q-Formers) have not been abandoned, but they are increasingly deprecated in favour of long-context multimodal transformers. For downstream applications — surveillance AI, sports analytics, lecture summarisation, robotic-task demonstrations — 2026 is the first year that off-the-shelf VLMs are genuinely useful on long video, and the pace of improvement is sharp.
A VLM that understands images but not space or action cannot fully serve robots, AR systems, or autonomous vehicles. The 3D and embodied VLM literature extends the visual–language interface to (i) three-dimensional scene representations (point clouds, voxels, meshes, NeRFs, 3D Gaussians) as input, (ii) spatial reasoning (distances, directions, relative position) as output, and (iii) continuous control (robot actions) as the final downstream task. This is the juncture where Part VII's computer-vision chapters meet Part IX's robotics chapters, and it is moving fast.
On the input side, 3D-LLM (Hong 2023) was the first VLM to ingest point clouds. The pipeline renders the point cloud from multiple viewpoints, encodes each rendering with a CLIP-style vision encoder, and treats the result as a multi-view input to a LLaVA-style LLM. PointLLM (Xu 2024) went more direct: encode the point cloud with a Point-BERT-style 3D encoder and feed the resulting features through an MLP projector into the LLM, mirroring the LLaVA 2D recipe in 3D. LL3DA and Chat-3D extend these ideas to dense scene understanding, multi-object reasoning, and 3D QA benchmarks (ScanQA, SQA3D, ScanRefer).
SpatialVLM (Chen 2024) and SpatialRGPT (Cheng 2024) attacked a complementary problem: VLMs are weak at 2D-image spatial reasoning ("is the cup to the left or right of the bottle?", "how far apart are these two objects?"). These systems produce large synthetic datasets of spatial reasoning questions (generated from RGB-D images with ground-truth geometry) and instruction-tune a LLaVA-style VLM on them, producing dramatic improvements on distance, direction, and relative-position tasks without needing native 3D input.
On the action side, PaLM-E (Driess 2023) was the first frontier-scale vision-language-action (VLA) model: a 562B-parameter multi-embodiment system that ingested images and robot sensor readings, output language-mediated plans, and generated low-level action sequences. PaLM-E established two important ideas: (i) a VLM trained as a robot policy inherits language-grounded planning for free, and (ii) mixing internet-scale VQA/caption data with robot trajectory data during training produces better robots than either alone.
RT-2 (Google DeepMind, 2023) built on PaLM-E by fine-tuning a PaLI-X or PaLM-E backbone on robot trajectories represented as discretised action tokens — each action becomes a vocabulary token, the policy becomes an autoregressive language-modelling problem. RT-2 showed transfer of web-scale semantic knowledge to novel instructions and objects ("pick up the dinosaur") that the robot had never seen in demonstration data. OpenVLA (Kim 2024) replicated the recipe in the open on top of LLaMA 7B and the Open X-Embodiment dataset (1M+ robot trajectories from 70+ robots), producing the strongest open VLA through most of 2024. π0 (Physical Intelligence, 2024) introduced a diffusion action head for smoother continuous control and became the production reference for generalist manipulation.
Where 3D and embodied VLMs remain brittle is in generalisation and metric accuracy. A VLM that describes a pile of blocks in English still struggles to estimate "how many centimetres between the top two blocks" with millimetre-level accuracy, which is what robot manipulation requires. Generalisation across embodiments (arms of different sizes, gripper types, camera angles) is the central challenge of open-ended robot VLMs, and is the reason the Open X-Embodiment dataset and its descendants have been such a focus. Research on 3D foundation models (e.g. DUSt3R, MAST3R, 3D Gaussian VLMs) is producing the encoders that embodied VLMs will eventually use as their spatial front end.
For a developer in 2026, the practical recipes look like this. For spatial 2D reasoning (left / right / near / far), fine-tune a LLaVA-style VLM on SpatialVLM-style synthetic data, or use Qwen2.5-VL / GPT-4o directly for modest accuracy. For 3D-scene understanding from point clouds, use 3D-LLM or PointLLM as reference architectures. For robot manipulation, use OpenVLA or π0 for base policies and fine-tune on task-specific demonstrations. For computer-use / web agents, use Claude 3.5 Sonnet or GPT-4o with screenshot input plus a structured action vocabulary. The embodied-VLM landscape is one of the most active frontiers, and the pace of public capability demos through 2025 showed no sign of slowing.
How do you measure a VLM that might be asked anything about any image? VQAv2's single-answer matching served the 2017–2021 era; it does not capture the open-ended capabilities of a 2026 frontier model. The VLM-benchmark landscape has expanded into a patchwork of comprehensive capability suites, hallucination tests, robustness probes, and long-form judgment benchmarks — each useful, none sufficient alone.
The comprehensive capability suites are the headline numbers reported in every new VLM paper. MMMU (Yue 2023) is the most cited: 11 500 multimodal questions from university exams across 30 subjects (art, business, science, medicine, engineering), split into ~10k test and 900 validation. MMMU's intended message was "can your VLM pass a freshman exam?" and its difficulty made it a useful frontier discriminator for a while. MMMU-Pro (Yue 2024) strengthened the filter by removing text-only-solvable questions, tightening the multimodal requirement.
MMBench (Liu 2023) evaluates 20 fine-grained abilities (object localisation, OCR, attribute comparison, social reasoning, etc.) with circular shuffled multiple-choice answers to mitigate positional bias. MME (Fu 2023) gives 14 perception and cognition tasks with yes/no answers, totalling 2800 questions. MM-Vet (Yu 2023) evaluates six integrated capabilities (recognition, OCR, knowledge, spatial reasoning, maths, language generation) with a GPT-4-judged open-ended response. SEED-Bench / SEED-Bench-2 (Li 2023) batches together 24 000 multiple-choice questions from diverse categories including 3D and video.
Specialised reasoning benchmarks probe specific weaknesses. MathVista (Lu 2023) stresses mathematical visual reasoning (geometry, algebra from diagrams, statistical charts). ChartQA, PlotQA, and CharXiv drill into chart understanding. AI2D and TQA test diagram-and-textbook science. VCR tests commonsense and causal reasoning with images. A-OKVQA tests outside-knowledge VQA. ScienceQA tests multi-modal chain-of-thought reasoning on textbook-style problems.
Open-ended generation is harder to evaluate. LLaVA-Bench-in-the-Wild (Liu 2023) uses GPT-4 as a judge to score responses to 60 real-world prompts; it was the benchmark that most visibly tracked LLaVA improvements in 2023. MM-Vet similarly uses GPT-4 judging. WildVision-Bench / WildVision-Arena (Lu 2024) is the image-multimodal analogue of LMSYS Chatbot Arena: anonymous pairwise human voting on real-world prompts produces an Elo ranking across all major VLMs. Arena-style rankings have become the most trusted public capability measure, because they are hard to game and reflect genuine user preference.
Frameworks for running benchmarks have consolidated around VLMEvalKit (OpenCompass) and LMMs-Eval (LMMs-Lab), which together provide unified, reproducible evaluation against dozens of VLMs on 50+ benchmarks. These toolkits are now the default way to evaluate a new VLM: integrate a model once, get numbers on every benchmark in a consistent protocol.
Three recurrent lessons stand out. First, no single benchmark is sufficient: a model can top MMMU while hallucinating on POPE, ace DocVQA while failing MathVista, or dominate OCR-heavy suites while being weak on commonsense. Reported capability should always span several orthogonal axes. Second, multiple-choice benchmarks saturate: MME's yes/no and MMBench's A-B-C-D formats are now scored 80–95% by every frontier VLM, which makes them increasingly uninformative. Open-ended benchmarks and arena-style rankings are more durable. Third, real-world capability is under-measured: a model that aces all academic benchmarks can still fail on "read this prescription label" in a way that matters. In 2026, the most honest capability assessment combines aggregate benchmark scores, arena Elo, and task-specific in-situ evaluation against your own use case.
A modern VLM deployment touches many of the same engineering concerns as an LLM deployment — quantisation, distillation, KV-cache management, batching, serving — plus vision-specific ones: how to pre-compute and cache image embeddings, how to handle variable-resolution inputs efficiently, how to fit high-resolution multi-image prompts into tight context budgets. This section surveys the techniques and toolkits that make VLM inference tractable at production scale.
The first lever is quantisation. Standard LLM quantisation (AWQ, GPTQ, GPTQv2, SmoothQuant) applies to the language tower; the vision tower is smaller and less sensitive. VLMs in 4-bit (W4A16 AWQ) routinely preserve > 98% of original benchmark performance, which makes 70B-parameter VLMs runnable on a single 48GB GPU. Frameworks — llama.cpp multimodal, vLLM-multimodal, SGLang, MLC-LLM — support quantised VLM serving as a first-class feature. Eight-bit (INT8 / FP8) quantisation is the default for H100-class deployment; four-bit is the default for consumer GPUs and edge.
The second lever is vision-token compression. A 336 × 336 CLIP image produces 576 tokens; at LLaVA-NeXT's 4-tile any-resolution, a single image can cost 2304 tokens; a video can cost tens of thousands. Many token-compression approaches have emerged: Q-Former-style compression to a small fixed set (BLIP-2: 32 tokens; Flamingo: 64 tokens); adaptive patch selection (PruMerge, LLaVA-PruMerge); resolution-adaptive encoding (Dynamic-ViT, Matryoshka VLM); KV-cache compression (FastV, which drops 50% of visual KV-cache entries after layer 2 with minimal accuracy loss). For any production VLM, token compression is where the latency and cost wins actually come from.
The third lever is small VLMs. A year ago (2024), a 7B VLM was the smallest credible chat model; by 2026, 2B and 1B-parameter models (MobileVLM, TinyLLaVA, nanoVLM, PaliGemma-3B, Phi-3-Vision 4.2B, MiniCPM-V 2.6B, InternVL-1B) can run on a laptop CPU or smartphone NPU and handle common tasks (captioning, OCR, simple VQA, screenshot QA) acceptably. The distillation pipeline is standard: generate a large synthetic SFT / DPO dataset from a frontier teacher (GPT-4o or Claude 3.5), train the small VLM on it, and iterate.
The fourth lever is serving infrastructure. vLLM-multimodal supports continuous batching with image inputs, handling variable-length image token sequences without wasteful padding. SGLang optimises complex multimodal prompt templates with prefix caching and structured generation. TensorRT-LLM and NVIDIA Triton have first-class VLM support for large-scale production. For on-device deployment, CoreML and ONNX Runtime ship quantised VLMs to iPhones, Macs, and Windows laptops with NPUs; MLC-LLM compiles VLMs to WebGPU for in-browser inference. The engineering complexity of VLM serving is genuinely higher than text-only LLM serving, but the 2024–26 tooling wave has made production deployment routine.
The fifth lever is caching and retrieval. If your application has a fixed set of images (a product catalogue, a document library), precomputing and caching image embeddings turns image queries into lookup-plus-generate rather than re-encode-plus-generate. The same idea at finer granularity — visual KV-cache reuse — lets the same image be referenced many times in a conversation without repeated vision-tower passes, which is what makes long agentic conversations affordable.
The practical calculus in 2026 looks like this. For high-throughput production APIs with aggressive latency targets, a 7B–13B open VLM (Qwen2.5-VL-7B, InternVL-2.5-8B, LLaVA-OneVision-7B) quantised to 4 bits and served by vLLM or TRT-LLM is the sweet spot — 20–50 ms per image prefill, 50–100 tok/s generation, single H100 serving hundreds of QPS. For quality-critical applications where a few seconds of latency are acceptable, frontier closed APIs (GPT-4o, Gemini 2.0, Claude 4) are still the best choice, at roughly 10× the cost. For edge and mobile, MobileVLM / Phi-3-Vision / MiniCPM-V run locally. Matching the model to the constraints — and pushing token compression and quantisation as far as accuracy allows — is the standard deployment problem.
VLMs began as a research curiosity and ended up — within five years — as the default cross-modal primitive of the ML stack. This closing section surveys how VLMs integrate with the rest of modern machine learning: their training pipelines, their role in retrieval and agents, their use as infrastructure for other vision tasks, and where they fit in the broader Part VII and Part VIII arc of this compendium.
Training a modern VLM is a three- or four-stage process. Stage 1 — pretraining: the vision tower (CLIP / SigLIP / InternViT) and the language tower (Llama / Qwen / Mistral / Gemma) are typically trained separately at their respective web scales. Stage 2 — alignment: the projector (MLP, Q-Former, cross-attention) is fit on a modest (0.5–10M) image-caption or interleaved-image-text set with both towers frozen. Stage 3 — multimodal SFT / instruction tuning: the model is trained end-to-end (or with LoRA) on a curated instruction-data mix that covers captioning, VQA, OCR, chart, document, grounding, video, and dialogue. Stage 4 — RLHF / DPO: preference tuning on human or AI-generated pairwise comparisons sharpens helpfulness, reduces hallucinations, and improves instruction-following.
The VLM as retriever: CLIP-style embedding models are the default backbone of visual search and multimodal retrieval. Product-search at scale, image-deduplication pipelines, stock-photo retrieval, academic-paper figure search, dataset curation — all now use CLIP (or SigLIP, or a domain-specialised CLIP) as the embedding function, with FAISS or ScaNN as the vector index. The RAG chapter of Part VI covered retrieval for text; the VLM extension uses ColPali or dense CLIP-over-regions for document and image retrieval. Multimodal RAG — retrieve relevant images given a text query, then pass them plus context to a VLM — is now standard for any product that answers questions over a document corpus with figures or screenshots.
The VLM as labeller: one of the most impactful applications of frontier VLMs has been dataset labelling. Cloud vendors and vision-AI startups use GPT-4o or Claude 3.5 to caption millions of images, classify fine-grained categories, annotate bounding boxes (via SoM or native grounding), and produce training data for smaller specialised models. This synthetic-annotation-at-scale pattern has effectively replaced a large fraction of human annotation for tasks where a good VLM gets 95% right. The flip side — models trained on VLM-annotated data can inherit VLM biases and errors — is a known concern that drives ongoing research on filtering, human-in-the-loop correction, and model-generated-data detection.
In terms of Part VII and the broader compendium: the previous chapters of Part VII — Image Representation & Classical Vision, Modern Image Classification & Architectures, Object Detection & Instance Segmentation, Video Understanding, 3D Vision & Spatial Understanding — provided the visual representations (ViT, CNN, 3D encoders) and the classical vision tasks (classification, detection, segmentation, tracking, depth, 3D reconstruction) that VLMs either use as components or subsume. The next chapters of Part VII — audio classification and music generation — will show a parallel story in audio, culminating in the end-to-end multimodal models (GPT-4o, Gemini 2.0) that combine all of it. Part VI's Transformer Architecture, Pretraining Paradigms, Instruction Tuning & Alignment, and Retrieval-Augmented Generation supply the text-side machinery that every modern VLM inherits.
Looking forward: the Part VIII chapters on agents and embodied AI use VLMs as their perception backbone; the Part IX chapter on autonomous vehicles uses VLMs for open-ended scene understanding; the Part X application chapters (healthcare, law, education) are overwhelmingly driven by VLMs operating on documents, medical images, lecture videos, and classroom artefacts. The practical claim is strong: if you are doing any applied AI work in 2026 that involves images, you should probably start with a VLM. The theoretical claim is stronger: VLMs are the cleanest empirical demonstration that scaling a single autoregressive objective over interleaved multimodal tokens produces intelligence that is not tied to any one modality — a lesson that shapes the broader multimodal-foundation-model programme across the entire field.
The engineering reality is that a VLM is never deployed alone. It sits inside retrieval pipelines, agent harnesses, evaluation harnesses, monitoring and guardrails, cost-aware caching, and user-interface layers that together make the model useful. The rest of this compendium covers each of those layers — and the VLM chapter is where the pixel half of the story meets the language half, and where modern AI's multimodal future is decided.
Vision-language research moves quickly; the canon below is a mix of foundational papers, the landmark system papers that define each era, and the toolkits and benchmarks that any practitioner will actually touch. It is organised the same way as the chapter — foundations, architectures, specialised branches, evaluation, deployment, datasets.
Foundations of Vision-Language Models — Du et al. (2024)
Comprehensive survey covering pretraining objectives, architectures, and benchmarks up to early 2024. A good orientation map for the entire field.
Multimodal Foundation Models: From Specialists to General-Purpose Assistants — Li et al. (2023)
The Microsoft CVPR tutorial paper. Organises VLMs, text-to-image, multimodal LLMs, and multimodal agents in one framework.
A Survey on Multimodal Large Language Models — Yin et al. (2023)
Focused on the LLM-coupled VLM era (Flamingo, BLIP-2, LLaVA); regularly updated with new releases. The go-to reference for anyone starting out.
Deep Learning for Vision-Language Pretraining — Khan et al. (2022)
Covers the fusion-transformer era (ViLBERT through UNITER) in depth; still the best reference for that generation of models.
Show and Tell: A Neural Image Caption Generator — Vinyals et al. (2015)
The CNN-LSTM paper that started modern image captioning. Short and easy to read.
Show, Attend and Tell — Xu et al. (2015)
Introduced soft attention over spatial feature maps. The conceptual ancestor of every modern VLM cross-attention layer.
Bottom-Up and Top-Down Attention for Image Captioning and VQA — Anderson et al. (2017)
Introduced Faster R-CNN region features as the default visual input. Dominated VL research for four years.
VQA: Visual Question Answering — Antol et al. (2015)
The original VQA dataset and task formulation. VQAv2 (Goyal 2017) is the balanced successor.
ViLBERT — Lu et al. (2019)
Two-stream transformer with cross-attention. The first VL-BERT and the template for the two-stream family.
LXMERT — Tan & Bansal (2019)
Two-stream with object-relationship and cross-modality encoders. The strongest VQA-focused pretrained model of its era.
VisualBERT — Li et al. (2019)
Single-stream BERT with visual tokens as prefix. Minimalist and influential.
UNITER — Chen et al. (2020)
Large single-stream with four carefully ablated pretraining objectives. The clearest taxonomy of what each objective contributes.
OSCAR — Li et al. (2020)
The object-tag-as-anchor trick; consistent improvements on VQA and retrieval.
ViLT — Kim et al. (2021)
Dropped Faster R-CNN features for raw patch embeddings; a bridge to the ViT era of VLMs.
Learning Transferable Visual Models from Natural Language Supervision (CLIP) — Radford et al. (2021)
The paper that changed vision. Essential read; the ablations are unusually thorough.
Scaling Up Visual and Vision-Language Representation Learning (ALIGN) — Jia et al. (2021)
Google's contemporaneous 1.8B-pair dual encoder. Showed that scale absorbs noise.
OpenCLIP: Reproducible Scaling Laws for Contrastive Language-Image Learning — Cherti et al. (2022)
The open-source CLIP reproduction that became most researchers' default CLIP weights.
Sigmoid Loss for Language Image Pre-Training (SigLIP) — Zhai et al. (2023)
Replaced softmax with pairwise sigmoid. Smaller optimal batch sizes, faster convergence; now the default vision tower for most open VLMs.
EVA-CLIP: Improved Training Techniques for CLIP at Scale — Sun et al. (2023)
Initialised image tower from EVA's MIM pretraining. The strongest open CLIP in 2023.
DataComp: In Search of the Next Generation of Multimodal Datasets — Gadre et al. (2023)
Recasts CLIP training as a dataset-filtering competition. Introduces DFN, DataComp-S, and the CommonPool.
MetaCLIP — Xu et al. (2024)
Metadata-matched filtering with concept balancing. Strong open CLIP weights.
BLIP — Li et al. (2022)
Multi-task image-text transformer with bootstrapped CapFilt data cleaning. The start of the BLIP line.
BLIP-2 — Li et al. (2023)
Q-Former plus frozen vision and language towers. The cleanest statement of the modular VLM paradigm.
Flamingo: A Visual Language Model for Few-Shot Learning — Alayrac et al. (2022)
Frozen LM + Perceiver Resampler + gated cross-attention. Demonstrated in-context multimodal learning.
CoCa: Contrastive Captioners Are Image-Text Foundation Models — Yu et al. (2022)
Combines contrastive and captioning objectives in a single model. A bridge between CLIP and generative VLMs.
BEiT-3: A General-Purpose Multimodal Foundation Model — Wang et al. (2022)
Multiway transformer with shared + modality-specific experts. Strong results across vision-only, language-only, and multimodal tasks.
PaLI: A Jointly-Scaled Multilingual Language-Image Model — Chen et al. (2022)
Google's encoder-decoder VLM; the predecessor line to Gemini's vision stack.
InstructBLIP — Dai et al. (2023)
Instruction-tunes BLIP-2 on 26 VL tasks; the first strong zero-shot instruction-following open VLM.
Visual Instruction Tuning (LLaVA) — Liu et al. (2023)
The paper that opened the floodgates. CLIP + MLP + Vicuna + 158K GPT-4 conversations; the recipe every subsequent open VLM follows.
LLaVA-1.5 — Liu et al. (2023)
Stronger MLP projector, larger Vicuna, more data. The first serious practical open VLM.
MiniGPT-4 — Zhu et al. (2023)
LLaVA-contemporary with a BLIP-2 Q-Former and Vicuna. A second popular minimalist recipe.
Qwen-VL — Bai et al. (2023)
Alibaba's frontier open VLM. Dynamic resolution, strong OCR/document performance, bilingual (Chinese/English).
Qwen2-VL — Wang et al. (2024)
Second-generation with naive dynamic resolution ViT and multimodal RoPE. As of 2024 release, the strongest open VLM in its size class.
InternVL: Scaling up Vision Foundation Models — Chen et al. (2023)
InternViT-6B plus a large language head. Shanghai AI Lab's flagship open VLM series.
IDEFICS and OBELICS — Laurençon et al. (2023–2024)
Flamingo-style open model trained on the OBELICS interleaved-web corpus. IDEFICS-2/3 added ViT-based inputs and better data.
Molmo — Deitke et al. (2024)
Open VLM with detailed data disclosures and strong human-preference scores. PixMo instruction data is particularly well-curated.
CogVLM — Wang et al. (2023)
Expert modules for visual-specific processing inside the LM; strong OCR and document performance.
GLIP: Grounded Language-Image Pre-training — Li et al. (2022)
Unified detection and phrase grounding. The first VLM-grade open-vocabulary detector.
GroundingDINO — Liu et al. (2023)
DETR-style open-vocabulary detector with strong phrase-grounding. A default building block for modern vision pipelines.
Kosmos-2: Grounding Multimodal Large Language Models to the World — Peng et al. (2023)
Native grounded captioning VLM using special location tokens. The GRIT dataset remains a reference for grounding data.
Set-of-Mark Prompting — Yang et al. (2023)
Inference-time trick that gives any VLM reliable grounding via numbered-mask overlays. Quick to adopt.
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks — Xiao et al. (2023)
Compact (0.2B / 0.7B) unified-prompting VLM. Surprisingly capable for its size.
LISA: Reasoning Segmentation via Large Language Model — Lai et al. (2023)
Couples LLaVA with SAM via a <SEG> token; produces masks from free-form reasoning queries.
LayoutLMv3 — Huang et al. (2022)
Unified text, layout, and image pretraining for documents. The production standard for forms and structured-document tasks.
Donut: OCR-free Document Understanding Transformer — Kim et al. (2022)
The OCR-free paradigm applied to receipts and forms. Direct ancestor of Qwen-VL's document pipeline.
Nougat: Neural Optical Understanding for Academic Documents — Blecher et al. (2023)
OCR-free parsing of scientific PDFs to Markdown+LaTeX. Indispensable for research-paper ingestion pipelines.
ColPali: Efficient Document Retrieval with Vision Language Models — Faysse et al. (2024)
Late-interaction multi-vector retrieval over document-page images. A new default for PDF RAG.
Video-LLaVA — Lin et al. (2023)
Unified image-and-video LLaVA variant with aligned pre-encoders. A clean baseline for video VLMs.
InternVideo2 — Wang et al. (2024)
Large-scale video-specific vision tower. Strongest open video understanding as of its release.
3D-LLM: Injecting the 3D World into Large Language Models — Hong et al. (2023)
First VLM that ingests point clouds via multi-view rendering. The template for spatial VLMs.
PaLM-E: An Embodied Multimodal Language Model — Driess et al. (2023)
Pioneered vision-language-action models. Set the template for RT-2 and OpenVLA.
RT-2 — Brohan et al. (2023)
Fine-tunes a VLM on robot trajectories as action tokens. Demonstrated web-scale-knowledge transfer to manipulation.
OpenVLA — Kim et al. (2024)
Open reproduction of the RT-2 recipe on Llama 7B and Open X-Embodiment. The reference open VLA.
GPT-4 Technical Report — OpenAI (2023)
Includes the initial GPT-4V capabilities and evaluation. Architecture and data are undisclosed but capabilities are documented.
Gemini: A Family of Highly Capable Multimodal Models — Google DeepMind (2023)
The initial Gemini 1.0 technical report. Native multimodal training from pretraining onwards.
Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context — Google DeepMind (2024)
Long-context multimodality with MoE architecture. The paper to read on long-video VLMs.
Claude 3 Model Card — Anthropic (2024)
The capability, eval, and safety disclosures for Claude 3 Haiku, Sonnet, and Opus.
MMMU — Yue et al. (2023)
University-exam-style multimodal benchmark across 30 subjects. Headline frontier-VLM capability measure.
MMBench — Liu et al. (2023)
20 ability dimensions with circular multi-choice to reduce positional bias. One of the two most-cited comprehensive VLM suites.
MME — Fu et al. (2023)
Perception + cognition yes/no benchmark. Fast and cheap to run; now largely saturated at the frontier.
MM-Vet — Yu et al. (2023)
Open-ended integrated-capability benchmark with GPT-4 judging. Better signal than multiple-choice on real-world capability.
SEED-Bench & SEED-Bench-2 — Li et al. (2023)
Comprehensive multiple-choice suites covering 12–24 dimensions including video and 3D.
POPE: Polling-based Object Probing Evaluation for Object Hallucination — Li et al. (2023)
Yes/no probing of object-existence hallucination. Ubiquitous in hallucination-focused papers.
MathVista — Lu et al. (2023)
Mathematical visual reasoning benchmark. One of the more revealing frontier-VLM challenges.
ChartQA — Masry et al. (2022)
Chart QA with visual reasoning over bar, line, and pie charts. A core document-VLM benchmark.
DocVQA — Mathew et al. (2021)
The reference document-VQA dataset. Now largely saturated by frontier VLMs and strong open ones.
Video-MME — Fu et al. (2024)
The most comprehensive video-VLM benchmark to date. 900 videos, short/medium/long, 2700 QA pairs.
VLMEvalKit & LMMs-Eval — OpenCompass; LMMs-Lab (2023–2024)
The two standard evaluation harnesses. One or both is how every recent VLM paper produces its numbers.
WildVision-Bench / WildVision-Arena — Lu et al. (2024)
Human-preference arena for VLMs. Elo-based rankings with hard-to-game real-world prompts.
MSCOCO Captions — Chen et al. (2015)
Five captions per image, 120K images. The canonical captioning benchmark and still a standard evaluation set.
Visual Genome — Krishna et al. (2016)
Objects, attributes, relationships, region descriptions. The supervisory source for the BUTD / fusion-transformer era.
Conceptual Captions (CC3M / CC12M) — Sharma et al. (2018); Changpinyo et al. (2021)
The first at-scale web alt-text datasets. Still used in LLaVA's feature-alignment stage.
LAION-5B — Schuhmann et al. (2022)
The 5.85B-pair open image-text dataset. Trained Stable Diffusion and most open CLIPs. See the 2024 Re-LAION re-release.
OBELICS — Laurençon et al. (2023)
141M interleaved-document corpus. The reference open data for Flamingo-style in-context-capable VLMs.
Hugging Face Transformers & TRL (multimodal support)
The default way to load and serve open VLMs. All major models (LLaVA, Qwen-VL, InternVL, Idefics, Molmo, Phi-3-Vision) are supported.
vLLM-multimodal
Continuous-batching serving with native VLM support. The first-choice inference stack for production VLM APIs.
SGLang
Structured-generation-aware multimodal serving. Excellent prefix-cache behaviour for agentic VLM workflows.
llama.cpp (multimodal)
CPU and GPU VLM inference with 4-bit quantisation. The go-to edge / laptop deployment path.
Open X-Embodiment — Padalkar et al. (2023)
1M+ robot trajectories from 70+ robots. The open reference data for embodied VLAs.