Multimodal Foundation Models, seeing, hearing, and reading all at once.

Language models learned to read and write. Vision models learned to see. The next leap is unifying these capabilities in a single architecture that can perceive images, hear audio, and reason across all modalities simultaneously — not by chaining separate models but by training one that genuinely integrates sight and language from the start. Flamingo, GPT-4V, and Gemini each made different bets on how to fuse modalities; all three turned out to be partially right.

Prerequisites

This chapter assumes solid footing with the transformer architecture (Part VI Ch 04), large language model pretraining and instruction tuning (Part VI Ch 05–07), and the basics of vision-language contrastive learning via CLIP (Part VII Ch 06). Familiarity with latent diffusion models (Part X Ch 04) helps with the generative multimodal sections, but the discriminative/understanding half of the chapter can be read independently. Some equations reference cross-attention (Part V Ch 05).

The Multimodal Fusion Problem

Section 01 · Early vs. late fusion · the modality gap · tokenization strategies

A purely unimodal language model sees only token sequences. A purely unimodal vision model sees only pixel grids. A multimodal model must bridge both: take heterogeneous signals — images, text, audio, video — and produce a unified internal representation that supports reasoning over all of them simultaneously. The seemingly straightforward question of when to merge these streams has spawned an entire taxonomy of architectures.

Early fusion converts every modality to tokens before the first transformer layer, so all modalities interact from the very first layer of computation. This requires finding a shared token vocabulary — either a discrete codebook (treating image patches like text tokens) or a continuous patch embedding projected into the LLM's embedding space. The advantage is maximum cross-modal interaction; the disadvantage is that vision and language encoders are not easily initialized from the rich pretrained checkpoints that exist for each modality separately.

Late fusion runs separate unimodal encoders to produce modality-specific embeddings, then merges them at a high level — typically by projection into a shared space and concatenation before the language model decoder. This allows the language backbone to remain frozen (leveraging all its pretraining) while only a lightweight adapter is trained. The disadvantage is shallower integration: vision features interact with text only in the language model layers, not in a dedicated vision encoder fine-tuned for multimodal understanding.

Cross-attention fusion, used most prominently by Flamingo, runs the vision encoder in parallel with the language model and injects visual information via cross-attention layers inserted between self-attention layers. This is a middle ground: the language backbone is largely preserved while visual context can be injected at multiple depths.

The Modality Gap

Even when vision and language encoders are jointly trained with contrastive objectives (as in CLIP), their representations lie in different regions of the embedding space — a phenomenon called the modality gap. Simple linear projection between these spaces loses information. Bridging the gap well requires either extensive joint training, nonlinear adapters with many parameters, or dedicated cross-modal attention mechanisms that can query visual features on demand. Much of the architectural innovation in multimodal models is ultimately about closing this gap.

A key design question is how to tokenize images. The two dominant approaches are: (1) patch embeddings — divide an image into \(N\) fixed-size patches, project each to a \(d\)-dimensional vector, and treat the resulting \(N\)-length sequence like text tokens; or (2) discrete visual tokens — encode images with a VQ-VAE or VQGAN to produce a sequence of integer tokens drawn from a shared vocabulary with text. The first approach is simpler and loses less information; the second enables true next-token autoregressive generation that spans both modalities seamlessly.

Flamingo: Few-Shot Multimodal Learning

Section 02 · Perceiver Resampler · gated cross-attention · in-context learning with images

Flamingo (Alayrac et al., 2022) was a landmark from DeepMind that demonstrated large-scale in-context learning for multimodal tasks — analogous to what GPT-3 achieved for text, but now with interleaved images and text. The key insight was that a large pretrained language model could be augmented with visual capability without extensive retraining: only the newly introduced components were trained while the frozen language backbone provided the linguistic reasoning engine.

The Perceiver Resampler

A vision encoder (NFNet pretrained on image-text contrastive data) produces a variable-length feature grid depending on the input resolution. To make visual features consumable by the language model, Flamingo introduces the Perceiver Resampler: a fixed set of \(N_q\) learned query vectors (typically 64) that attend to the variable-length image features via cross-attention, producing a fixed-length visual representation regardless of input resolution. This borrows from the Perceiver architecture and elegantly solves the length mismatch problem.

Perceiver Resampler
\[X_{\text{vis}} = \text{Perceiver}(Q_{\text{learned}},\; F_{\text{image}})\] \[X_{\text{vis}} \in \mathbb{R}^{N_q \times d}, \quad F_{\text{image}} \in \mathbb{R}^{HW \times d_v}\]
The \(N_q\) learned queries attend to the spatial image features \(F_{\text{image}}\), producing a compact fixed-length summary that is injected into the language model via cross-attention layers.

Gated Cross-Attention Layers

Flamingo freezes the pretrained language model entirely and inserts new gated cross-attention layers between the existing self-attention and feed-forward layers. Each such layer computes cross-attention between text tokens and the Perceiver-compressed visual features, then applies a learned scalar gate \(\tanh(\alpha)\) initialized to zero:

Gated Cross-Attention
\[h' = h + \tanh(\alpha)\cdot\text{CrossAttn}(h,\; X_{\text{vis}})\]
Initializing \(\alpha = 0\) means the layer starts as an identity — the pretrained language model is unperturbed at initialization and visual influence grows only as training proceeds. This stabilizes training considerably.

The model handles interleaved sequences of images and text: each image is encoded into \(N_q\) visual tokens, and each cross-attention layer attends to only the most recent image appearing before the current text position — implementing a natural form of positional locality.

Few-Shot Multimodal Learning

Trained on massive datasets of interleaved web documents (ALIGN-style image-text pairs plus C4 text), Flamingo exhibits strong few-shot performance on visual question answering, image captioning, and classification simply by constructing in-context examples: a context window of image-answer pairs followed by a query image, with the model completing the answer. On 16-shot VQAv2, Flamingo-80B exceeded fine-tuned state-of-the-art models while using zero task-specific training — a striking demonstration that in-context learning generalizes to multimodal settings.

BLIP and the Q-Former Lineage

Section 03 · Bootstrapped captioning · Q-Former · InstructBLIP · efficient VLM adapters

The BLIP series from Salesforce Research pursued a different philosophy: rather than starting from a frozen LLM and adding cross-attention, it focused on pre-training strategies that could efficiently bootstrap multimodal capability from noisy web data, and then proposed increasingly lightweight connector modules between a frozen vision encoder and a frozen language model.

BLIP: Bootstrapping Language-Image Pretraining

The original BLIP (Li et al., 2022) combined three objectives in a single model sharing a common image encoder and text encoder/decoder: image-text contrastive learning (ITC, pushing matching pairs together), image-text matching (ITM, a binary classifier for alignment), and image-conditioned language modeling (LM, generating captions). The key innovation was CapFilt (Captioning and Filtering): a bootstrapping loop where the model itself generates synthetic captions for web images and then filters out noisy (mismatched) pairs using its ITM classifier. This dramatically improved data quality and downstream performance even when starting from identical raw data.

BLIP-2: The Q-Former

BLIP-2 (Li et al., 2023) addressed efficiency: training a full multimodal model from scratch is enormously expensive. Instead, it keeps both the image encoder and the LLM completely frozen and trains only a lightweight Querying Transformer (Q-Former) that bridges them. The Q-Former contains \(N\) learnable query tokens (32 by default, producing 32 output vectors) that interact with frozen image features via self-attention among queries and cross-attention to image features. The same queries then interact with text via self-attention, enabling the Q-Former to learn what visual information is most relevant to language.

Frozen Image Encoder (ViT-G/14) Q-Former (trained) 32 learnable query tokens cross-attn to image features Linear Projection Frozen LLM (OPT / FlanT5)
BLIP-2 architecture: only the Q-Former and linear projection are trained. Both the image encoder and the LLM remain completely frozen, dramatically reducing compute while preserving the quality of both pretrained components.

InstructBLIP: Instruction-Following Vision

InstructBLIP (Dai et al., 2023) builds on BLIP-2 by fine-tuning the Q-Former on a diverse collection of vision-language instruction-following datasets (converted from 26 publicly available datasets). A key modification: the instruction text is also fed into the Q-Former's self-attention layers alongside the query tokens, making the extracted visual features instruction-conditioned — the queries attend to whichever visual features are most relevant to the current question. This boosted performance on held-out tasks by a large margin, demonstrating that generic visual feature extraction is suboptimal when task context is available.

LLaVA: Visual Instruction Tuning

Section 04 · Simple projection · GPT-4 generated data · LLaVA-1.5 · scaling open-source VLMs

LLaVA (Liu et al., 2023) made a bold simplifying bet: you do not need a sophisticated connector. A single linear projection from CLIP vision features to the LLM's embedding space is enough — if you pair it with high-quality instruction-following data. This frugality turned out to be highly competitive, launching an influential open-source lineage.

Architecture

LLaVA's architecture is almost comically simple: a CLIP ViT-L/14 vision encoder produces a \(256\)-length sequence of patch embeddings. A single linear layer \(W\) projects these to the LLM's token dimension \(d\): \(H_v = W \cdot Z_v\) where \(Z_v \in \mathbb{R}^{256 \times d_v}\) and \(H_v \in \mathbb{R}^{256 \times d}\). These 256 visual tokens are prepended to the text token sequence and processed together by the language model (LLaMA, Vicuna, or similar). That is the entire architecture — no Q-Former, no cross-attention layers, no perceiver.

LLaVA Visual Projection
\[H_v = W_p \cdot Z_v, \quad Z_v = f_{\text{CLIP}}(X_v)\] \[\text{Input to LLM} = [H_v;\; H_q]\]
\(W_p\) is a learned linear projection. \(H_v\) are visual tokens, \(H_q\) are text instruction tokens. The LLM generates the response autoregressively conditioned on the full concatenated sequence.

GPT-4-Generated Instruction Data

What makes LLaVA powerful is its training data. Rather than manually annotating instruction-following examples, the authors queried GPT-4 (text-only) with image captions and bounding boxes to generate rich conversational and reasoning examples at scale. From 595K image-caption pairs, they generated 158K instruction-following examples covering three formats: conversational Q&A, detailed descriptions, and complex reasoning. GPT-4's strong reasoning capabilities propagated into the small LLaVA model through this distillation pipeline.

LLaVA-1.5 and Beyond

LLaVA-1.5 (Liu et al., 2023b) replaced the linear projection with a two-layer MLP — a seemingly small change that improved performance substantially across all benchmarks. It also switched to CLIP ViT-L/14@336px for higher resolution and adopted a richer mix of academic VQA datasets during instruction tuning. Despite remaining architecturally simple, LLaVA-1.5 matched or exceeded the performance of much more complex models on most standard benchmarks, validating the "simple connector, great data" philosophy. Later variants — LLaVA-NeXT, LLaVA-HD — addressed the key remaining weakness: poor handling of high-resolution images by using dynamic resolution tiling (splitting images into overlapping tiles processed separately then concatenated).

Open-Source VLM Ecosystem

LLaVA's simplicity and open release catalyzed an explosion of open-source vision-language models. InternVL, Idefics, Qwen-VL, MiniGPT-4, mPLUG-Owl, CogVLM, and dozens of others followed similar patterns: CLIP (or SigLIP) encoder, lightweight connector, open-source LLM backbone. The field moved fast enough that open-source models closed much of the gap with proprietary systems within months of the original GPT-4V release.

GPT-4V and the Proprietary Frontier

Section 05 · Vision integration · multimodal RLHF · GPT-4o · real-time omni model

GPT-4 with Vision (GPT-4V), released by OpenAI in late 2023, demonstrated capabilities qualitatively beyond contemporaneous open-source models: complex document understanding, scientific diagram interpretation, medical image reasoning, and multi-step visual problem solving. The architectural details remain proprietary, but several aspects are publicly known or strongly inferred from the technical report and behavior.

Architecture and Training

GPT-4V integrates a vision encoder (believed to be a large ViT) with the GPT-4 language model, likely via a projection layer or adapter connecting vision features to the LLM embedding space — in the late-fusion style. What distinguishes GPT-4V is not architectural novelty but training scale and quality: pretraining on enormous quantities of image-text data followed by careful RLHF alignment specifically for visual tasks. The system card reveals extensive red-teaming focused on multimodal content safety — a dimension largely absent from academic research.

Critically, GPT-4V exhibited strong multimodal chain-of-thought reasoning: it could describe what it sees in an image, reason step by step about the implications, and arrive at nuanced answers — in a way that felt qualitatively different from the retrieval-style answers of earlier VQA models. This likely reflects both the power of GPT-4's underlying language reasoning and RLHF training that rewarded careful, grounded explanations.

GPT-4o: The Omni Model

GPT-4o (released May 2024) represented a further step: a single end-to-end model handling text, images, audio, and video rather than a language model augmented with a separate vision adapter. GPT-4o processes audio natively (rather than routing through Whisper), enabling real-time speech interaction with far lower latency than pipeline approaches. The model can seamlessly switch between modalities mid-conversation and respond with audio output — demonstrating that the modality boundary is becoming more a design choice than an architectural necessity. Crucially, GPT-4o reportedly uses a jointly trained approach where all modalities are learned together from the start, rather than freezing a language backbone and adding vision.

Why Proprietary Models Still Lead

Despite the rapid rise of open-source VLMs, GPT-4V and later models maintained a lead on complex reasoning tasks — particularly those requiring fine-grained spatial understanding, multi-step visual reasoning, and reading comprehension from documents. The gap appears driven less by architecture and more by: (1) the sheer scale of training data, (2) the quality and diversity of RLHF demonstrations, and (3) extended post-training alignment passes targeting specific failure modes. Each of these is a resource advantage rather than a secret architectural trick.

Gemini: Natively Multimodal Training

Section 06 · Joint token space · multimodal pretraining · chain-of-thought across modalities

Gemini (Google DeepMind, 2023) took perhaps the most ambitious architectural stance of the major multimodal models: train multimodality in from the very beginning, rather than adding vision capability to a pretrained text model. This natively multimodal pretraining — jointly training on text, images, audio, and video from scratch — was intended to produce deeper integration than post-hoc adapter approaches could achieve.

Architecture

Gemini is built on a transformer decoder with a modified tokenization scheme. Images are encoded via a vision encoder (a variant of ViT) into sequences of soft tokens that are interleaved with text tokens in the input stream. Critically, these image tokens are fed into the very first layer of the transformer, not injected midway via cross-attention — making this architecturally similar to early fusion. Audio and video receive analogous treatment: audio is represented as log-mel spectrogram patches, and video is sampled as sequences of frames processed as images. The model learns to predict the next token (text or image codebook entry) given this mixed-modality sequence.

Gemini comes in three sizes — Nano, Pro, and Ultra — designed for deployment contexts ranging from on-device inference to datacenter-scale reasoning. The scaling philosophy echoes GPT-4: multiple capability tiers trained with the same architecture but different compute budgets, enabling deployment across diverse latency and cost constraints.

Multimodal Chain-of-Thought

One of Gemini's most compelling demonstrations was multimodal chain-of-thought reasoning: interleaving visual observations with textual reasoning steps. For example, when presented with a multi-page scientific paper including figures, Gemini could cite specific figures, reason about the data they display, and integrate that with claims in the text — treating the document as a unified multimodal artifact rather than separate text and image streams. This behavior emerges most strongly in Ultra-scale models and appears to require the deep modality integration that native joint training provides.

Model Fusion Strategy Vision Encoder Key Differentiator
Flamingo-80B Gated cross-attention NFNet (frozen) In-context multimodal learning
BLIP-2 Q-Former adapter ViT-G (frozen) Both encoder and LLM fully frozen
LLaVA-1.5 MLP projection CLIP ViT-L (frozen) Simplicity + instruction data quality
GPT-4V Proprietary (late fusion) Large ViT (proprietary) Scale + multimodal RLHF alignment
Gemini Ultra Native early fusion ViT variant (jointly trained) Joint multimodal pretraining from scratch
GPT-4o End-to-end (joint) Jointly trained Real-time audio+vision+text omni model

Audio and Speech in Multimodal Models

Section 07 · Whisper encoder · AudioPaLM · SeamlessM4T · speech-native LLMs

Vision and language have received the most research attention in multimodal learning, but audio — and speech specifically — is an equally rich and practically important modality. The path to audio-language integration has largely mirrored the vision-language playbook: start with a powerful unimodal encoder, bridge it to a language model, and scale.

Whisper as a Universal Audio Encoder

OpenAI's Whisper (Radford et al., 2022) trained a sequence-to-sequence transformer on 680,000 hours of weakly supervised audio-transcript pairs, producing a general-purpose audio encoder with excellent representations of speech, music, and ambient sound. Much as CLIP became the default frozen vision encoder for many VLMs, Whisper's encoder has become a popular frozen audio encoder for multimodal models. Its log-mel spectrogram features can be projected to an LLM's embedding space with a simple adapter, enabling models to understand spoken questions, transcribe audio, and reason about sound events.

AudioPaLM and Speech-Text Joint Models

AudioPaLM (Rubenstein et al., 2023) extends PaLM-2 to handle speech by representing audio as discrete tokens from a speech tokenizer (SoundStream-based). The model operates on a shared vocabulary of text BPE tokens and speech tokens, enabling it to condition text generation on speech input and vice versa — performing tasks like spoken question answering, speech translation, and voice cloning within a single unified model. This shared token-space approach directly enables generation of speech responses rather than just transcription.

SeamlessM4T: Massively Multilingual Multimodal Translation

Meta's SeamlessM4T (Barrault et al., 2023) targets a specific but enormously impactful application: speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation across nearly 100 languages — all in one model. The architecture uses a unified encoder (supporting both speech mel-spectrograms and text tokens), a shared representation space, and modality-specific decoders. Training leverages massively multilingual paired data and pseudo-labeling to cover low-resource languages. SeamlessM4T represents a practical instantiation of the multimodal promise: one model replacing an entire pipeline of specialized translators.

Audio Tokenization Approaches

Discrete audio tokens can be generated via: (1) Semantic tokens from the internal representations of a speech encoder (like HuBERT or wav2vec 2.0 k-means clusters) — capturing linguistic content but not acoustic details; (2) Acoustic tokens from a neural codec (EnCodec, SoundStream) — multiple residual quantization levels that together reconstruct waveforms with high fidelity; or (3) Hybrid tokens combining both levels for models that need to both understand and generate speech. The choice depends on whether the model must generate high-quality speech audio (requiring acoustic tokens) or merely understand spoken content (where semantic tokens suffice).

Visual Token Efficiency

Section 08 · Token compression · dynamic resolution · visual token pruning · long-context vision

A practical bottleneck in vision-language models is the sheer number of visual tokens generated for high-resolution images. A CLIP ViT-L/14 at 336px produces \(24 \times 24 = 576\) patch tokens. At 1024px the number grows to nearly 5,000. When these tokens are fed into a large LLM, they occupy most of the context window and dramatically increase computation — quadratically in attention. Addressing this is one of the most active engineering fronts in multimodal research.

Dynamic Resolution and Tiling

Rather than resizing all images to a fixed resolution, dynamic resolution approaches tile images into overlapping sub-images that are each processed at the encoder's native resolution, then concatenated. LLaVA-NeXT uses this approach: a 1024×768 image is split into four 336px tiles plus one downsampled global view, producing 5× the tokens of the base model but capturing 3× the spatial resolution. InternVL-1.5 extends this further with dynamic tiling that adapts the number and arrangement of tiles based on the image's aspect ratio.

Token Compression Modules

An alternative is explicit compression: reduce the number of visual tokens before passing them to the LLM. Methods include:

Long-Context Vision

A related challenge arises when the input contains many images — a multi-page document, a video clip, or a multi-turn conversation with repeated image references. Context window scaling for multimodal models requires both architectural changes (e.g., sliding window attention) and training curriculum design (progressively longer sequences). Models like InternVL-2 support 1–2 million token contexts partly by aggressively compressing visual tokens and using efficient attention implementations (FlashAttention, ring attention).

Multimodal Training: Data and Recipes

Section 09 · Interleaved data · DataComp · stage-wise training · modality balance

The quality of a multimodal model depends as much on its training data and curriculum as on its architecture. This section examines the data sources, curation strategies, and training schedules that underpin modern VLMs.

Data Sources

Multimodal pretraining draws from several types of data. Image-caption pairs (LAION-5B, CC12M, WIT) provide broad visual coverage at the cost of noisy captions scraped from alt-text. Interleaved image-text documents (MMC4, OBELICS, MINT-1T) — web pages with images naturally embedded in text — are more representative of how multimodal understanding actually works and are critical for few-shot in-context learning. Curated instruction data (LLaVA-Instruct, ShareGPT4V, TextVQA, DocVQA) provides task-aligned examples for supervised fine-tuning. High-quality synthetic data generated by stronger models (GPT-4V captions, Gemini-generated reasoning chains) is increasingly important for bridging capability gaps.

DataComp and Curation at Scale

DataComp (Gadre et al., 2023) studied the impact of data curation on CLIP-trained models with a controlled benchmark: the total compute budget is fixed, and the variable is what data you train on. Their finding — that quality curation dominates over quantity — validated aggressive filtering strategies: keeping only image-caption pairs where the CLIP similarity exceeds a threshold, filtering by English language, and deduplicating using perceptual hashing. The resulting DataComp-1B models outperformed LAION-trained models trained on 10× more data.

Stage-Wise Training

Most production VLMs use a multi-stage training curriculum:

  1. Stage 1 — Feature alignment: freeze both the vision encoder and the LLM; train only the connector (linear projection, Q-Former, or MLP) on large-scale image-caption pairs. This is cheap and aligns the embedding spaces.
  2. Stage 2 — Instruction tuning: unfreeze the LLM (and sometimes the vision encoder); fine-tune on curated instruction-following datasets. This teaches the model to follow diverse task formats.
  3. Stage 3 (optional) — RLHF / DPO alignment: human feedback or preference data to improve helpfulness, safety, and instruction adherence, correcting systematic failure modes identified in evaluation.
Multimodal Language Modeling Objective
\[\mathcal{L} = -\sum_{t} \log p_\theta\bigl(y_t \mid y_{<t},\; H_v\bigr)\]
The model is trained to predict text tokens \(y_t\) autoregressively, conditioned on previous text and visual tokens \(H_v\). Only text tokens contribute to the loss during language modeling — visual token prediction is handled separately if the model also generates images.

Modality Balance and Catastrophic Forgetting

A persistent challenge is maintaining language model quality when adding vision. Naive fine-tuning on vision-heavy data degrades language performance — models become better at describing images but worse at reasoning, coding, or following complex text-only instructions. Practitioners mix text-only data into every training stage (typically 30–50% of the training mix) and monitor language-only benchmark performance during multimodal training. LoRA-based adapters for vision integration (training adapter modules rather than the full LLM) further reduce forgetting by separating the parameter updates for visual capability from those for language capability.

Multimodal Generation: Unified Models

Section 10 · Chameleon · Show-o · Janus · discrete image tokens

Most VLMs discussed so far are discriminative: they understand visual input but generate only text. The richer vision of multimodal AI is a model that can both understand and generate across modalities — answering questions about images and creating new images, in the same forward pass, within a single architecture. Several approaches to unified understanding-and-generation have emerged.

The Unified Token Stream

The cleanest approach treats both understanding and generation as next-token prediction over a shared discrete vocabulary containing both text BPE tokens and visual codebook tokens. The model receives interleaved text and image token sequences and predicts the next token — which might be a word or an image patch. At generation time, producing a sequence of visual tokens followed by decoding them with a VQGAN yields a generated image. This approach was pioneered by DALL-E (Part X Ch 06) and extended to joint models that handle both directions.

Chameleon (Meta, 2024) is one of the most thorough implementations: a single transformer trained from scratch on interleaved image-text data, using a shared BPE + image codebook vocabulary. Chameleon handles text-to-image, image-to-text, and mixed-modality generation in one model without any specialized modules. The key challenge was training stability — image tokens and text tokens have very different statistical properties, requiring architectural modifications (norms before rather than after attention, dropout scheduling) to prevent the model from collapsing to always predicting one modality.

Show-o (Singapore NUS, 2024) takes a hybrid approach: it uses autoregressive prediction for text tokens and discrete diffusion for image tokens, allowing images to be generated non-autoregressively (in parallel, via iterative denoising over masked tokens) while text remains autoregressive. This hybrid gives faster image generation than fully sequential AR while retaining the flexibility of a single model handling both tasks.

Janus (DeepSeek, 2024) observes that the visual representations optimal for understanding (capturing semantic content) differ from those optimal for generation (capturing appearance detail), and proposes a decoupled architecture: a separate vision encoder path for understanding queries and a separate image generation path, both operating within the same LLM backbone but with different front-ends.

Discrete vs. Continuous Image Tokens

Unified AR generation requires discrete image tokens (integers from a codebook, predicted by softmax). But the best vision encoders — CLIP, SigLIP — produce continuous embeddings. This creates a fundamental tension: the tokenization that works best for understanding (continuous CLIP features) is incompatible with the tokenization needed for generation (discrete VQ codes). Models like Janus resolve this by using two separate visual pathways. Others (Chameleon) accept continuous→discrete quantization loss for simplicity. Finding representations that are both good for understanding and suitable for generation is an active research area.

Evaluation: Benchmarks and Failure Modes

Section 11 · VQAv2 · MMMU · MMBench · hallucination · spatial reasoning · OCR

Evaluating multimodal models is substantially harder than evaluating text-only models. Benchmark saturation, train-test contamination, and evaluation protocol differences all complicate comparison. Nevertheless, a set of benchmarks has emerged as the standard suite.

Standard Benchmarks

VQAv2 tests visual question answering on natural images with a balanced binary-answer distribution. Despite achieving near-human accuracy on this benchmark (>80% for top models vs. ~81% human), models can game it through language biases without genuine visual understanding — motivating harder benchmarks.

MMMU (Massive Multidisciplinary Multimodal Understanding) tests college-level subject matter reasoning across 30 disciplines including medicine, engineering, and art, with questions requiring domain knowledge and visual reasoning together. Top models (GPT-4o, Gemini Ultra) score 60–70%; humans score ~56% (expert baseline). Unlike VQAv2, MMMU resists shortcut solutions.

MMBench and MMStar evaluate a wide spectrum of capabilities — object recognition, spatial reasoning, OCR, scene understanding, relationship comprehension — with careful contamination controls. MMStar in particular was constructed by ensuring none of its images appear in common web training data.

DocVQA and TextVQA test reading comprehension from document images and scene text — skills that require both OCR capability and language understanding. Models that use high-resolution tiling tend to excel here; those using aggressive token compression often fail on fine text.

Failure Modes

Hallucination — producing plausible-sounding but visually unsupported statements — is the primary failure mode. Unlike text-only hallucination (confabulating facts), visual hallucination involves claiming to see objects, relationships, or text that do not exist in the image. POPE (Polling-based Object Probing Evaluation) and CHAIR (Caption Hallucination Assessment with Image Relevance) specifically measure this. RLHF with visual grounding rewards reduces but does not eliminate it.

Spatial reasoning remains surprisingly hard: tasks like "which object is to the left of X" or counting identical objects consistently trip up models that have strong high-level image understanding. This reflects the fact that CLIP-style training emphasizes global semantic alignment over precise spatial localization.

Multi-image and video understanding — reasoning about changes across frames, tracking objects through time, or comparing two photographs — remains well below human performance. Token budget constraints and the lack of temporal training data compound each other here.

Benchmark Focus Human Baseline Top Model Score (approx.)
VQAv2 Natural image QA ~81% ~86% (GPT-4o)
MMMU Expert multidisciplinary reasoning ~56% ~70% (GPT-4o)
MMBench Broad capability sweep ~75% ~84% (GPT-4o)
TextVQA Scene text reading ~85% ~78% (InternVL-2)
POPE Object hallucination rate ~87% F1 (top models)
MMStar Contamination-free evaluation ~70% ~67% (GPT-4o)

The Multimodal Landscape

Section 12 · Convergence · omni models · video understanding · next frontiers

The multimodal foundation model landscape has undergone a remarkable convergence in just a few years. What began as separate communities — vision-language researchers adapting CLIP, NLP researchers adapting GPT-2, speech researchers extending wav2vec — have been steadily merging into a unified field centered on one question: what does the ideal general-purpose multimodal architecture look like?

The Omni Model Trajectory

The trajectory is clear: models are accumulating modalities. Early VLMs handled images and text. The current generation adds audio (GPT-4o, Gemini 1.5), video (Gemini 1.5 Pro with 1M context supporting 1-hour videos), and is beginning to incorporate additional modalities like PDFs as first-class inputs rather than image conversions. The end point of this trajectory — a model that processes any signal type with equal facility — looks increasingly plausible.

The key remaining architectural question is whether native joint pretraining (Gemini, GPT-4o style) or modular adapter approaches (Flamingo, LLaVA style) will dominate. Joint pretraining is computationally expensive but produces deeper integration; modular approaches are cheaper and more flexible but may hit an integration ceiling. The current evidence suggests joint pretraining wins on deeply multimodal tasks (tasks requiring simultaneous vision and language reasoning) while modular approaches remain competitive on tasks that can be decomposed into sequential vision-then-language processing.

Video Understanding

Video is the natural next frontier. A one-hour video at 1fps produces ~3,600 frames — far more than any current model can process without aggressive subsampling or compression. Gemini 1.5's million-token context makes direct frame processing feasible at low frame rates; specialized video models (Video-LLaMA, Video-ChatGPT, LLaVA-Video) use learned temporal compression. The key open problem is temporal reasoning — understanding that event A caused event B, that an object moved between frames, that a person's emotional state changed — which requires not just processing frames but building causal models of time.

Reasoning and Grounding

The next capability frontier is richer spatial grounding and physical reasoning. Current models excel at semantic image understanding but struggle with precise localization, counting, and physical simulation ("if I push this object, what happens?"). Models augmented with object detection heads (OFA, Shikra, Qwen-VL with bounding box tokens) take first steps toward grounding. Combining multimodal understanding with tool use — calling a segmentation model, using a depth estimator — may bridge the gap between what can be solved with direct perception and what requires structured reasoning about scene geometry.

Open Problems

Several deep challenges remain. Compositional visual reasoning — understanding complex spatial relationships between multiple objects, especially in novel combinations — is systematically weak. Low-resource language + vision: nearly all VLMs are English-dominant; extending to low-resource language communities requires multilingual multimodal training data that largely does not yet exist. Embodied perception: translating visual understanding to action — predicting where to reach, how to grasp, what will fall — requires grounding in physical world dynamics that static image training cannot provide. And fundamentally, measuring genuine understanding vs. sophisticated pattern matching remains an unsolved philosophical and empirical problem: it is unclear whether any current benchmark can distinguish them.

The Convergence Thesis

The history of deep learning is a history of convergence: convolutions gave way to attention, speech models and language models converged on transformers, vision and language converged on CLIP-style pretraining. Multimodal foundation models represent the current convergence frontier — the bet that one architecture, trained on enough diverse data, can be the substrate for all perceptual and reasoning tasks. Whether this bet is correct — or whether some tasks require modality-specific inductive biases that a universal model cannot learn — is the central empirical question of the next five years.

Further Reading