Part X · Generative Models · Chapter 06

Image & Video Generation, pixels conjured from text and latent space.

The algorithms from Chapters 1–5 were the science. This chapter is the engineering — the systems that turned diffusion, autoregressive models, and latent spaces into the products that changed how images and video get made. DALL-E, Stable Diffusion, Imagen, Sora: each represents a different set of architectural and training bets, and each pushed what was thought possible. Understanding these systems means understanding how the theory gets deployed at scale.

Prerequisites

This chapter assumes you have read the diffusion models chapter (Part X Ch 04) — latent diffusion, classifier-free guidance, U-Nets, DiT, DDIM, and consistency models are all used here without re-derivation. CLIP (Part VII Ch 06) and transformer architecture (Part VI Ch 04) are referenced extensively. The autoregressive chapter (Part X Ch 05) provides background for DALL-E 1 and discrete-token image generation. Some sections on video reference transformer attention patterns from Part VI Ch 04.

Sections

The Text-to-Image Landscape model families · timeline · closed vs. open · architecture choices
DALL-E: OpenAI's Three-Generation Arc DALL-E 1 · unCLIP · DALL-E 3 · recaptioning · instruction following
Stable Diffusion: The Open Ecosystem SD v1/v2 · SDXL · SD3 · Flux · VAE · training data
Imagen and Cascaded Generation T5 text encoder · cascaded super-resolution · Imagen 2 · Parti
Personalization: DreamBooth, Textual Inversion, LoRA DreamBooth · textual inversion · LoRA rank
Controlled Generation: ControlNet and Friends ControlNet · depth · pose · IP-Adapter
Image Editing: SDEdit to InstructPix2Pix SDEdit · prompt-to-prompt · InstructPix2Pix
Video Generation: Temporal Consistency frame independence problem · 3D attention · motion priors · temporal U-Net
Sora and Transformer-Based Video video DiT · spacetime patches · world model debate
Consistency Models in Practice LCM · SDXL-Turbo · ADD · real-time generation · step distillation
Evaluating Image and Video Generation FID · CLIP score · GenEval · VBench
The Generation Landscape open vs. closed · safety · creative tools · remaining hard problems

The Text-to-Image Landscape

Section 01 · Model families · timeline · architecture choices

Text-to-image generation went from a research curiosity to a transformative creative technology between 2021 and 2023. The inflection point was the simultaneous maturation of three ingredients: large-scale contrastive vision-language models (CLIP) that could align text and image representations; high-capacity diffusion models that could generate photorealistic images; and the availability of billion-image training datasets scraped from the web. When these came together, the results were qualitatively different from anything that had come before.

The systems built on these ingredients can be grouped into three architectural families. The first is autoregressive token generation, where images are tokenized into discrete codes and generated left-to-right by a transformer — the approach of DALL-E 1 and Parti. The second is diffusion in pixel space, where the denoising U-Net operates directly on image pixels — used by DALL-E 2's prior stage and Imagen's base model. The third and now dominant approach is latent diffusion, where a VAE compresses the image into a compact latent space and diffusion is run there — the approach of Stable Diffusion, SDXL, SD3, Flux, and DALL-E 3's underlying model.

The standard latent diffusion text-to-image pipeline. A text encoder maps the prompt to token embeddings; the diffusion model iteratively denoises a latent code conditioned on those embeddings via cross-attention and classifier-free guidance; the VAE decoder upsamples the final latent to pixel space.

A second axis of variation is the text encoder. CLIP's text encoder was the default early choice because CLIP embeddings were known to carry semantically rich visual concepts. But CLIP was trained on short captions and struggles with long, complex prompts. Imagen's key finding was that replacing CLIP with a large language model text encoder — specifically T5-XXL — dramatically improved prompt following, especially for complex compositions and unusual attribute bindings. SD3 and Flux use a combination: both a CLIP encoder and a T5 encoder, concatenating their outputs and conditioning the DiT on both simultaneously.

DALL-E: OpenAI's Three-Generation Arc

Section 02 · DALL-E 1 · unCLIP / DALL-E 2 · DALL-E 3 and recaptioning

OpenAI's DALL-E series spans three distinct paradigms over three years, making it a useful lens for understanding how the field evolved. Each generation solved the failures of its predecessor by adopting a fundamentally different architecture.

DALL-E 1: Autoregressive VQ-VAE

The original DALL-E (Ramesh et al., 2021) was a 12-billion parameter autoregressive transformer. Images were tokenized into 32×32 = 1024 discrete tokens using a dVAE (discrete VAE with 8192 codebook entries). The text prompt was tokenized with BPE and prepended to the image token sequence. The transformer was trained to predict the next token — a joint sequence of up to 256 text tokens followed by 1024 image tokens — using standard cross-entropy loss.

The quality was remarkable for its time but limited by the resolution of the 32×32 discrete grid and the compounding errors of 1024 sequential sampling steps. The model also struggled with precise spatial relationships: generating "a red cube to the left of a blue sphere" requires the model to plan ahead across a flattened raster scan, which transformers trained on text do not naturally do. CLIP was used as a reranking filter — generate 512 candidates, score each with CLIP similarity to the prompt, return the top 1 — which significantly improved apparent quality.

DALL-E 2: Hierarchical Diffusion with CLIP Prior

DALL-E 2 (unCLIP, Ramesh et al., 2022) introduced a two-stage architecture. The first stage is a prior that maps CLIP text embeddings to CLIP image embeddings — trained to predict, given a text prompt's CLIP embedding, what the CLIP image embedding of a matching image would look like. The second stage is a decoder that generates images conditioned on CLIP image embeddings, implemented as a diffusion model operating in pixel space with CLIP embedding injected via cross-attention.

This two-stage design was motivated by the structure of the CLIP embedding space. CLIP image embeddings are known to capture semantics but not fine visual details (hair texture, exact color values, lighting), since CLIP's contrastive training objective cares only about image-text alignment, not pixel reconstruction. The decoder therefore has two conditioning signals: the CLIP image embedding (for semantics) and a CLIP text embedding (for prompt-specific details the image embedding may not capture). The resulting system could generate semantically coherent images with better prompt fidelity than DALL-E 1, but fine-grained control remained limited.

DALL-E 3: Recaptioning at Scale

DALL-E 3 (Betker et al., 2023) made a different bet: rather than improving the architecture, improve the training data. The key observation was that models trained on web-scraped image-caption pairs learn to ignore prompt details, because web captions are short, vague, and often inconsistent with the image content. A model trained on "a dog" learns to generate dogs when given "a golden retriever sitting on a red chair in front of a white fence at sunset" — it simply doesn't need the detail because the training data never provided it.

The recaptioning insight

Betker et al. used a large vision-language model (similar to GPT-4V) to generate detailed synthetic captions for every training image — captions that actually describe what is in the image, including object positions, colors, lighting, and style. Models trained on these synthetic captions learned to use prompt details because following prompt details was necessary to minimize training loss. The result was dramatically better prompt adherence with no architectural change.

DALL-E 3 also integrated with ChatGPT for prompt rewriting: user inputs are expanded and clarified by GPT-4 before being passed to the image model, leveraging the language model's world knowledge to fill in details and resolve ambiguities. The combination of recaptioning, a strong text encoder, and LLM prompt expansion pushed DALL-E 3's prompt fidelity well ahead of contemporaries.

Stable Diffusion: The Open Ecosystem

Section 03 · v1 / v2 · SDXL · SD3 · Flux · the open-source effect

Stable Diffusion is unique among major image generation systems in being fully open-source — weights, architecture, and training code all released publicly. This decision, made by Stability AI in August 2022, triggered an explosion of community innovation: fine-tuning, LoRA adapters, ControlNet, inpainting models, upscalers, prompt tools, and entirely new user interfaces were developed within months, by thousands of researchers and hobbyists who would never have had access to a closed model.

SD v1 and v2: Establishing the Baseline

Stable Diffusion v1 was a 860M parameter U-Net trained on LAION-5B at 512×512, conditioned on CLIP ViT-L/14 text embeddings. The VAE encoder/decoder was pretrained separately. v1.4 and v1.5 became the most widely used bases for community fine-tuning due to their manageable size and strong generalization. SD v2 switched to OpenCLIP ViT-H/14 (a larger, openly-trained CLIP variant), trained at 512×512 and later 768×768. Despite the improved text encoder, v2 was widely perceived as producing worse artistic quality than v1 — likely because OpenCLIP trained on a differently filtered dataset with fewer artistic images, and because the new CLIP space was incompatible with the extensive v1 community tooling.

SDXL: Scaling the Base Model

SDXL (Podell et al., 2023) made several simultaneous improvements. The U-Net grew to 2.6B parameters, with more transformer blocks added at lower resolutions in the network. The text conditioning was upgraded to use two CLIP models simultaneously: CLIP ViT-L (used in v1) and OpenCLIP ViT-bigG, with their embeddings concatenated and fed to the U-Net via cross-attention. A novel conditioning signal — the original image size and the crop coordinates used during training — was embedded and added to the timestep conditioning, helping the model understand what "full-frame" versus "cropped" content looks like.

SDXL also introduced a refiner model: a separate U-Net trained on high-noise-level denoising (operating only in the low-SNR regime). The pipeline runs the base SDXL model from full noise to an intermediate noise level, then hands off to the refiner for the final steps. The refiner adds fine detail and texture that the base model generates only coarsely. This two-model cascade improved output quality substantially but doubled the inference cost.

SD3 and Flux: Rectified Flow and Diffusion Transformers

Stable Diffusion 3 (Esser et al., 2024) replaced the U-Net with a Multimodal Diffusion Transformer (MMDiT) and the diffusion framework with rectified flow (flow matching with straight-line trajectories). The MMDiT processes text and image tokens jointly using separate weight matrices but the same attention mechanism — unlike the U-Net cross-attention where image features query text features asymmetrically, the MMDiT allows bidirectional information flow between modalities at every layer. The text encoder combination (CLIP-L + OpenCLIP-bigG + T5-XXL) provided rich long-form prompt understanding.

Flux (Black Forest Labs, 2024), from the team that built Stable Diffusion, extended the MMDiT architecture further with hybrid attention blocks that alternate between full joint image-text attention and single-stream image-only attention. Flux.1 Pro/Dev/Schnell achieved state-of-the-art prompt adherence, detail rendering, and text generation (a historically weak point of diffusion models) at 1024×1024 and above.

Model	Architecture	Text encoder	Resolution	Params
SD v1.5	U-Net	CLIP ViT-L	512	860M
SD v2.1	U-Net	OpenCLIP ViT-H	768	865M
SDXL	U-Net (2× depth)	CLIP-L + OpenCLIP-bigG	1024	2.6B + 1.5B refiner
SD3 Medium	MMDiT	CLIP-L + CLIP-bigG + T5-XXL	1024	2B
Flux.1	Hybrid DiT	CLIP-L + T5-XXL	1024+	12B

Imagen and Cascaded Generation

Section 04 · T5 text encoder · cascaded super-resolution · Parti · Imagen 2

Google's Imagen (Saharia et al., 2022) made one architectural bet that turned out to be particularly consequential: using a large language model — T5-XXL (4.6 billion parameters) — as the text encoder, rather than a vision-language model like CLIP. The argument was that text encoders trained on pure language modeling had richer representations of syntax, semantics, and world knowledge than CLIP encoders trained on noisy image-text pairs. The empirical result was striking: replacing a CLIP-L encoder with T5-XXL, with everything else held constant, improved human-rated image quality and prompt fidelity dramatically.

The generation pipeline is a cascade of three separate diffusion models. A base model at 64×64 generates the initial semantic content. A 64→256 super-resolution model adds detail. A 256→1024 super-resolution model adds high-frequency texture. Each stage conditions on the T5 text embeddings via cross-attention. Training all three stages end-to-end would be intractable; instead, each stage is trained independently, with the output of the previous stage used as conditioning input during the next stage's training.

Imagen introduced dynamic thresholding as a replacement for the standard static clipping used in pixel-space diffusion. When guidance scale is high, the predicted clean image can have pixel values far outside the valid range [−1, 1]. Static clipping (clamping to [−1, 1]) causes oversaturation. Dynamic thresholding instead finds the percentile of the absolute pixel values and normalizes by it, preserving relative contrast while keeping values in range — a small change with large impact on high-guidance quality.

Parti (Yu et al., 2022), also from Google, took the autoregressive token approach to its logical extreme: a 20 billion parameter sequence-to-sequence transformer (based on the ViT-VQGAN image tokenizer) that treated text-to-image as a translation task. Parti achieved excellent fine-grained text rendering and complex scene composition at the cost of very slow sampling. The two systems demonstrated that cascaded diffusion and scaled autoregressive generation were competitive approaches, with different failure modes.

Personalization: DreamBooth, Textual Inversion, and LoRA

Section 05 · Subject-driven generation · concept binding · LoRA adapters

A major practical limitation of text-to-image models is that they cannot generate a specific real-world subject — your pet, your face, a particular product — that they have not seen in training. Personalization methods solve this by adapting the model to a handful of user-provided images (typically 3–25) of the target subject.

Textual Inversion

Gal et al. (2022) proposed the simplest possible approach: freeze the entire model and only optimize a new text token embedding. A placeholder token like S* is introduced into the tokenizer vocabulary with a randomly initialized embedding vector. Given 3–5 images of the target concept, gradient descent updates only this embedding vector — keeping all model weights frozen — until the model generates images similar to the target when prompted with S*. Because only a single embedding vector (typically 768 dimensions) is trained, the method is extremely parameter-efficient. The weakness is limited expressiveness: a single token cannot capture all the visual variation of a complex concept, and the optimization can overfit to specific poses or backgrounds in the training images.

DreamBooth

Ruiz et al. (2022) fine-tune the entire U-Net and text encoder on the small set of subject images, using a rare token identifier (e.g., a sks dog) as the subject label. The key insight is a prior preservation loss: along with the subject images, the model simultaneously trains on synthetic images of the same class (a dog) generated by the original model. This prevents the fine-tuned model from forgetting the general concept — without it, fine-tuning on 5 images of one dog would cause the model to lose the ability to generate any other dogs.

DreamBooth training objective \[ \mathcal{L} = \mathbb{E}\!\left[\lVert\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^{\text{sub}}, t, \mathbf{c}_{\text{id}})\rVert^2\right] + \lambda\,\mathbb{E}\!\left[\lVert\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^{\text{pr}}, t, \mathbf{c}_{\text{class}})\rVert^2\right] \] The first term trains on subject images with the rare identifier token. The second (prior preservation) term trains on model-generated class images, weighted by \(\lambda \approx 1\). This prevents language drift and concept forgetting.

DreamBooth produces significantly better subject fidelity than textual inversion, especially for subjects with complex geometry or appearance. The cost is a full fine-tuned model per subject, which is expensive to store and serve at scale.

LoRA for Efficient Personalization

Low-Rank Adaptation (LoRA) adapts a model by adding small trainable rank-decomposition matrices to the attention weight matrices, while keeping the original weights frozen. For a weight matrix \(\mathbf{W} \in \mathbb{R}^{m \times n}\), LoRA trains matrices \(\mathbf{A} \in \mathbb{R}^{m \times r}\) and \(\mathbf{B} \in \mathbb{R}^{r \times n}\) where \(r \ll \min(m, n)\), and uses \(\mathbf{W} + \alpha \mathbf{A}\mathbf{B}\) at inference. Typical ranks are \(r = 4\) to \(r = 64\), reducing trainable parameters by 100–10,000×.

For diffusion model personalization, LoRA applied to the U-Net's attention matrices achieves quality close to full DreamBooth fine-tuning at a fraction of the storage cost — a LoRA adapter for a subject might be 4–100 MB rather than several GB for a full model checkpoint. The Hugging Face hub and Civitai host hundreds of thousands of community-created LoRA adapters for SD and SDXL, enabling instant model customization. Multiple LoRAs can be merged at inference time with per-adapter weights, combining artistic styles with subject personalization simultaneously.

Controlled Generation: ControlNet and Friends

Section 06 · Spatial conditioning · edge maps · depth · pose · IP-Adapter

Text prompts provide semantic guidance but poor spatial control. If you want the model to generate a person standing in a specific pose, or a scene with specific depth geometry, or an image with the composition of an existing reference image, text alone is inadequate. A series of techniques add structured spatial conditioning signals to pre-trained diffusion models.

ControlNet

Zhang et al. (2023) introduced a method for adding new spatial conditioning to a frozen diffusion model by training a trainable copy of the U-Net encoder alongside the frozen decoder. The trainable encoder takes both the noisy latent and the conditioning signal (edge map, depth map, pose skeleton, semantic segmentation, etc.) as input. Its output at each resolution is added to the corresponding frozen decoder feature map via zero-initialized convolutions. The zero initialization is critical: at the start of training, ControlNet contributes nothing to the output, so the model's pretrained behavior is preserved. Gradients flow back only through the trainable encoder branch.

The result is a model that can be precisely controlled by the conditioning signal while still following the text prompt for style, color, and content. A user can sketch a rough outline of a room layout and let the model fill in realistic furniture, lighting, and materials. A photographer can provide a depth map of a scene and have the model generate a new image matching that depth structure. ControlNet adapters for SD v1.5 and SDXL now exist for dozens of conditioning types and are standard tools in AI-assisted design workflows.

IP-Adapter

Image Prompt Adapter (Ye et al., 2023) enables image-based conditioning — providing a reference image whose style, color palette, or specific elements should appear in the output. The reference image is encoded by a CLIP image encoder, and its embeddings are fed to the U-Net via decoupled cross-attention: new cross-attention layers are added in parallel with the existing text cross-attention, each processing the image prompt embedding independently. At inference, the outputs of both attention layers are summed, allowing the text prompt to control content and the image prompt to control style or identity.

IP-Adapter is particularly useful for identity preservation (generating a person in different settings while preserving their appearance), style transfer (applying the look of an artwork to a new scene), and product photography (placing a product in generated lifestyle contexts). IP-Adapter-FaceID additionally uses face recognition embeddings rather than CLIP embeddings for more precise identity control.

T2I-Adapter and Other Conditioning Methods

T2I-Adapter (Mou et al., 2023) takes a lighter approach: a small adapter network (fewer than 80M parameters) processes the conditioning signal and produces feature maps that are added to the U-Net's residual blocks. Multiple adapters can be combined — for example, a pose adapter and a color adapter applied simultaneously — with per-adapter weighting. This modularity, with no changes to the frozen base model, makes adapters trivially composable.

Image Editing: SDEdit to InstructPix2Pix

Section 07 · Noise-and-denoise · prompt-to-prompt · instruction-following editing

Generating new images from scratch is only part of the creative workflow. Editing existing images — changing the style, replacing objects, modifying attributes — while preserving the parts that should stay the same requires specialized techniques. The core challenge is the same in all cases: how do you retain structure from an input image while allowing semantic changes driven by a text instruction?

SDEdit: Noise and Denoise

Meng et al. (2021) proposed the simplest possible approach: add noise to the input image up to an intermediate timestep \(t^*\), then run the reverse diffusion process starting from that noisy image. Choosing \(t^*\) small means the output stays close to the input (low noise = small edits). Choosing \(t^*\) large gives the model more freedom to deviate. The text prompt steers what structure the model puts back during denoising. SDEdit requires no training beyond the base diffusion model, but it offers crude control — the model may not preserve the exact structures you want, and the boundary between "edited" and "preserved" regions is blurry.

Prompt-to-Prompt

Hertz et al. (2022) exploited an observation about cross-attention maps: when a model generates an image from a prompt, the cross-attention maps between text tokens and spatial positions encode which spatial regions correspond to which words. "A cat sitting on a chair" will produce high attention from the "cat" token over the cat's pixels and high attention from "chair" over the chair's pixels. Prompt-to-Prompt edits images by injecting the cross-attention maps from one generation (original prompt) into a second generation (edited prompt), forcing the spatial layout to be preserved even as the semantic content changes. Replacing "cat" with "dog" produces a dog in the same pose, on the same chair, with the same background — only the animal changes.

InstructPix2Pix

Brooks et al. (2023) trained an edit-conditional diffusion model directly on (image, instruction, edited-image) triples, where the training data was generated synthetically. GPT-3 generated diverse editing instructions ("make it raining," "change the wallpaper to brick," "add a hat"). Prompt-to-Prompt with DALL-E 2 generated paired before/after images. The resulting dataset, filtered by CLIP consistency, trained a model to follow natural-language editing instructions directly. The model is conditioned on both the input image (injected via channel concatenation into the U-Net) and the text instruction. Two separate guidance scales control how closely the output follows the instruction versus how closely it preserves the input image, enabling fine-grained control over the editing strength.

Inpainting as a special case

Inpainting — filling in a masked region while keeping the rest intact — is a common editing primitive handled by a specialized diffusion model variant. The mask is channel-concatenated with the noisy latent, and the non-masked region's latent is blended in at each diffusion step. Stable Diffusion's inpainting model was fine-tuned specifically on masked image reconstruction. More recent approaches like differential diffusion allow soft masks with per-pixel editing strength, enabling smooth transitions between edited and preserved regions.

Video Generation: Temporal Consistency

Section 08 · The frame-independence problem · 3D attention · motion priors · temporal U-Net

Video generation is a strict superset of image generation — every frame must be high-quality on its own, and consecutive frames must be temporally consistent: objects maintain their identity, motion is smooth and physically plausible, lighting evolves coherently. This last requirement is what makes video hard. A model that generates each frame independently from a text prompt will produce a sequence of individually good images with completely inconsistent content — objects teleport, backgrounds change between frames, lighting flickers randomly.

The straightforward extension — treat a video clip as a 3D volume of shape \(T \times H \times W\) and run diffusion over this volume — runs into the curse of dimensionality immediately. A 16-frame 256×256 video has \(16 \times 256 \times 256 = 1{,}048{,}576\) elements per channel, compared to \(256 \times 256 = 65{,}536\) for a single image. Both training memory and attention complexity scale dramatically.

Temporal Attention and 3D U-Nets

The canonical approach (Ho et al., 2022; Singer et al., 2022) extends the image U-Net by adding temporal attention layers interleaved with the existing spatial attention layers. The spatial attention layers process each frame independently — attention is computed within the spatial dimensions of each frame. The temporal attention layers process each spatial position across all frames simultaneously — attention is computed along the time axis. The resulting "pseudo-3D" U-Net has significantly less memory overhead than full 3D attention while still allowing information to propagate across frames.

Training strategy matters enormously. Make-A-Video (Singer et al., 2022) froze the pretrained image diffusion weights and only trained the new temporal attention layers, initializing temporal attention as identity operations (each frame attends only to itself). This allowed the model to leverage the image model's rich visual knowledge while learning motion separately. The resulting model generates coherent short clips from text prompts, with motion that looks natural even though the model has never seen video before — the temporal attention layers learn what "plausible motion" looks like from the video training data, while the spatial quality is inherited from the image model.

Latent Video Diffusion

VideoLDM (Blattmann et al., 2023) and Stable Video Diffusion (SVD, Blattmann et al., 2023) run the diffusion process in the latent space of a video VAE that compresses both spatially and temporally. A video VAE with temporal compression factor 4 and spatial factor 8 compresses a \(T \times 512 \times 512\) video to \(T/4 \times 64 \times 64 \times 4\) latents, making the diffusion problem tractable. SVD is particularly notable for its image-to-video capability: given a single conditioning frame, it generates a 14- or 25-frame clip that animates from that image, following motion priors learned from large video datasets. This image-to-video setup is simpler to control than text-to-video because the first frame anchors the scene content.

Sora and Transformer-Based Video

Section 09 · Video DiT · spacetime patches · variable duration · physical plausibility

OpenAI's Sora (Brooks et al., 2024) represented a qualitative leap in video generation quality, duration, and physical coherence. The technical report revealed that Sora replaced the temporal U-Net architecture with a Diffusion Transformer operating on spacetime patches, scaled to orders of magnitude more parameters and training compute than prior video models.

Spacetime Patch Tokenization

Sora's key architectural contribution is treating video as a sequence of spacetime patches — rectangular patches extracted from the video's spatial and temporal dimensions simultaneously — rather than as a sequence of frames. A video is encoded by a video VAE into a compressed latent, then divided into patches of shape \(t_p \times h_p \times w_p\) in the latent space. These patches are linearly projected into token embeddings and processed by a standard transformer with full self-attention across all spacetime positions.

This approach has several important properties. First, it is resolution and duration agnostic: any video, regardless of aspect ratio, resolution, or length, produces a sequence of tokens of the appropriate length. The same model can generate a 2-second 480p clip or a 60-second 1080p clip by simply changing the number of tokens. Second, because attention is computed across all spacetime positions, the model can reason about global motion patterns, camera trajectories, and long-range temporal dependencies without the receptive field limitations of convolutional temporal U-Nets.

Sora's pipeline. A video VAE compresses raw frames into a compact latent volume. Spacetime patches are extracted and tokenized. A Diffusion Transformer with full spacetime self-attention processes all tokens jointly, conditioned on text via cross-attention. The same model handles any resolution, duration, or aspect ratio by adjusting the number of patches.

Training at Scale

Sora was trained on a large diverse dataset of videos with generated captions (recaptioned, per the DALL-E 3 approach) and images. Images are treated as single-frame videos during training, which allows the vast supply of labeled images to improve spatial quality while video data trains temporal dynamics. The model was trained at varying resolutions and durations simultaneously, using padding and masking to handle variable-length sequences within a batch.

Physical Plausibility and the "World Model" Debate

Sora's demos showed remarkable physical plausibility — objects cast consistent shadows, liquids flow realistically, camera motion matched the scene geometry — which prompted debate about whether large video generation models constitute a form of world model (a learned simulator of physics). The technical report explicitly describes Sora as "a world simulator" that has learned physical intuitions from video data. Critics noted that Sora also produces physically impossible generations (objects passing through each other, physics violations in complex interactions) and that its apparent physical understanding may be pattern matching to frequent video motifs rather than grounded physical simulation. The resolution of this debate has implications for whether video generation models can be directly used for autonomous agent planning and robotics.

Consistency Models in Practice

Section 10 · LCM · SDXL-Turbo · ADD · real-time generation

The diffusion models chapter introduced the theory of consistency models and progressive distillation. Here we examine how these techniques have been applied to production image generation systems, enabling interactive and real-time applications that were impossible with 50-step DDIM sampling.

Latent Consistency Models (LCM)

LCM (Luo et al., 2023) applied consistency distillation directly to Stable Diffusion's latent space. Rather than distilling the pixel-space diffusion ODE, LCM distills the augmented probability flow ODE that accounts for classifier-free guidance — producing a model that performs the equivalent of guided diffusion in 2–4 steps. LCM-LoRA further packages the distilled consistency property as a LoRA adapter that can be applied to any fine-tuned SD model without retraining, making 4-step generation compatible with the entire ecosystem of community models.

The perceptual improvement of LCM over naive step-reduction (taking 4 DDIM steps instead of 50) is substantial. DDIM at 4 steps produces blurry, low-detail images because the ODE integration error is too large. Consistency distillation teaches the model to produce the endpoint of the ODE trajectory in a single (or few) steps, which is a fundamentally different prediction target that does not accumulate integration error.

SDXL-Turbo and Adversarial Distillation

SDXL-Turbo (Sauer et al., 2023) used Adversarial Diffusion Distillation (ADD) — training a student SDXL model to match the teacher's output distribution using a combination of score distillation and adversarial loss. A discriminator is trained to distinguish real images from 1-step student samples, providing a richer training signal than the MSE-based consistency loss. The resulting model generates high-quality 1024×1024 images in a single denoising step — approximately 200ms on an A100 GPU — which is fast enough for near-real-time interactive applications.

FLUX.1-Schnell used a similar distillation approach on the Flux architecture, producing a 4-step model that retains most of the quality of the full Flux.1 Pro while being suitable for local deployment on consumer hardware. These fast models have enabled new interaction paradigms: real-time sketch-to-image (the model regenerates as the user draws), interactive style transfer, and on-device generation on mobile phones.

Method	Base model	Steps	Technique	Quality vs. full
LCM	SD v1.5 / SDXL	2–4	Consistency distillation	Good (slight detail loss)
SDXL-Turbo	SDXL	1–4	Adversarial distillation	Very good
Lightning	SDXL	1–8	Progressive distillation	Very good
Flux.1-Schnell	Flux	4	Flow matching distillation	Excellent
Hyper-SD	SD / SDXL	1	Consistency + adversarial	Good

Evaluating Image and Video Generation

Section 11 · FID · CLIP score · IS · GenEval · VBench · human evaluation

Evaluating generative models is genuinely difficult. We want to measure both quality (do individual samples look good?) and diversity (does the model cover the full range of the data distribution?), as well as prompt fidelity (do outputs match the conditioning text?) and safety (does the model avoid generating harmful content?). No single metric captures all of these simultaneously.

Fréchet Inception Distance (FID)

FID remains the most widely reported single metric for image generation quality. It computes the Fréchet distance between the distribution of Inception-v3 features extracted from generated images and from real images:

Fréchet Inception Distance \[ \text{FID} = \lVert\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\rVert^2 + \text{Tr}\!\left(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2(\boldsymbol{\Sigma}_r\boldsymbol{\Sigma}_g)^{1/2}\right) \] where \((\boldsymbol{\mu}_r, \boldsymbol{\Sigma}_r)\) and \((\boldsymbol{\mu}_g, \boldsymbol{\Sigma}_g)\) are the mean and covariance of Inception features from real and generated distributions. Lower FID is better; a perfect model would score 0.

FID has known limitations: it is sensitive to the number of samples used, the Inception network used for feature extraction may not capture all perceptually important dimensions, and it tends to reward diversity in Inception feature space rather than fine-grained visual quality. FID also does not measure text-image alignment — a model that generates perfectly diverse realistic images that ignore the text prompt entirely would score well.

CLIP Score

CLIP score measures semantic alignment between generated images and their text prompts: compute the CLIP embedding of both image and text, and report their cosine similarity. High CLIP score means the image semantically matches the prompt in CLIP's embedding space. The limitation is that CLIP embeddings are coarse — they may not capture fine-grained attribute binding, precise counts, or spatial relationships. A model can get a high CLIP score by generating a dog next to a ball without correctly placing them in the described spatial arrangement.

GenEval

GenEval (Ghosh et al., 2023) is a compositional generation benchmark that evaluates specific capabilities using automated VQA (visual question answering) rather than embedding similarity. It tests single object generation, two-object generation, counting, colors, color attribution (binding colors to specific objects), and spatial relationships (left/right, above/below). A vision-language model (e.g., OWL-ViT or BLIP-2) answers questions about generated images, allowing fine-grained measurement of specific failure modes. GenEval revealed that most models, including DALL-E 3 and SDXL, fail frequently on two-object color attribution and spatial relationships — "a red cube to the left of a blue sphere" is still a hard generation task.

Video Metrics: FVD and VBench

Fréchet Video Distance (FVD) is the video analog of FID, comparing distributions of I3D (Inflated 3D ConvNet) features between real and generated video clips. VBench (Huang et al., 2023) provides a more comprehensive battery of video generation metrics: temporal consistency (do adjacent frames agree?), motion smoothness, subject consistency across frames, background consistency, aesthetic quality, imaging quality, and semantic alignment with the text prompt — each evaluated by specialized models. Together these metrics provide a more nuanced picture of video generation quality than any single number.

Human Evaluation

Automated metrics do not fully capture human preferences, which are sensitive to subtle artifacts (unnatural skin texture, impossible hand anatomy, font-like text) that automated feature extractors miss. Standard human evaluation uses side-by-side comparisons (A/B tests) where evaluators choose the better image given the same prompt, or absolute ratings on quality/fidelity/aesthetics scales. ELO-style ranking systems like LMSYS's image arena aggregate pairwise human preferences over many comparators, providing a more robust ranking than any single study. Human evaluation is expensive and subject to annotator biases but remains the gold standard for measuring real-world generation quality.

The Generation Landscape

Section 12 · Open vs. closed · safety · creative tools · remaining hard problems

By 2024 the image generation landscape had stratified into two tiers. Closed commercial systems (DALL-E 3, Midjourney, Adobe Firefly, Imagen 3) competed on quality, ease of use, and safety filters, with pricing models based on image credits or subscriptions. Open systems (Stable Diffusion, Flux) competed on customizability, local deployment, and community tooling. The two tiers developed distinct user bases: professionals wanting seamless integration and content safety for commercial work; artists and hobbyists wanting maximum control and freedom from content restrictions.

Safety and Content Moderation

Safety is a first-class concern for deployed generation systems. The primary mechanisms are: training data filtering (removing CSAM, non-consensual imagery, and other harmful content from training datasets); NSFW classifiers on outputs (blocking generated images that match categories of harmful content); prompt filtering (blocking prompts that match known harmful patterns); and watermarking (adding imperceptible signals to generated images that identify them as AI-generated for downstream verification). Adobe Firefly was specifically trained on licensed Adobe Stock imagery to avoid copyright concerns, demonstrating that responsible training data curation is technically feasible at scale. Google's SafeSearch and OpenAI's safety systems add additional layers, but adversarial prompt techniques to bypass them remain an ongoing challenge.

Creative Tools Integration

Generation models are increasingly embedded in professional creative tools rather than accessed as standalone APIs. Adobe Firefly is integrated into Photoshop's Generative Fill, Illustrator's Generative Expand, and Express. Canva integrates multiple models for magic generation and background replacement. Video generation is being integrated into Adobe Premiere, DaVinci Resolve, and Runway's creative platform. This integration shifts the interaction model from "generate image from prompt" to "assist with specific creative steps in an existing workflow" — a harder but more commercially valuable task.

Remaining Hard Problems

Despite remarkable progress, several problems remain persistently difficult. Hands and anatomy improved substantially between 2022 and 2024 but still fail on complex poses and occlusions. Text rendering in images is now feasible with Flux and DALL-E 3 on short strings but degrades on long or styled text. Compositional generation — correctly binding attributes to objects and placing objects in specified spatial relationships — remains below human performance on systematic benchmarks. Long video coherence — maintaining consistent character identity, background, and physics across clips longer than a few seconds — is improving rapidly but not solved. Consistency across views — generating the same scene or character from multiple viewpoints with 3D consistency — is the bridge to 3D generation covered in the next chapter.

Perhaps the deepest open question is whether scaling alone closes these gaps, or whether new architectural or training innovations are required. The history of the field suggests both: scaling delivers consistent quality improvements, but the qualitative leaps — from GANs to diffusion, from pixel space to latent space, from CLIP to T5 to recaptioning — have been architectural. The current bet in the field is on scaling video models specifically: the intuition is that videos are an implicit 3D simulation, and a model that has seen enough video has implicitly learned enough about 3D structure and physics to generalize beyond current capabilities.

Key Papers

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

Ramesh, Dhariwal, Nichol, Chu, Chen · 2022

Introduces the CLIP prior + diffusion decoder two-stage architecture. Demonstrates CLIP-guided generation and image variation. The architectural midpoint between DALL-E 1's autoregressive approach and DALL-E 3's latent diffusion.

arXiv
Improving Image Generation with Better Captions (DALL-E 3)

Betker et al. · OpenAI Technical Report 2023

Introduces synthetic recaptioning as the key lever for improved prompt fidelity. Shows that training data quality dominates architectural improvements. The most influential training insight since classifier-free guidance.

PDF
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)

Saharia et al. · NeurIPS 2022

Demonstrates the superiority of T5-XXL over CLIP for text conditioning, introduces dynamic thresholding, and shows cascaded diffusion surpassing prior SOTA on DrawBench. The paper that established large language model text encoders as the default for high-fidelity generation.

arXiv
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Ruiz et al. · CVPR 2023

Introduces subject-driven fine-tuning with prior preservation loss for personalized image generation. The personalization paper most responsible for the explosion of custom model fine-tuning.

arXiv
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)

Zhang et al. · ICCV 2023

Introduces zero-initialized trainable encoder copies for spatial conditioning without retraining the base model. The foundational technique for spatial control that every production image pipeline now incorporates.

arXiv
InstructPix2Pix: Learning to Follow Image Editing Instructions

Brooks, Holynski, Efros · CVPR 2023

Generates synthetic training data with GPT-3 + Prompt-to-Prompt, trains an instruction-following image edit model. The standard approach to natural-language-guided image editing.

arXiv
Video Generation Models as World Simulators (Sora)

Brooks et al. · OpenAI Technical Report 2024

Describes the spacetime patch tokenization, DiT architecture, and variable-duration training that produced Sora's video generation capabilities. The architecture paper that redefined state of the art for video generation and sparked the "world model" debate.

Blog
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo et al. · 2023

Applies consistency distillation to SD's latent space with CFG-augmented ODE, enabling 2–4 step high-quality generation. Brought real-time image generation within reach of consumer hardware.

arXiv
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Ghosh et al. · NeurIPS 2023

Compositional generation benchmark using VQA evaluation of specific skills: object presence, counting, color attribution, spatial relations. The most diagnostic evaluation framework for understanding what text-to-image models actually get right and wrong.

arXiv