Diffusion Models, learning to reverse the noise.
Diffusion models learn to generate data by learning to reverse a gradual noising process. Start with a clean image, systematically destroy it with Gaussian noise over hundreds of steps, then train a neural network to run that process in reverse. The insight is profound in its simplicity: if you can learn to remove noise, you can learn to create structure from nothing. This framework now underlies the best image, audio, and video generators in the world.
Prerequisites
This chapter assumes solid footing in probability theory — particularly Gaussian distributions, Bayes' theorem, and KL divergence (Part I Ch 04 and Ch 06). Variational autoencoders (Part X Ch 01) are the most important conceptual predecessor; the ELBO derivation and reparameterization trick appear in a new form here. Score functions and Langevin dynamics are introduced from scratch, but familiarity with stochastic differential equations will enrich Section 8. Neural network fundamentals (Part V Ch 01–02) and U-Net architectures are assumed for the denoiser model sections.
The Core Intuition: Destruction and Reconstruction
Every generative model has to answer the same question: how do we describe a probability distribution over high-dimensional data — images, audio waveforms, molecular structures — in a way that lets us both evaluate likelihoods and draw new samples? GANs avoid the question with an adversarial game. VAEs introduce a latent space and bound the likelihood. Normalizing flows construct exact invertible maps. Diffusion models take a completely different approach: they define the generative process as the time-reversal of a noise-injection process that they already know.
Here is the core idea. Take a real image \(\mathbf{x}_0\). Add a small amount of Gaussian noise to get \(\mathbf{x}_1\). Add more noise to get \(\mathbf{x}_2\). Keep going for \(T\) steps, each time adding a little more noise, until at step \(T\) you have \(\mathbf{x}_T\) — a sample that looks essentially like pure noise, indistinguishable from \(\mathcal{N}(\mathbf{0}, \mathbf{I})\). This is the forward process, and it is fully defined by the noise schedule. It requires no learning.
Now ask: what is the reverse? If you knew the distribution \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\) — the conditional distribution of the slightly less noisy image given a noisy one — you could run the process backward. Start from random noise at step \(T\) and iteratively denoise to recover a clean image at step \(0\). That reverse conditional is intractable to compute directly, but a neural network can learn to approximate it. This is the reverse process, and it is what gets trained.
The elegance of this framework is that the forward process is mathematically tractable — you can write down its marginals in closed form — which makes it possible to train the reverse process efficiently. And unlike GANs, training is stable: there is no adversarial game, just a regression objective. Unlike VAEs, there is no encoder at inference time and the latent space has the same dimensionality as the data. Unlike normalizing flows, the architecture of the denoiser is unconstrained.
Each denoising step only asks the network to solve a relatively easy problem: given an image with a known noise level, estimate what the slightly cleaner version looks like. No single step asks the model to conjure structure from nothing. The hard work of creating coherent global structure is distributed across hundreds of small steps, each individually tractable.
The Forward Process: Gradual Corruption
The forward process is a fixed (not learned) Markov chain that gradually corrupts data. At each step \(t\), we add a small amount of Gaussian noise, scaling the signal down and adding noise up:
The most useful property of this Gaussian chain is that you can sample \(\mathbf{x}_t\) at any arbitrary timestep directly from \(\mathbf{x}_0\), without having to iterate through all intermediate steps. Define \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s\). Then the marginal distribution at step \(t\) is:
This reparameterization is crucial for training efficiency. Rather than running the Markov chain step by step during each training iteration, we can directly sample any noisy version of any training image in a single operation. At \(t=0\), \(\bar{\alpha}_0 = 1\), so \(\mathbf{x}_0 = \mathbf{x}_0\) (no noise). As \(t \to T\), \(\bar{\alpha}_T \to 0\), so the distribution approaches \(\mathcal{N}(\mathbf{0}, \mathbf{I})\).
Noise Schedules
The choice of schedule \(\{\beta_t\}\) matters. The original DDPM paper used a linear schedule with \(\beta_t\) increasing uniformly from \(10^{-4}\) to \(0.02\). However, with a linear schedule on raw pixel space, the final steps are largely wasted — \(\bar{\alpha}_t\) reaches near-zero well before \(t = T\), meaning many timesteps add effectively no new information. Improved DDPM (Nichol & Dhariwal, 2021) introduced a cosine schedule:
More recent work frames schedule design in terms of the signal-to-noise ratio (SNR): \(\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t)\). A good schedule should cover a wide range of SNR values, from the high-SNR regime where only fine details are corrupted, down to the low-SNR regime where global structure is being determined. Research continues into optimal schedule design, including learned schedules and schedules tailored to specific data modalities.
The Reverse Process and DDPM
Ho et al. (2020) — the DDPM paper — define the reverse process as a learned Markov chain running backward from \(t=T\) to \(t=0\). The joint distribution of the reverse process is:
The reverse conditional \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\) is intractable because it requires integrating over all possible data. But a remarkable fact saves us: conditioned on \(\mathbf{x}_0\), the reverse conditional is tractable — it is Gaussian with a closed-form mean and variance:
This is the target: we want \(p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\) to match \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\). Since we do not have \(\mathbf{x}_0\) at inference time, the network must predict it — or, equivalently, predict the noise \(\boldsymbol{\epsilon}\) that was added. Ho et al. found that parameterizing the network to predict noise led to better results:
The Simplified Training Objective
After deriving the ELBO (detailed in the next section), Ho et al. found that a simplified loss — ignoring most weighting terms — works just as well or better in practice:
Sampling from a trained DDPM requires running the reverse chain for all \(T\) steps (typically \(T = 1000\)), which is the main computational bottleneck. At each step, we compute the predicted noise, reconstruct the mean, and add a small amount of noise appropriate to that step. The output at step 0 is the generated sample.
The ELBO and Training Objective
The formal justification for the DDPM training objective comes from maximizing a variational lower bound on the log-likelihood of the data. The derivation reveals the deep connection between diffusion models and VAEs — diffusion can be viewed as an extremely deep hierarchical VAE with a fixed encoder and learned decoder.
We want to maximize \(\log p_\theta(\mathbf{x}_0)\). Using the forward process as a variational approximation to the true posterior, the ELBO is:
This ELBO can be decomposed into interpretable terms by applying the Markov structure of both the forward and reverse processes. After careful manipulation, three types of terms emerge:
- Reconstruction term \(\mathcal{L}_0 = -\mathbb{E}_q[\log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)]\): how well the final denoising step recovers the original image. In practice this is a discrete likelihood term over pixel values.
- Prior matching term \(\mathcal{L}_T = D_{\mathrm{KL}}(q(\mathbf{x}_T \mid \mathbf{x}_0) \| p(\mathbf{x}_T))\): how close the fully-noised distribution is to the prior. Since the forward process is fixed, this term has no trainable parameters and can be ignored during training.
- Denoising terms \(\mathcal{L}_{t-1} = D_{\mathrm{KL}}(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t))\) for \(t = 2, \ldots, T\): the KL divergence between the tractable forward posterior and the learned reverse step. This is where the main learning happens.
Because both distributions in each \(\mathcal{L}_{t-1}\) term are Gaussian, the KL divergence has a closed form that reduces to a weighted MSE between the true and predicted means. Substituting the noise parameterization yields the simplified objective above.
Loss Weighting
The full ELBO involves a per-timestep weight \(\lambda_t = \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)}\). This weight is large for small \(t\) (fine-detail denoising) and small for large \(t\) (rough structure). Ho et al. found empirically that dropping this weighting — using uniform weighting across all timesteps — improved sample quality. This upweights the high-noise timesteps, which might seem counterintuitive but works because those steps are harder and more informative for learning global structure.
A VAE has a single latent variable and a learned encoder. Diffusion can be seen as a \(T\)-level hierarchical VAE where the encoder \(q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)\) is fixed (no parameters), and the decoder \(p_\theta(\mathbf{x}_{0:T})\) is learned. The ELBO structure is identical. The key difference is that fixing the encoder removes the need for amortized inference and makes each training step cheaper, at the cost of requiring many decoding steps at inference time.
Score Matching and Score-Based Models
In parallel with DDPM, Song & Ermon (2019, 2020) developed score-based generative models from a different starting point — and the two approaches turned out to be deeply equivalent. Their framework starts with the score function: the gradient of the log-probability density with respect to the data.
If you knew the score function of the data distribution, you could generate samples via Langevin dynamics: start from any point and follow a noisy gradient ascent toward regions of high density:
Score Matching
We cannot directly compute \(\nabla_{\mathbf{x}} \log p(\mathbf{x})\) since \(p(\mathbf{x})\) is unknown. But we can train a neural network \(\mathbf{s}_\theta(\mathbf{x})\) to approximate it. The score matching objective minimizes the expected squared distance between the network's output and the true score:
The denoising score matching objective is strikingly familiar: if you perturb \(\mathbf{x}_0\) by noise \(\boldsymbol{\epsilon}\) to get \(\mathbf{x}_t\), the optimal score network satisfies \(\mathbf{s}_\theta(\mathbf{x}_t, t) = -\boldsymbol{\epsilon} / \sqrt{1-\bar{\alpha}_t}\). This is precisely the noise prediction objective of DDPM, scaled by \(-1/\sqrt{1-\bar{\alpha}_t}\). The two frameworks are equivalent.
Noise Conditional Score Networks (NCSN)
Song & Ermon's NCSN perturbs data with noise at multiple scales \(\sigma_1 > \sigma_2 > \cdots > \sigma_L\) and trains a single score network \(\mathbf{s}_\theta(\mathbf{x}, \sigma)\) conditioned on the noise level. Sampling proceeds via annealed Langevin dynamics: start from a high noise level where the score landscape is smooth and unimodal, run Langevin dynamics to approximate a sample from the perturbed distribution, then anneal to the next lower noise level and repeat. Each successive noise level provides a warm start from the previous level. This annealing is conceptually identical to the DDPM reverse chain.
Unifying View: Diffusion as SDEs
Song et al. (2021) unified DDPM and score-based models by taking the continuous-time limit. As \(T \to \infty\) and step sizes \(\beta_t\) become infinitesimal, the discrete Markov chain converges to a stochastic differential equation (SDE):
By Anderson's (1982) reverse-time SDE theorem, the reverse process is also an SDE, and its drift depends on the score:
Probability Flow ODE and DDIM
Every SDE has a corresponding probability flow ODE that induces the same marginal distributions at every time \(t\), but without stochastic noise:
DDIM (Song et al., 2020) derived a non-Markovian sampling process that generalizes DDPM. Whereas DDPM requires all \(T\) reverse steps, DDIM can take large strides through the timestep sequence — skipping from \(t=1000\) to \(t=800\) to \(t=600\), etc. — and still produce good samples. With 50 steps instead of 1000, DDIM achieves comparable quality to DDPM while being 20× faster. The key insight is that DDIM's update rule is a discretization of the probability flow ODE, not the stochastic SDE — and the ODE is smoother, so larger steps remain accurate.
Because DDIM defines a deterministic mapping from noise \(\mathbf{x}_T\) to sample \(\mathbf{x}_0\), you can encode two images into their latent noise vectors, interpolate between those vectors, and decode to get semantically meaningful interpolations. This would be impossible with the stochastic DDPM sampler.
Denoiser Architecture: U-Nets and Transformers
The choice of architecture for the denoising network \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) is separate from the diffusion mathematics — any architecture that takes a noisy image and a timestep as input and outputs a noise estimate can be plugged in. In practice, two architectures have dominated: U-Net with attention, and the Diffusion Transformer (DiT).
U-Net with Attention
The original DDPM paper and most subsequent image diffusion models use a U-Net: an encoder-decoder convolutional network with skip connections between corresponding encoder and decoder layers. The architecture was originally developed for biomedical image segmentation, and it turns out to be well-suited for denoising because it captures both local texture (via shallow features) and global structure (via bottleneck features).
Key modifications for diffusion: the timestep \(t\) is embedded using sinusoidal position encodings (borrowing from transformers), then projected and added to the feature maps via learned scale-and-shift operations (adaptive group normalization). Multi-head self-attention is added at lower-resolution feature maps in the bottleneck and some intermediate layers, allowing long-range spatial dependencies to be captured. The ADM model (Dhariwal & Nichol, 2021) achieved state-of-the-art FID by significantly scaling up this architecture and adding class conditioning.
Diffusion Transformer (DiT)
Peebles & Xie (2023) showed that a pure transformer architecture, DiT, matches or outperforms U-Net diffusion models when trained at scale. The image is first patchified into a sequence of tokens (as in ViT), then processed by a stack of transformer blocks with adaptive layer norm (adaLN) for conditioning on timestep and class label. The simplicity of the transformer scaling laws — more layers and wider dimensions predictably improve FID — made DiT an attractive foundation for large-scale systems. Stable Diffusion 3, Flux, and other recent systems use DiT or DiT-derived architectures.
Timestep Embedding
The denoiser must know what noise level it is operating at. The standard approach uses sinusoidal embeddings analogous to transformer position encodings, then passes through a two-layer MLP to produce a conditioning vector. This vector modulates the feature maps via scale-and-shift (AdaGN or adaLN), with learned per-layer scale and bias that depend on the timestep. Ensuring the network can distinguish nearby timesteps precisely is critical: a network that cannot tell the difference between \(t=100\) and \(t=120\) will incorrectly calibrate its denoising strength.
Classifier Guidance and Classifier-Free Guidance
An unconditional diffusion model generates diverse samples but cannot be steered toward a particular class or text description. Guidance mechanisms allow a trained diffusion model to sample from a conditional distribution \(p(\mathbf{x} \mid \mathbf{c})\) — where \(\mathbf{c}\) is a class label, text prompt, or any other conditioning signal — by using the score of the conditional distribution.
Classifier Guidance
Dhariwal & Nichol (2021) showed that a separately trained classifier on noisy images can guide the diffusion sampling process. Applying Bayes' theorem to the score:
A guidance scale \(s \geq 1\) amplifies the classifier gradient: the guided score becomes \(\nabla \log p(\mathbf{x}_t) + s \cdot \nabla \log p(\mathbf{c} \mid \mathbf{x}_t)\). Higher \(s\) produces samples more clearly matching the condition at the cost of diversity. Classifier guidance produced the first diffusion models to surpass GANs on FID, but requires training an additional noise-aware classifier.
Classifier-Free Guidance
Ho & Salimans (2022) eliminated the need for a separate classifier with classifier-free guidance (CFG). The trick is to train a single conditional model that also occasionally receives a null condition (dropping the conditioning signal with some probability during training). At inference, two forward passes produce the conditional and unconditional score estimates, and they are combined:
CFG is now the standard approach for conditional generation in all major diffusion systems — Stable Diffusion, DALL-E 2, Imagen, and their successors all use it. A higher guidance scale makes outputs more faithful to the prompt but more saturated and less diverse. The guidance scale is a critical hyperparameter that practitioners tune based on the desired trade-off. Recent research (e.g., Imagen Video, Emu) has explored adaptive guidance, timestep-dependent guidance scales, and extensions beyond the simple linear extrapolation formula.
The guidance equation can be interpreted as: take the unconditional direction, then push further in the direction that the conditional model prefers relative to unconditional. The extrapolation beyond the conditional model's natural output sharpens the distribution at the cost of coverage — a principled way of trading sample diversity for fidelity. At \(s=1\), the guided output is just the conditional model; at \(s=0\), it is unconditional; and at \(s > 1\), you get a sharper conditional than the model naturally produces.
Latent Diffusion and Stable Diffusion
Running diffusion directly in pixel space is expensive: a 512×512 image has 786,432 dimensions, and each forward or reverse step requires a full U-Net pass over this entire space. Rombach et al. (2022) introduced Latent Diffusion Models (LDM), which run the diffusion process in a compressed latent space learned by a separately trained autoencoder. This is the architecture underlying Stable Diffusion, the most widely deployed open-source image generation model.
Two-Stage Architecture
The first stage is a VQ-regularized or KL-regularized autoencoder that learns to encode images into a compact latent representation and decode latents back to pixel space. For a 512×512 image, a typical encoder produces a 64×64×4 latent (an 8× spatial downsampling), reducing dimensionality by a factor of 48. The autoencoder is trained with a combination of reconstruction loss, perceptual loss (comparing VGG features), and an adversarial loss from a patch discriminator. Once trained, the encoder and decoder are frozen.
The second stage trains the diffusion model in this learned latent space. Because the latent space is much smaller, the U-Net processes 64×64 feature maps rather than 512×512, making each forward pass dramatically faster. The semantic content of images is well-preserved in the latent space (the autoencoder handles perceptual compression), so the diffusion model focuses its capacity on learning semantic structure rather than pixel-level texture.
Text Conditioning via Cross-Attention
For text-to-image generation, the text prompt is encoded by a pretrained language model — CLIP's text encoder in Stable Diffusion v1, OpenCLIP or T5 in later versions — to produce a sequence of text token embeddings. These embeddings condition the U-Net denoiser through cross-attention layers inserted into the U-Net's residual blocks:
Stable Diffusion v1 was trained on LAION-5B with a 512×512 resolution. Subsequent versions (v2, SDXL, SD3) improved resolution, used improved text encoders (including T5), changed the architecture to DiT, and added multi-aspect-ratio training. SDXL introduced a two-stage pipeline: a base model at 1024×1024 followed by a refinement model that applies diffusion at low noise levels to sharpen details.
Training diffusion in latent space separates two distinct learning problems: perceptual compression (what information to keep) handled by the autoencoder, and semantic generation (how concepts, objects, and styles arrange themselves) handled by the diffusion model. This decomposition lets each component be optimized independently, and lets the diffusion model focus its parameters entirely on semantic content rather than low-level texture.
Fast Sampling: DDIM, DPM-Solver, and Beyond
The main practical limitation of diffusion models is sampling speed: DDPM requires 1000 sequential neural network evaluations per sample. A great deal of research has focused on reducing this to dozens or even a single step while preserving sample quality.
DDIM and Other ODE Samplers
As discussed in Section 6, DDIM reformulates sampling as solving a probability flow ODE, enabling large step sizes. Using 50 DDIM steps instead of 1000 DDPM steps gives roughly the same FID scores at 20× the speed. Further improvements come from higher-order ODE solvers: while DDIM uses an Euler method (first-order), PLMS, DPM-Solver, and DPM-Solver++ use second- and third-order methods that are more accurate per step, enabling further step-count reduction to 10–20 steps with minimal quality loss.
DPM-Solver++ (Lu et al., 2022) introduced a multistep solver specifically designed for the semi-linear ODE structure of diffusion models, exploiting the fact that the noise prediction network changes slowly between nearby timesteps. With 5–10 function evaluations, it achieves quality comparable to DDIM with 100 steps. Most modern image generation pipelines use DPM-Solver++ or similar as their default sampler.
Consistency Models
Song et al. (2023) introduced consistency models, which learn a mapping directly from any point on a probability flow ODE trajectory to the trajectory's endpoint (the clean sample). The consistency property is: if two points \(\mathbf{x}_t\) and \(\mathbf{x}_{t'}\) lie on the same ODE trajectory, then \(f_\theta(\mathbf{x}_t, t) = f_\theta(\mathbf{x}_{t'}, t')\). Training enforces this by minimizing differences between predictions at adjacent timesteps.
A trained consistency model can generate samples in a single neural network evaluation (consistency sampling), or in a small number of steps with alternating denoising and noising (consistency trajectory models). Consistency distillation — initializing from a pretrained diffusion model and distilling the consistency property — produces high-quality single-step generators. LCM (Latent Consistency Model) applied consistency distillation to Stable Diffusion, enabling 2–4 step generation of competitive images.
Progressive and Rectified Flow Distillation
Progressive distillation (Salimans & Ho, 2022) trains a student model to match the output of two teacher steps in a single step, then uses the student as the new teacher and repeats. After \(K\) rounds of halving, the number of required steps drops from \(T\) to \(T / 2^K\). Flow matching (Lipman et al., 2022) and rectified flow (Liu et al., 2022) take a complementary approach: rather than diffusing along curved trajectories, they learn straight-line (or near-straight-line) flows from noise to data, which can be integrated accurately with far fewer steps. Flux and Stable Diffusion 3 use rectified flow as their sampling framework.
| Method | Min steps | Quality | Training | Key idea |
|---|---|---|---|---|
| DDPM | 1000 | Good | From scratch | Stochastic reverse chain |
| DDIM | 50 | Good | No retraining | Deterministic ODE sampler |
| DPM-Solver++ | 10–20 | Very good | No retraining | Higher-order ODE solver |
| Consistency distillation | 1–2 | Good | Distillation | Learn ODE endpoint directly |
| Progressive distillation | 4–8 | Very good | Distillation | Iterative step halving |
| Rectified flow | 5–25 | Excellent | From scratch | Straight-line flow paths |
Variants and Extensions
The diffusion framework is highly modular. Once the core forward/reverse process is defined and the denoiser is trained, the conditioning mechanism and the data modality are largely orthogonal design choices. This flexibility has led to an explosion of variants across domains.
Cascaded Diffusion Models
Generating high-resolution images (e.g., 1024×1024) directly in a single model is challenging. Cascaded diffusion (Ho et al., 2022) chains multiple diffusion models: a base model generates a low-resolution image (e.g., 64×64), then a cascade of super-resolution models upsamples sequentially (64→256, 256→1024). Each model in the cascade is conditioned on the output of the previous stage plus the text prompt. Imagen (Saharia et al., 2022) demonstrated that cascading a small base model with large super-resolution models, using a strong text encoder (T5-XXL), outperformed all prior methods including DALL-E 2 and Stable Diffusion on benchmark metrics.
Video Diffusion
Extending diffusion to video requires modeling temporal coherence across frames. Video Diffusion Models (Ho et al., 2022) extend the 2D U-Net with 3D convolutions and temporal attention, treating the video as a spacetime volume. Imagen Video uses a cascaded approach with multiple spatial and temporal super-resolution stages. More recent models like Sora (OpenAI, 2024) use DiT-based architectures operating in compressed video latent spaces, training on vast collections of video to learn physical dynamics, camera motion, and long-range temporal coherence.
Audio Diffusion
DiffWave and WaveGrad apply diffusion to raw audio waveforms, treating the 1D waveform as the data space. AudioLM combines discrete acoustic tokens (similar to VQGAN but for audio) with language-model-based generation. Stable Audio uses latent diffusion on a VAE-compressed audio representation with CLAP (Contrastive Language-Audio Pretraining) conditioning, enabling high-quality music generation from text prompts with timing control.
Molecular Generation
Diffusion over 3D molecular geometries — atom positions and types — enables drug discovery applications. DiffSBDD and DiffDock use equivariant diffusion models that respect the rotational and translational symmetries of 3D space, generating ligand structures conditioned on protein binding site geometry. The score network must be equivariant (or SE(3)-invariant): if the protein is rotated, the generated ligand should rotate accordingly.
Image Editing and Inpainting
SDEdit (Meng et al., 2022) enables image editing without any fine-tuning: add noise to an edited image up to an intermediate timestep, then run the reverse diffusion process from that point. This produces images that blend the user's edits with the model's learned image manifold. Inpainting can be done by masking certain regions, applying diffusion only in the masked area while keeping the rest fixed, then blending — though more sophisticated approaches maintain consistency between masked and unmasked regions through the full reverse chain.
ControlNet (Zhang et al., 2023) adds conditioning on structured signals — edge maps, depth maps, human pose skeletons, semantic segmentation masks — by copying U-Net encoder blocks into a trainable side network and adding their outputs back into the frozen decoder. This enables precise spatial control over generated images without retraining the base model.
The Diffusion Landscape
Diffusion models arrived relatively late compared to GANs (2014), VAEs (2013), and normalizing flows (2014–2018). Their explosive rise — from Ho et al.'s DDPM in 2020 to dominating image, audio, and video generation by 2022–2023 — represents one of the fastest paradigm shifts in deep learning history. Understanding why requires comparing them honestly to alternatives.
Diffusion vs. GANs
GANs were the quality leaders for image generation from 2018–2021. Their weakness is training instability and mode collapse: the adversarial game can fail to converge, and generators often learn a limited subset of the true distribution. Diffusion models train with a stable regression objective and tend to cover the data distribution more completely. The initial weakness of diffusion — slow sampling — has been largely addressed by DDIM, DPM-Solver++, and consistency models. FID scores of modern diffusion systems handily surpass the best GANs on most benchmarks, with better diversity and fewer training failure modes.
Diffusion vs. Normalizing Flows
Normalizing flows provide exact likelihood computation and exact inversion, which diffusion does not. However, flows require invertible architectures that constrain their expressiveness. Diffusion imposes no such constraint on the denoiser — it can be any neural network. In practice, diffusion models produce qualitatively superior samples at equivalent parameter counts. The probability flow ODE formulation does enable approximate likelihood computation and exact encoding, partially bridging the gap.
Diffusion vs. Autoregressive Models
For images and audio, diffusion models generally produce higher quality samples than autoregressive models of similar scale, because the iterative refinement allows global structure to be corrected during sampling in a way that strictly left-to-right generation cannot. For text, the situation is reversed: autoregressive transformers remain dominant because text is naturally discrete and sequential, and the token-by-token factorization is exact without approximation. There is active research into discrete diffusion for text (masked diffusion, absorbing diffusion) but it has not yet matched the quality of AR language models at scale.
| Property | Diffusion | GANs | Flows | AR models |
|---|---|---|---|---|
| Sample quality | Excellent | Very good | Good | Good (images) |
| Sample diversity | Excellent | Moderate | Good | Excellent |
| Training stability | High | Low | High | High |
| Exact likelihood | No (approx.) | No | Yes | Yes |
| Sampling speed | Moderate | Fast | Fast | Slow |
| Conditioning | Excellent (CFG) | Moderate | Moderate | Excellent |
| Dominant domain | Images, audio, video | Images (historic) | Audio, density est. | Text |
Open Questions
Despite their success, diffusion models have open challenges. Compositional generation — producing images that correctly compose multiple objects, attributes, and spatial relationships — remains imperfect; models still fail on prompts like "a red cube to the left of a blue sphere." Text rendering was a persistent weakness of early diffusion models (Stable Diffusion v1–2 could barely produce legible text); recent models have improved significantly but the problem is not fully solved. Consistency and 3D understanding — generating coherent multi-view scenes or videos with consistent object identities — requires new architectural and training innovations beyond the basic diffusion framework.
The boundary between diffusion and other generative paradigms is also blurring. Flow matching and rectified flows simplify diffusion's curved trajectories into straight lines, making the ODE easier to integrate. Masked diffusion and continuous-time discrete diffusion extend the framework to language. Diffusion is no longer a single model but a family of related generative frameworks unified by the idea of learning a noising-denoising transformation — and that family continues to grow.
Key Papers
-
Denoising Diffusion Probabilistic ModelsThe paper that brought diffusion models to the forefront. Derives the simplified training objective, demonstrates high-quality unconditional image generation, and establishes the DDPM framework. The essential starting point — every subsequent diffusion paper builds on this one.
-
Score-Based Generative Modeling through Stochastic Differential EquationsUnifies DDPM and NCSN under the SDE framework, introduces the probability flow ODE and the VP/VE-SDE taxonomy, and enables exact likelihood computation via the instantaneous change of variables formula. The theoretical framework that lets you reason about the entire family of diffusion-like models.
-
Denoising Diffusion Implicit ModelsDerives DDIM as a non-Markovian generalization of DDPM that enables deterministic, 50-step generation. Enables latent-space interpolation and exact encoding. The paper that made diffusion models practical — 20× speed-up with comparable quality.
-
Diffusion Models Beat GANs on Image SynthesisIntroduces classifier guidance, the ADM U-Net architecture with attention at multiple resolutions, and demonstrates that diffusion models surpass the best GANs on ImageNet FID. The paper that established diffusion as the new state of the art for image generation.
-
Classifier-Free Diffusion GuidanceIntroduces classifier-free guidance — training a single model to serve as both conditional and unconditional model, then extrapolating at inference. Now the standard approach in all major image generation systems. Understanding CFG is essential for understanding how modern text-to-image models work.
-
High-Resolution Image Synthesis with Latent Diffusion ModelsIntroduces LDM / Stable Diffusion: running diffusion in a compressed VAE latent space with cross-attention conditioning for text-to-image generation. The architecture behind Stable Diffusion, the most widely deployed open-source image generation model.
-
Scalable Diffusion Models with TransformersDiT: replaces the U-Net backbone with a Vision Transformer, demonstrating clean scaling laws (more parameters → lower FID) and competitive results. The architecture adopted by SD3, Flux, and Sora-class models.
-
Consistency ModelsIntroduces the consistency property and trains models that generate high-quality samples in 1–2 evaluations. The clearest framework for understanding how to close the gap between diffusion quality and single-step generation speed.
-
Flow Matching for Generative ModelingIntroduces flow matching — learning straight-line probability paths between noise and data, enabling simpler ODEs that require fewer integration steps. The framework behind SD3, Flux, and other modern high-quality generators.