Diffusion Models, learning to reverse the noise.

Diffusion models learn to generate data by learning to reverse a gradual noising process. Start with a clean image, systematically destroy it with Gaussian noise over hundreds of steps, then train a neural network to run that process in reverse. The insight is profound in its simplicity: if you can learn to remove noise, you can learn to create structure from nothing. This framework now underlies the best image, audio, and video generators in the world.

Prerequisites

This chapter assumes solid footing in probability theory — particularly Gaussian distributions, Bayes' theorem, and KL divergence (Part I Ch 04 and Ch 06). Variational autoencoders (Part X Ch 01) are the most important conceptual predecessor; the ELBO derivation and reparameterization trick appear in a new form here. Score functions and Langevin dynamics are introduced from scratch, but familiarity with stochastic differential equations will enrich Section 8. Neural network fundamentals (Part V Ch 01–02) and U-Net architectures are assumed for the denoiser model sections.

The Core Intuition: Destruction and Reconstruction

Section 01 · Forward and reverse processes · the key insight

Every generative model has to answer the same question: how do we describe a probability distribution over high-dimensional data — images, audio waveforms, molecular structures — in a way that lets us both evaluate likelihoods and draw new samples? GANs avoid the question with an adversarial game. VAEs introduce a latent space and bound the likelihood. Normalizing flows construct exact invertible maps. Diffusion models take a completely different approach: they define the generative process as the time-reversal of a noise-injection process that they already know.

Here is the core idea. Take a real image \(\mathbf{x}_0\). Add a small amount of Gaussian noise to get \(\mathbf{x}_1\). Add more noise to get \(\mathbf{x}_2\). Keep going for \(T\) steps, each time adding a little more noise, until at step \(T\) you have \(\mathbf{x}_T\) — a sample that looks essentially like pure noise, indistinguishable from \(\mathcal{N}(\mathbf{0}, \mathbf{I})\). This is the forward process, and it is fully defined by the noise schedule. It requires no learning.

Now ask: what is the reverse? If you knew the distribution \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\) — the conditional distribution of the slightly less noisy image given a noisy one — you could run the process backward. Start from random noise at step \(T\) and iteratively denoise to recover a clean image at step \(0\). That reverse conditional is intractable to compute directly, but a neural network can learn to approximate it. This is the reverse process, and it is what gets trained.

𝐱₀ clean 𝐱₁ slight noise 𝐱₂ more noise ··· 𝐱_T pure noise q(x₁|x₀) pθ(x₀|x₁) q(x₂|x₁) pθ(x₁|x₂) forward q (fixed) reverse pθ (learned)
The forward process q gradually adds Gaussian noise over T steps until the data becomes pure noise. The reverse process pθ, parameterized by a neural network, learns to undo each noising step. Sampling is done by starting from noise and running the reverse process.

The elegance of this framework is that the forward process is mathematically tractable — you can write down its marginals in closed form — which makes it possible to train the reverse process efficiently. And unlike GANs, training is stable: there is no adversarial game, just a regression objective. Unlike VAEs, there is no encoder at inference time and the latent space has the same dimensionality as the data. Unlike normalizing flows, the architecture of the denoiser is unconstrained.

Why this works intuitively

Each denoising step only asks the network to solve a relatively easy problem: given an image with a known noise level, estimate what the slightly cleaner version looks like. No single step asks the model to conjure structure from nothing. The hard work of creating coherent global structure is distributed across hundreds of small steps, each individually tractable.

The Forward Process: Gradual Corruption

Section 02 · Markov chain definition · closed-form marginals · noise schedules

The forward process is a fixed (not learned) Markov chain that gradually corrupts data. At each step \(t\), we add a small amount of Gaussian noise, scaling the signal down and adding noise up:

Forward transition kernel
\[ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\right) \]
where \(\{\beta_t\}_{t=1}^T\) is the noise schedule — a sequence of small positive scalars, typically increasing from \(\beta_1 \approx 10^{-4}\) to \(\beta_T \approx 0.02\).

The most useful property of this Gaussian chain is that you can sample \(\mathbf{x}_t\) at any arbitrary timestep directly from \(\mathbf{x}_0\), without having to iterate through all intermediate steps. Define \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s\). Then the marginal distribution at step \(t\) is:

Closed-form marginal — the key shortcut
\[ q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1-\bar{\alpha}_t)\mathbf{I}\right) \]
or equivalently: \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}\) where \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).

This reparameterization is crucial for training efficiency. Rather than running the Markov chain step by step during each training iteration, we can directly sample any noisy version of any training image in a single operation. At \(t=0\), \(\bar{\alpha}_0 = 1\), so \(\mathbf{x}_0 = \mathbf{x}_0\) (no noise). As \(t \to T\), \(\bar{\alpha}_T \to 0\), so the distribution approaches \(\mathcal{N}(\mathbf{0}, \mathbf{I})\).

Noise Schedules

The choice of schedule \(\{\beta_t\}\) matters. The original DDPM paper used a linear schedule with \(\beta_t\) increasing uniformly from \(10^{-4}\) to \(0.02\). However, with a linear schedule on raw pixel space, the final steps are largely wasted — \(\bar{\alpha}_t\) reaches near-zero well before \(t = T\), meaning many timesteps add effectively no new information. Improved DDPM (Nichol & Dhariwal, 2021) introduced a cosine schedule:

Cosine noise schedule
\[ \bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos^2\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right) \]
where \(s = 0.008\) prevents \(\bar{\alpha}_t\) from dropping too quickly near \(t = 0\). The cosine schedule keeps signal-to-noise ratio changing more uniformly across timesteps.

More recent work frames schedule design in terms of the signal-to-noise ratio (SNR): \(\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t)\). A good schedule should cover a wide range of SNR values, from the high-SNR regime where only fine details are corrupted, down to the low-SNR regime where global structure is being determined. Research continues into optimal schedule design, including learned schedules and schedules tailored to specific data modalities.

The Reverse Process and DDPM

Section 03 · Reverse Markov chain · parameterizing the denoiser · training objective

Ho et al. (2020) — the DDPM paper — define the reverse process as a learned Markov chain running backward from \(t=T\) to \(t=0\). The joint distribution of the reverse process is:

Reverse process definition
\[ p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T)\prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \]
\[ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \sigma_t^2 \mathbf{I}\right) \]
where \(p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I})\) and the variance \(\sigma_t^2\) can be fixed (e.g., \(\sigma_t^2 = \beta_t\) or \(\sigma_t^2 = \tilde{\beta}_t\)) or learned.

The reverse conditional \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\) is intractable because it requires integrating over all possible data. But a remarkable fact saves us: conditioned on \(\mathbf{x}_0\), the reverse conditional is tractable — it is Gaussian with a closed-form mean and variance:

Posterior conditioned on x₀ (tractable)
\[ q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\; \tilde{\beta}_t \mathbf{I}\right) \]
\[ \tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t, \quad \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t \]

This is the target: we want \(p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\) to match \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\). Since we do not have \(\mathbf{x}_0\) at inference time, the network must predict it — or, equivalently, predict the noise \(\boldsymbol{\epsilon}\) that was added. Ho et al. found that parameterizing the network to predict noise led to better results:

Noise parameterization
\[ \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) \]
The network \(\boldsymbol{\epsilon}_\theta\) takes the noisy image and timestep \(t\) as input and predicts the noise that was added. Given this prediction, the denoised mean \(\boldsymbol{\mu}_\theta\) is fully determined.

The Simplified Training Objective

After deriving the ELBO (detailed in the next section), Ho et al. found that a simplified loss — ignoring most weighting terms — works just as well or better in practice:

DDPM training objective (simplified)
\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\lVert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},\; t\right)\right\rVert^2\right] \]
Training algorithm: (1) Sample a clean image \(\mathbf{x}_0\); (2) sample a timestep \(t \sim \text{Uniform}(1, T)\); (3) sample noise \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\); (4) compute \(\mathbf{x}_t\) using the closed-form marginal; (5) minimize the MSE between actual and predicted noise. The elegance is that each training step sees a fresh random \(t\), training the denoiser uniformly across all noise levels.

Sampling from a trained DDPM requires running the reverse chain for all \(T\) steps (typically \(T = 1000\)), which is the main computational bottleneck. At each step, we compute the predicted noise, reconstruct the mean, and add a small amount of noise appropriate to that step. The output at step 0 is the generated sample.

The ELBO and Training Objective

Section 04 · Variational lower bound · KL terms · connection to VAEs

The formal justification for the DDPM training objective comes from maximizing a variational lower bound on the log-likelihood of the data. The derivation reveals the deep connection between diffusion models and VAEs — diffusion can be viewed as an extremely deep hierarchical VAE with a fixed encoder and learned decoder.

We want to maximize \(\log p_\theta(\mathbf{x}_0)\). Using the forward process as a variational approximation to the true posterior, the ELBO is:

ELBO decomposition
\[ \log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_q\!\left[\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}\right] = -\mathcal{L}_{\text{ELBO}} \]

This ELBO can be decomposed into interpretable terms by applying the Markov structure of both the forward and reverse processes. After careful manipulation, three types of terms emerge:

Because both distributions in each \(\mathcal{L}_{t-1}\) term are Gaussian, the KL divergence has a closed form that reduces to a weighted MSE between the true and predicted means. Substituting the noise parameterization yields the simplified objective above.

Loss Weighting

The full ELBO involves a per-timestep weight \(\lambda_t = \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)}\). This weight is large for small \(t\) (fine-detail denoising) and small for large \(t\) (rough structure). Ho et al. found empirically that dropping this weighting — using uniform weighting across all timesteps — improved sample quality. This upweights the high-noise timesteps, which might seem counterintuitive but works because those steps are harder and more informative for learning global structure.

Diffusion as a Hierarchical VAE

A VAE has a single latent variable and a learned encoder. Diffusion can be seen as a \(T\)-level hierarchical VAE where the encoder \(q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)\) is fixed (no parameters), and the decoder \(p_\theta(\mathbf{x}_{0:T})\) is learned. The ELBO structure is identical. The key difference is that fixing the encoder removes the need for amortized inference and makes each training step cheaper, at the cost of requiring many decoding steps at inference time.

Score Matching and Score-Based Models

Section 05 · Score function · Langevin dynamics · NCSN · annealed sampling

In parallel with DDPM, Song & Ermon (2019, 2020) developed score-based generative models from a different starting point — and the two approaches turned out to be deeply equivalent. Their framework starts with the score function: the gradient of the log-probability density with respect to the data.

Score function
\[ \mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}) \]
The score points in the direction of increasing data density — toward regions where the data distribution concentrates. It does not require knowing the normalizing constant of \(p(\mathbf{x})\).

If you knew the score function of the data distribution, you could generate samples via Langevin dynamics: start from any point and follow a noisy gradient ascent toward regions of high density:

Langevin MCMC update
\[ \mathbf{x}_{k+1} = \mathbf{x}_k + \frac{\eta}{2}\nabla_{\mathbf{x}} \log p(\mathbf{x}_k) + \sqrt{\eta}\,\mathbf{z}_k, \quad \mathbf{z}_k \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \]
As step size \(\eta \to 0\) and iterations \(k \to \infty\), the distribution of \(\mathbf{x}_k\) converges to \(p(\mathbf{x})\).

Score Matching

We cannot directly compute \(\nabla_{\mathbf{x}} \log p(\mathbf{x})\) since \(p(\mathbf{x})\) is unknown. But we can train a neural network \(\mathbf{s}_\theta(\mathbf{x})\) to approximate it. The score matching objective minimizes the expected squared distance between the network's output and the true score:

Score matching objective
\[ \mathcal{J}_{\text{SM}} = \mathbb{E}_{p(\mathbf{x})}\!\left[\tfrac{1}{2}\left\lVert \mathbf{s}_\theta(\mathbf{x}) - \nabla_{\mathbf{x}} \log p(\mathbf{x})\right\rVert^2\right] \]
Through integration by parts (Hyvärinen 2005), this can be rewritten to depend only on data samples and first-order network Jacobians, without requiring the true score. Denoising score matching (Vincent 2011) further simplifies this by perturbing data with known Gaussian noise and matching the score of the perturbed distribution.

The denoising score matching objective is strikingly familiar: if you perturb \(\mathbf{x}_0\) by noise \(\boldsymbol{\epsilon}\) to get \(\mathbf{x}_t\), the optimal score network satisfies \(\mathbf{s}_\theta(\mathbf{x}_t, t) = -\boldsymbol{\epsilon} / \sqrt{1-\bar{\alpha}_t}\). This is precisely the noise prediction objective of DDPM, scaled by \(-1/\sqrt{1-\bar{\alpha}_t}\). The two frameworks are equivalent.

Noise Conditional Score Networks (NCSN)

Song & Ermon's NCSN perturbs data with noise at multiple scales \(\sigma_1 > \sigma_2 > \cdots > \sigma_L\) and trains a single score network \(\mathbf{s}_\theta(\mathbf{x}, \sigma)\) conditioned on the noise level. Sampling proceeds via annealed Langevin dynamics: start from a high noise level where the score landscape is smooth and unimodal, run Langevin dynamics to approximate a sample from the perturbed distribution, then anneal to the next lower noise level and repeat. Each successive noise level provides a warm start from the previous level. This annealing is conceptually identical to the DDPM reverse chain.

Unifying View: Diffusion as SDEs

Section 06 · Continuous-time limit · VP-SDE · VE-SDE · probability flow ODE · DDIM

Song et al. (2021) unified DDPM and score-based models by taking the continuous-time limit. As \(T \to \infty\) and step sizes \(\beta_t\) become infinitesimal, the discrete Markov chain converges to a stochastic differential equation (SDE):

Forward SDE (general form)
\[ d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w} \]
where \(\mathbf{w}\) is a standard Wiener process (Brownian motion), \(\mathbf{f}\) is the drift coefficient, and \(g\) is the diffusion coefficient. Two important special cases: VP-SDE (Variance Preserving, corresponding to DDPM) and VE-SDE (Variance Exploding, corresponding to NCSN).

By Anderson's (1982) reverse-time SDE theorem, the reverse process is also an SDE, and its drift depends on the score:

Reverse-time SDE
\[ d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\right]dt + g(t)\,d\bar{\mathbf{w}} \]
where \(\bar{\mathbf{w}}\) is a Wiener process running backward in time, and \(p_t(\mathbf{x})\) is the marginal density at time \(t\). Learning the score at all noise levels is sufficient to simulate the reverse SDE and generate samples.

Probability Flow ODE and DDIM

Every SDE has a corresponding probability flow ODE that induces the same marginal distributions at every time \(t\), but without stochastic noise:

Probability flow ODE
\[ d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - \tfrac{1}{2}g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\right]dt \]
This ODE can be solved with any numerical ODE solver. Because it is deterministic, it enables exact likelihood computation (via the instantaneous change of variables formula) and exact encoding of images into noise space.

DDIM (Song et al., 2020) derived a non-Markovian sampling process that generalizes DDPM. Whereas DDPM requires all \(T\) reverse steps, DDIM can take large strides through the timestep sequence — skipping from \(t=1000\) to \(t=800\) to \(t=600\), etc. — and still produce good samples. With 50 steps instead of 1000, DDIM achieves comparable quality to DDPM while being 20× faster. The key insight is that DDIM's update rule is a discretization of the probability flow ODE, not the stochastic SDE — and the ODE is smoother, so larger steps remain accurate.

Determinism enables interpolation

Because DDIM defines a deterministic mapping from noise \(\mathbf{x}_T\) to sample \(\mathbf{x}_0\), you can encode two images into their latent noise vectors, interpolate between those vectors, and decode to get semantically meaningful interpolations. This would be impossible with the stochastic DDPM sampler.

Denoiser Architecture: U-Nets and Transformers

Section 07 · U-Net backbone · timestep conditioning · attention · DiT

The choice of architecture for the denoising network \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) is separate from the diffusion mathematics — any architecture that takes a noisy image and a timestep as input and outputs a noise estimate can be plugged in. In practice, two architectures have dominated: U-Net with attention, and the Diffusion Transformer (DiT).

U-Net with Attention

The original DDPM paper and most subsequent image diffusion models use a U-Net: an encoder-decoder convolutional network with skip connections between corresponding encoder and decoder layers. The architecture was originally developed for biomedical image segmentation, and it turns out to be well-suited for denoising because it captures both local texture (via shallow features) and global structure (via bottleneck features).

Key modifications for diffusion: the timestep \(t\) is embedded using sinusoidal position encodings (borrowing from transformers), then projected and added to the feature maps via learned scale-and-shift operations (adaptive group normalization). Multi-head self-attention is added at lower-resolution feature maps in the bottleneck and some intermediate layers, allowing long-range spatial dependencies to be captured. The ADM model (Dhariwal & Nichol, 2021) achieved state-of-the-art FID by significantly scaling up this architecture and adding class conditioning.

Diffusion Transformer (DiT)

Peebles & Xie (2023) showed that a pure transformer architecture, DiT, matches or outperforms U-Net diffusion models when trained at scale. The image is first patchified into a sequence of tokens (as in ViT), then processed by a stack of transformer blocks with adaptive layer norm (adaLN) for conditioning on timestep and class label. The simplicity of the transformer scaling laws — more layers and wider dimensions predictably improve FID — made DiT an attractive foundation for large-scale systems. Stable Diffusion 3, Flux, and other recent systems use DiT or DiT-derived architectures.

Timestep Embedding

The denoiser must know what noise level it is operating at. The standard approach uses sinusoidal embeddings analogous to transformer position encodings, then passes through a two-layer MLP to produce a conditioning vector. This vector modulates the feature maps via scale-and-shift (AdaGN or adaLN), with learned per-layer scale and bias that depend on the timestep. Ensuring the network can distinguish nearby timesteps precisely is critical: a network that cannot tell the difference between \(t=100\) and \(t=120\) will incorrectly calibrate its denoising strength.

Classifier Guidance and Classifier-Free Guidance

Section 08 · Conditional generation · guidance scale · CFG · diversity trade-off

An unconditional diffusion model generates diverse samples but cannot be steered toward a particular class or text description. Guidance mechanisms allow a trained diffusion model to sample from a conditional distribution \(p(\mathbf{x} \mid \mathbf{c})\) — where \(\mathbf{c}\) is a class label, text prompt, or any other conditioning signal — by using the score of the conditional distribution.

Classifier Guidance

Dhariwal & Nichol (2021) showed that a separately trained classifier on noisy images can guide the diffusion sampling process. Applying Bayes' theorem to the score:

Classifier-guided score
\[ \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t \mid \mathbf{c}) = \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) + \nabla_{\mathbf{x}_t} \log p(\mathbf{c} \mid \mathbf{x}_t) \]
The first term is provided by the unconditional diffusion model. The second term is the gradient of a classifier's log-probability with respect to the noisy image. During sampling, this gradient nudges the denoising toward the direction that increases the classifier's confidence in class \(\mathbf{c}\).

A guidance scale \(s \geq 1\) amplifies the classifier gradient: the guided score becomes \(\nabla \log p(\mathbf{x}_t) + s \cdot \nabla \log p(\mathbf{c} \mid \mathbf{x}_t)\). Higher \(s\) produces samples more clearly matching the condition at the cost of diversity. Classifier guidance produced the first diffusion models to surpass GANs on FID, but requires training an additional noise-aware classifier.

Classifier-Free Guidance

Ho & Salimans (2022) eliminated the need for a separate classifier with classifier-free guidance (CFG). The trick is to train a single conditional model that also occasionally receives a null condition (dropping the conditioning signal with some probability during training). At inference, two forward passes produce the conditional and unconditional score estimates, and they are combined:

Classifier-free guidance
\[ \tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, \mathbf{c}) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \varnothing) + s\!\left[\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \varnothing)\right] \]
where \(\varnothing\) is the null condition (e.g., an empty string or a dropped class label) and \(s > 1\) is the guidance scale. This is equivalent to amplifying the conditional direction beyond what the model naturally outputs. Typical guidance scales are \(s \in [3, 10]\) for text-to-image models.

CFG is now the standard approach for conditional generation in all major diffusion systems — Stable Diffusion, DALL-E 2, Imagen, and their successors all use it. A higher guidance scale makes outputs more faithful to the prompt but more saturated and less diverse. The guidance scale is a critical hyperparameter that practitioners tune based on the desired trade-off. Recent research (e.g., Imagen Video, Emu) has explored adaptive guidance, timestep-dependent guidance scales, and extensions beyond the simple linear extrapolation formula.

Why guidance works so well

The guidance equation can be interpreted as: take the unconditional direction, then push further in the direction that the conditional model prefers relative to unconditional. The extrapolation beyond the conditional model's natural output sharpens the distribution at the cost of coverage — a principled way of trading sample diversity for fidelity. At \(s=1\), the guided output is just the conditional model; at \(s=0\), it is unconditional; and at \(s > 1\), you get a sharper conditional than the model naturally produces.

Latent Diffusion and Stable Diffusion

Section 09 · Perceptual compression · LDM · CLIP text conditioning · cross-attention

Running diffusion directly in pixel space is expensive: a 512×512 image has 786,432 dimensions, and each forward or reverse step requires a full U-Net pass over this entire space. Rombach et al. (2022) introduced Latent Diffusion Models (LDM), which run the diffusion process in a compressed latent space learned by a separately trained autoencoder. This is the architecture underlying Stable Diffusion, the most widely deployed open-source image generation model.

Two-Stage Architecture

The first stage is a VQ-regularized or KL-regularized autoencoder that learns to encode images into a compact latent representation and decode latents back to pixel space. For a 512×512 image, a typical encoder produces a 64×64×4 latent (an 8× spatial downsampling), reducing dimensionality by a factor of 48. The autoencoder is trained with a combination of reconstruction loss, perceptual loss (comparing VGG features), and an adversarial loss from a patch discriminator. Once trained, the encoder and decoder are frozen.

The second stage trains the diffusion model in this learned latent space. Because the latent space is much smaller, the U-Net processes 64×64 feature maps rather than 512×512, making each forward pass dramatically faster. The semantic content of images is well-preserved in the latent space (the autoencoder handles perceptual compression), so the diffusion model focuses its capacity on learning semantic structure rather than pixel-level texture.

Text Conditioning via Cross-Attention

For text-to-image generation, the text prompt is encoded by a pretrained language model — CLIP's text encoder in Stable Diffusion v1, OpenCLIP or T5 in later versions — to produce a sequence of text token embeddings. These embeddings condition the U-Net denoiser through cross-attention layers inserted into the U-Net's residual blocks:

Cross-attention in the denoiser U-Net
\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)V \]
where \(Q = W_Q \phi(\mathbf{x}_t)\) is projected from spatial features, and \(K = W_K \tau(\mathbf{c})\), \(V = W_V \tau(\mathbf{c})\) are projected from the text embeddings \(\tau(\mathbf{c})\). Each spatial location in the feature map attends over the entire text sequence, learning which text tokens are relevant for which spatial regions.

Stable Diffusion v1 was trained on LAION-5B with a 512×512 resolution. Subsequent versions (v2, SDXL, SD3) improved resolution, used improved text encoders (including T5), changed the architecture to DiT, and added multi-aspect-ratio training. SDXL introduced a two-stage pipeline: a base model at 1024×1024 followed by a refinement model that applies diffusion at low noise levels to sharpen details.

Why the two-stage approach matters

Training diffusion in latent space separates two distinct learning problems: perceptual compression (what information to keep) handled by the autoencoder, and semantic generation (how concepts, objects, and styles arrange themselves) handled by the diffusion model. This decomposition lets each component be optimized independently, and lets the diffusion model focus its parameters entirely on semantic content rather than low-level texture.

Fast Sampling: DDIM, DPM-Solver, and Beyond

Section 10 · Non-Markovian samplers · ODE solvers · consistency models · one-step generation

The main practical limitation of diffusion models is sampling speed: DDPM requires 1000 sequential neural network evaluations per sample. A great deal of research has focused on reducing this to dozens or even a single step while preserving sample quality.

DDIM and Other ODE Samplers

As discussed in Section 6, DDIM reformulates sampling as solving a probability flow ODE, enabling large step sizes. Using 50 DDIM steps instead of 1000 DDPM steps gives roughly the same FID scores at 20× the speed. Further improvements come from higher-order ODE solvers: while DDIM uses an Euler method (first-order), PLMS, DPM-Solver, and DPM-Solver++ use second- and third-order methods that are more accurate per step, enabling further step-count reduction to 10–20 steps with minimal quality loss.

DPM-Solver++ (Lu et al., 2022) introduced a multistep solver specifically designed for the semi-linear ODE structure of diffusion models, exploiting the fact that the noise prediction network changes slowly between nearby timesteps. With 5–10 function evaluations, it achieves quality comparable to DDIM with 100 steps. Most modern image generation pipelines use DPM-Solver++ or similar as their default sampler.

Consistency Models

Song et al. (2023) introduced consistency models, which learn a mapping directly from any point on a probability flow ODE trajectory to the trajectory's endpoint (the clean sample). The consistency property is: if two points \(\mathbf{x}_t\) and \(\mathbf{x}_{t'}\) lie on the same ODE trajectory, then \(f_\theta(\mathbf{x}_t, t) = f_\theta(\mathbf{x}_{t'}, t')\). Training enforces this by minimizing differences between predictions at adjacent timesteps.

A trained consistency model can generate samples in a single neural network evaluation (consistency sampling), or in a small number of steps with alternating denoising and noising (consistency trajectory models). Consistency distillation — initializing from a pretrained diffusion model and distilling the consistency property — produces high-quality single-step generators. LCM (Latent Consistency Model) applied consistency distillation to Stable Diffusion, enabling 2–4 step generation of competitive images.

Progressive and Rectified Flow Distillation

Progressive distillation (Salimans & Ho, 2022) trains a student model to match the output of two teacher steps in a single step, then uses the student as the new teacher and repeats. After \(K\) rounds of halving, the number of required steps drops from \(T\) to \(T / 2^K\). Flow matching (Lipman et al., 2022) and rectified flow (Liu et al., 2022) take a complementary approach: rather than diffusing along curved trajectories, they learn straight-line (or near-straight-line) flows from noise to data, which can be integrated accurately with far fewer steps. Flux and Stable Diffusion 3 use rectified flow as their sampling framework.

MethodMin stepsQualityTrainingKey idea
DDPM1000GoodFrom scratchStochastic reverse chain
DDIM50GoodNo retrainingDeterministic ODE sampler
DPM-Solver++10–20Very goodNo retrainingHigher-order ODE solver
Consistency distillation1–2GoodDistillationLearn ODE endpoint directly
Progressive distillation4–8Very goodDistillationIterative step halving
Rectified flow5–25ExcellentFrom scratchStraight-line flow paths

Variants and Extensions

Section 11 · Cascaded diffusion · video · audio · molecules · image editing

The diffusion framework is highly modular. Once the core forward/reverse process is defined and the denoiser is trained, the conditioning mechanism and the data modality are largely orthogonal design choices. This flexibility has led to an explosion of variants across domains.

Cascaded Diffusion Models

Generating high-resolution images (e.g., 1024×1024) directly in a single model is challenging. Cascaded diffusion (Ho et al., 2022) chains multiple diffusion models: a base model generates a low-resolution image (e.g., 64×64), then a cascade of super-resolution models upsamples sequentially (64→256, 256→1024). Each model in the cascade is conditioned on the output of the previous stage plus the text prompt. Imagen (Saharia et al., 2022) demonstrated that cascading a small base model with large super-resolution models, using a strong text encoder (T5-XXL), outperformed all prior methods including DALL-E 2 and Stable Diffusion on benchmark metrics.

Video Diffusion

Extending diffusion to video requires modeling temporal coherence across frames. Video Diffusion Models (Ho et al., 2022) extend the 2D U-Net with 3D convolutions and temporal attention, treating the video as a spacetime volume. Imagen Video uses a cascaded approach with multiple spatial and temporal super-resolution stages. More recent models like Sora (OpenAI, 2024) use DiT-based architectures operating in compressed video latent spaces, training on vast collections of video to learn physical dynamics, camera motion, and long-range temporal coherence.

Audio Diffusion

DiffWave and WaveGrad apply diffusion to raw audio waveforms, treating the 1D waveform as the data space. AudioLM combines discrete acoustic tokens (similar to VQGAN but for audio) with language-model-based generation. Stable Audio uses latent diffusion on a VAE-compressed audio representation with CLAP (Contrastive Language-Audio Pretraining) conditioning, enabling high-quality music generation from text prompts with timing control.

Molecular Generation

Diffusion over 3D molecular geometries — atom positions and types — enables drug discovery applications. DiffSBDD and DiffDock use equivariant diffusion models that respect the rotational and translational symmetries of 3D space, generating ligand structures conditioned on protein binding site geometry. The score network must be equivariant (or SE(3)-invariant): if the protein is rotated, the generated ligand should rotate accordingly.

Image Editing and Inpainting

SDEdit (Meng et al., 2022) enables image editing without any fine-tuning: add noise to an edited image up to an intermediate timestep, then run the reverse diffusion process from that point. This produces images that blend the user's edits with the model's learned image manifold. Inpainting can be done by masking certain regions, applying diffusion only in the masked area while keeping the rest fixed, then blending — though more sophisticated approaches maintain consistency between masked and unmasked regions through the full reverse chain.

ControlNet (Zhang et al., 2023) adds conditioning on structured signals — edge maps, depth maps, human pose skeletons, semantic segmentation masks — by copying U-Net encoder blocks into a trainable side network and adding their outputs back into the frozen decoder. This enables precise spatial control over generated images without retraining the base model.

The Diffusion Landscape

Section 12 · Diffusion vs. GANs vs. flows vs. AR · quality-diversity trade-off · current position

Diffusion models arrived relatively late compared to GANs (2014), VAEs (2013), and normalizing flows (2014–2018). Their explosive rise — from Ho et al.'s DDPM in 2020 to dominating image, audio, and video generation by 2022–2023 — represents one of the fastest paradigm shifts in deep learning history. Understanding why requires comparing them honestly to alternatives.

Diffusion vs. GANs

GANs were the quality leaders for image generation from 2018–2021. Their weakness is training instability and mode collapse: the adversarial game can fail to converge, and generators often learn a limited subset of the true distribution. Diffusion models train with a stable regression objective and tend to cover the data distribution more completely. The initial weakness of diffusion — slow sampling — has been largely addressed by DDIM, DPM-Solver++, and consistency models. FID scores of modern diffusion systems handily surpass the best GANs on most benchmarks, with better diversity and fewer training failure modes.

Diffusion vs. Normalizing Flows

Normalizing flows provide exact likelihood computation and exact inversion, which diffusion does not. However, flows require invertible architectures that constrain their expressiveness. Diffusion imposes no such constraint on the denoiser — it can be any neural network. In practice, diffusion models produce qualitatively superior samples at equivalent parameter counts. The probability flow ODE formulation does enable approximate likelihood computation and exact encoding, partially bridging the gap.

Diffusion vs. Autoregressive Models

For images and audio, diffusion models generally produce higher quality samples than autoregressive models of similar scale, because the iterative refinement allows global structure to be corrected during sampling in a way that strictly left-to-right generation cannot. For text, the situation is reversed: autoregressive transformers remain dominant because text is naturally discrete and sequential, and the token-by-token factorization is exact without approximation. There is active research into discrete diffusion for text (masked diffusion, absorbing diffusion) but it has not yet matched the quality of AR language models at scale.

PropertyDiffusionGANsFlowsAR models
Sample qualityExcellentVery goodGoodGood (images)
Sample diversityExcellentModerateGoodExcellent
Training stabilityHighLowHighHigh
Exact likelihoodNo (approx.)NoYesYes
Sampling speedModerateFastFastSlow
ConditioningExcellent (CFG)ModerateModerateExcellent
Dominant domainImages, audio, videoImages (historic)Audio, density est.Text

Open Questions

Despite their success, diffusion models have open challenges. Compositional generation — producing images that correctly compose multiple objects, attributes, and spatial relationships — remains imperfect; models still fail on prompts like "a red cube to the left of a blue sphere." Text rendering was a persistent weakness of early diffusion models (Stable Diffusion v1–2 could barely produce legible text); recent models have improved significantly but the problem is not fully solved. Consistency and 3D understanding — generating coherent multi-view scenes or videos with consistent object identities — requires new architectural and training innovations beyond the basic diffusion framework.

The boundary between diffusion and other generative paradigms is also blurring. Flow matching and rectified flows simplify diffusion's curved trajectories into straight lines, making the ODE easier to integrate. Masked diffusion and continuous-time discrete diffusion extend the framework to language. Diffusion is no longer a single model but a family of related generative frameworks unified by the idea of learning a noising-denoising transformation — and that family continues to grow.

Key Papers