Part X · Generative Models · Chapter 03

Normalizing Flows, bending simple distributions into complex shapes.

A normalizing flow is a sequence of invertible transformations that warps a simple probability distribution — a Gaussian, say — into an arbitrarily complex one. Unlike VAEs, which approximate likelihood, and GANs, which abandon likelihood entirely, flows provide exact log-likelihoods and exact samples in a single unified framework. The price is a demanding constraint: every transformation must be both invertible and have a tractable Jacobian determinant.

Prerequisites

This chapter assumes familiarity with probability distributions, the chain rule of probability, and multivariable calculus including Jacobians and determinants (Parts I Ch 02–04). Neural network fundamentals (Part V) are assumed throughout. The VAE chapter (Part X Ch 01) provides useful comparative context for the ELBO vs. exact likelihood distinction, but is not required. The change of variables formula is derived from first principles here.

Sections

The Change of Variables Formula Jacobian determinant · invertibility · log-likelihood
The Normalizing Flow Framework composition of bijections · normalizing direction · training
Early Flows: Planar, Radial & NICE planar flow · additive coupling · triangular Jacobian
Autoregressive Flows MADE · MAF · IAF · generation vs. density speed
Coupling Layers in Depth affine coupling · Jacobian structure · expressiveness
RealNVP affine coupling · multiscale · checkerboard masking
Glow actnorm · invertible 1×1 convolutions · LU decomposition
Continuous Normalizing Flows & Neural ODEs continuous-time dynamics · adjoint method · FFJORD
Expressive Bijections: Spline Flows monotone splines · neural spline flows · RQ-splines
Discrete & Categorical Flows integer discrete flows · argmax flows · text generation
Flows vs. VAEs vs. GANs exact likelihood · sample quality · compute cost
Applications & Current Standing density estimation · anomaly detection · speech synthesis

The Change of Variables Formula

Everything in normalizing flows rests on a single theorem from calculus: when you pass a random variable through an invertible function, the probability density of the output can be computed exactly from the density of the input and the local stretching or squishing of space at each point.

Suppose you have a random variable $z$ with a known probability density $p_z(z)$ — say, a standard Gaussian — and you apply an invertible, differentiable function $f: \mathbb{R}^d \to \mathbb{R}^d$ to obtain a new random variable $x = f(z)$. What is the density $p_x(x)$ of $x$?

The answer comes from the requirement that probability mass must be conserved: the probability of $x$ falling in a small region around a point must equal the probability of $z$ falling in the corresponding region under $f^{-1}$. In one dimension this is elementary — if $x = f(z)$, then $p_x(x) = p_z(f^{-1}(x)) \cdot |df^{-1}/dx|$ — and the generalization to $d$ dimensions replaces the scalar derivative with a matrix determinant:

$$p_x(x) = p_z(f^{-1}(x)) \cdot \left| \det \frac{\partial f^{-1}}{\partial x} \right|$$

More conveniently, using the inverse function theorem which tells us that the Jacobian of $f^{-1}$ is the inverse of the Jacobian of $f$, and that $|\det A^{-1}| = 1/|\det A|$:

$$p_x(x) = p_z(f^{-1}(x)) \cdot \left| \det \frac{\partial f}{\partial z} \right|^{-1}$$

Taking the log, which is more convenient for training:

$$\log p_x(x) = \log p_z(f^{-1}(x)) - \log \left| \det \frac{\partial f}{\partial z} \right|$$

This is the central equation of normalizing flows. Every architectural choice in the field is an attempt to make the log-determinant term $\log |\det \partial f / \partial z|$ cheap to compute while keeping $f$ expressive enough to model complex distributions.

What the Jacobian determinant means geometrically

The Jacobian $\partial f / \partial z$ is a $d \times d$ matrix whose $(i,j)$ entry is $\partial f_i / \partial z_j$ — how the $i$-th output changes as you perturb the $j$-th input. Its determinant measures the local volume scaling factor of $f$ at each point: $|\det J_f(z)|$ is the ratio of the volume of a tiny region around $f(z)$ to the volume of the corresponding tiny region around $z$. If $|\det J_f| > 1$, the transformation is locally expanding; if $|\det J_f| < 1$, it is contracting; if $|\det J_f| = 1$, it is volume-preserving.

The change of variables formula makes intuitive sense in this language: if the transformation expands a region by a factor of 10, the probability density in that region must decrease by a factor of 10 to keep total probability equal to 1. The log-determinant term in the log-likelihood precisely implements this density correction.

The computational challenge

For a general $d \times d$ matrix, computing the determinant costs $O(d^3)$ operations — unacceptable when $d$ is the dimensionality of an image (perhaps $d = 3 \times 256 \times 256 \approx 200{,}000$). The entire research agenda of normalizing flows is about designing bijections $f$ whose Jacobian determinants can be computed in $O(d)$ time, typically by engineering $f$ so that $\partial f / \partial z$ has special structure: triangular, block-diagonal, or low-rank plus diagonal.

A normalizing flow maps a simple base distribution $p_z$ through an invertible transformation $f$ to produce a complex target distribution $p_x$. The exact log-likelihood under $p_x$ is computable via the change of variables formula, correcting for the volume-scaling Jacobian determinant.

The Normalizing Flow Framework

A single invertible transformation is usually not expressive enough. The power of normalizing flows comes from composing many such transformations into a deep stack — each layer refines the distribution further, and the total log-determinant is the sum of the individual ones.

A normalizing flow is a sequence of $K$ invertible transformations $f_1, f_2, \ldots, f_K$, where each $f_k: \mathbb{R}^d \to \mathbb{R}^d$ is differentiable and its inverse is also differentiable. Starting from a base distribution $p_z$ (typically a standard Gaussian), we define:

$$z_0 \sim p_z, \quad z_k = f_k(z_{k-1}), \quad x = z_K$$

The composed transformation $f = f_K \circ \cdots \circ f_1$ is itself invertible, with inverse $f^{-1} = f_1^{-1} \circ \cdots \circ f_K^{-1}$. Applying the change of variables formula to the full composition, and using the chain rule for Jacobians ($J_{f \circ g}(z) = J_f(g(z)) \cdot J_g(z)$), yields the log-likelihood of a data point $x$:

$$\log p_x(x) = \log p_z(z_0) + \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k^{-1}}{\partial z_k} \right|$$

or equivalently, writing in the generative direction ($z_0 \to x$):

$$\log p_x(x) = \log p_z(f^{-1}(x)) - \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|$$

This is a sum of log-determinants — one per layer — which means each layer's Jacobian only needs to be computable independently, not jointly. The log-likelihood of the full model decomposes cleanly across layers.

Two directions: normalizing vs. generative

The model operates in two directions with different purposes. The generative direction runs $z_0 \to x$: sample $z_0$ from the simple base distribution, pass it through $f_1, f_2, \ldots, f_K$ to produce a sample $x$. This is how you generate new data. The normalizing direction runs $x \to z_0$: given a data point $x$, invert all transformations to obtain the corresponding latent code $z_0 = f^{-1}(x)$. This direction is used to evaluate the log-likelihood of data points for training. The name "normalizing" refers to the fact that applying $f^{-1}$ to a complex data distribution $p_x$ "normalizes" it back to a simple (Gaussian) distribution.

Training

Flows are trained by maximum likelihood: minimize the negative log-likelihood of the training data $\{x^{(i)}\}_{i=1}^N$:

$$\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p_\theta(x^{(i)}) = -\frac{1}{N} \sum_{i=1}^N \left[ \log p_z(f_\theta^{-1}(x^{(i)})) - \sum_{k=1}^K \log \left|\det J_{f_k}\right| \right]$$

This is exact maximum likelihood — no ELBO approximation, no adversarial training, no learned prior. The gradient of the exact log-likelihood is computed by backpropagation through the entire sequence of transformations, including through the log-determinant terms. The main challenge, again, is ensuring that each $|\det J_{f_k}|$ is tractable to compute and differentiate.

The three constraints on flow layers: Each transformation $f_k$ must be (1) invertible — so the normalizing direction is well-defined; (2) have a tractable Jacobian determinant — so the log-likelihood can be evaluated; and (3) be expressive — a composition of many simple bijections should be capable of modeling complex distributions. Most of the flow literature is about architectural choices that navigate these three constraints.

Early Flows: Planar, Radial & NICE

The first practical normalizing flows used simple, analytically tractable transformations with closed-form Jacobian determinants. Their expressiveness was limited, but they established the key architectural ideas that all subsequent flows build upon.

Rezende & Mohamed (2015) introduced planar flows and radial flows as flexible approximate posteriors for variational inference. A planar flow applies a rank-1 update to the identity map:

$$f(z) = z + u \cdot h(w^T z + b)$$

where $w, u \in \mathbb{R}^d$ are parameter vectors, $b \in \mathbb{R}$ is a bias, and $h$ is a smooth nonlinearity (typically $\tanh$). This transformation pushes all of $\mathbb{R}^d$ in the direction $u$, with the amount of push varying along the hyperplane defined by $w$. The key property is that the Jacobian has a very special structure: it is the identity plus a rank-1 matrix. By the matrix determinant lemma, $\det(I + uv^T) = 1 + v^T u$, so the log-determinant of a planar flow is a single dot product — $O(d)$ rather than $O(d^3)$. A composition of $K$ planar flows is capable of increasingly complex distributions, but each layer adds only one hyperplane of "push," which limits how quickly the distribution can be reshaped.

Radial flows apply a similar idea but in spherical coordinates around a reference point $z_0$:

$$f(z) = z + \beta h(\alpha, r)(z - z_0), \quad r = \|z - z_0\|$$

The transformation contracts or expands a ball of radius $r$ around $z_0$, with $\alpha$ and $\beta$ controlling the radius and strength of the effect. The Jacobian determinant can again be computed in $O(d)$ time from the radial symmetry.

NICE: Non-linear Independent Components Estimation

The much more influential early work was NICE (Dinh et al., 2015), which introduced the additive coupling layer — the architectural primitive that underlies RealNVP and Glow. The idea is to partition the $d$-dimensional input $z$ into two groups, $(z^A, z^B)$, and define:

$$x^A = z^A$$ $$x^B = z^B + m_\theta(z^A)$$

where $m_\theta$ is an arbitrary neural network. This transformation is trivially invertible: given $(x^A, x^B)$, we recover $z^A = x^A$ and $z^B = x^B - m_\theta(z^A)$. No inversion of $m_\theta$ itself is needed. The Jacobian has a triangular structure: the upper-left block is the identity ($\partial x^A / \partial z^A = I$), the lower-right block is the identity ($\partial x^B / \partial z^B = I$), and the off-diagonal block is $\partial x^B / \partial z^A = \partial m_\theta / \partial z^A$. The determinant of a triangular matrix is the product of its diagonal entries, so $\det J_f = 1$ — the transformation is volume-preserving. This means the Jacobian determinant makes no contribution to the log-likelihood, enormously simplifying computation.

The price of volume preservation is that all the expressiveness must come from the network $m_\theta$ reshaping one half of the coordinates as a function of the other half. NICE uses multiple coupling layers with different partitions to progressively reshape the full distribution. The resulting model can capture complex dependencies across all dimensions but requires many layers to do so, and the additive coupling alone limits how dramatically the distribution can be warped per layer.

Autoregressive Flows

Autoregressive models have a natural connection to normalizing flows: any autoregressive model defines a bijection between data and noise, whose Jacobian is triangular and therefore cheaply computable. The direction in which you run the model — sampling or likelihood evaluation — determines the speed trade-off.

An autoregressive model factorizes the joint density as a product of conditionals: $p(x) = \prod_{i=1}^d p(x_i | x_{1:i-1})$. If each conditional is a Gaussian with mean $\mu_i$ and standard deviation $\sigma_i$ predicted by a neural network from the previous dimensions:

$$p(x_i | x_{1:i-1}) = \mathcal{N}(x_i; \mu_i(x_{1:i-1}), \sigma_i^2(x_{1:i-1}))$$

then we can write each $x_i$ as a deterministic function of latent noise $z_i \sim \mathcal{N}(0,1)$:

$$x_i = \mu_i(x_{1:i-1}) + \sigma_i(x_{1:i-1}) \cdot z_i$$

This is a bijection from $z = (z_1, \ldots, z_d)$ to $x = (x_1, \ldots, x_d)$. The Jacobian $\partial x / \partial z$ is lower-triangular (dimension $i$ of $x$ depends only on $z_i$ directly, and on $z_{1:i-1}$ indirectly through $x_{1:i-1}$), so its determinant is just the product of diagonal entries: $\det J = \prod_i \sigma_i$. This gives an exact log-likelihood at $O(d)$ cost.

MADE: Masked Autoencoder for Distribution Estimation

MADE (Germain et al., 2015) provided a critical implementation tool: a way to evaluate all $d$ conditional means and variances simultaneously in a single neural network forward pass, by masking the network's weights to enforce the autoregressive ordering. Without MADE, computing all $d$ conditionals would require $d$ separate forward passes. With MADE, the same output $(\mu_i, \sigma_i)$ for all $i$ can be computed in one pass, making density evaluation fast.

MAF: Masked Autoregressive Flow

MAF (Papamakarios et al., 2017) stacks multiple MADE blocks as flow layers, reversing the variable ordering between layers to allow each dimension to condition on all others across the full stack. MAF is fast at density evaluation — the normalizing direction $x \to z$ runs one MADE pass per layer, giving log-likelihoods in $O(K \cdot d)$ time — but slow at sampling. To generate a sample, one must compute $x_1$ first (easy, no conditioning), then $x_2$ (conditioning on $x_1$), then $x_3$, and so on — requiring $d$ sequential network evaluations per layer. At high dimensions, this is prohibitively slow.

IAF: Inverse Autoregressive Flow

IAF (Kingma et al., 2016) inverts the direction: it is designed for fast sampling at the cost of slow density evaluation. The IAF transformation defines $x_i = \mu_i(z_{1:i-1}) + \sigma_i(z_{1:i-1}) \cdot z_i$ — conditioning on the latent noise rather than the data. Sampling is now fast: all of $z$ is sampled first, then all $x_i$ can be computed in parallel (since they all condition on $z$, which is already fully available). But computing the log-likelihood of a data point $x$ requires solving for $z$ sequentially — which is the slow direction.

MAF vs. IAF: MAF and IAF are, in a precise sense, inverses of each other. A MAF flow is fast at density evaluation (suitable for density estimation, variational inference, annealed importance sampling) and slow at sampling. An IAF flow is fast at sampling (suitable as a flexible approximate posterior in a VAE) and slow at density evaluation. Parallel WaveNet (van den Oord et al., 2018) used exactly this relationship: a teacher WaveNet (MAF-like) trained a student IAF for fast speech synthesis.

Coupling Layers in Depth

Coupling layers are the workhorse of practical normalizing flows. By restricting the Jacobian to have a block-triangular structure, they make both sampling and density evaluation fast — at the cost of requiring multiple layers with different masks to allow all dimensions to interact.

The coupling layer idea from NICE generalizes naturally to affine coupling layers, which replace the additive shift with an affine (scale and shift) transformation. Given input $z$ split into $(z^A, z^B)$:

$$x^A = z^A$$ $$x^B = z^B \odot \exp(s_\theta(z^A)) + t_\theta(z^A)$$

where $s_\theta$ ("scale") and $t_\theta$ ("translation") are arbitrary neural networks, and $\odot$ denotes elementwise multiplication. The inverse is:

$$z^A = x^A$$ $$z^B = (x^B - t_\theta(x^A)) \odot \exp(-s_\theta(x^A))$$

Crucially, neither direction requires inverting $s_\theta$ or $t_\theta$ themselves. The networks are used in the forward direction for both the transformation and its inverse — only the input changes. This means $s_\theta$ and $t_\theta$ can be arbitrarily complex (deep ResNets, attention-based networks, etc.) without increasing the complexity of inversion. The log-Jacobian determinant is simply:

$$\log |\det J_f| = \sum_j s_\theta(z^A)_j$$

the sum of the scale network's outputs over the $B$ components — computed for free as a side product of the forward pass. Both sampling and density evaluation are $O(d)$ in the number of network evaluations, making affine coupling layers ideal for large-scale models.

The expressiveness limitation and how to overcome it

The obvious limitation of coupling layers is that half the dimensions ($z^A$) pass through unchanged in each layer. Dimensions in $z^A$ are transformed by the scale and shift applied to $z^B$, but their own values are not changed. A single coupling layer thus only partially reshapes the distribution. The standard solution is to alternate which half is transformed: odd layers transform the first half, even layers transform the second half. With enough alternating layers and expressive enough $s_\theta$, $t_\theta$ networks, a coupling flow can approximate any continuous density to arbitrary precision (in theory). In practice, tens to hundreds of layers are used for complex distributions.

An alternative that avoids the strict partition is the masked coupling layer, which uses a binary mask $m \in \{0,1\}^d$ rather than a literal partition: $x = m \odot z + (1-m) \odot (z \odot \exp(s_\theta(m \odot z)) + t_\theta(m \odot z))$. This allows more flexible masking patterns — checkerboard patterns for images, for example — that can be more efficient than strict partitions.

RealNVP

RealNVP was the first flow model to produce convincing high-dimensional image samples, combining affine coupling layers with a multiscale architecture that progressively factored out dimensions rather than transforming them all the way to the final layer.

Dinh et al. (2017) introduced Real-valued Non-Volume Preserving transformations (RealNVP) by replacing NICE's additive coupling with the affine coupling layers described above. This provides the additional degree of freedom of scaling ($\exp(s_\theta)$ vs. $+t_\theta$), which dramatically increases the expressiveness of each layer. The scale term allows each layer to control how much each output dimension contributes to the log-likelihood, essentially learning to concentrate probability mass in regions of high data density more aggressively.

Masking strategies for images

For spatial data like images, RealNVP uses two complementary masking strategies. The checkerboard mask assigns alternating pixels to the two groups (like a checkerboard pattern in 2D), allowing each group to condition on the other in a spatially interleaved way. The channel mask splits along the channel dimension: after reshaping the image from $H \times W \times C$ to $H \times W \times 4C$ via a squeezing operation (grouping 2×2 spatial neighborhoods into a single spatial location with 4× more channels), half the channels pass through unchanged and the other half are transformed. Alternating between checkerboard and channel masks over many layers allows the model to capture spatial correlations at multiple scales.

The multiscale architecture

The most important architectural innovation in RealNVP is the multiscale architecture. Rather than passing all dimensions through every coupling layer, after every few layers half the dimensions are "factored out" — removed from the active set and directly connected to the Gaussian latent space. Specifically, after a squeeze + coupling block, the dimensions are split: half continue to be transformed by subsequent layers, half are directly output as part of the latent $z_0$.

This has several benefits. First, it reduces the computational cost of later layers (fewer dimensions to transform). Second, it provides a natural multiscale latent representation: early-factored dimensions capture low-level texture; late-factored dimensions capture global structure. Third, it improves gradient flow by providing shorter paths from the objective to early layers. The overall log-likelihood is the sum of the log-likelihoods of all factored components plus the final remaining components, each under the appropriate Gaussian marginal.

RealNVP trained on CIFAR-10 achieved bits-per-dimension (BPD) competitive with other density models of the time, while also producing visually coherent image samples and enabling realistic interpolations between images in latent space. It was the proof of concept that flows could serve as practical image generative models.

Glow

Glow (Kingma & Dhariwal, 2018) refined RealNVP's architecture into a cleaner, more powerful design — replacing channel shuffles with learned invertible 1×1 convolutions and introducing ActNorm to stabilize training at large batch sizes. The result was flow-based 256×256 face synthesis of remarkable quality.

RealNVP alternated between fixed checkerboard and channel masks to ensure all dimensions were eventually transformed. The fixed alternation pattern is suboptimal: it restricts which dimensions interact with which. Kingma & Dhariwal (2018) replaced the fixed mask alternation with a learned, invertible linear transformation applied to the channel dimension before each coupling layer — an invertible 1×1 convolution. This generalization allows the model to learn the optimal permutation and rotation of channels at each layer, rather than being constrained to fixed mask patterns.

Invertible 1×1 convolutions

A 1×1 convolution with kernel $W \in \mathbb{R}^{C \times C}$ applies a linear transformation to each spatial location's $C$-dimensional channel vector independently. It is invertible if and only if $W$ is a non-singular matrix. The log-determinant of the Jacobian across the full $H \times W \times C$ feature map is simply:

$$\log \left|\det J\right| = H \cdot W \cdot \log |\det W|$$

Computing $|\det W|$ naively costs $O(C^3)$. Glow uses an LU decomposition $W = PLU$ (where $P$ is a fixed permutation matrix, $L$ is lower-triangular with unit diagonal, and $U$ is upper-triangular) to reduce this to $O(C)$: the determinant of a triangular matrix is the product of its diagonal. The LU factors are learned rather than the full matrix $W$, keeping the parameterization constrained to invertible matrices without any explicit constraint. This is a significant practical improvement over RealNVP's fixed shuffles, which are a special case of this framework.

ActNorm

A second important innovation in Glow is Activation Normalization (ActNorm), which replaces batch normalization in the flow. Batch normalization is problematic for flows: its Jacobian depends on the batch statistics, which couples the log-determinants of different examples in a mini-batch in a way that makes the per-example log-likelihood technically incorrect during training. ActNorm instead applies an affine transformation with per-channel scale and bias parameters that are initialized using the first mini-batch (so that the activations have zero mean and unit variance after the first step) and then treated as ordinary learned parameters thereafter. This provides the normalization benefits of batch norm without the batch-coupling pathology, and its Jacobian is simply the product of the scale parameters — trivially computed.

The three-step flow

Each level of the Glow architecture repeats a three-step building block: (1) ActNorm — normalize each channel; (2) Invertible 1×1 convolution — mix channels; (3) Affine coupling layer — conditionally transform half the channels. This three-step pattern is repeated $K$ times within each level (with $K = 32$ in the published model), and $L$ levels in total are nested within the multiscale architecture. The complete model trained on 256×256 face images (CelebA-HQ) produced results that were, at the time of publication, competitive with the best GANs while offering exact likelihoods and semantically meaningful latent interpolations.

Glow demonstrated that you could manipulate attributes of a face image — adding a smile, changing hair color, adjusting age — by performing simple arithmetic in the model's latent space: encoding an image to get its latent code, adding a vector corresponding to the desired attribute direction (estimated from the difference between average latent codes for examples with and without the attribute), and decoding back to pixel space. This capability, which emerges naturally from the exact inference that flows provide, is much more difficult to achieve with GANs or VAEs, which lack access to the exact posterior or exact likelihood.

Continuous Normalizing Flows & Neural ODEs

Instead of composing a discrete sequence of transformations, continuous normalizing flows let the transformation evolve through continuous time, governed by an ordinary differential equation. This removes the constraint of discrete layer design and introduces new theoretical tools — at the cost of numerical ODE solving.

The discrete flow framework composes finitely many transformations $z_K = f_K \circ \cdots \circ f_1(z_0)$. Taking the limit as the number of layers $K \to \infty$ while each transformation approaches the identity, we arrive at a continuous normalizing flow (CNF), where the transformation is defined by an ordinary differential equation:

$$\frac{dz(t)}{dt} = v_\theta(z(t), t), \quad z(0) = z_0, \quad x = z(T)$$

Here $v_\theta: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d$ is a time-dependent velocity field parameterized by a neural network. The data $x$ is obtained by integrating this ODE from time 0 to time $T$. Sampling is then ODE solving in the forward direction (easy); density evaluation requires solving in the reverse direction (also an ODE solve, with the same cost).

The instantaneous change of variables theorem

In the continuous-time setting, the change of variables formula takes a differential form. Rather than tracking the log-determinant at discrete steps, one tracks its continuous-time evolution. The instantaneous change of variables theorem (Chen et al., 2018) states:

$$\frac{d \log p(z(t))}{dt} = -\text{tr}\left(\frac{\partial v_\theta}{\partial z}\right)$$

The log-density changes at a rate equal to the negative trace of the velocity field's Jacobian (the divergence of the vector field). This is a dramatic simplification: instead of a full $d \times d$ determinant, we need only the trace of the Jacobian. But computing the trace exactly still requires $d$ backward passes (one to get each diagonal entry of $\partial v_\theta / \partial z$). The total change in log-density from $t=0$ to $t=T$ is:

$$\log p_x(x) = \log p_z(z_0) - \int_0^T \text{tr}\left(\frac{\partial v_\theta(z(t),t)}{\partial z(t)}\right) dt$$

FFJORD: free-form Jacobians of reversible dynamics

Grathwohl et al. (2019) introduced FFJORD, which makes CNFs computationally practical using the Hutchinson trace estimator. Rather than computing the full trace exactly, one estimates it stochastically: for a random vector $\epsilon \sim p(\epsilon)$ with $\mathbb{E}[\epsilon \epsilon^T] = I$, we have $\text{tr}(A) = \mathbb{E}_\epsilon[\epsilon^T A \epsilon]$. A single vector-Jacobian product $\epsilon^T (\partial v_\theta / \partial z)$ can be computed by a single backpropagation call, giving an unbiased trace estimate at the cost of one Jacobian-vector product — $O(d)$ rather than $O(d^2)$. FFJORD uses this estimator at each step of the ODE solver, making the full log-likelihood evaluation cost comparable to a discrete flow with the same number of function evaluations.

CNFs have several appealing theoretical properties. The velocity field can be any neural architecture — no coupling-layer constraints, no triangular Jacobians, no restricted masking. The transformation is defined implicitly by the ODE, which means the effective "depth" is controlled by the ODE solver's step count rather than an architectural choice. CNFs also make it natural to model flows on manifolds (spheres, tori) by choosing ODE dynamics that keep the trajectory on the manifold, which is difficult to achieve with discrete coupling layers.

The practical disadvantage is computational cost: ODE solving requires many function evaluations (often 10–100× more than a fixed forward pass through an equivalent discrete flow), and the adjoint method used for memory-efficient backpropagation through the ODE solver introduces additional complexity and numerical error. For large-scale image synthesis, discrete flows remain more practical; CNFs have found their strongest applications in scientific settings (molecular dynamics, density estimation for particle physics) where their theoretical flexibility is worth the cost.

Expressive Bijections: Spline Flows

Affine coupling layers are expressive in depth but limited in the flexibility each individual layer can apply. Replacing the affine transform with a learned monotone spline dramatically increases per-layer expressiveness while retaining tractable inverses and Jacobians.

In an affine coupling layer, each component of $z^B$ is transformed by $z^B_i \mapsto z^B_i \cdot e^{s_i} + t_i$ — an affine function of $z^B_i$ given $z^A$. This is a highly constrained family: only two parameters ($s_i, t_i$) per dimension per layer, and the function is strictly linear in $z^B_i$. A spline flow replaces the affine transform with a monotone spline: a piecewise function defined by a set of knots and values, where the piece between each pair of adjacent knots is a polynomial (linear, quadratic, or cubic). Monotonicity ensures invertibility; having many knots ensures high expressiveness.

Neural spline flows

Durkan et al. (2019) introduced neural spline flows, which use rational-quadratic splines — piecewise rational functions of degree at most 2/2 (numerator and denominator are both quadratic) — with $K$ knots per dimension. Given the $K$ knot positions, the $K$ function values at those knots, and the $K$ derivatives at the knots (all predicted by the coupling network from $z^A$), the rational-quadratic spline is uniquely determined, monotone by construction, and has a closed-form analytic inverse (solvable by the quadratic formula). The log-Jacobian is the sum of the log-derivatives of the splines, which are also available in closed form.

The result is a coupling layer where each $z^B_i$ is passed through a different learned monotone function with $O(K)$ parameters (rather than just 2), enabling much more flexible per-dimension transformations without any change to the overall coupling structure or the $O(d)$ Jacobian complexity. In practice, $K = 8$ or $K = 16$ knots per dimension is typical. Neural spline flows consistently outperform affine flows on density estimation benchmarks with fewer layers, and they are now the standard choice for practitioners building flow models for scientific data.

Other expressive bijection families include sum-of-squares polynomial flows (Jaini et al., 2019), which parameterize the CDF as a sum-of-squares polynomial (automatically non-decreasing), and invertible neural networks based on monotone networks (using constrained weights to ensure monotonicity). Each trades off expressiveness per layer against the cost of evaluating the inverse, which for general monotone networks may require numerical root finding rather than a closed-form formula.

Discrete & Categorical Flows

Standard normalizing flows operate on continuous variables. Extending the framework to discrete data — integers, categorical tokens, binary variables — requires fundamentally different tools, since the change of variables formula assumes differentiable, volume-changing transformations that make no sense for discrete spaces.

Discrete data such as text, discrete audio (PCM with 8-bit quantization), or categorical labels presents a genuine challenge for flow models. The standard change of variables formula requires both the forward and inverse transformations to be differentiable and to have well-defined Jacobians — none of which applies to integer-valued maps. Several approaches have been proposed to bring flow-like modeling to discrete spaces.

Integer discrete flows

Hoogeboom et al. (2019) introduced integer discrete flows for ordinal (integer-valued) data. The key insight is to replace the affine coupling transform with an integer affine coupling: given input integers $(z^A, z^B) \in \mathbb{Z}^{d_A} \times \mathbb{Z}^{d_B}$, define:

$$x^B = z^B + \text{round}(t_\theta(z^A)) \pmod{M}$$

where $M$ is the range of integers and the rounding ensures the output stays in $\mathbb{Z}$. This is invertible by $z^B = x^B - \text{round}(t_\theta(z^A)) \pmod{M}$, and it preserves the uniform distribution on $\{0, \ldots, M-1\}^d$ (analogous to a volume-preserving flow in continuous space). The probability of a data point is computed by counting how many latent codes map to it — which, for invertible integer maps, is always exactly one. Training maximizes the exact discrete log-likelihood.

Categorical flows and the argmax trick

For truly categorical (non-ordinal) variables — like vocabulary tokens — the problem is harder. One approach uses the Gumbel-softmax or straight-through estimator to approximate discrete gradients, but these introduce bias. Hoogeboom et al. (2021) introduced argmax flows for categorical data: a flow is defined in a continuous relaxation space, where the base distribution is a Gumbel distribution and the categorical variable is recovered via the argmax. The flow transforms Gumbel noise into the continuous pre-argmax activations, and the categorical variable is then the index of the maximum. This enables exact categorical density evaluation and sampling at the cost of operating in the continuous pre-categorical space.

Discrete flows remain an active and somewhat unsatisfying research area. The change of variables framework that makes continuous flows so elegant does not transfer gracefully to discrete settings. For text generation specifically, autoregressive language models (Part X Ch 05) are generally more effective than flow-based approaches because they naturally handle the sequential and discrete structure of language. Discrete flows have found more success in domains like lossless image compression, where the connection between discrete flows and arithmetic coding provides principled compression algorithms with competitive compression ratios.

Flows vs. VAEs vs. GANs

Each of the three classical generative frameworks — flows, VAEs, and GANs — makes different trade-offs between exact likelihood, sample quality, latent structure, and computational cost. Understanding these trade-offs guides which framework to reach for in a given application.

The three frameworks occupy different regions of a multi-dimensional design space. Comparing them along several axes illuminates both their strengths and their limitations.

Likelihood tractability

Flows provide exact log-likelihoods via the change of variables formula — no approximation, no bounds. This is their defining advantage. VAEs provide a lower bound (the ELBO) on the log-likelihood; the gap between the ELBO and the true log-likelihood is the KL divergence between the approximate posterior and the true posterior, which can be substantial. GANs provide no likelihood at all; the adversarial objective is not a proxy for the log-likelihood of a probabilistic model and cannot be directly used for density estimation, anomaly detection, or principled comparison across models.

Sample quality

Historically, GANs produced the highest sample quality — sharpest textures, most realistic images — because the discriminator provides a rich, learned perceptual loss that penalizes blurriness directly. Flows and VAEs, trained on pixel-level log-likelihoods, tend to produce slightly blurrier samples or spread probability mass more uniformly across the distribution. The gap has narrowed with modern techniques (consistency training, improved architecture), but for raw image quality GANs (and now diffusion models) generally remain superior. Flows are competitive on density estimation benchmarks even if their image samples are slightly softer.

Latent space quality and inference

Flows have the clearest latent structure of the three: the mapping $x \mapsto z_0 = f^{-1}(x)$ is an exact bijection, so every data point has a unique latent code and latent arithmetic corresponds precisely to data-space arithmetic. VAEs also have meaningful latent spaces, but inference is approximate (the encoder gives only the approximate posterior mean and variance). GANs have no inference procedure by default; recovering a latent code for a given image (GAN inversion) requires iterative optimization and is approximate.

Computational cost

Flows are generally the most computationally expensive to train per parameter, because the Jacobian determinant computation imposes constraints on architecture that limit parallelism (e.g., coupling layers process only half the dimensions per layer). They also require more parameters than equivalent-quality VAEs or GANs to achieve comparable density estimates, because the expressive power is constrained to invertible architectures. GANs are generally the cheapest to sample from (a single generator forward pass), though training can be unstable. VAEs offer a good compromise: the ELBO is a stable, well-behaved objective, and both encoder and decoder are unconstrained architectures.

Property	Flows	VAEs	GANs
Exact log-likelihood	✓ Yes	✗ ELBO (lower bound)	✗ Not applicable
Sample quality (images)	Good	Fair (can be blurry)	Excellent
Exact posterior inference	✓ Yes	✗ Approximate	✗ Not defined
Training stability	High	High	Low
Architecture constraints	Strong (invertibility)	Mild	None
Mode coverage	Good (MLE)	Good (MLE)	Can collapse
Latent space arithmetic	Excellent	Good	Poor (requires inversion)

Applications & Current Standing

Flows never dominated image synthesis the way GANs did, but they found important niches — especially wherever exact likelihoods matter, whether for density estimation, anomaly detection, Bayesian inference, or lossless compression.

Density estimation and anomaly detection

The killer application for flows is wherever you need to know the actual probability density of a data point. Anomaly detection requires distinguishing low-density (anomalous) from high-density (normal) samples — flows provide exact densities, making them directly applicable without the approximations required by VAEs or the ad hoc heuristics needed with GANs. Industrial quality control (detecting defective parts from camera images), cybersecurity (detecting unusual network traffic), and medical monitoring (flagging unusual vital signs) all benefit from principled density estimation that flows naturally provide.

Flexible variational inference

Flows found immediate application as flexible approximate posteriors in VAEs and Bayesian neural networks. The ELBO in a VAE is tightest when the approximate posterior $q(z|x)$ is flexible enough to closely match the true posterior $p(z|x)$. Using a normalizing flow starting from a simple Gaussian posterior and transforming it through several coupling layers produces a much richer posterior approximation, tightening the ELBO and improving both generation quality and latent space structure. Normalizing flows used in this way are called posterior flows or auxiliary flows and do not need to model the full data distribution — only the posterior, which is a much simpler object.

Lossless compression

There is a deep mathematical connection between probability models and lossless compression: Shannon's source coding theorem guarantees that the optimal code length for a symbol is its negative log-probability $-\log_2 p(x)$ bits. A flow model that assigns accurate probabilities to image patches can therefore drive a lossless compressor that approaches the entropy of the image distribution. Townsend et al. (2019) implemented this using bits-back coding with flow models, achieving lossless compression rates that beat standard codecs for natural images. This is a uniquely flow-appropriate application: neither GANs (no density) nor diffusion models (computationally expensive density evaluation) are well-suited to it.

Speech synthesis: WaveGlow and WaveFlow

Flows were successfully applied to high-fidelity speech synthesis. WaveGlow (Prenger et al., 2019) combined WaveNet-like dilated convolutions with RealNVP-style coupling layers to generate raw waveforms in a single flow forward pass — much faster than autoregressive WaveNet, which generates samples one timestep at a time. WaveGlow produced human-quality speech with generation speeds fast enough for real-time deployment. The flow framework was particularly well-suited here because audio has cleaner statistical structure than natural images, and the sequential nature of waveforms fits well with autoregressive flow architectures.

Scientific simulation and molecular generation

Perhaps the most active current application of flows is in the physical sciences. Boltzmann generators (Noé et al., 2019) use flows to learn the Boltzmann distribution of molecular configurations, enabling efficient sampling of low-energy molecular conformations without expensive molecular dynamics simulations. Flows for particle physics density estimation allow physicists to model the complex high-dimensional distributions of collision products. Continuous normalizing flows on molecular graphs have been used to generate candidate drug-like molecules with desired chemical properties. These scientific applications benefit from flows' exact densities and the fact that domain-specific priors can be incorporated into the base distribution or the flow architecture in principled ways.

Where flows stand today

Normalizing flows occupy a somewhat niche but intellectually important position in the modern generative modeling landscape. For raw image synthesis at scale, diffusion models now dominate — their training is more stable, their sample quality is higher, and their conditional generation (text-to-image) capabilities are unmatched. For density estimation, anomaly detection, and applications where exact likelihoods are essential, flows remain the method of choice. In scientific computing and Bayesian statistics, continuous normalizing flows and their relatives (like flow matching) are an active area of growth. The mathematical elegance of the change of variables formula ensures that flows will remain an important part of the practitioner's toolkit even if they no longer headline the generative modeling race.

Flow matching — the current frontier: Lipman et al. (2022) and Liu et al. (2022) introduced flow matching and rectified flows, which train continuous normalizing flows by regressing the velocity field $v_\theta$ against a prescribed vector field connecting data to noise — without ever solving the ODE during training. This dramatically reduces training cost compared to FFJORD while retaining the CNF's theoretical properties. Flow matching is now a core component of several state-of-the-art image and video generation systems, representing the most active current use of flow ideas.

Normalizing Flows, bending simple distributions into complex shapes.

Prerequisites

The Change of Variables Formula

What the Jacobian determinant means geometrically

The computational challenge

The Normalizing Flow Framework

Two directions: normalizing vs. generative

Training

Early Flows: Planar, Radial & NICE

NICE: Non-linear Independent Components Estimation

Autoregressive Flows

MADE: Masked Autoencoder for Distribution Estimation

MAF: Masked Autoregressive Flow

IAF: Inverse Autoregressive Flow

Coupling Layers in Depth

The expressiveness limitation and how to overcome it

RealNVP

Masking strategies for images

The multiscale architecture

Glow

Invertible 1×1 convolutions

ActNorm

The three-step flow

Continuous Normalizing Flows & Neural ODEs

The instantaneous change of variables theorem

FFJORD: free-form Jacobians of reversible dynamics

Expressive Bijections: Spline Flows

Neural spline flows

Discrete & Categorical Flows

Integer discrete flows

Categorical flows and the argmax trick

Flows vs. VAEs vs. GANs

Likelihood tractability

Sample quality

Latent space quality and inference

Computational cost

Applications & Current Standing

Density estimation and anomaly detection

Flexible variational inference

Lossless compression

Speech synthesis: WaveGlow and WaveFlow

Scientific simulation and molecular generation

Where flows stand today

Further Reading

Foundational Papers

Advanced Architectures

Reviews & Tutorials