Part X · Generative Models · Chapter 01

Variational Autoencoders, learning the shape of data's hidden space.

A variational autoencoder is simultaneously a compression algorithm, a probabilistic model, and a generative machine. By marrying the representational power of deep networks with the rigor of Bayesian inference, VAEs were the first framework to give neural networks a principled way to both understand and synthesize high-dimensional data — paving the road for diffusion models, latent diffusion, and the modern era of generative AI.

Prerequisites

This chapter assumes familiarity with feedforward and convolutional neural networks (Parts IV–V), basic probability including Gaussian distributions and Bayes' theorem (Part I Ch 04–05), and the concept of KL divergence from information theory (Part I Ch 06). No prior knowledge of variational inference is assumed — it is developed from first principles.

Sections

The Generative Modeling Problem density estimation · latent variables · intractability
Autoencoders & Their Limits bottleneck · reconstruction · why they can't generate
Latent Variable Models marginal likelihood · posterior · amortized inference
The Evidence Lower Bound ELBO derivation · reconstruction · KL term · Jensen's inequality
The Reparameterization Trick pathwise gradient · ε ~ N(0,I) · why sampling blocks gradients
VAE Architecture in Practice encoder network · diagonal Gaussian · decoder network
Training Dynamics posterior collapse · KL annealing · free bits · β-warmup
Latent Space Geometry interpolation · arithmetic · manifold structure · visualization
Disentangled Representations β-VAE · FactorVAE · MIG · axis-aligned factors
Conditional & Semi-Supervised VAEs CVAE · class conditioning · M1+M2 model
Advanced Variants VQ-VAE · hierarchical VAEs · NVAE · discrete latents
Applications & Connections drug discovery · anomaly detection · latent diffusion · representation learning

The Generative Modeling Problem

Generative modeling asks a deceptively simple question: given a collection of data points — images, molecules, pieces of text — can we build a model that captures the underlying process that produced them, well enough to generate plausible new examples?

Every dataset we observe is the product of some unknown data-generating process. Photographs of faces, for instance, are not arbitrary configurations of pixels. They occupy a vanishingly small region of the space of all possible images, governed by deep constraints: lighting interacts coherently with three-dimensional geometry; eyes have consistent structure across individuals; expression correlates with context. A generative model's task is to learn this structure well enough to be useful — for synthesis, for compression, for understanding.

Formally, we observe a dataset $\mathcal{D} = \{x^{(1)}, \ldots, x^{(N)}\}$ of data points drawn i.i.d. from some unknown distribution $p_{\text{data}}(x)$. We want to fit a parametric model $p_\theta(x)$ that approximates this distribution closely. The standard objective is maximum likelihood estimation:

$$\max_\theta \sum_{i=1}^{N} \log p_\theta(x^{(i)})$$

The challenge is that for high-dimensional data like images — where $x$ might be a vector of hundreds of thousands of pixel values — directly parameterizing $p_\theta(x)$ as a tractable distribution is nearly impossible. The space is too vast, the structure too complex, and the constraints too intricate to capture with any simple probability family.

The Role of Latent Variables

The key insight that motivates VAEs, and indeed most deep generative models, is that observable complexity often arises from simple underlying factors. A face image is complicated pixel-by-pixel, but it can be described compactly by a small number of latent attributes: identity, pose, expression, lighting direction, age. If we could model these latent factors explicitly, the relationship between factors and observed pixels might be much simpler to express.

A latent variable model introduces an unobserved variable $z$ such that:

$$p_\theta(x) = \int p_\theta(x \mid z)\, p(z)\, dz$$

The generative story is: first, sample a latent code $z$ from a prior $p(z)$ (say, a standard Gaussian); then, sample the observation $x$ from a conditional distribution $p_\theta(x \mid z)$ parameterized by a neural network. This is an elegant decomposition — the decoder network $p_\theta(x \mid z)$ only needs to learn the conditional mapping from a compact code to an observation, while the prior $p(z)$ handles the overall structure of the latent space.

The fundamental problem is the integral. For any non-trivial decoder network, the marginal $p_\theta(x) = \int p_\theta(x|z)\,p(z)\,dz$ is intractable — there are infinitely many values of $z$, and the integrand has no analytic form. This is the core technical challenge that variational autoencoders are designed to address.

Why not just use MCMC? Markov chain Monte Carlo methods can approximate intractable integrals, but they are far too slow for training large neural networks. We need a gradient-based optimization approach that gives us unbiased or low-variance estimates of the log-likelihood gradient at each step. Variational inference provides this.

Autoencoders & Their Limits

Before VAEs, autoencoders were the dominant approach to learning compressed representations. Understanding what they do well — and precisely where they fail — motivates every design decision in the variational extension.

A conventional autoencoder consists of two networks: an encoder $f_\phi: \mathcal{X} \to \mathcal{Z}$ that maps an input to a latent code, and a decoder $g_\theta: \mathcal{Z} \to \mathcal{X}$ that maps the code back to an approximate reconstruction. The network is trained end-to-end to minimize reconstruction loss — typically mean squared error for continuous data or cross-entropy for binary data:

$$\mathcal{L}_{AE} = \mathbb{E}_{x \sim \mathcal{D}}\left[\|x - g_\theta(f_\phi(x))\|^2\right]$$

By constraining the latent dimension to be much smaller than the input dimension, the network is forced to learn a compressed representation that captures the most important structure in the data. Autoencoders learn remarkably useful features — they have been used for dimensionality reduction, denoising, anomaly detection, and pre-training for downstream classification tasks.

The Fundamental Problem: Discontinuous Latent Space

The fatal flaw of a standard autoencoder for generation is that its latent space has no meaningful structure. The encoder maps each training point to a specific code, and the decoder learns to reconstruct from those specific codes. But the space between these codes is essentially unconstrained — if you sample a random latent vector $z$ that doesn't correspond to any training point's encoded representation, the decoder has no obligation to produce anything sensible. It has never seen those intermediate values and has received no gradient signal from them.

Consider a concrete example: suppose you train an autoencoder on handwritten digit images. The encoder learns that "5" maps to some vector $z_5$ and "7" maps to some vector $z_7$. If you linearly interpolate between $z_5$ and $z_7$ in latent space and decode the midpoint, you might expect a smooth blend of the two digits. Instead, you often get visual garbage — the decoder has memorized a mapping at the endpoints and nothing coherent in between.

A second, subtler problem is that autoencoders admit arbitrarily sharp encodings. Because nothing penalizes the encoder for clustering similar inputs far apart in latent space, or for encoding different inputs very close together, the learned geometry can be pathological. Dimensions that happen to encode important variation for one batch of data may collapse to near-zero for another.

What Would a Good Generative Latent Space Look Like?

For a latent space to be useful for generation, we need three properties. First, coverage: every point in a region of latent space should decode to something plausible. Second, continuity: nearby points in latent space should correspond to similar observations. Third, regularity: the aggregate distribution of latent codes across the training set should match a known prior that we can sample from at test time. Standard autoencoders guarantee none of these. Variational autoencoders are specifically engineered to enforce all three simultaneously.

Latent Variable Models & Variational Inference

Variational inference is a technique for approximating intractable posterior distributions. It reframes an integration problem as an optimization problem — one we can solve with gradient descent.

Given our latent variable model $p_\theta(x, z) = p_\theta(x|z)\,p(z)$, the key quantity for maximum likelihood training is the log marginal likelihood $\log p_\theta(x)$. We cannot compute it directly, but we can reason about it using the posterior distribution $p_\theta(z|x)$, which tells us which latent codes are likely given an observation. By Bayes' theorem:

$$p_\theta(z \mid x) = \frac{p_\theta(x \mid z)\, p(z)}{p_\theta(x)}$$

The denominator is exactly the intractable marginal we're trying to compute. So the posterior itself is intractable. We're caught in a circle: computing the posterior requires knowing the marginal, and computing the marginal requires integrating over the posterior.

Approximating the Posterior

Variational inference breaks this circle by introducing a family of approximate distributions $q_\phi(z|x)$, parameterized by $\phi$, and optimizing $\phi$ so that $q_\phi(z|x)$ is as close as possible to the true posterior $p_\theta(z|x)$. The closeness is measured by KL divergence:

$$\mathrm{KL}(q_\phi(z|x) \| p_\theta(z|x)) = \mathbb{E}_{z \sim q_\phi}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]$$

Classical variational inference, developed in statistics before the deep learning era, used structured families of approximate posteriors (fully factorized "mean field" distributions) and hand-derived coordinate-ascent update rules. This worked for small models with conjugate structure, but it scaled very poorly. Computing a new set of variational parameters $\phi$ for each new data point $x$ required solving a separate optimization problem — infeasible for training on millions of examples.

Amortized Inference: The VAE Breakthrough

The key innovation of Kingma and Welling (2013) was amortized inference: instead of optimizing a separate $\phi_i$ for each data point $x^{(i)}$, they parameterize the approximate posterior with a single neural network (the encoder) that takes $x$ as input and outputs the parameters of $q_\phi(z|x)$ directly. The encoder "amortizes" the cost of variational inference across all data points by learning a shared inference function.

This means we no longer need to solve an inner optimization loop at test time. Given a new data point, we simply run a forward pass through the encoder to get a posterior approximation. The encoder parameters $\phi$ are learned jointly with the decoder parameters $\theta$ via a single end-to-end gradient-based training procedure. This combination of amortized variational inference with neural network function approximators is the technical heart of the VAE framework.

Why diagonal Gaussian? The most common choice for $q_\phi(z|x)$ is a diagonal Gaussian: $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x) \cdot I)$, where the encoder outputs a mean vector $\mu$ and a variance vector $\sigma^2$ (often parameterized as $\log \sigma^2$ for numerical stability). The diagonal structure assumes conditional independence among the latent dimensions given $x$, which is a simplifying approximation but one that works well in practice and keeps the KL term analytically tractable.

The Evidence Lower Bound

The ELBO is the central object of VAE training — an objective that is simultaneously a lower bound on the log-likelihood we want to maximize, and a sum of two interpretable terms: a reconstruction quality term and a regularization term.

We want to maximize $\log p_\theta(x)$ but cannot compute it directly. The ELBO gives us something we can compute that lower-bounds this quantity. The derivation begins by introducing the approximate posterior and applying Jensen's inequality:

$$\log p_\theta(x) = \log \int p_\theta(x, z)\, dz = \log \int q_\phi(z|x) \cdot \frac{p_\theta(x, z)}{q_\phi(z|x)}\, dz$$

Since $\log$ is concave, Jensen's inequality gives us $\log \mathbb{E}[f] \geq \mathbb{E}[\log f]$, so:

$$\log p_\theta(x) \geq \mathbb{E}_{z \sim q_\phi(\cdot|x)}\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)}\right] \equiv \mathcal{L}(\theta, \phi; x)$$

This is the evidence lower bound (ELBO), also called the variational lower bound or ELBO. Expanding the joint $p_\theta(x,z) = p_\theta(x|z)\,p(z)$:

$$\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi}\left[\log p_\theta(x|z) + \log p(z) - \log q_\phi(z|x)\right]$$

Regrouping the last two terms as a KL divergence:

$$\mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{z \sim q_\phi(\cdot|x)}\left[\log p_\theta(x|z)\right]}_{\text{reconstruction term}} - \underbrace{\mathrm{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization term}}$$

Interpreting the Two Terms

The reconstruction term $\mathbb{E}_{z \sim q_\phi}[\log p_\theta(x|z)]$ measures how well the decoder recovers the original input from samples of the encoder's posterior. For a Gaussian decoder $p_\theta(x|z) = \mathcal{N}(x; \mu_\theta(z), \sigma^2 I)$, this term is proportional to mean squared reconstruction error. For a Bernoulli decoder (used for binary images), it becomes binary cross-entropy. Maximizing this term pushes the network to encode enough information in $z$ to reconstruct $x$ faithfully — the autoencoding objective.

The regularization term $-\mathrm{KL}(q_\phi(z|x) \| p(z))$ penalizes the encoder for making the approximate posterior deviate from the prior. Intuitively, it pushes each data point's latent distribution toward the standard Gaussian prior — ensuring that the latent space remains regular and that different data points' codes overlap enough to enable smooth interpolation and random sampling. This is the term that standard autoencoders lack and that gives the VAE its generative capability.

The Exact ELBO Gap

The gap between the ELBO and the true log-likelihood is exactly the KL divergence between the approximate and true posteriors:

$$\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + \mathrm{KL}(q_\phi(z|x) \| p_\theta(z|x))$$

Since KL divergence is always non-negative, maximizing the ELBO simultaneously pushes the log-likelihood up (improving the model) and the KL gap down (improving the posterior approximation). The ELBO therefore serves as a unified objective for joint optimization of $\theta$ and $\phi$.

Analytic KL for Gaussians

When both $q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$ and $p(z) = \mathcal{N}(0, I)$ are Gaussian, the KL term has a closed-form expression that requires no Monte Carlo estimation:

$$\mathrm{KL}(q_\phi(z|x) \| p(z)) = \frac{1}{2}\sum_{j=1}^{d}\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)$$

This is a key practical advantage. Rather than estimating the KL by sampling, we compute it analytically from the encoder's output $(\mu, \sigma^2)$. Only the reconstruction term requires Monte Carlo approximation — and that requires just a single sample in practice, because the expectation is over $z$ given a fixed $x$, and a single sample provides an unbiased gradient estimate.

The Reparameterization Trick

Sampling is not differentiable. If we naively sample $z$ from the encoder's posterior and pass it to the decoder, the gradient cannot flow back through the sampling operation to the encoder. The reparameterization trick is the elegant fix.

To compute the reconstruction term of the ELBO, we need to evaluate $\mathbb{E}_{z \sim q_\phi(\cdot|x)}[\log p_\theta(x|z)]$. In practice, we approximate this expectation by drawing a sample $z \sim q_\phi(\cdot|x)$ and computing $\log p_\theta(x|z)$ — a simple, unbiased Monte Carlo estimate. The problem is that we then need to backpropagate gradients through this operation to update the encoder parameters $\phi$.

The gradient of an expectation over a distribution that depends on the parameters is tricky. The sampling step $z \sim \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x) I)$ is a stochastic node in the computation graph. The sample $z$ depends on $\phi$ through the distribution's parameters, but standard automatic differentiation cannot propagate gradients through a sampling operation because sampling is not a deterministic function.

Separating Randomness from Parameters

The reparameterization trick solves this by factoring the sampling into a deterministic function of the parameters and a separate, parameter-free noise term. Instead of sampling $z$ directly from $\mathcal{N}(\mu, \sigma^2 I)$, we sample a noise variable $\varepsilon \sim \mathcal{N}(0, I)$ and compute:

$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon$$

where $\odot$ denotes element-wise multiplication. This is mathematically equivalent — the result is still Gaussian with mean $\mu_\phi(x)$ and variance $\sigma_\phi^2(x)$ — but now $z$ is expressed as a deterministic, differentiable function of $\phi$ and the random variable $\varepsilon$. The randomness lives entirely in $\varepsilon$, which has no dependence on any parameters. Gradients flow freely from the decoder through $z$ back to the encoder's outputs $\mu$ and $\sigma$.

VAE architecture with reparameterization. The encoder outputs μ and σ; a noise sample ε ~ N(0,I) is drawn independently; z is computed deterministically as μ + σ⊙ε. Gradients flow back through z all the way to the encoder, since the path is now fully differentiable.

Why This Works: A Closer Look

Before reparameterization, the gradient of the ELBO with respect to $\phi$ involved differentiating through a sampling step, which requires the REINFORCE estimator (also known as the score function estimator or likelihood-ratio estimator). REINFORCE provides unbiased gradient estimates, but it is notoriously high-variance in practice — the gradient signal is so noisy that training does not converge reliably for the complex encoder networks needed in VAEs.

The reparameterization trick provides a pathwise gradient estimator — the gradient flows along a deterministic path from the loss through the decoder, through $z = \mu + \sigma \odot \varepsilon$, and into the encoder. This estimator has dramatically lower variance, making gradient-based optimization practical. The intuition: rather than asking "how does changing $\phi$ affect the distribution of $z$?" (a distributional question), we ask "for this particular $\varepsilon$, how does changing $\phi$ affect the value of $z$?" (a deterministic question). The latter is much easier to estimate.

Beyond Gaussians. The reparameterization trick applies whenever the distribution $q_\phi$ can be expressed as $z = g_\phi(\varepsilon)$ for some differentiable function $g$ and parameter-free noise $\varepsilon$. This covers most exponential family distributions. For distributions without a closed-form reparameterization (such as categorical or Dirichlet distributions), other techniques — including the Gumbel-softmax trick and implicit reparameterization — extend the idea to broader distribution families.

VAE Architecture in Practice

Translating the mathematical framework into a working implementation involves several concrete choices about network architecture, output parameterization, and the relationship between reconstruction loss and KL weight.

In its simplest form, a VAE for image data consists of four elements. The encoder $q_\phi(z|x)$ is a convolutional neural network that takes an image $x$ and outputs two vectors: a mean $\mu \in \mathbb{R}^d$ and a log-variance $\log \sigma^2 \in \mathbb{R}^d$, where $d$ is the latent dimension. Outputting $\log \sigma^2$ rather than $\sigma^2$ directly ensures the variance stays positive and improves numerical stability (the network can output any real number without constraint). The latent code is then sampled via $z = \mu + \sigma \odot \varepsilon$ with $\varepsilon \sim \mathcal{N}(0,I)$ and $\sigma = \exp(\frac{1}{2}\log \sigma^2)$.

The decoder $p_\theta(x|z)$ is a transposed-convolutional or upsampling network that takes a latent vector $z \in \mathbb{R}^d$ and produces the parameters of a distribution over observations. For continuous image data normalized to $[0,1]$, the decoder often outputs a per-pixel mean and treats reconstruction as a Bernoulli or Gaussian likelihood; for natural images at full quality, a richer decoder output (or a separate flow model on top) is often needed. The decoder's job is to map from the compact latent code back to a full-resolution image.

Choosing the Latent Dimension

The latent dimension $d$ is one of the most important hyperparameters. Too small, and the bottleneck prevents the model from capturing sufficient variation — the reconstruction quality suffers and the model fails to represent the full diversity of the data. Too large, and the KL regularization term weakens relative to the reconstruction pressure: the encoder has so many latent dimensions available that it can assign zero variance to most of them, effectively using some dimensions as deterministic storage and ignoring the prior. This pathological behavior is called posterior collapse (see Section 07). In practice, latent dimensions range from 2 for visualization experiments to several hundred or even thousands for complex high-resolution data.

Training Loss in Code

The complete training loss for a mini-batch of $B$ examples, computed per sample, is:

$$\mathcal{L} = -\frac{1}{B}\sum_{i=1}^{B}\left[\underbrace{\log p_\theta(x^{(i)}|z^{(i)})}_{\text{reconstruction}} - \underbrace{\frac{1}{2}\sum_{j=1}^{d}\left(\mu_{ij}^2 + \sigma_{ij}^2 - \log\sigma_{ij}^2 - 1\right)}_{\text{KL divergence}}\right]$$

The reconstruction loss can also be written as mean squared error (for Gaussian decoder with fixed variance), binary cross-entropy (for Bernoulli decoder), or more elaborate perceptual losses that better capture human notions of image similarity. Perceptual losses — computed in the feature space of a pretrained classifier rather than pixel space — often produce visually sharper reconstructions at the cost of mathematical elegance, since they break the clean probabilistic interpretation of the ELBO.

The β parameter. Many implementations weight the KL term by a scalar $\beta$, computing $\mathcal{L} = \text{reconstruction} - \beta \cdot \text{KL}$. Setting $\beta = 1$ corresponds to the original VAE objective. Setting $\beta < 1$ prioritizes reconstruction quality; setting $\beta > 1$ (as in β-VAE, Section 09) prioritizes latent space regularity and disentanglement. This single hyperparameter controls the fundamental tradeoff between faithfulness to the data and structure in the latent space.

Training Dynamics

VAEs have a well-known failure mode called posterior collapse, where the decoder ignores the latent code entirely. Understanding why it happens — and how to prevent it — is essential for getting VAEs to work on complex data.

Posterior collapse (also called KL vanishing or inactive latents) occurs when the encoder learns to set $\sigma^2 \approx 1$ and $\mu \approx 0$ for some or all latent dimensions, making $q_\phi(z|x) \approx p(z)$ and driving the KL term to zero for those dimensions. When this happens, those latent dimensions carry no information about the input — the posterior matches the prior, the encoder is not using that dimension, and the decoder ignores the corresponding latent activations. The model degenerates into a simpler model that ignores part of its capacity.

Posterior collapse is most severe with powerful autoregressive decoders. An autoregressive decoder — one where each output pixel conditions on all previously generated pixels — is so expressive that it can model the training data well without any help from the latent code. The decoder simply ignores $z$ and relies entirely on its autoregressive context. The encoder, receiving no gradient signal that would penalize it for matching the prior, happily collapses. This is a fundamental tension: we want a powerful decoder for sample quality, but powerful decoders make it easy to ignore the latent code.

Mitigations

KL annealing is the most widely used fix. Training begins with the KL weight set to zero (or nearly so), which allows the encoder to first learn to encode information usefully. The weight is then gradually increased to 1 over thousands of training steps. Because the reconstruction loss has already established useful latent representations before the KL regularization kicks in at full strength, the model resists the temptation to collapse.

Free bits (Kingma et al., 2016) is a complementary approach that imposes a minimum KL per latent dimension: $\mathcal{L}_{KL} = \sum_j \max(\lambda, \mathrm{KL}(q_\phi(z_j|x) \| p(z_j)))$ for some threshold $\lambda$. Dimensions whose KL already exceeds $\lambda$ receive no gradient from the KL term — they are "free" to encode information. Only dimensions that have collapsed below the threshold are penalized for doing so. This prevents full collapse while still allowing some dimensions to use the prior when the encoder genuinely has nothing useful to encode there.

β-warmup with cycle annealing — periodically cycling the KL weight between zero and its target value — has been found to produce models with more active latent dimensions than simple monotonic annealing. Each cycle gives the model a fresh opportunity to discover and activate latent dimensions that collapsed in the previous cycle.

Monitoring Training

Practitioners track several diagnostic metrics during VAE training. The KL per dimension (averaged and plotted as a histogram) reveals how many dimensions are active — a healthy VAE uses most of its latent dimensions, not just a few. The reconstruction loss should decrease monotonically. The ELBO should increase overall, but note that a decreasing KL without improvement in reconstruction is a warning sign of collapse. Generating random samples $z \sim p(z)$ and decoding them is the most direct qualitative diagnostic: if the samples look plausible and diverse, the VAE is working; if they look blurry or all similar, the model has collapsed or has insufficient capacity.

Latent Space Geometry

One of the most striking properties of a well-trained VAE is that its latent space has meaningful geometric structure — distances, directions, and paths correspond to semantically coherent transformations of the data.

Because the KL regularization pushes all encoder distributions toward a shared standard Gaussian prior, the latent codes of different training examples are forced to occupy overlapping, contiguous regions of space. This produces the "good latent space" properties we identified in Section 02: coverage (decoding any point in a reasonable neighborhood of the prior produces something plausible), continuity (nearby latent codes decode to similar observations), and regularity (we can sample from the prior and get plausible generations).

Interpolation

Given two observations $x_1$ and $x_2$, we can encode them to posterior means $\mu_1$ and $\mu_2$ and then decode points along the linear interpolation $z(\alpha) = (1-\alpha)\mu_1 + \alpha\mu_2$ for $\alpha \in [0,1]$. In a well-trained VAE, this produces a smooth, semantically meaningful transition between the two inputs: faces morph gradually, digits transform coherently, molecular structures change continuously. This is impossible in a standard autoencoder — the interpolation path passes through regions of latent space that the decoder has never been trained on and produces incoherent outputs.

Latent Arithmetic

Inspired by the analogy with word2vec's famous "king − man + woman ≈ queen" arithmetic, VAE latent codes support analogous vector operations. If a face dataset encodes "smiling woman" as $z_1$, "neutral woman" as $z_2$, and "neutral man" as $z_3$, then decoding $z_1 - z_2 + z_3$ often produces a plausible "smiling man." This arithmetic works because the KL regularization encourages the model to encode semantic attributes as directions in the latent space rather than as isolated point clusters — a consequence of the fact that the prior treats all dimensions symmetrically.

Visualizing Two-Dimensional Latent Spaces

A common pedagogical experiment is to train a VAE with $d = 2$ on a dataset like MNIST and plot the encoder means $\mu^{(i)}$ colored by class label. In a well-trained 2D VAE, the classes form continuous, overlapping clusters arranged in the latent plane, with smooth transitions between classes at the boundaries. The regularization prevents the classes from being separated by large empty voids — a hallmark of good generative latent structure. This 2D setting sacrifices generation quality (2 dimensions is far too few to capture all MNIST variation) but provides invaluable intuition for understanding what the VAE objective is accomplishing in higher dimensions.

Curvature of the latent manifold. The Gaussian prior implicitly treats all directions in latent space as equally important. But the true data manifold may have non-uniform curvature — some directions may encode much more variation than others. Riemannian geometry-aware VAE variants (such as RVAE, Hauberg et al.) explicitly model this curvature, using pullback metrics to define distances on the latent space that respect the decoder's local geometry. This produces better interpolations and more meaningful distances between encoded points.

Disentangled Representations

A disentangled representation is one where each latent dimension independently controls a single, interpretable factor of variation in the data. Disentanglement has been a central research goal in representation learning — and the VAE framework provides a natural setting for pursuing it.

In an ideal disentangled representation of face images, one dimension would control age, another pose, another hair color, another expression — with each dimension independently variable without affecting the others. Such representations would be highly useful: you could modify a specific attribute of a generated face by nudging a single latent coordinate, making generative editing precise and controllable. More broadly, disentangled representations are hypothesized to be more robust, more transferable to downstream tasks, and more interpretable than entangled alternatives.

β-VAE

Higgins et al. (2017) proposed the β-VAE, which modifies the VAE objective by increasing the weight on the KL term:

$$\mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{q_\phi}[\log p_\theta(x|z)] - \beta\,\mathrm{KL}(q_\phi(z|x) \| p(z)), \quad \beta > 1$$

The intuition is that a stronger KL penalty more aggressively forces the encoder to use an efficient, disentangled code. The prior $\mathcal{N}(0,I)$ has independent dimensions, and a large $\beta$ pushes the posterior toward this factorized structure. If the data has $K$ underlying generative factors and the model is given a latent dimension of exactly $K$, the β-VAE tends to allocate one dimension per factor. Higgins et al. showed compelling qualitative results on synthetic datasets: with the right $\beta$, each latent dimension corresponded to a single interpretable axis of variation (position, rotation, scale, shape) with no correlation between dimensions.

The cost is a reconstruction quality tradeoff. Higher $\beta$ means less capacity devoted to reconstruction — the model is more regularized but produces blurrier, less faithful outputs. Finding the right $\beta$ requires balancing these competing objectives and often depends on the dataset and the evaluation criterion.

FactorVAE and TC Decomposition

Kim and Mnih (2018) analyzed the β-VAE objective and showed that the excess KL penalty ($\beta - 1)\,\mathrm{KL}$ can be decomposed into a term penalizing total correlation — the KL divergence between the aggregate posterior $q_\phi(z) = \mathbb{E}_{x}[q_\phi(z|x)]$ and the product of marginals $\prod_j q_\phi(z_j)$. Minimizing total correlation directly encourages statistical independence among latent dimensions without unnecessarily penalizing dimensions that need high KL to encode variation. FactorVAE adds a discriminator-based total correlation penalty to the standard VAE objective, achieving better disentanglement at the same reconstruction quality as β-VAE with smaller $\beta$.

Measuring Disentanglement

Quantifying disentanglement is surprisingly difficult. The most widely used metrics include the Mutual Information Gap (MIG), which measures the average gap between the highest and second-highest mutual information between each generative factor and each latent dimension — a perfectly disentangled model would have exactly one latent dimension per factor with zero MI for all others. The DCI (Disentanglement, Completeness, Informativeness) framework measures three complementary properties: whether each latent dimension is relevant to at most one factor (disentanglement), whether each factor is captured by at most one latent dimension (completeness), and whether the overall code is predictive of the factors (informativeness).

A sobering empirical finding by Locatello et al. (2019) demonstrated that disentanglement in VAEs depends critically on inductive biases and random seeds, and that without supervision, no unsupervised method reliably achieves disentanglement across different datasets. This suggests that true disentanglement may require some form of weak supervision — at minimum, knowledge of which factors exist — rather than being achievable purely from the structure of the ELBO.

Conditional & Semi-Supervised VAEs

By conditioning the encoder and decoder on auxiliary information — class labels, attributes, text descriptions — VAEs become controllable generative models. The semi-supervised variant elegantly handles datasets where only some examples are labeled.

A Conditional VAE (CVAE) extends the basic framework by conditioning both the encoder and decoder on a context variable $c$:

$$q_\phi(z|x,c) \quad\text{and}\quad p_\theta(x|z,c)$$

The ELBO becomes $\mathbb{E}_{z \sim q_\phi(\cdot|x,c)}[\log p_\theta(x|z,c)] - \mathrm{KL}(q_\phi(z|x,c) \| p(z|c))$, where the prior may also be conditioned (often $p(z|c) = \mathcal{N}(0,I)$ for simplicity). In practice, conditioning is implemented by concatenating a one-hot label vector (or learned embedding) to the input of both encoder and decoder networks. At generation time, we specify the desired class $c$, sample $z \sim \mathcal{N}(0,I)$, and decode to get a sample of the desired class. This gives precise, label-conditional generation without needing a separate discriminator or classifier.

Semi-Supervised Learning

Kingma et al.'s M1+M2 model (2014) showed how the VAE framework can leverage unlabeled data to improve classification. The M1 model is a standard VAE trained on all data (labeled and unlabeled) to produce a latent representation. These latent codes are then used as features for a classifier trained on the labeled subset. Because the VAE uses all the data to learn the latent space — exploiting the unlabeled data to understand the structure of the distribution — the resulting features are richer than those trained on labels alone.

The M2 model treats the class label itself as a latent variable. For unlabeled examples, the label is marginalized out in the ELBO: $\log p_\theta(x) \geq \sum_y q_\phi(y|x)\,\mathcal{L}(x,y) + \mathcal{H}(q_\phi(y|x))$, where the classifier $q_\phi(y|x)$ provides soft label assignments, and $\mathcal{H}$ is the entropy of the classifier's distribution over classes. For labeled examples, the label is observed and the standard ELBO applies. The classifier and generative model are trained jointly, so each component benefits from the other: the generative model uses classifier predictions to structure the latent space, and the latent space provides richer features for classification.

Text-Conditioned VAEs

When the condition is free-form text rather than a discrete label, the condition is typically encoded by a separate encoder (often a language model) whose representation is then provided to both the image encoder and decoder. Text-conditioned VAEs predate the CLIP-based approaches that now dominate multimodal generation, and they established many of the architectural patterns — cross-attention between latent codes and text representations, hierarchical conditioning — that later became standard in diffusion models.

Advanced Variants

The standard Gaussian VAE with continuous latents is just one point in a large design space. Key variants introduce discrete latents, hierarchical structure, or entirely different prior families — each addressing specific limitations of the baseline.

The Vector Quantised VAE (VQ-VAE), introduced by van den Oord et al. (2017), replaces the continuous Gaussian latent space with a discrete codebook. Instead of encoding $x$ to a Gaussian distribution, the encoder maps $x$ to a continuous vector $z_e$, which is then quantized by replacing it with the nearest vector in a learned codebook $\{e_1, \ldots, e_K\}$:

$$z_q = e_k \quad \text{where } k = \arg\min_j \|z_e - e_j\|$$

The discrete code $k$ is the latent representation. Because quantization is not differentiable, gradients are passed from the decoder through a straight-through estimator: the gradient from the decoder with respect to $z_q$ is copied directly to $z_e$ as if the quantization were the identity function. The codebook vectors are updated by an exponential moving average of the encoder outputs assigned to them. The result is a latent space consisting of a finite vocabulary of codes, much like the vocabulary of a language model — enabling the training of a separate autoregressive prior over code sequences, which serves as the generative model at test time.

VQ-VAE is the direct ancestor of the tokenizers used in modern image generation systems like DALL-E, Parti, and LlamaGen. The idea of discretizing an image into a sequence of tokens — enabling language-model-style generation over images — flows directly from this architecture. VQ-VAE-2 (Razavi et al., 2019) extended the approach to hierarchical codebooks, with a bottom-level codebook capturing fine local details and a top-level codebook capturing global structure.

Hierarchical VAEs

Standard VAEs with a single Gaussian latent variable struggle with very high-resolution, complex images — the latent space must simultaneously capture both global structure (overall composition) and fine detail (textures, edges), which is difficult with a flat representation. Hierarchical VAEs address this by stacking multiple latent variables in a hierarchy:

$$p_\theta(x, z_1, \ldots, z_L) = p(z_L)\prod_{l=1}^{L} p_\theta(z_{l-1}|z_l) \cdot p_\theta(x|z_1)$$

The top-level latent $z_L$ captures the most abstract, global aspects of the data. Each lower level refines the representation with progressively more local, detailed information. The encoder runs a bottom-up pass computing approximate posteriors at each level, and the decoder runs a top-down pass generating each level conditioned on the one above.

NVAE (Vahdat and Kautz, 2020) showed that hierarchical VAEs could achieve generation quality competitive with GANs on CelebA-HQ and CIFAR-10, using very deep residual networks and careful normalization for stability. Key design choices include depth-wise separable convolutions in the encoder/decoder, batch normalization adapted for VAE training dynamics, and spectral regularization to prevent training instabilities that arise when the KL term is small. NVAE demonstrated that the blurriness commonly associated with VAEs was a capacity problem, not a fundamental limitation of the framework.

Diffusion Models as Hierarchical VAEs

There is a deep connection between hierarchical VAEs and diffusion models. A diffusion model can be viewed as a hierarchical VAE with a fixed (non-learned) encoder that progressively adds Gaussian noise at each level, and a learned decoder that denoises at each level. The ELBO derivation carries over exactly, with the reconstruction term becoming the denoising score matching objective. This perspective unifies two seemingly different families of generative models and provides theoretical insight into why diffusion models work so well: they are implicitly hierarchical VAEs with an encoder that does not need to be learned, eliminating the posterior collapse and encoder-decoder balance problems that plague standard VAEs.

Applications & Connections

VAEs are not just a theoretical curiosity — they underpin applications from drug discovery to anomaly detection to the latent spaces of modern image generators. Understanding where VAEs shine, where they struggle, and how they connect to other generative frameworks completes the picture.

Drug Discovery and Molecular Generation

Molecular design is one of the clearest success stories for VAEs. Molecules can be represented as graphs (atoms are nodes, bonds are edges) or as SMILES strings. Gómez-Bombarelli et al. (2018) showed that a VAE trained on SMILES sequences produces a smooth, continuous latent space over chemical structures, enabling gradient-based optimization in latent space: starting from a known molecule and moving in the direction of higher predicted solubility or bioactivity, the model generates novel molecular structures that optimize the target property. This approach — encoding molecules into a continuous space, optimizing, and decoding — is fundamentally more efficient than discrete search over combinatorial chemical space, and has been used to discover candidate drug molecules with specific target properties.

Anomaly Detection

VAEs are natural anomaly detectors. A model trained on normal data learns to reconstruct normal inputs well; anomalous inputs — those from a different distribution — typically receive high reconstruction error (because the decoder hasn't learned to reconstruct them) and high KL divergence (because the encoder can't find a good latent code for them within the prior). The ELBO itself serves as an anomaly score: a low ELBO indicates that the input is poorly explained by the model, flagging it as potentially anomalous. This is used in industrial quality control, medical imaging, and network intrusion detection.

Latent Diffusion and the VAE Renaissance

Perhaps the most impactful application of VAEs in modern AI is as the perceptual compression stage of latent diffusion models. Rombach et al. (2022) observed that pixel-space diffusion models spend enormous compute modeling imperceptible high-frequency pixel variation that has no semantic content. Their solution: first train a VAE to compress images to a much smaller latent representation (e.g., a 512×512 image becomes a 64×64×4 latent map), then train a diffusion model in this compressed latent space rather than pixel space. The VAE handles the perceptual compression; the diffusion model handles the semantic generation. This is the architecture of Stable Diffusion, which can generate high-quality 512×512 images in seconds on consumer hardware — a feat that would be computationally prohibitive in pixel space.

Comparison with Other Generative Frameworks

Framework	Training objective	Sample quality	Latent space	Training stability
VAE	ELBO (log-likelihood lower bound)	Good (blurry at baseline)	Continuous, regularized	Very stable
GAN	Adversarial min-max game	Excellent (sharp)	Implicit	Notoriously unstable
Normalizing Flow	Exact log-likelihood	Good	Bijective mapping	Stable but slow
Diffusion	Denoising score matching (ELBO)	Excellent	Hierarchical (implicit)	Very stable
Autoregressive	Exact log-likelihood	Excellent	None (sequential)	Stable but slow to sample

VAEs occupy a distinctive niche: they are the only major framework that explicitly learns a structured, navigable latent space. This makes them uniquely suitable for applications that require latent space operations — interpolation, arithmetic, optimization, structured sampling — even as diffusion models surpass them on raw generation quality. For representation learning tasks, VAEs remain highly competitive: their posterior provides principled uncertainty estimates about latent codes, which pure discriminative encoders cannot.

Blurriness and the Gaussian decoder. The notorious blurriness of VAE samples is largely a consequence of using a Gaussian decoder with a fixed variance, not a fundamental limitation of the framework. A Gaussian decoder minimizes MSE, which under uncertainty about the exact pixel values produces averaged, blurred outputs. Replacing the decoder with an autoregressive model, a normalizing flow, or a diffusion model eliminates the blurriness entirely — at the cost of slower sampling and more complex training. Architectures like DALL-E (VQ-VAE + autoregressive prior) and Stable Diffusion (VAE + diffusion) embody this principle.

Variational Autoencoders, learning the shape of data's hidden space.

Prerequisites

The Generative Modeling Problem

The Role of Latent Variables

Autoencoders & Their Limits

The Fundamental Problem: Discontinuous Latent Space

What Would a Good Generative Latent Space Look Like?

Latent Variable Models & Variational Inference

Approximating the Posterior

Amortized Inference: The VAE Breakthrough

The Evidence Lower Bound

Interpreting the Two Terms

The Exact ELBO Gap

Analytic KL for Gaussians

The Reparameterization Trick

Separating Randomness from Parameters

Why This Works: A Closer Look

VAE Architecture in Practice

Choosing the Latent Dimension

Training Loss in Code

Training Dynamics

Mitigations

Monitoring Training

Latent Space Geometry

Interpolation

Latent Arithmetic

Visualizing Two-Dimensional Latent Spaces

Disentangled Representations

β-VAE

FactorVAE and TC Decomposition

Measuring Disentanglement

Conditional & Semi-Supervised VAEs

Semi-Supervised Learning

Text-Conditioned VAEs

Advanced Variants

Hierarchical VAEs

Diffusion Models as Hierarchical VAEs

Applications & Connections

Drug Discovery and Molecular Generation

Anomaly Detection

Latent Diffusion and the VAE Renaissance

Comparison with Other Generative Frameworks

Further Reading

Foundational Papers

Disentanglement

Advanced Architectures