Generative Adversarial Networks, two networks locked in creative competition.
Ian Goodfellow's 2014 paper introduced an idea that was both absurdly simple and deeply radical: train two neural networks against each other, one forging images and one detecting forgeries, and let competition drive both toward perfection. The adversarial principle unlocked a decade of photorealistic synthesis, transforming GANs into the engine behind deepfakes, artistic style transfer, data augmentation, and the architectural blueprints that still echo through modern diffusion models.
Prerequisites
This chapter assumes solid familiarity with deep neural networks and backpropagation (Parts IV–V), convolutional architectures (Part V Ch 03), and basic probability including KL divergence and the concept of a probability distribution over high-dimensional spaces (Part I Ch 04–06). Exposure to VAEs (Chapter 01 of this part) is helpful context for contrasting generative frameworks but is not required.
The Adversarial Idea
Before GANs, every generative model required an explicit formula for the likelihood of observed data. Goodfellow's insight was that you could sidestep likelihood entirely — you just needed a critic good enough to tell real from fake.
The classical approach to generative modeling is density estimation: you posit a family of probability distributions $p_\theta(x)$ and find the parameters $\theta$ that maximize the likelihood of your training data. Pixel-level autoregressive models, normalizing flows, VAEs — all operate within this paradigm. The problem is that defining a tractable, expressive density over a 1024×1024 image (roughly a million dimensions) is extraordinarily hard. Every pixel must be consistent with every other pixel, and the density must be normalizable.
Ian Goodfellow, while arguing with colleagues at a Montreal bar in 2014, sketched a different answer. What if we replaced the explicit density with an implicit model? Instead of specifying $p_\theta(x)$ analytically, we define it procedurally: sample a noise vector $z \sim p(z)$ (typically a standard Gaussian), pass it through a deep neural network $G_\theta$, and declare that the output $G_\theta(z)$ is a sample from the learned distribution. We never write down or evaluate $p_\theta(x)$ — it is whatever distribution $G_\theta$ implicitly induces over output space.
The question becomes: how do you train such a model without a likelihood to maximize? The adversarial answer is to introduce a second network, the discriminator $D_\phi$, whose sole job is to distinguish real samples $x \sim p_\text{data}$ from generated fakes $G_\theta(z)$. The discriminator outputs a scalar in $[0,1]$, interpreted as the probability that its input is real. The two networks are trained simultaneously in opposition: the generator tries to fool the discriminator, and the discriminator tries not to be fooled. As training progresses, the generator is forced to produce increasingly convincing images, because a naive generator is easily caught.
This setup draws on a game-theoretic metaphor Goodfellow himself described as a forger versus an art detective, or a currency counterfeiter versus a bank. The forger starts by producing crude forgeries; the bank quickly learns to spot them. The forger studies the feedback and improves; the bank sharpens its scrutiny. At equilibrium — if one exists — the forger produces forgeries so perfect that the bank is reduced to guessing at chance. In GAN terms, the generator has learned to match the data distribution so well that the discriminator assigns probability $\frac{1}{2}$ to everything, unable to tell real from synthetic.
The elegance of this setup is that the discriminator acts as a learned loss function. Traditional approaches to image generation used hand-crafted losses — pixel-wise MSE, perceptual losses, etc. — that often produced blurry outputs because averaging over uncertainty yields a blurry average. The discriminator, being a neural network, can learn arbitrarily complex structure: it discovers what makes an image look real on its own, without anyone telling it. This learned loss is exactly what allows GANs to produce crisp, photorealistic textures where VAEs often produce soft, oversmoothed results.
The GAN Objective & JS Divergence
The original minimax formulation has a clean theoretical story: at the optimum, the generator minimizes the Jensen–Shannon divergence between its distribution and the data distribution. But this elegance comes with a sharp practical problem.
Formally, Goodfellow et al. proposed the following two-player minimax game. The discriminator $D_\phi$ is trained to maximize the probability of correctly labeling real and fake samples. The generator $G_\theta$ is trained to minimize the same quantity — or equivalently, to maximize the probability that the discriminator misclassifies its outputs as real:
$$\min_{G} \max_{D} \; V(D, G) = \mathbb{E}_{x \sim p_\text{data}} [\log D(x)] + \mathbb{E}_{z \sim p(z)} [\log (1 - D(G(z)))]$$The discriminator wants both terms large: $D(x) \approx 1$ on real data (first term large) and $D(G(z)) \approx 0$ on fakes (second term large when expressed as $\log(1 - D(G(z)))$). The generator wants the second term small — it wants $D(G(z)) \approx 1$, so that $\log(1 - D(G(z)))$ becomes very negative, driving the overall objective down. This is the standard binary cross-entropy objective for a binary classifier, just applied in a minimax sense across two competing optimizers.
The optimal discriminator
Suppose we fix $G$ and ask: what is the optimal discriminator? This is a pointwise optimization problem. For any input $x$, the discriminator must choose $D(x)$ to maximize $p_\text{data}(x) \log D(x) + p_G(x) \log(1 - D(x))$, where $p_G$ is the distribution induced by the generator. Taking the derivative and setting it to zero yields the optimal discriminator:
$$D^*(x) = \frac{p_\text{data}(x)}{p_\text{data}(x) + p_G(x)}$$This is exactly the posterior probability that $x$ came from the data distribution given that it came from either the data or generator distribution with equal prior probability. If $p_G = p_\text{data}$ everywhere, then $D^*(x) = \frac{1}{2}$ everywhere — the optimal discriminator cannot do better than random guessing, which is exactly the condition we want at convergence.
The JS divergence connection
Substituting the optimal discriminator back into $V(D^*, G)$ and doing some algebra, the generator's objective reduces to:
$$V(D^*, G) = -\log 4 + 2 \cdot \text{JSD}(p_\text{data} \| p_G)$$where $\text{JSD}$ is the Jensen–Shannon divergence — a symmetrized, bounded version of the KL divergence that is always between 0 and $\log 2$. Minimizing the GAN objective with respect to $G$ is therefore equivalent to minimizing the JS divergence between the generated and data distributions. The unique global minimum, achieved when $p_G = p_\text{data}$, yields $V(D^*, G) = -\log 4$.
This is a beautiful theoretical result. In practice, however, it is also the source of a major practical failure mode. When $p_G$ and $p_\text{data}$ have disjoint or nearly disjoint supports — which is typical early in training, since a random generator produces outputs nowhere near real images — the JS divergence is constant at $\log 2$ and provides zero gradient information. The generator learns nothing from a perfect discriminator. Worse still, a nearly perfect discriminator makes $\log(1 - D(G(z))) \approx \log(0)$, which saturates and kills gradients entirely.
The saturation problem: When the discriminator is too good early in training, $D(G(z)) \approx 0$ for all generated samples, so $\log(1 - D(G(z))) \approx 0$ and its gradient approaches zero. The generator receives no useful learning signal. This is the core instability of the original GAN formulation.
Training Dynamics & Instabilities
GAN training is a moving-target optimization problem: neither player is minimizing a fixed loss. The landscape is non-stationary, equilibria can be unstable, and the practical tricks needed to keep training on track took years to accumulate.
The non-saturating loss
The first and most important practical fix is the non-saturating generator loss, proposed by Goodfellow in the original paper as a heuristic alternative. Instead of minimizing $\log(1 - D(G(z)))$ (which saturates when $D$ is good), the generator maximizes $\log D(G(z))$:
$$\mathcal{L}_G = -\mathbb{E}_{z \sim p(z)} [\log D(G(z))]$$This objective is not equivalent to the minimax game — it changes the game's Nash equilibrium — but it provides much stronger gradients early in training. When $D(G(z)) \approx 0$, $-\log D(G(z))$ is large and its gradient is steep, pointing the generator in a useful direction. The discriminator's loss remains the standard binary cross-entropy:
$$\mathcal{L}_D = -\mathbb{E}_{x \sim p_\text{data}} [\log D(x)] - \mathbb{E}_{z \sim p(z)} [\log (1 - D(G(z)))]$$In practice, virtually every GAN implementation uses the non-saturating loss for the generator while keeping the minimax discriminator objective. It is one of many asymmetries that practitioners have learned to live with.
Balancing the two networks
GAN training is a delicate balance. If the discriminator improves much faster than the generator, it becomes too good and the generator's gradients vanish. If the generator improves much faster, it might find a specific set of images that fool a weak discriminator without actually matching the data distribution. The conventional wisdom — which varies by architecture and dataset — is to train the discriminator for one or a few steps per generator step, keep learning rates similar and low, and use momentum carefully (Adam with $\beta_1 = 0.5$ rather than 0.9 was found empirically to help).
Spectral normalization
A powerful and now widely-adopted stabilization technique is spectral normalization (Miyato et al., 2018), which constrains the discriminator's Lipschitz constant by normalizing each weight matrix by its largest singular value. If $W$ is a weight matrix and $\sigma_1(W)$ its spectral norm, the normalized weight is $\hat{W} = W / \sigma_1(W)$. This ensures that the discriminator cannot change its output too dramatically in response to small input changes, preventing the extreme confidence that causes vanishing gradients. Spectral normalization adds almost no computational overhead and became a standard component in large-scale GAN training.
Label smoothing and noise
Several other regularization techniques improve stability in practice. One-sided label smoothing replaces the discriminator's target of 1.0 for real samples with a softer target of, say, 0.9, preventing the discriminator from making overconfident predictions that saturate the loss. Instance noise adds Gaussian noise to both real and fake inputs to the discriminator — this has the theoretical effect of blurring the two distributions together, giving the generator non-zero gradient signal even when its distribution barely overlaps with the data. The noise is gradually annealed to zero during training.
Practical GAN training checklist: Use the non-saturating generator loss. Apply spectral normalization or gradient penalty to the discriminator. Use Adam with $\beta_1 = 0.5$. Normalize inputs to $[-1, 1]$. Use batch normalization in the generator (but not the discriminator, or use instance normalization instead). Monitor both losses carefully — if the discriminator loss hits zero, training has collapsed.
Mode Collapse
The worst failure mode of GAN training is not instability but deception: a generator that has found a small set of outputs that reliably fool the discriminator, abandoning the diversity of the real data distribution entirely.
The data distribution $p_\text{data}$ is typically multimodal: for face images, it contains millions of distinct people with distinct ages, ethnicities, expressions, and lighting conditions. A generator minimizing the adversarial loss is incentivized to produce outputs that the discriminator cannot distinguish from real. But it is not incentivized to produce diverse outputs. If the generator can produce a single face that consistently fools the discriminator, it has no gradient signal pushing it to also produce other faces. The result is mode collapse: the generator assigns all of its probability mass to a small number of outputs, ignoring most of the true distribution.
Mode collapse can be partial (the generator covers some modes but not all) or complete (the generator produces essentially a single output or a tiny cluster of outputs regardless of the input $z$). In the partial case, $z$ may still have some effect on the output but the outputs cluster into a small number of distinct visual types. Complete collapse is usually visible immediately from visual inspection of generated samples.
Why does mode collapse happen?
The fundamental cause is the adversarial objective's asymmetry. The JS divergence (and its relatives) penalize the generator for producing outputs that are clearly not in $p_\text{data}$, but they do not separately penalize the generator for ignoring parts of $p_\text{data}$ that it has decided to neglect. As long as $p_G$ is concentrated on high-density regions of $p_\text{data}$, the generator can achieve a low loss even if it misses most of the distribution. This is in contrast to maximum likelihood estimation, where the KL divergence $\text{KL}(p_\text{data} \| p_G)$ diverges to infinity if $p_G$ assigns zero probability to any region where $p_\text{data} > 0$ — a much stronger diversity guarantee.
Minibatch discrimination
One early fix, proposed by Salimans et al. (2016), is minibatch discrimination: the discriminator is shown not just a single sample but an entire batch at once, and can compute statistics across the batch — for example, checking whether all images in the batch are suspiciously similar to each other. A real batch of faces contains huge diversity; a batch of collapsed generator outputs will be nearly identical. By letting the discriminator use this batch-level signal, we give it the ability to detect and penalize collapse directly. The generator must then produce diverse outputs to avoid detection, even if any single output would fool the discriminator on its own.
Unrolled GANs and historical averaging
Another approach, Unrolled GANs (Metz et al., 2017), addresses the root cause of collapse: the generator is trained against a discriminator that has been updated for $k$ steps into the future, rather than the current discriminator. This prevents the generator from exploiting the discriminator's current weaknesses in ways that will immediately be corrected, reducing the oscillatory dynamics that lead to collapse. Historical averaging, another technique from the same era, adds a penalty proportional to the L2 distance between current parameters and their time-averaged values, discouraging rapid oscillations.
In practice, the most effective solution to mode collapse was not any of these clever tricks but rather better loss functions — specifically, the Wasserstein distance, which provides a meaningful gradient signal proportional to how far the generated distribution is from the data distribution, rather than a flat zero-gradient region.
Wasserstein GANs
The Wasserstein distance — also called the Earth Mover's Distance — is a fundamentally better measure of how far apart two distributions are when they don't overlap. WGAN and WGAN-GP were among the most important algorithmic advances in GAN training.
The Jensen–Shannon divergence and KL divergence share a critical flaw: they are undefined or uninformative when the two distributions have non-overlapping support. For two distributions living on low-dimensional manifolds in a high-dimensional space — exactly the situation for image distributions — their supports will generically be disjoint, and the JS divergence will be constant regardless of how close or far apart the manifolds actually are. This gives the generator zero gradient signal with a perfect discriminator.
The Earth Mover's Distance
The Wasserstein-1 distance (also called the Earth Mover's Distance, or EMD) has a beautiful geometric interpretation: it is the minimum cost of transporting all the mass of one distribution to match another, where cost is measured by the distance each unit of mass travels. Formally:
$$W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|]$$where $\Pi(p,q)$ is the set of all joint distributions with marginals $p$ and $q$. Unlike JS divergence, $W(p_\text{data}, p_G)$ is well-defined and continuous even when the two distributions don't overlap — it simply equals the geometric distance between their supports. A generator that produces a distribution far from $p_\text{data}$ receives a large Wasserstein distance and a strong gradient, regardless of whether there is any overlap.
The Kantorovich dual formulation
Computing the Wasserstein distance directly requires solving an expensive optimization problem. Arjovsky et al. (2017) used the Kantorovich–Rubinstein duality to derive a tractable alternative. The Wasserstein distance can be written as:
$$W(p_\text{data}, p_G) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_\text{data}}[f(x)] - \mathbb{E}_{x \sim p_G}[f(x)]$$where the supremum is over all 1-Lipschitz functions $f$ — functions that cannot change faster than linearly. The WGAN replaces the GAN discriminator with a critic $f_w$ (no sigmoid output, no probability interpretation) trained to approximate this supremum. The generator then minimizes:
$$\mathcal{L}_G = -\mathbb{E}_{z \sim p(z)}[f_w(G_\theta(z))]$$and the critic maximizes $\mathbb{E}_{x \sim p_\text{data}}[f_w(x)] - \mathbb{E}_{z \sim p(z)}[f_w(G_\theta(z))]$. The critic must be 1-Lipschitz, which Arjovsky et al. enforced by weight clipping — constraining all critic weights to lie in $[-c, c]$ after each update. This is crude but effective enough to demonstrate that the Wasserstein objective dramatically reduces mode collapse and provides stable, meaningful loss curves that actually correlate with sample quality (a huge practical benefit: you can monitor training by watching the Wasserstein estimate).
WGAN-GP: gradient penalty
Weight clipping has a significant pathology: it pushes the critic toward learning very simple functions (clipped weights tend to produce either maximum or minimum values rather than nuanced outputs), which limits its expressivity. Gulrajani et al. (2017) replaced weight clipping with a gradient penalty that directly enforces the Lipschitz constraint. The 1-Lipschitz condition is equivalent to requiring that the gradient of $f$ has norm at most 1 everywhere. Rather than enforcing this everywhere (impossible in practice), WGAN-GP enforces it on interpolated points $\hat{x} = \epsilon x + (1 - \epsilon) G(z)$ for $\epsilon \sim \text{Uniform}[0,1]$:
$$\mathcal{L}_\text{GP} = \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} f_w(\hat{x})\|_2 - 1)^2]$$The full WGAN-GP critic loss is the Wasserstein estimate minus $\lambda \cdot \mathcal{L}_\text{GP}$ (typically $\lambda = 10$). WGAN-GP became one of the most widely adopted GAN formulations: it is stable, interpretable, and produces excellent results across many domains. The gradient penalty approach was later generalized into the broader category of consistency regularization for discriminators, which remains active research.
An important practical note: WGAN-GP should not be used with batch normalization in the critic, since batch norm creates dependencies between samples in a batch that interfere with the gradient penalty computation. Use layer normalization or instance normalization in the critic instead.
Conditional GANs & Pix2Pix
Unconditional GANs generate samples from the full data distribution with no user control. Conditioning on class labels, text, or input images unlocks a far wider range of applications — from class-conditional synthesis to semantic image editing.
The Conditional GAN (cGAN, Mirza & Osindero, 2014) extends the GAN framework by conditioning both the generator and discriminator on auxiliary information $y$ — typically a class label, but potentially any structured side information. The generator becomes $G_\theta(z, y)$ and the discriminator $D_\phi(x, y)$, where the conditioning is achieved by concatenating $y$ (or a learned embedding of $y$) with the input to each network. The objective becomes:
$$\min_G \max_D \; \mathbb{E}_{x,y}[\log D(x,y)] + \mathbb{E}_{z,y}[\log(1 - D(G(z,y), y))]$$With class conditioning, the user can specify which class of image to generate. Rather than sampling from the full distribution of, say, ImageNet classes, one can request specifically a "dog" or "sports car." This is a critical capability for practical applications. More sophisticated conditioning mechanisms — projection discriminators (Miyato & Koyama, 2018), where $y$'s embedding is multiplied element-wise with the discriminator's internal activations — proved more effective than simple concatenation for high-resolution class-conditional synthesis.
Pix2Pix: image-to-image translation
Isola et al. (2017) pushed conditional GANs into a particularly powerful application: image-to-image translation. Rather than conditioning on a label vector, the generator conditions on an entire input image. The training data consists of paired images — a semantic segmentation map and its corresponding photograph, a satellite image and its map, a day photo and its night version. The generator learns to translate from one domain to the other; the discriminator judges whether the translation is realistic and consistent with the input.
The Pix2Pix loss combines an adversarial term with an L1 reconstruction loss:
$$\mathcal{L}_\text{Pix2Pix} = \mathcal{L}_\text{cGAN}(G, D) + \lambda \cdot \mathbb{E}_{x,y,z}[\|y - G(x,z)\|_1]$$The L1 term prevents the generator from ignoring the input image entirely and ensures rough correctness; the adversarial term drives it toward crisp, realistic textures. The combination works remarkably well, producing high-fidelity translations across wildly different domain pairs.
The PatchGAN discriminator
Pix2Pix introduced the PatchGAN discriminator, which has become widely influential. Rather than classifying the entire image as real or fake with a single scalar, the discriminator classifies overlapping patches of the image independently. Its output is a grid of real/fake scores, one per patch, and the loss averages over this grid. This has several advantages: it runs fully convolutionally on any image size, it focuses the discriminator's attention on local texture and high-frequency structure (which matters most for realism), and it has far fewer parameters than a full-image discriminator. PatchGAN with patch size 70×70 was found to work well across many image sizes and has been adopted in numerous subsequent architectures.
Progressive Growing of GANs
Training a GAN at high resolution directly from random initialization is extremely unstable — the generator must simultaneously learn coarse structure and fine detail. Progressive growing solves this by starting simple and gradually increasing complexity.
By 2017, GANs could produce decent images at 64×64 or 128×128 pixels. But 1024×1024 — the resolution needed for convincing portraits — remained out of reach. The fundamental problem is that high-resolution training exposes extreme instability: the generator at random initialization produces noise, and the discriminator can detect this trivially, leading to vanishing gradients or mode collapse before any meaningful learning begins. Karras et al. (2018) at NVIDIA introduced Progressive GAN (PGGAN) as a solution.
The core idea is to grow both the generator and discriminator incrementally, starting from a tiny resolution (4×4 pixels) and progressively adding new layers that double the spatial resolution as training proceeds. At 4×4, the networks learn to produce and evaluate only the most basic global structure: is this image dark or light? Does it have a face-like blob? Only when training has stabilized at this resolution do new 8×8 layers get added, allowing the networks to develop slightly finer structure while retaining everything learned at 4×4. This continues through 16×16, 32×32, 64×64, 128×128, 256×256, 512×512, and finally 1024×1024.
Fade-in transitions
Simply adding new layers abruptly would shock the training dynamics. PGGAN instead introduces new layers with a fade-in mechanism: the new higher-resolution layer's contribution is linearly interpolated from zero to full weight over a training period, while the existing upsampling path (which simply doubles the previous resolution by nearest-neighbor upsampling) is simultaneously faded out. At $\alpha = 0$ the network is identical to the previous stage; at $\alpha = 1$ the new layers are fully activated. This smooth transition prevents any discontinuity in the learning dynamics.
PGGAN also introduced several other important techniques: equalized learning rate (scaling weights dynamically to ensure all layers receive gradients of similar magnitude, rather than using careful per-layer initialization); minibatch standard deviation (a simplified version of minibatch discrimination that adds a single feature map summarizing within-batch diversity to the discriminator); and pixelwise feature normalization in the generator (normalizing each feature vector at every spatial location to unit length, preventing runaway signal magnitudes). Together these techniques enabled the first convincing 1024×1024 face synthesis, producing results that shocked the research community when published.
PGGAN's celebrity face dataset (CelebA-HQ) outputs were the first GAN results widely shared on social media as demonstrations of AI-generated photorealism. The images were indistinguishable from real photographs to most observers, marking a cultural inflection point for public awareness of generative AI.
StyleGAN & StyleGAN2
StyleGAN transformed GAN architecture by separating high-level style from stochastic detail — giving users unprecedented control over generated images and redefining what fine-grained image synthesis could look like.
PGGAN produced stunning images but offered little control over what those images looked like. The input noise $z$ was mapped directly to the generator's first layer, and features at all scales — overall pose, facial identity, fine hair texture — were all entangled in a single latent code. Interpolating between two codes in $z$ space produced incoherent blends. Karras et al. (2019) at NVIDIA designed StyleGAN to address this by completely rethinking the generator architecture.
The mapping network and W space
StyleGAN begins by passing the input noise $z \in \mathcal{Z}$ (a unit Gaussian) through a learned mapping network — a sequence of fully-connected layers — to produce an intermediate latent code $w \in \mathcal{W}$. The mapping network is not just a convenience; it serves a crucial function. The Gaussian latent space $\mathcal{Z}$ is highly entangled: to generate a face with glasses, you may need to move in a direction that also changes ethnicity, because these factors are correlated in training data and the Gaussian must accommodate them. The mapping network learns to "disentangle" these factors by mapping $\mathcal{Z}$ to a learned $\mathcal{W}$ space that more closely follows the true variation in the data distribution. In $\mathcal{W}$ space, directions are more semantically meaningful: moving along one direction changes age, another changes expression, another changes lighting, with less interference between factors.
Adaptive instance normalization
Rather than feeding $w$ into the first generator layer and letting it propagate through, StyleGAN uses $w$ to control the style of each resolution level independently. At each resolution, a learned affine transformation converts $w$ into scale and bias parameters $(\gamma, \beta)$, which are then applied via Adaptive Instance Normalization (AdaIN):
$$\text{AdaIN}(x_i, y) = y_{s,i} \cdot \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}$$where $x_i$ is the $i$-th feature map, $\mu$ and $\sigma$ are its mean and standard deviation, and $y_{s,i}, y_{b,i}$ are the learned scale and bias derived from $w$. The generator's spatial feature maps start from a learned constant (not from $z$ or $w$) and are reshaped entirely by the style modulation at each layer. Stochastic noise is added separately at each resolution to control fine-grained variation — pore texture, individual hairs, slight lighting randomness — without affecting global style.
Style mixing
This architecture enables style mixing: one can use $w_1$ to control the style of coarse layers (4×4 through 16×16, affecting overall pose and identity) while using a different $w_2$ to control fine layers (64×64 through 1024×1024, affecting skin texture, hair detail, and color). The result is an image with the pose of one person and the skin texture of another — a capability that is impossible with conventional generators. Style mixing is used as a regularization technique during training (randomly mixing styles across two latent codes prevents the network from becoming overly reliant on correlations between adjacent layers' styles), but it also became a powerful tool for interactive image editing.
StyleGAN2
StyleGAN2 (Karras et al., 2020) revisited the architecture to address several artifacts. The original StyleGAN produced characteristic "water droplet" artifacts in generated images — subtle blobs that appeared when zooming in. These were traced to the normalization: when AdaIN normalizes feature statistics before each style application, it can produce degenerate activation patterns. StyleGAN2 removed instance normalization and instead uses a weight demodulation scheme: the convolutional weights are scaled by the style vector and then normalized to have unit expected output variance per output feature map, without actually computing any statistics from the activations. This eliminates the artifacts and allows end-to-end training with a path length regularization term that encourages the mapping from $w$ to images to be smooth and invertible, further improving disentanglement and the quality of interpolations in $\mathcal{W}$ space. StyleGAN2 at 1024×1024 on FFHQ (Flickr Faces HQ) achieved an FID of 2.84, a new state of the art at the time and still a benchmark for unconditional face synthesis.
BigGAN & Large-Scale Synthesis
What happens when you train a class-conditional GAN at unprecedented scale — more parameters, more data, more compute? BigGAN showed that scale dramatically improves quality but introduces new instability challenges that require novel solutions.
By 2018, most GAN research focused on algorithmic improvements: better loss functions, better normalization, better architectures. Brock et al. (2019) took a different approach: train a class-conditional GAN on ImageNet (1000 classes, 1.2 million training images) at 128×128 and 256×256 resolution with massively increased model capacity — batch size of 2048, 4× more parameters than previous ImageNet GANs — and see what happens. The answer was dramatic improvements in both image quality and class diversity, but also new and serious training instabilities at large scale.
Class-conditional architecture
BigGAN uses a projection discriminator for class conditioning and class-conditional batch normalization (cBN) in the generator: the batch normalization scale and shift parameters are learned as a linear function of the class embedding. The generator architecture is a residual network with self-attention layers at 64×64 resolution (following Zhang et al.'s SAGAN, 2019), and the discriminator uses spectral normalization throughout. The class embedding is concatenated with $z$ and injected into every generator layer via cBN, ensuring that class information permeates the entire generation process rather than only entering at the first layer.
The truncation trick
A key insight in BigGAN is that the Gaussian prior on $z$ has long tails — occasionally sampling $z$ far from the origin produces low-quality or unusual images. The truncation trick improves sample quality by resampling any $z$ whose norm exceeds a threshold $\psi$ until $\|z\| \leq \psi$. This truncated sampling trades diversity for quality: with $\psi = 1$ (no truncation) the samples are diverse but some are low-quality; with $\psi = 0.5$ the samples are consistently high-quality but cover fewer modes. The truncation level becomes a user-controlled dial: reduce $\psi$ to get cleaner images at the cost of variety. This is only possible because BigGAN has learned a well-structured latent space where nearby $z$ values produce similar, coherent images.
Training instability at scale
BigGAN training frequently collapses catastrophically — the training loss suddenly spikes and image quality drops to noise-like outputs. Brock et al. identified the cause as rapidly growing singular values in the generator's weight matrices, indicating that the generator is becoming ill-conditioned. They used periodic singular value decomposition monitoring as an early warning signal. Applying orthogonal regularization — an additional loss term encouraging weight matrices to have orthonormal rows and columns — reduced but did not eliminate collapse. Ultimately, training stability at BigGAN scale required careful hyperparameter tuning and accepting that some training runs would collapse, requiring restarts. This fragility was one of the key motivations for the subsequent shift toward diffusion models, which have far more predictable and stable training dynamics.
CycleGAN & Unpaired Translation
Pix2Pix required paired training data — photographs and their corresponding segmentation maps, or day and night pairs. CycleGAN removed this requirement entirely, learning to translate between image domains using only unpaired examples from each domain.
Paired image translation datasets are expensive to collect. A database of 10,000 photographs with their hand-labeled segmentation maps requires significant human annotation. Photographs of the same scene at day and night require specialized time-lapse setups. For many interesting translation tasks — converting a landscape photo to an oil painting, transferring the style of Monet to a photograph, translating between horse and zebra images — no paired dataset exists and collecting one is impossible. Zhu et al. (2017) designed CycleGAN for unpaired image-to-image translation.
The cycle-consistency loss
The key insight is that while we cannot directly supervise the translation (there is no target image to compare against), we can impose a cycle-consistency constraint: if we translate an image from domain $X$ to domain $Y$ and then translate the result back to domain $X$, we should recover the original image. This is analogous to requiring that if you translate a sentence from English to French, the French translation should translate back to the original English sentence.
CycleGAN trains two generators simultaneously: $G: X \to Y$ and $F: Y \to X$, along with two discriminators $D_Y$ (judging realism in domain $Y$) and $D_X$ (judging realism in domain $X$). The full loss is:
$$\mathcal{L}(G, F, D_X, D_Y) = \mathcal{L}_\text{GAN}(G, D_Y) + \mathcal{L}_\text{GAN}(F, D_X) + \lambda \cdot \mathcal{L}_\text{cyc}(G, F)$$where the cycle-consistency loss is:
$$\mathcal{L}_\text{cyc}(G, F) = \mathbb{E}_{x \sim p_X}[\|F(G(x)) - x\|_1] + \mathbb{E}_{y \sim p_Y}[\|G(F(y)) - y\|_1]$$The cycle-consistency loss is the crucial constraint that prevents trivial solutions. Without it, each generator could independently map its inputs to any point in the output domain that fools its discriminator — there would be nothing connecting $G$ and $F$. With cycle-consistency, the two generators must form a coherent inverse pair: $G$ must produce outputs that $F$ can invert, which forces $G$ to preserve the structural content of its input even while changing its visual style.
Identity loss
CycleGAN also optionally uses an identity loss: when $G$ is given an image already in domain $Y$, it should produce that image unchanged. This prevents color shifts and encourages the generator to preserve chrominance when translating between domains where the color mapping is not clearly defined. For example, translating photographs to Monet paintings should not arbitrarily change the hue of grass from green to purple; the identity loss anchors the color palette.
CycleGAN demonstrated results across dozens of domain pairs — horses ↔ zebras, summer ↔ winter, photographs ↔ paintings — and inspired a large family of unpaired translation methods. Its key limitation is that it cannot learn one-to-many mappings: since the cycle-consistency loss penalizes any translation that cannot be exactly inverted, the generator is forced to produce a deterministic output for each input rather than sampling from a distribution of plausible translations. This is appropriate for some tasks (there is essentially one "zebra version" of a given horse) but not for others (a sketch could correspond to many different colored renderings).
Evaluation Metrics
Evaluating generative models is genuinely hard. Unlike classification, where accuracy is unambiguous, generative quality is multidimensional — fidelity, diversity, and correspondence to conditioning all matter, and they can trade off against each other.
The ideal generative model produces samples that are (a) individually realistic — indistinguishable from real data by any test, (b) diverse — covering the full distribution of real data, not just a few modes, and (c) conditionally appropriate — matching whatever conditioning information was provided. These three desiderata are in genuine tension: a model that always copies training data achieves (a) perfectly but fails (b); a model that generates maximum diversity may produce incoherent samples that fail (a). No single metric captures all three dimensions.
Inception Score
The Inception Score (IS, Salimans et al., 2016) was the first widely adopted automatic metric. The idea: a good generative model should produce images that (a) are individually recognizable by a classifier (high confidence on some class) and (b) produce a diverse set of classes across many samples. IS measures both properties simultaneously using a pretrained Inception network $p(y|x)$:
$$\text{IS} = \exp\left(\mathbb{E}_{x \sim p_G}\left[\text{KL}(p(y|x) \| p(y))\right]\right)$$where $p(y) = \mathbb{E}_{x \sim p_G}[p(y|x)]$ is the marginal class distribution across generated samples. When $p(y|x)$ is sharp (each image clearly belongs to one class — high fidelity) and $p(y)$ is uniform across classes (the model generates many different classes — high diversity), the KL divergence is large and IS is high. IS has serious limitations: it only uses the Inception network's class predictions (ignoring whether images actually look good), it does not compare to real data (a model trained on ImageNet can score well even if its images look nothing like ImageNet images, as long as they're confidently classified), and it does not detect mode collapse within a class.
Fréchet Inception Distance
The Fréchet Inception Distance (FID, Heusel et al., 2017) became the dominant metric for GAN evaluation and remains standard today. FID compares the distribution of real and generated images in the feature space of a pretrained Inception network. Specifically, it fits a multivariate Gaussian to the pool3 features of both real and generated images, then computes the Fréchet distance between the two Gaussians:
$$\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$$where $(\mu_r, \Sigma_r)$ and $(\mu_g, \Sigma_g)$ are the means and covariances of real and generated feature distributions. Lower FID is better; FID = 0 would mean the generated and real distributions are identical in Inception feature space. FID captures both fidelity and diversity in a single number: generated images that are unrealistic (far from real images in feature space) increase $\|\mu_r - \mu_g\|^2$; generated images that are too uniform (low variance) increase the trace term. FID is far more sensitive to mode collapse and quality degradation than IS, and its scores correlate well with human judgments of quality. However, FID requires a large sample (50,000 images is conventional) to be reliable, is sensitive to the choice of feature extractor (Inception v3 trained on ImageNet is standard, but domain-specific extractors are sometimes better), and cannot distinguish between the fidelity and diversity components of quality.
Precision and Recall
Sajjadi et al. (2018) and Kynkäänniemi et al. (2019) introduced Precision and Recall for generative models, directly addressing the fidelity/diversity decomposition. In this framework, Precision measures the fraction of generated samples that fall within the support of the real distribution (how many generated images look real), while Recall measures the fraction of real samples covered by the generated distribution (how much of the diversity in real data is represented in generated data). High Precision with low Recall indicates a high-quality but collapsed model; low Precision with high Recall indicates a diverse but low-quality model. Plotting Precision vs. Recall for different truncation values of $\psi$ in BigGAN reveals the fundamental trade-off explicitly, providing a more complete picture than any single scalar metric.
| Metric | What it measures | Limitation |
|---|---|---|
| IS | Confidence + diversity of class predictions | No comparison to real data; ignores within-class quality |
| FID | Feature-space distance between real and generated distributions | Single number; conflates fidelity and diversity; sensitive to sample size |
| Precision | Fraction of generated samples that look real | Does not capture diversity; needs large sample |
| Recall | Fraction of real distribution covered by generator | Does not capture quality; needs large sample |
| Human eval | Perceptual realism and naturalness | Expensive, slow, non-reproducible, domain-dependent |
Applications & Legacy
GANs powered a generation of applications — some transformative, some troubling — and their architectural innovations echo throughout modern generative AI, from diffusion models to large language model training.
What GANs enabled
The most immediate application of GAN technology was in creative and entertainment domains. StyleGAN's face synthesis enabled tools like thispersondoesnotexist.com and drove the rapid development of the deepfake industry — the synthesis of face-swap videos that spawned a large ecosystem of detection research and media literacy concern. In more constructive domains, GANs drove progress in data augmentation for small medical imaging datasets (synthesizing training images to supplement limited real data), in drug discovery (generating candidate molecular structures with desired pharmacological properties), and in artistic tools for style transfer and image manipulation.
In the medical imaging space, GANs were used to synthesize rare pathology cases for training radiologists and classifiers, to translate between imaging modalities (e.g., MRI to CT without paired data, using CycleGAN), and to improve the resolution of low-dose CT scans. In material science and drug discovery, molecular graph generation using GAN-like frameworks produced candidate molecules that were later synthesized and tested. These are tangible scientific contributions that predate and in some cases still exceed the capabilities of diffusion-based generative models for structured data.
Architectural inheritance
The transition from GANs to diffusion models as the dominant paradigm for image synthesis began around 2021–2022. But this transition was not a clean break — diffusion models inherited substantial architectural DNA from GAN research. The U-Net backbone used in virtually every diffusion model was popularized for image-to-image tasks through GAN-era work. Self-attention inserted at multiple resolutions was a key innovation in SAGAN that later became standard in diffusion U-Nets. The principle of conditioning generation on structured signals — class labels, CLIP text embeddings — was developed through years of GAN conditional generation research. Classifier-free guidance, the technique that dramatically improves diffusion model quality and controllability, draws on the same intuitions as GAN truncation.
Perhaps most importantly, the GAN era established the conceptual vocabulary of modern generative AI: latent spaces, the fidelity/diversity trade-off, the importance of learned perceptual losses, the value of multi-scale discriminators, and the challenge of evaluating generative quality rigorously. Researchers who built intuitions about generation through years of GAN debugging carried those intuitions directly into diffusion model development.
Why GANs gave way to diffusion models
The decline of GANs as the dominant framework was driven by several persistent limitations. Training instability — mode collapse, discriminator saturation, sudden catastrophic collapse — remained a constant challenge even in state-of-the-art systems like BigGAN. GAN training is sensitive to hyperparameters, architecture details, and random seeds in ways that made reliable reproduction difficult and open-source accessibility limited. GANs also struggled with text conditioning: while class-conditional GANs worked well, the open-ended, compositional nature of natural language descriptions proved difficult to integrate effectively. Diffusion models, with their stable denoising objective, explicit probabilistic formulation, and natural accommodation of classifier-free guidance, produced higher-quality, more diverse images with less engineering overhead — and were far easier to condition on arbitrary text. By 2023, the GAN era was effectively over for state-of-the-art image synthesis, though GANs retain advantages in specific regimes, particularly real-time inference (single-pass synthesis is much faster than iterative denoising), high-resolution video, and structured data domains where the denoising prior of diffusion models is less natural.
The GAN decade in summary: From Goodfellow's 2014 paper to the rise of diffusion models, GANs drove an extraordinary decade of progress in generative modeling. They established implicit density estimation as a viable alternative to explicit likelihood, created the first photorealistic synthesizers of human faces and objects, and built the architectural and conceptual foundations on which modern generative AI rests. The problems they never fully solved — training stability, mode coverage, evaluation — drove the research that ultimately displaced them.
Further Reading
Foundational Papers
-
Generative Adversarial NetsThe original paper. Unusually readable for a landmark paper — Goodfellow's exposition of the minimax objective, the optimal discriminator derivation, and the JS divergence connection is still the clearest treatment. Read the theory section even if you implement from a tutorial.
-
Wasserstein GANCareful mathematical analysis of why the original GAN objective fails, followed by the Wasserstein distance derivation. Required reading for anyone serious about GAN theory. The appendix has complete proofs.
-
Improved Training of Wasserstein GANs (WGAN-GP)The gradient penalty that made WGAN practical. Short, clearly written, and directly implementable. The go-to reference for discriminator regularization.
-
Image-to-Image Translation with Conditional Adversarial Networks (Pix2Pix)The PatchGAN discriminator and L1+GAN loss combination. Demonstrates the power of paired conditional generation across a wide range of domain pairs. The project page is a wonderful demonstration of the technique.
-
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial NetworksCycleGAN. The cycle-consistency loss is an elegant constraint that enables translation without paired data. One of the most impactful GAN papers. Read alongside Pix2Pix to appreciate the progression from paired to unpaired settings.
Architecture Advances
-
Progressive Growing of GANs for Improved Quality, Stability, and VariationPGGAN. The progressive training paradigm that enabled 1024×1024 face synthesis. Also introduces equalized learning rate and minibatch standard deviation. An engineering tour-de-force that reads clearly despite its technical depth.
-
A Style-Based Generator Architecture for Generative Adversarial NetworksStyleGAN. The mapping network, AdaIN, and style injection. Introduced FFHQ, the standard high-quality face dataset, and the concept of $\mathcal{W}$ space. The supplementary material's style mixing results are stunning.
-
Analyzing and Improving the Image Quality of StyleGANStyleGAN2. Weight demodulation, path length regularization, and the elimination of the characteristic droplet artifacts. Essential companion to the original StyleGAN paper.
-
Large Scale GAN Training for High Fidelity Natural Image Synthesis (BigGAN)The scale-up study that showed what class-conditional GANs could achieve on ImageNet at unprecedented compute. The truncation trick and training stability analysis are particularly valuable. Best read alongside the BigGAN-deep and self-conditioning follow-ups.
Evaluation & Theory
-
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID)Introduces the Fréchet Inception Distance. The theoretical justification and practical guidance on sample sizes make this worth reading. Also contains a useful overview of the convergence theory for two-time-scale GAN updates.
-
Improved Precision and Recall Metric for Assessing Generative ModelsThe precision/recall decomposition for generative evaluation. Directly addresses the fidelity vs. diversity trade-off that FID cannot decompose. Essential context for understanding what metrics actually measure.