Part X · Generative Models · Chapter 05

Autoregressive Generative Models, predicting each token from all those before it.

The simplest possible generative model just predicts the next element from all the previous ones — one token, one pixel, one audio sample at a time. This deceptively mundane idea, grounded in the probability chain rule, turned out to be the most scalable and capable generative framework ever discovered. PixelCNN modeled images pixel-by-pixel; WaveNet synthesized speech sample-by-sample; GPT predicts text token-by-token. All three are the same idea at different scales and on different data types.

Prerequisites

This chapter assumes familiarity with convolutional neural networks (Part V Ch 03), sequence models and recurrent networks (Part V Ch 04), and the transformer architecture (Part VI Ch 04). Basic probability and the chain rule of probability (Part I Ch 04) are essential. The connection to normalizing flows (Part X Ch 03) is made explicit in several places but is not required reading. Language model pretraining is treated in depth in Part VI Ch 05.

Sections

The Chain Rule as a Generative Model factorization · causal ordering · exact likelihood
A Brief History: From N-Grams to Neural LMs n-gram models · RNN LMs · perplexity
PixelRNN: Images as Sequences raster-scan ordering · BiLSTM · masked convolutions
PixelCNN: Parallel Training, Sequential Sampling masked convolutions · gated activation · receptive field
PixelCNN++ and Conditional PixelCNN logistic mixture · class conditioning · super-resolution
WaveNet: Autoregressive Audio at Scale dilated convolutions · skip connections · conditional WaveNet
The Transformer as an Autoregressive Model causal masking · decoder-only · KV cache
GPT: Scaling Autoregressive Language Models GPT-1/2/3 · in-context learning · emergent capabilities
Autoregressive Image Generation via Discrete Tokens VQ-VAE · ImageGPT · DALL-E 1 · VQGAN
Sampling Strategies temperature · top-k · nucleus · typical sampling
The Sequential Bottleneck & Acceleration KV cache · speculative decoding · Medusa
The Autoregressive Landscape AR vs. diffusion vs. flows · perplexity · frontiers

The Chain Rule as a Generative Model

Every joint probability distribution over a sequence can be written as a product of conditional probabilities. Autoregressive models make this factorization operational by parameterizing each conditional with a neural network — no approximations, no adversaries, no latent variables.

For any sequence $x = (x_1, x_2, \ldots, x_T)$ over a probability space, the chain rule of probability gives an exact, always-valid factorization:

$$p(x) = p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_1, x_2) \cdots p(x_T | x_1, \ldots, x_{T-1}) = \prod_{t=1}^T p(x_t | x_{1:t-1})$$

This is not a model assumption — it is a mathematical identity. Every joint distribution over sequences admits this factorization exactly, regardless of the structure of the distribution. The sequence can be tokens of text, pixels in raster-scan order, audio samples at 24kHz, or any other element of a structured space. The factorization always exists.

An autoregressive model (AR model) approximates each conditional factor $p(x_t | x_{1:t-1})$ with a parametric model $p_\theta(x_t | x_{1:t-1})$ — typically a neural network that takes all previous elements as input and outputs a distribution over the next element. The full joint is then:

$$p_\theta(x) = \prod_{t=1}^T p_\theta(x_t | x_{1:t-1})$$

This is an exact probability model — not a lower bound, not an approximation of the marginal, not an implicit density. For any data point $x$, the log-likelihood $\log p_\theta(x) = \sum_t \log p_\theta(x_t | x_{1:t-1})$ is computable exactly in a single forward pass through the network, as long as we can evaluate each conditional. Training by maximum likelihood is then straightforward: minimize the negative log-likelihood, which decomposes into a sum of cross-entropy losses — one per position in the sequence — each averaged over the training data.

The autoregressive constraint: causality

The key structural requirement on autoregressive models is causality: the prediction of $x_t$ must depend only on $x_{1:t-1}$, never on future elements $x_{t+1:T}$. This is the defining property that separates autoregressive from bidirectional models. In practice, causality is enforced architecturally — either by ensuring that the network has no path from future inputs to current outputs (masked convolutions in PixelCNN, causal self-attention masking in GPT) or by the natural sequential structure of recurrent networks (which process inputs left-to-right). The causality constraint is what enables both exact training and exact sampling: during training, all positions can be predicted in parallel (each conditioning on the ground-truth left context); during sampling, they must be generated sequentially (each conditioning on the previously generated tokens).

Why the chain rule direction matters

The chain rule factorization is not unique: any ordering of the elements gives a valid factorization. For text, the natural left-to-right ordering is linguistically motivated (people read and write left-to-right) and practically convenient (sampling produces naturally fluent text). For images, the raster-scan ordering (row by row, left to right) is conventional but arbitrary — an image has no intrinsic "first" pixel. For audio, the temporal ordering is natural. The choice of ordering affects what structure the model is encouraged to learn at each position, and some orderings are empirically better than others for different data types. Permutation-equivariant autoregressive models that marginal over orderings (like XLNet for text) were explored but generally found not to outperform fixed-order models at scale.

Autoregressive models are normalizing flows. The map $x \mapsto z = (z_1, \ldots, z_T)$ defined by $z_t = F_t^{-1}(x_t | x_{1:t-1})$ — where $F_t$ is the CDF of $p_\theta(x_t | x_{1:t-1})$ — is an invertible transformation with a triangular Jacobian (dimension $t$ of $z$ depends only on $x_{1:t}$). Its log-determinant is $\sum_t \log p_\theta(x_t | x_{1:t-1}) / \partial x_t$, exactly the log-likelihood. Every autoregressive model is a normalizing flow, and every autoregressive flow (MAF/IAF from Chapter 03) is an autoregressive model.

A Brief History: From N-Grams to Neural LMs

The autoregressive idea for language is older than deep learning by decades. Understanding the progression from simple count-based n-gram models to recurrent neural language models reveals how the core challenge — learning long-range context — drove the architectural innovations that culminate in transformers.

The earliest practical autoregressive language models were n-gram models, which approximate each conditional with a limited context window: $p(x_t | x_{1:t-1}) \approx p(x_t | x_{t-n+1:t-1})$. A bigram model ($n=2$) estimates $p(\text{word} | \text{previous word})$ from co-occurrence counts in a corpus; a trigram model uses the previous two words. N-gram models are simple, fast, and require no gradient descent — just counting and normalizing. They were the state of the art in language modeling for speech recognition, machine translation, and other NLP tasks from the 1980s through the mid-2000s.

The fundamental limitation of n-gram models is their finite context window. Natural language exhibits long-range dependencies — a pronoun's referent may appear many sentences earlier, a subject and its verb may be separated by a long relative clause — that trigram and even 5-gram models cannot capture. They also suffer from the curse of dimensionality: the number of possible n-grams grows exponentially with $n$, so even massive corpora leave most n-grams unseen, requiring extensive smoothing and backoff heuristics.

Neural language models

Bengio et al. (2003) introduced the first practical neural language model: a feedforward network that takes learned word embeddings for the previous $n-1$ words as input and predicts a distribution over the next word. The embeddings allow generalization across semantically similar words — the model that sees "the cat sat on the" learns something useful about "the dog sat on the" even without that exact 5-gram in training data. This distributed representation approach broke the curse of dimensionality and achieved significantly better perplexity than n-gram models on the same data, at the cost of slower training and inference.

RNN language models

Mikolov et al. (2010) demonstrated that recurrent neural language models (RNNLMs) could significantly outperform both n-gram and feedforward neural models by processing sequences with a hidden state that carries information across arbitrary time horizons. The RNN reads the sequence left-to-right, updating its hidden state $h_t = f(h_{t-1}, x_t)$ at each step and predicting $p(x_{t+1} | h_t)$ from the accumulated hidden state. In principle, the hidden state can remember arbitrarily distant context; in practice, vanilla RNNs suffer from vanishing gradients and fail to propagate information across more than a few dozen steps. LSTMs and GRUs (Part V Ch 04) substantially alleviate this, enabling recurrent language models that became competitive for speech recognition and machine translation by 2013–2015. The PixelRNN and WaveNet architectures below apply exactly these recurrent principles to non-text sequence data.

PixelRNN: Images as Sequences

van den Oord et al. (2016) demonstrated that images could be modeled as sequences of pixels, with a recurrent network predicting each pixel's color distribution from all previous pixels. The result was the first autoregressive image model to produce coherent, globally-consistent generated images.

The idea of modeling images autoregressively requires first choosing an ordering. PixelRNN adopts the raster-scan order: pixels are processed row by row, left to right within each row, with the three color channels (R, G, B) at each pixel modeled sequentially. For a 32×32 image this gives a sequence of $32 \times 32 \times 3 = 3{,}072$ elements; for 64×64 images, $12{,}288$ elements. The model must predict each element conditioned on all previous elements — a recurrent network with a very long context.

Each pixel's value is treated as a discrete random variable over 256 intensity levels (0–255). The model outputs a softmax distribution over 256 values per channel at each step, trained to maximize the log-likelihood of the training images under this factorized distribution. The joint distribution over all pixels and channels is:

$$p(x) = \prod_{i=1}^{n^2} p(x_{i,R} | x_{where $x_{

The Diagonal BiLSTM

The naive application of a standard LSTM along the raster-scan sequence is too slow: processing $3{,}072$ steps sequentially makes training prohibitively expensive. PixelRNN's key architectural innovation is the Diagonal BiLSTM, which enables parallel computation across pixels along skewed diagonals. The intuition is that pixels on the same anti-diagonal of an image (bottom-left to top-right) have no autoregressive dependency on each other (each depends only on pixels above and to the left), so they can be processed simultaneously. By reordering computation along these diagonals and using a carefully designed recurrence, the Diagonal BiLSTM achieves full use of the left-and-above context while processing $O(\sqrt{n})$ sequential steps rather than $O(n)$, dramatically accelerating training. A simpler Row LSTM variant restricts context to a triangular region above each pixel, trading completeness for further parallelism.

PixelRNN produced the first high-quality autoregressive image samples: generated MNIST digits were sharp and correct, generated CIFAR-10 images showed consistent global structure (skies above, ground below) even if fine details were imperfect. The bits-per-dimension metric — which measures log-likelihood under the model in bits per pixel — gave a principled, comparable score across models, and PixelRNN set strong benchmarks on MNIST and CIFAR-10. The main limitation was training speed: even with the diagonal parallelism, recurrent processing of thousands of sequential steps remained slow.

PixelCNN: Parallel Training, Sequential Sampling

PixelCNN replaced the slow recurrence of PixelRNN with masked convolutions — filters that see only previous pixels in the raster-scan order. Training became massively parallel, though sampling remained sequential. A subtle blind-spot artifact required an architectural fix.

The core challenge with applying convolutions to autoregressive image modeling is maintaining the causality constraint: the prediction of pixel $(i,j)$ must not depend on any pixel $(i',j')$ that comes later in raster-scan order (i.e., any pixel below $(i,j)$ or to its right on the same row). Standard convolutions look at a neighborhood centered on the current pixel — which includes future pixels. Masked convolutions solve this by zeroing out the filter weights that correspond to future positions:

A masked convolution kernel for PixelCNN. The center pixel (★) is predicted from all pixels marked ✓ — those above it in any row, and those to its left in the same row. Pixels below or to the right (✗) are masked to zero to enforce causality.

A 3×3 masked convolution zeroes out the center pixel and all positions below-right of center in the kernel. Stacking many such layers builds up receptive fields that grow with depth: the first layer sees a small neighborhood, but after $L$ layers the effective receptive field covers the entire upper-left triangular region of the image above the current pixel. Unlike recurrent models, all pixel predictions can be computed simultaneously during training — the entire sequence $x_{1:T}$ is fed as input, and all $T$ conditional distributions are predicted in a single forward pass by ensuring no layer can "see" future pixels. Backpropagation flows through all positions simultaneously, giving a dramatic training speedup over PixelRNN.

The blind spot problem

A subtle flaw in the naive masked convolution stack is the blind spot: due to the rectangular shape of the masked kernel, pixels directly to the right of the current position in rows above it are not accessible within a finite stack of 3×3 convolutions. The receptive field has a vertical gap on the right side. van den Oord et al. fixed this with the gated PixelCNN, which uses two separate convolutional stacks — a vertical stack (processing columns above the current row) and a horizontal stack (processing the current row up to the current position) — that are combined via a gating mechanism. The vertical stack has no blind spot; the horizontal stack conditions on the vertical stack's output to provide the missing context.

Gated activation units

Both PixelCNN and WaveNet use a gated activation unit borrowed from the LSTM gate structure:

$$y = \tanh(W_f * x) \odot \sigma(W_g * x)$$

where $W_f$ and $W_g$ are separate convolutional filters, $*$ denotes convolution, and $\sigma$ is the sigmoid function. The $\tanh$ branch produces the candidate feature values; the sigmoid branch produces a gate that controls how much of each feature passes through. This gated nonlinearity proved more expressive than ReLU for autoregressive generative models and significantly improved perplexity on image and audio modeling benchmarks.

PixelCNN++ & Conditional PixelCNN

Several refinements to the base PixelCNN architecture significantly improved image quality and expanded the model to conditional settings — generating images conditioned on class labels, other images, or any structured side information.

Salimans et al. (2017) introduced PixelCNN++, which addressed two limitations of the original: the categorical softmax output head (which treats 256 intensity levels as entirely independent, ignoring their natural ordering) and the architectural inefficiency of modeling all spatial scales with uniform-resolution convolutions.

Discretized logistic mixture output

Instead of a 256-way softmax over raw pixel values, PixelCNN++ models each channel's conditional distribution as a discretized mixture of logistics: a weighted sum of $K$ logistic distributions, where each logistic has a learned mean $\mu_k$ and scale $s_k$, and the continuous distribution is discretized to the 256 integer values by integrating within each interval $[v - 0.5, v + 0.5]$. The parameters $(\mu_k, s_k, \pi_k)$ are predicted by the network. This output distribution has two major advantages: it is much more parameter-efficient than 256 independent logits (needing only $3K$ parameters per channel rather than 256), it respects the ordering structure of pixel intensities (pixel value 128 is between 127 and 129, which the logistic mixture captures naturally), and it provides much smoother gradients for learning the output distribution. PixelCNN++ also uses downsampled residual connections at multiple resolutions and dropout regularization, achieving state-of-the-art bits-per-dimension on CIFAR-10 at the time of publication.

Conditional PixelCNN

van den Oord et al. also introduced conditional PixelCNN, which extends the model to generate images conditioned on auxiliary information $h$ — for example, a class label, another image, or a high-level description. Conditioning is implemented by adding a learned bias to each gated activation layer:

$$y = \tanh(W_f * x + V_f^T h) \odot \sigma(W_g * x + V_g^T h)$$

where $V_f$ and $V_g$ are learned matrices mapping the conditioning vector to the activation space. This simple additive injection works remarkably well: a PixelCNN conditioned on ImageNet class labels produces class-consistent images across diverse ImageNet categories. A more powerful variant uses spatial conditioning, where $h$ is a full spatial feature map (from an encoder network) rather than a single vector, allowing fine-grained pixel-level conditioning. Spatial conditional PixelCNN was used in the original VQ-VAE paper as the prior over codebook indices, producing much sharper samples than an unconditional prior.

Super-resolution is a particularly compelling conditional application: the model conditions on a low-resolution image and generates a plausible high-resolution version, pixel by pixel. Because the low-resolution image strongly constrains global structure while leaving fine details ambiguous, the PixelCNN can sample diverse, sharp completions — each globally consistent with the low-resolution input but differing in texture and fine detail. This stochastic super-resolution produces qualitatively more realistic results than deterministic upsampling methods.

WaveNet: Autoregressive Audio at Scale

WaveNet applied the PixelCNN architecture directly to raw audio waveforms, predicting each sample from up to thousands of previous samples using dilated causal convolutions. The results — human-quality speech synthesis — were shocking when published in 2016.

Audio presents a more extreme version of the image modeling challenge. A 24kHz audio waveform has 24,000 samples per second; even one second of speech is a sequence of 24,000 values. Meaningful prosodic patterns (rhythm, intonation) span hundreds of milliseconds; the fundamental frequency of a voice involves periodicities of 5–10 milliseconds. Capturing these multi-scale dependencies requires an effective receptive field covering tens of thousands of samples — far beyond what a small stack of convolutions can achieve.

Dilated causal convolutions

WaveNet's architectural solution is dilated causal convolutions. A standard causal convolution with filter length $k$ at each layer adds $k-1$ to the effective receptive field. With dilation factor $d$, the filter skips $d-1$ samples between each tap, increasing the step contribution to $d(k-1)$. WaveNet stacks convolutions with exponentially increasing dilation: $d = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512$ — a single block of 10 dilated layers achieves a receptive field of $1 + 2 + 4 + \cdots + 512 = 1{,}023$ samples with just 10 layers of 2-tap filters. Stacking multiple such blocks (typically 5, for a total of 50 layers) extends the receptive field to $5{,}115$ samples at 24kHz, covering roughly 200ms of audio — sufficient to capture phoneme-level temporal structure.

Each dilated layer uses the gated activation unit from PixelCNN, with residual connections bypassing each block and skip connections summing contributions from all layers into the final output. The output predicts a $\mu$-law companded 256-way softmax (or a discretized logistic mixture) for the next sample. Training is fully parallel — all samples' conditionals computed simultaneously from the ground-truth waveform — while inference (sample generation) must proceed sample by sample, which makes synthesis slow on standard hardware.

Conditional WaveNet for text-to-speech

WaveNet can be conditioned on linguistic features derived from text — phoneme sequences, pitch targets, speaking rate — to produce controllable speech synthesis. In the original Google implementation, a separately trained text-analysis front-end produces a frame-level linguistic feature vector, which is upsampled to the waveform sample rate and injected into every WaveNet layer via the conditioning mechanism. The result was rated by human listeners as significantly more natural than any previous TTS system, with mean opinion scores (MOS) approaching the quality of recorded human speech. Google deployed WaveNet in production for Google Assistant, though the original autoregressive model was eventually replaced by faster non-autoregressive variants (including distillation into IAF models — see Chapter 03's WaveGlow discussion).

WaveNet's legacy in audio: Beyond speech synthesis, the WaveNet architecture and its dilated convolution principle influenced nearly every subsequent neural audio model. WaveGlow (flows), MelGAN (adversarial), HiFi-GAN (adversarial), and EnCodec (codec-based) all draw architectural lineage from WaveNet, whether using its dilated convolutions, its gated activations, or its multi-scale design philosophy.

The Transformer as an Autoregressive Model

The transformer's self-attention mechanism is ideally suited to autoregressive modeling: a causal mask enforces the causality constraint, and attention's global receptive field captures long-range context that RNNs and dilated convolutions can only approximate. This combination — transformer + next-token prediction — became the dominant generative architecture.

The original transformer (Vaswani et al., 2017) was designed for sequence-to-sequence tasks (machine translation) using an encoder-decoder structure. The decoder uses causal (masked) self-attention: at each position $t$, self-attention is restricted to attend only to positions $1, \ldots, t$ by setting attention logits for future positions to $-\infty$ before softmax. This is trivially implemented by masking the upper triangle of the attention score matrix. With this mask, the decoder can predict all $T$ output positions in parallel during training — attending to the correct (teacher-forced) ground-truth inputs shifted right by one — while sampling at inference time proceeds one token at a time.

The decoder-only transformer removes the cross-attention to an encoder entirely, using only self-attention and feedforward layers. This is the GPT architecture: a stack of causal self-attention layers processing a single sequence, predicting the next token at each position. The attention mechanism's global receptive field — every position can, in principle, attend to every earlier position — gives it a fundamental advantage over PixelRNN and WaveNet for sequences with long-range dependencies. A 512-token context window connects every token pair with a direct attention path; a WaveNet-style dilated convolution stack needs many more layers to achieve a comparable receptive field.

The KV cache for efficient inference

During autoregressive inference, the model generates tokens one at a time. At each step $t$, it must compute self-attention over all previous tokens $1, \ldots, t$. Without optimization, this requires recomputing the key and value vectors $K, V$ for all previous tokens at every step — quadratic in the sequence length. The KV cache eliminates this redundancy: the key and value projections for all tokens up to position $t-1$ are computed once and stored; at step $t$, only the new query, key, and value for the most recent token need to be computed, and attention is computed between the new query and all cached keys. This reduces per-step cost from $O(t^2 d)$ to $O(t d)$, making long-context autoregressive generation practical. The KV cache is a memory-compute trade-off: for long contexts or large models, the cache can consume gigabytes of GPU memory, which constrains batch size during inference.

Training objective and teacher forcing

Autoregressive transformers are trained with teacher forcing: rather than feeding the model's own generated tokens as input at training time, the ground-truth tokens are used as context for all positions simultaneously. This enables fully parallelized training — the entire sequence is forward-passed at once, producing predictions for all positions in $O(T^2)$ attention operations. The training loss is the sum of cross-entropy losses at all positions:

$$\mathcal{L}(\theta) = -\sum_{t=1}^T \log p_\theta(x_t | x_1, \ldots, x_{t-1})$$

This is exact maximum likelihood — the model learns to assign high probability to the actual next token given the ground-truth context. Teacher forcing introduces a distributional shift between training (ground-truth context) and inference (model-generated context): if the model makes a mistake early in generation, it must continue from an out-of-distribution prefix. This exposure bias is a known limitation of teacher-forced training, though in practice large models trained on diverse data are robust to it.

GPT: Scaling Autoregressive Language Models

The GPT series demonstrated that language generation quality improves dramatically and predictably with scale — more parameters, more data, more compute — transforming a simple next-token predictor into a general-purpose generative engine capable of in-context learning, instruction following, and reasoning.

Radford et al. (2018) introduced GPT-1 — a 117-million parameter decoder-only transformer trained on BooksCorpus with a next-token prediction objective. The key insight was to pretrain on language modeling (easy to obtain massive data, clear training signal) and then fine-tune on downstream tasks with minimal task-specific architecture changes. GPT-1 showed that language model pretraining produced rich representations useful for tasks far beyond generation: sentiment analysis, textual entailment, question answering, and semantic similarity all improved with pretrained features.

GPT-2: language models as unsupervised multitask learners

GPT-2 (Radford et al., 2019) scaled to 1.5 billion parameters trained on WebText — 40GB of curated web text — and produced a surprising finding: without any task-specific fine-tuning, the model could perform translation, summarization, reading comprehension, and question answering by simply conditioning on naturally-occurring task specifications in the prompt (e.g., "TL;DR:" as a summarization cue). This zero-shot generalization was not explicitly trained for — it emerged from learning to complete text across a diverse corpus that contained many formats and tasks. GPT-2's text generation was fluent enough that OpenAI initially withheld the full model due to concerns about misuse.

GPT-2 also established that language model quality — measured by perplexity on held-out text — scales smoothly with model size over several orders of magnitude, with no indication of diminishing returns at the sizes tested. This empirical observation laid the groundwork for the systematic scaling laws investigated later.

GPT-3: few-shot learners

GPT-3 (Brown et al., 2020) pushed to 175 billion parameters and introduced the framing of in-context learning: rather than updating model weights for new tasks, GPT-3 learns from a few examples provided directly in the prompt context. Given a prompt showing 5 examples of "English word → French translation," GPT-3 continues the pattern with high accuracy for new words — not by gradient descent on those examples, but by pattern-completing from the prompt. This few-shot capability emerged from scale in a way that smaller models could not match.

As a generative model specifically, GPT-3 demonstrated remarkable text generation capabilities: coherent long-form articles, poetry matching the style of classic poets, programming code, and convincing human-like dialogue. These emergent generation capabilities arise from the model having compressed an enormous amount of world knowledge, stylistic variation, and textual structure into its parameters through next-token prediction alone. The diversity and coherence of GPT-3's generations at temperature 0.9–1.0 established that scale alone — no architectural innovation — was sufficient to produce qualitatively new generative capabilities.

Why next-token prediction works so well: To accurately predict the next word in a Wikipedia article about quantum mechanics, you must know quantum mechanics. To predict dialogue in a novel, you must model character psychology and narrative structure. To predict code completions, you must understand algorithms and syntax. Next-token prediction on a large enough corpus forces the model to internalize an enormous breadth of knowledge and skills as a necessary side-effect of learning to predict text accurately. The task is hard enough that there is no shortcut — the model must actually understand the content to predict it well.

Autoregressive Image Generation via Discrete Tokens

Applying a language-model-scale transformer directly to raw pixels is impractical — even a 256×256 image is 196,608 values. The solution: learn a discrete "visual vocabulary" with a VQ-VAE, compress images into short sequences of codebook indices, then train an autoregressive transformer over those indices.

The transformer's quadratic attention cost with sequence length makes direct autoregressive pixel modeling intractable at large resolutions. The key enabling idea — introduced in VQ-VAE and made famous by DALL-E 1 — is to first train a discrete image tokenizer that compresses an image from, say, 256×256 pixels to a 32×32 grid of discrete token indices, each drawn from a learned vocabulary of $K$ visual codebook entries. The autoregressive transformer then models sequences of length $32 \times 32 = 1{,}024$ over a $K$-way vocabulary, rather than sequences of length 196,608 over 256 pixel values — a ~200× reduction in sequence length.

VQ-VAE and discrete latent codes

The Vector Quantized VAE (VQ-VAE, van den Oord et al., 2017) trains an encoder-decoder architecture where the encoder's continuous feature map is quantized to the nearest entry in a learned codebook $\{e_1, \ldots, e_K\} \subset \mathbb{R}^D$. Each spatial location $i,j$ of the encoder output is replaced by the nearest codebook vector: $z_{ij} = e_{k^*}$ where $k^* = \arg\min_k \|z_e(x)_{ij} - e_k\|$. The decoder reconstructs the image from these quantized vectors. Gradient flow through the discrete quantization step uses the straight-through estimator: in the backward pass, gradients flow directly from the decoder's input to the encoder's output as if the quantization were the identity. A commitment loss encourages the encoder outputs to stay close to codebook entries, and a codebook update rule keeps the codebook vectors near their assigned encoder outputs. VQ-VAE learns a compressed discrete visual representation with good reconstruction quality; the codebook is a discrete "visual vocabulary" where each entry represents a frequently-occurring visual pattern at a certain scale.

ImageGPT

ImageGPT (Chen et al., 2020) applied a GPT-style autoregressive transformer directly to sequences of quantized pixel clusters (using a k-means color palette of 512 colors), generating images autoregressively in raster-scan order. ImageGPT trained at scales up to 6.8 billion parameters and produced surprisingly strong self-supervised visual representations — the internal activations of the model, when linearly probed, achieved competitive ImageNet classification accuracy. This was an early demonstration that autoregressive prediction of visual content, not just text, forces the model to learn semantically rich internal representations.

DALL-E 1 and VQGAN

OpenAI's original DALL-E (Ramesh et al., 2021) combined a dVAE (discrete VAE with learned codebook) image tokenizer with a GPT-3-scale transformer trained on 250 million image-text pairs. Text tokens and image tokens were concatenated into a single sequence — first the text caption tokens (up to 256 BPE tokens), then the image tokens (1,024 VQ-VAE indices at 32×32) — and the transformer was trained to predict all tokens autoregressively. At generation time, a text prompt conditions the model, which then samples image tokens that complete the sequence. The generated image tokens are decoded back to pixels by the VQ-VAE decoder. DALL-E 1 demonstrated that large-scale autoregressive pretraining on image-text pairs could produce semantically coherent image generations from natural language descriptions — the first large-scale text-to-image system, predating the diffusion-based systems that would later surpass it.

VQGAN (Esser et al., 2021) improved the visual tokenizer by using an adversarial training objective for the encoder-decoder: rather than training with pixel-level reconstruction loss alone, VQGAN adds a patch-based discriminator (similar to PatchGAN) that pushes the decoded images toward photorealistic quality. VQGAN codes are more perceptually compact than VQ-VAE codes — they capture more visual detail with fewer tokens — which improves the quality of subsequent autoregressive generation. The VQGAN + transformer pipeline (often called "taming transformers") became the standard recipe for high-quality autoregressive image synthesis before the rise of diffusion models.

Sampling Strategies

The output of an autoregressive model at each step is a full probability distribution over the next token. How you sample from that distribution — whether deterministically, randomly, or with a constrained budget — profoundly affects the quality, diversity, and coherence of generated sequences.

At each generation step, the model outputs logits $l_1, \ldots, l_V$ over the vocabulary of size $V$, which are converted to probabilities by softmax. The choice of sampling strategy is a fundamental design decision with strong effects on generation quality. Several strategies are commonly used:

Greedy decoding and beam search

Greedy decoding always selects the highest-probability token at each step: $x_t = \arg\max_v p(v | x_{1:t-1})$. It is fast and deterministic but often produces repetitive, low-diversity text — the model locks onto the locally safest tokens and fails to explore the distribution. Beam search maintains $B$ candidate sequences (beams) at each step and expands each by all possible tokens, keeping only the $B$ highest-probability complete sequences. It achieves better global optimization of the joint sequence probability but still tends toward boring, repetitive output and is more expensive ($B\times$ slower). Both methods are primarily used for translation and summarization tasks where fidelity to a ground truth is paramount; for open-ended generation, sampling is preferred.

Temperature sampling

Temperature scaling sharpens or flattens the distribution before sampling by dividing logits by a scalar temperature $\tau > 0$: $p_\tau(v) = \text{softmax}(l / \tau)$. At $\tau = 1$, sampling is from the model's distribution. At $\tau < 1$, the distribution is sharpened (higher confidence, more conservative choices); at $\tau > 1$, it is flattened (lower confidence, more random choices). Values of $\tau \approx 0.7$–$0.9$ typically produce the best quality/diversity tradeoff for text generation, slightly concentrating probability on likely tokens while retaining diversity. Very low $\tau$ approaches greedy; $\tau \to \infty$ approaches uniform random sampling.

Top-k sampling

Top-k sampling restricts sampling to only the $k$ highest-probability tokens at each step, redistributing probability mass uniformly (or proportionally to original probabilities) among those $k$ tokens. With $k=40$ or $k=50$, this eliminates very unlikely tokens that would produce incoherent text while retaining considerable diversity. The limitation of top-k is that $k$ is a fixed hyperparameter: when the model is very confident (the top token has probability 0.9), sampling from $k=50$ still includes many implausible tokens; when the model is very uncertain (flat distribution), restricting to $k=50$ may be too aggressive.

Nucleus (top-p) sampling

Nucleus sampling or top-p sampling (Holtzman et al., 2020) addresses this adaptively: rather than a fixed $k$, it samples from the smallest set of tokens whose cumulative probability mass exceeds a threshold $p$ (typically 0.9 or 0.95). When the model is confident, the nucleus is small (perhaps just 2–3 tokens); when uncertain, it expands to include many more options. This adaptive cutoff provides more appropriate diversity at each step. Nucleus sampling became the dominant decoding strategy for open-ended language generation after its introduction, outperforming top-k sampling in human evaluations of text quality and diversity.

Typical sampling and other variants

Typical sampling (Meister et al., 2023) takes a different perspective: rather than sampling from the top of the distribution, it samples tokens whose information content $-\log p(x_t | x_{1:t-1})$ is close to the expected information content (conditional entropy) of the distribution. Tokens far from the expected information — either too predictable (boring) or too surprising (incoherent) — are excluded. This produces text that is more "typical" of the model's distribution rather than concentrated at high-probability peaks, and is argued to better approximate genuine samples from the model. Repetition penalties, which reduce logits for tokens already appearing in the context, are often combined with any of these sampling strategies to reduce the looping and repetition that plagues autoregressive generation without explicit anti-repetition mechanisms.

Strategy	Mechanism	Best for	Risk
Greedy	Always pick max-probability token	Translation, structured output	Repetitive, degenerate
Beam search	Maintain top-B candidate sequences	Summarization, translation	Boring; high compute
Temperature	Scale logits by 1/τ before softmax	General creative generation	Incoherent at high τ
Top-k	Sample from top k tokens only	General; fast	Fixed k doesn't adapt to confidence
Nucleus (top-p)	Sample from smallest set summing to p	Open-ended generation; default choice	Hyperparameter sensitivity
Typical	Sample tokens near expected entropy	Diverse, natural-sounding text	Less explored at scale

The Sequential Bottleneck & Acceleration

Autoregressive generation is fundamentally sequential — each token depends on all previous tokens. On modern GPU hardware this makes generation memory-bandwidth-bound rather than compute-bound, and various techniques have been developed to either hide the latency or break the sequential dependency partially.

Training an autoregressive transformer is fast: all token predictions are computed in a single forward pass, and GPUs are highly utilized. Inference is another matter. Generating $T$ tokens requires $T$ sequential forward passes, each computing attention over all previous tokens. For a large model with $L$ layers, $d$ model dimension, and a KV cache of $T$ tokens, each step involves $O(T \cdot d \cdot L)$ multiply-accumulate operations for the attention over the cache and $O(d^2 \cdot L)$ operations for the feedforward layers. On modern hardware, the feedforward computation is compute-bound, but the attention over a long KV cache is memory-bandwidth-bound: the bottleneck is reading the cache from GPU HBM (high-bandwidth memory) at each step, not performing the multiplications. This means generation speed for large models with long contexts is limited by the GPU's memory bandwidth — which scales much more slowly than compute as GPUs improve. A 70-billion-parameter model generating at 10 tokens/second is largely limited by bandwidth in reading its 140GB of weights and KV cache at each step.

Speculative decoding

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) exploits the observation that a small draft model can quickly propose several tokens at once, and the large target model can verify them all in a single parallel forward pass. Concretely: a small "draft" model $M_q$ (say, 7B parameters) generates $k$ tokens speculatively ($k$ sequential small-model steps); the large "target" model $M_p$ (say, 70B parameters) evaluates the joint probability of all $k$ proposed tokens in one parallel forward pass (since the target model can process any prefix in parallel); tokens that the target model would have accepted are kept, and the first rejection terminates the sequence, falling back to target-model sampling for the next token. Because the target model's parallel forward pass costs the same as a single sequential step, verifying $k$ tokens is no more expensive than generating 1 — and when the draft model is accurate, speculative decoding achieves $2$–$3\times$ speedups on generation throughput with mathematically exact samples from the target model's distribution.

Parallel decoding and Medusa

A complementary approach modifies the model itself to enable partial parallelism. Medusa (Cai et al., 2024) adds multiple additional prediction heads to the autoregressive model — one head predicts the next token, a second head predicts the token two positions ahead, a third predicts three positions ahead, and so on. All heads run in parallel on the current hidden state, proposing a "tree" of candidate continuations. A verification step using the original model's first head selects the longest valid prefix among the candidates. This achieves $2$–$3\times$ speedup without a separate draft model, at the cost of additional parameters and a modified training procedure (the lookahead heads are trained separately, keeping the main model weight unchanged). Other parallel decoding schemes like Lookahead Decoding and Jacobi decoding similarly trade additional memory and careful algorithmic design for reduced wall-clock generation time.

Why not generate all tokens in parallel? Non-autoregressive generation (NAR) does exactly this — predicts all tokens of the output simultaneously in a single forward pass. For machine translation, NAR models (Gu et al., 2018) can match autoregressive quality at a fraction of the decoding cost. But for open-ended generation without a strong conditioning signal, NAR models suffer from multi-modality: without left-to-right context, predicting each token independently causes incoherence. Masked diffusion models (e.g., MDLM, MDLM-T5) are a modern approach that re-introduces partial sequential structure through iterative refinement, bridging the gap between fully autoregressive and fully parallel generation.

The Autoregressive Landscape

Autoregressive modeling is not a niche technique for one data type — it is a universal framework that has produced the state of the art in language, code, audio, images, and scientific data. Understanding where it excels and where it faces fundamental limits is essential for choosing the right generative approach.

Autoregressive models vs. diffusion models

Since 2022, diffusion models have dominated image generation — Stable Diffusion, DALL-E 2 and 3, Midjourney, and Imagen all use diffusion. But the comparison is more nuanced than it appears. Diffusion models have a fundamental advantage for continuous data (images, audio waveforms) where the denoising prior is a strong inductive bias. Autoregressive models have a fundamental advantage for discrete data (text, code, symbolic sequences) where the chain rule factorization is natural and the softmax output distribution is well-suited to categorical choices. The most capable generative systems of 2024 — GPT-4o, Gemini, Claude — are primarily autoregressive, because language and reasoning remain the dominant generative tasks, and autoregressive models excel there.

Recent work has blurred the boundary: autoregressive models over visual tokens (LlamaGen, Chameleon) produce competitive image quality with appropriate tokenizers. Masked diffusion models add partial ordering to otherwise parallel diffusion. Flow matching (a CNF variant) can be applied discretely. The convergence of these paradigms suggests that the distinction between "autoregressive" and "diffusion" may matter less than the data type, tokenization, and scale.

Perplexity as a universal metric

One underappreciated advantage of autoregressive models is that their training objective — negative log-likelihood — directly produces a principled evaluation metric: perplexity, defined as $\text{PPL} = \exp\left(-\frac{1}{T}\sum_t \log p_\theta(x_t|x_{1:t-1})\right)$. Perplexity measures the average surprisal per token under the model — lower is better. It enables direct, comparable evaluation across different models trained on the same data without any task-specific evaluation. This is a meaningful scientific advantage: the loss function is the evaluation metric, which is rarely true in GAN or diffusion model research (where FID and IS are computed separately and are subject to their own biases).

Why autoregressive models dominate at scale

The deep reason for the dominance of autoregressive models at large scale is the simplicity and stability of the training objective. The cross-entropy loss is convex in the model's output probabilities, has clean gradients, decomposes perfectly across tokens and training examples (enabling large-batch distributed training), and can be computed exactly without any approximation or adversarial instability. When you train 100 billion parameters on 10 trillion tokens, you need a training procedure that works reliably at that scale — and next-token prediction, more than any other generative objective, has proven itself there. Every architectural improvement (Flash Attention, RoPE, SwiGLU, grouped-query attention) has been compatible with the same fundamental training loop, allowing steady progress without overturning the paradigm.

The autoregressive framework's limits are equally clear. Sequential sampling is slow and cannot be trivially parallelized. Long-context processing is expensive: attention is quadratic in sequence length, and even with linear-complexity alternatives (state-space models, linear attention) the memory of $n$ context tokens must be maintained somehow. The exposure bias between teacher-forced training and autoregressive inference is a persistent but manageable challenge. And for continuous data at very high resolution — 4K video, hours of audio — the tokenization overhead of mapping to discrete sequences introduces compression artifacts that diffusion models avoid by working directly in pixel space. None of these limitations are likely fatal; they are engineering challenges that ongoing research continues to address.

The transformer decade: Every dominant language model since 2018 — BERT, GPT-2, GPT-3, PaLM, LLaMA, Claude, Gemini — is an autoregressive transformer or a close variant. The architecture has proven extraordinarily durable, scaling from 100 million to over a trillion parameters with the same basic training recipe. Autoregressive prediction may be the simplest possible generative model, but simplicity at scale has proven to be the most powerful formula in the history of machine learning.

Autoregressive Generative Models, predicting each token from all those before it.

Prerequisites

The Chain Rule as a Generative Model

The autoregressive constraint: causality

Why the chain rule direction matters

A Brief History: From N-Grams to Neural LMs

Neural language models

RNN language models

PixelRNN: Images as Sequences

The Diagonal BiLSTM

PixelCNN: Parallel Training, Sequential Sampling

The blind spot problem

Gated activation units

PixelCNN++ & Conditional PixelCNN

Discretized logistic mixture output

Conditional PixelCNN

WaveNet: Autoregressive Audio at Scale

Dilated causal convolutions

Conditional WaveNet for text-to-speech

The Transformer as an Autoregressive Model

The KV cache for efficient inference

Training objective and teacher forcing

GPT: Scaling Autoregressive Language Models

GPT-2: language models as unsupervised multitask learners

GPT-3: few-shot learners

Autoregressive Image Generation via Discrete Tokens

VQ-VAE and discrete latent codes

ImageGPT

DALL-E 1 and VQGAN

Sampling Strategies

Greedy decoding and beam search

Temperature sampling

Top-k sampling

Nucleus (top-p) sampling

Typical sampling and other variants

The Sequential Bottleneck & Acceleration

Speculative decoding

Parallel decoding and Medusa

The Autoregressive Landscape

Autoregressive models vs. diffusion models

Perplexity as a universal metric

Why autoregressive models dominate at scale

Further Reading

Foundational Papers

GPT and Language Model Scaling

Discrete Image Tokens & Sampling