Part V · Deep Learning Foundations · Chapter 05

Sequence models, the architecture family that taught neural networks to remember — to process inputs of arbitrary length, to carry state across time, and to translate one sequence into another.

The previous chapter built convolutional networks around a spatial inductive bias: translation equivariance and local receptive fields turned the naturally grid-structured image into the first domain where deep learning decisively beat hand-engineered pipelines. This chapter turns to the other great inductive bias of classical deep learning — temporal structure. A sentence, a speech waveform, a heart-rate trace, a stream of user clicks, a protein's amino-acid sequence: each is a string of observations where order carries meaning and length is not known in advance. The fully-connected network cannot represent such inputs without fixing a window size, and the convolutional network can only reach a fixed receptive field. The recurrent neural network — a single cell applied at every time step, carrying a hidden state forward — is the architecture that natively handles variable-length input, and for three decades it was the default choice for any sequential data. The history of sequence models is a history of fighting the vanishing-gradient problem: Sepp Hochreiter's 1991 diploma thesis named the pathology, Bengio, Simard, and Frasconi's 1994 paper proved it was fundamental, and Hochreiter and Schmidhuber's 1997 Long Short-Term Memory paper introduced the gated cell that solved it in practice. What followed was the great age of RNNs — Graves's speech recognition work, Sutskever–Vinyals–Le's 2014 sequence-to-sequence paper, Bahdanau–Cho–Bengio's attention mechanism, Google's 2016 neural machine translation system — before the 2017 transformer made everything in this chapter look, overnight, like the previous era. It wasn't, quite: the ideas developed here — gating, state carried across time, encoder–decoder, teacher forcing, beam search — are all still in daily use, and modern state-space models and linear RNNs are bringing recurrence back under new names. This chapter is the chapter on how neural networks learned to remember.

How to read this chapter

Sections one through four set up the sequence-modelling problem and the vanilla recurrent architecture that was the starting point for everything in this chapter. Section one argues for the recurrent design — why fully-connected and convolutional networks are not enough, what it means for an architecture to handle variable-length inputs, and the distinction between stateful and stateless models. Section two defines the recurrent neural network itself — the single-cell, shared-weight, state-carrying recurrence h_t = f(W_hh h_{t−1} + W_xh x_t) that unrolls into a deep computation graph through time, plus the three-in-one idea that an RNN is simultaneously a dynamical system, an encoder of arbitrary-length inputs, and a recursive function approximator. Section three is backpropagation through time (BPTT) — Werbos 1990, Rumelhart–Hinton–Williams 1986 — the algorithm that makes an RNN trainable by unrolling the recurrence into a feedforward graph and running ordinary backprop, together with the truncated variant that makes it tractable for long sequences. Section four is the central theoretical obstacle this chapter is about: the vanishing and exploding gradient problem (Hochreiter 1991; Bengio, Simard, Frasconi 1994; Pascanu, Mikolov, Bengio 2013) — the mathematical reason a vanilla RNN cannot learn long-range dependencies, and the reason every architecture in the rest of this chapter exists.

Sections five and six are the gated architectures that solved the vanishing-gradient problem. Section five is Long Short-Term Memory (Hochreiter and Schmidhuber 1997) — the cell state that flows almost unchanged along a linear conveyor belt, the forget/input/output gates that decide what to read and write, and the remarkable fact that a single paper published a full five years before deep learning's modern era had already solved the problem. Section six is the Gated Recurrent Unit (Cho, van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, Bengio 2014) — a simplification of the LSTM with one fewer gate and no separate cell state, empirically competitive on most tasks and computationally cheaper. Section seven covers bidirectional and deep RNNs — the forward-and-backward pass that made RNNs the default for labelling tasks like named-entity recognition and part-of-speech tagging, and the stacking of multiple recurrent layers that eventually proved necessary for strong language-modelling performance. Section eight is character-level language models — Karpathy's 2015 Unreasonable Effectiveness of Recurrent Neural Networks, the generative char-RNNs that wrote fake Shakespeare and fake C code, and the surprisingly interpretable hidden-state visualisations that came out of that work.

Sections nine through thirteen are the sequence-to-sequence era — the five-year period (2014–2017) when RNN-based encoder–decoder models were the state of the art for machine translation, dialogue, and summarisation, and when almost every idea that survives in today's large language models was first invented. Section nine is the sequence-to-sequence architecture itself (Sutskever, Vinyals, Le 2014; Cho et al. 2014) — two RNNs, one reading the source, one emitting the target, connected by a single fixed-length context vector. Section ten is attention (Bahdanau, Cho, Bengio 2014; Luong, Pham, Manning 2015) — the decisive invention that replaced the fixed-length bottleneck with a learned soft alignment over the encoder states, foreshadowing everything the transformer would do three years later. Section eleven covers teacher forcing and the exposure bias it introduces — the training–inference mismatch that motivated scheduled sampling and minimum-risk training. Section twelve is decoding — greedy decoding, beam search, nucleus/top-k sampling, and the length and diversity pathologies each introduces. Section thirteen is neural machine translation as the flagship application that drove all of this — the Google NMT system (Wu et al. 2016) that replaced phrase-based statistical MT in production.

Sections fourteen through seventeen cover adjacent applications and alternative architectures. Section fourteen is speech recognition and the Connectionist Temporal Classification (CTC) loss (Graves, Fernández, Gomez, Schmidhuber 2006) — the change-of-variables trick that lets an RNN map an audio frame sequence to a character sequence without explicit alignment, and the DeepSpeech lineage (Hannun et al. 2014) that made end-to-end speech recognition work. Section fifteen is Temporal Convolutional Networks (Bai, Kolter, Koltun 2018) — the quiet result that causal dilated convolutions, trained with the same compute, match or beat RNNs on most sequence benchmarks, undermining the idea that recurrence is essential to sequence modelling. Section sixteen is the modern revival of recurrence — linear RNNs, state-space models (S4, Mamba), the RWKV architecture, and the argument that the transformer's O(N²) attention is only one of several valid answers to the sequence-modelling question. Section seventeen is the limits of RNNs and the transformer succession — why Vaswani et al.'s 2017 Attention Is All You Need retired recurrence as the default and what the consequences were for every domain in this chapter. The closing in-ml section places sequence models in the broader landscape: as the archetype of temporal inductive bias, as the generation of ideas that transformers inherited wholesale, and as the architecture family that taught the field how to frame every problem — translation, classification, generation — as a conditional sequence model.

Why sequence models?Variable length, temporal structure, stateful computation
The recurrent neural networkShared weights through time, the unrolled computation graph
Backpropagation through timeWerbos 1990, truncated BPTT, the unrolled graph
Vanishing and exploding gradientsHochreiter 1991, Bengio et al. 1994, gradient clipping
Long Short-Term MemoryHochreiter–Schmidhuber 1997, gates, cell state
Gated Recurrent UnitsCho et al. 2014, simplified gating
Bidirectional and deep RNNsSchuster–Paliwal 1997, stacked recurrence
Character-level language modelsSutskever–Martens–Hinton 2011, Karpathy's char-RNN
Sequence-to-sequence modelsSutskever–Vinyals–Le 2014, encoder–decoder
Attention for sequence modelsBahdanau 2014, Luong 2015, soft alignment
Teacher forcing and exposure biasScheduled sampling, Ranzato et al. 2015
Decoding strategiesBeam search, sampling, length normalisation
Neural machine translationWu et al. 2016 GNMT, production NMT systems
Speech recognition and CTCGraves et al. 2006, DeepSpeech, end-to-end ASR
Temporal Convolutional NetworksBai–Kolter–Koltun 2018, causal dilated convolutions
Modern RNNs and state-space modelsLinear RNNs, S4, Mamba, RWKV
The limits of RNNs and the transformer successionVaswani et al. 2017, what recurrence lost
Where it compounds in MLTemporal inductive bias, ideas transformers inherited

Why sequence models?

A photograph has a fixed shape: H by W pixels, three channels, and every photograph in a dataset has the same H and W after resizing. A sentence does not. One sentence has seven words, the next has forty-three, and tomorrow's sentence might have a thousand. The architectures of the last chapter — fully-connected and convolutional — were built assuming the input shape is known ahead of time. Real-world sequential data breaks that assumption on contact. Handling it requires a different architectural idea.

What counts as sequence data

Sequence data is anything where the input is a string of observations x_1, x_2, …, x_T in which (i) T varies across examples, (ii) the order carries information, and (iii) observations at nearby positions are typically more related than observations far apart. This is an enormous category: natural-language text (tokens or characters), speech and audio (frames of a spectrogram), time series (stock prices, ECGs, sensor readings), user behaviour logs (clickstreams, search sessions), genomic sequences (DNA, proteins), video (a sequence of images), and streams of actions in reinforcement learning. A photograph can be cast as a sequence of patches; an image can be modelled autoregressively pixel-by-pixel. The sequence framing is so general that in the modern transformer era almost every problem has been reformulated as sequence modelling.

Why fixed-length architectures fail

A fully-connected network has an input layer of a specific width d. To feed it a sentence, you must pick d in advance — say, a 512-word ceiling, padding shorter sentences and truncating longer ones. Padding wastes compute on empty positions; truncation throws away information; and the model cannot share weights across positions, so word i and word j learn completely separate parameterisations even though they play interchangeable roles grammatically. A convolutional network is better — stride-1 convs work on any length T and share weights across positions — but its receptive field is a fixed window determined by kernel sizes and depth, and information outside that window cannot influence the current output. Both architectures can be made to work for sequences with heroic effort, and temporal convnets (section fifteen) remain a credible alternative. But the natural architecture for sequential data is one whose computation itself grows with the input length — that reads the input one step at a time, carrying information forward as it goes.

Stateful computation

The recurrent architecture's move is to maintain a hidden state h_t that summarises everything the network has seen up to time t, and to update it via a fixed cell h_t = f(h_{t−1}, x_t). The same function f (with the same weights) is applied at every time step, regardless of T. The network's "depth" in time scales with the sequence length; its parameter count does not. The hidden state h_t plays the role the feedforward network's activations played in the spatial domain — a learned representation of input seen so far — with the additional property that nothing in the architecture fixes how long "so far" is allowed to be. This is the stateful computation idea.

Sequences as conditional distributions

A complementary framing that dominated the mid-2010s is sequence modelling as conditional probability. A sentence y_1, …, y_T is modelled as p(y) = ∏_t p(y_t | y_{<t}), the product of next-token predictions given everything before. A translation is modelled as p(y | x) = ∏_t p(y_t | y_{<t}, x), conditional on a source sentence x. Once you frame the problem this way, the model only has to produce one categorical distribution at a time — next token — and the sequence distribution falls out of repeated sampling. Almost every ambitious sequence task of the last decade — machine translation, speech recognition, image captioning, code completion, chatbots, large language models — is in this form. The RNN's hidden state h_t is the natural place to put the conditioning information: it is a sufficient statistic of the past, and the next-token probability p(y_t | h_{t−1}) only has to read it.

The three design questions

Every sequence architecture in this chapter answers three questions: (i) How is the hidden state updated? — what function f takes (h_{t−1}, x_t) to h_t. (ii) How are long-range dependencies preserved? — what mechanism prevents information from degrading as it flows through many steps of f. (iii) How does training work? — what loss signal reaches the weights despite the long computation graph. The vanilla RNN gives the simplest answers to all three, and they turn out to be insufficient. Most of this chapter is the story of better answers, one subproblem at a time.

The architectural bet. For data where order matters and length varies, the right architecture is one in which the computation grows with the input, not the parameter count. Recurrence — the same cell applied at every step, carrying state forward — is the cleanest realisation of that idea, and for three decades it was the default. The transformer replaced the default by finding a way to do the same thing with parallel attention and positional encodings, but every architecture in this chapter, including the modern state-space models, is still trying to answer the same three design questions.

The recurrent neural network

The vanilla RNN is a single nonlinear cell applied at every step of a sequence, with a shared weight matrix for the recurrent connection and another for the input. It was first proposed by Rumelhart, Hinton, and Williams (1986) and named and analysed in its modern form by Elman (1990). Its simplicity is the reason it is the starting point of every sequence-modelling discussion; its limitations are the reason the rest of this chapter exists.

The recurrence

The vanilla RNN cell takes the previous hidden state h_{t−1} ∈ ℝ^d and the current input x_t ∈ ℝ^k and produces a new hidden state: h_t = σ(W_hh h_{t−1} + W_xh x_t + b), where W_hh ∈ ℝ^{d×d} is the recurrent weight matrix, W_xh ∈ ℝ^{d×k} is the input weight matrix, b ∈ ℝ^d is a bias, and σ is a nonlinearity — almost always tanh, occasionally ReLU, rarely anything exotic. An output may be produced at each step via y_t = W_hy h_t + b_y, with the particular form depending on the task (softmax for categorical prediction, identity for regression). Training is end-to-end via gradient descent on the loss summed across time steps.

The unrolled computation graph

An RNN applied to a sequence of length T unfolds into a computation graph that is T layers deep — every time step adds one copy of the cell to the graph, with shared weights tied across all copies. This unrolling is the conceptual key to training: an RNN is not actually a "recurrent" computation in the optimisation sense; it is a very deep feedforward network with tied weights, and we train it with ordinary backpropagation applied to the unrolled graph. The tied weights mean the network has one parameterisation of the cell that is applied at every step, and gradients from all T applications accumulate into the same W_hh and W_xh matrices.

The three roles the hidden state plays

The hidden state h_t wears three hats. It is (i) a summary of the past — a running compression of x_1, …, x_t that downstream computation can read. It is (ii) a working memory — a buffer the cell can write information into and read from on the next step. And it is (iii) the carrier of gradients during backpropagation — information about the loss at time T has to flow backwards through every hidden state between T and the point where it influences the weights. These three roles are not always compatible. A hidden state that is a compact summary may be too low-dimensional to serve as working memory; a hidden state that is a good gradient carrier (wide, near-identity transitions) may not do much computation. The gated architectures of sections five and six largely work by decoupling these three roles.

Input, output, and hidden dimensions

The hyperparameters are the input dimension k (fixed by the problem — one-hot vocabulary size, mel-spectrogram channel count, feature count), the hidden dimension d (a design choice; typical values 128–2048), and the output dimension (fixed by the task). The parameter count is d(d + k + 1) + d · output_dim, dominated by the recurrent matrix when d ≫ k. A 1024-dim RNN on a 50,000-word vocabulary has about 10⁶ + 5 × 10⁷ ≈ 5 × 10⁷ parameters in the cell plus output, modest by modern standards. Width has to be chosen carefully: too narrow a state cannot carry enough information across many steps; too wide a state means more parameters in the d × d recurrence and slower training.

Many-to-one, one-to-many, many-to-many

An RNN can be wired for four basic task shapes. Many-to-one — read the whole sequence, produce one output at the end (sentiment classification, sequence classification). One-to-many — condition on one input, produce a sequence (image captioning given a CNN encoding). Many-to-many with aligned outputs — produce one output per input (part-of-speech tagging, frame-level speech features). Many-to-many with unaligned outputs — read one sequence, produce another of different length (translation, summarisation). The aligned case is handled by reading out y_t at every step; the unaligned case requires an encoder–decoder architecture, which is the subject of section nine.

The RNN as a dynamical system

From a more theoretical angle, the RNN recurrence is a nonlinear dynamical system h_t = f(h_{t−1}; θ, x_t) driven by external input x_t. Without input, it settles into fixed points, limit cycles, or chaos depending on the eigenspectrum of W_hh. With input, those attractors are deformed into state-space trajectories that encode the history. This view became especially important in the echo-state-network and reservoir-computing literature (Jaeger 2001), which argued that a randomly initialised W_hh with carefully tuned spectral radius already has enough computational power, and only the output map needs training. The view also explains, retroactively, why RNN training is so hard: the dynamical behaviour of the recurrence depends sensitively on the eigenstructure of a matrix that gradient descent is trying to modify.

Backpropagation through time

Training an RNN is mechanically the same as training any feedforward network: unroll the computation graph through time, compute the loss, and run backpropagation in reverse. The idea was worked out by Werbos in 1990 and independently by Rumelhart, Hinton, and Williams as part of their 1986 backpropagation paper. What makes it distinctive is what happens when T is large — the graph gets very deep, the memory cost of storing every intermediate activation blows up, and the gradient signal has to flow backwards through every intermediate state before it reaches the cell's weights.

The algorithm

Unroll the RNN into a T-layer feedforward graph with tied weights. For a loss L = ∑_t ℓ_t(y_t), the gradient ∂L/∂W_hh accumulates contributions from every time step: ∂L/∂W_hh = ∑_t (∂L/∂h_t) · (∂h_t/∂W_hh at time t). The key factor is ∂L/∂h_t, which depends on everything the loss sees at times ≥ t, and propagates backwards via ∂L/∂h_t = (∂L/∂h_{t+1}) · (∂h_{t+1}/∂h_t) + ∂ℓ_t/∂h_t. This is ordinary chain-rule backpropagation, applied to the unrolled graph. Modern autograd frameworks (PyTorch's autograd, TensorFlow's GradientTape, JAX's grad) handle the whole thing automatically once you write the recurrence as ordinary Python or tensor code.

The memory cost

Backpropagation requires the forward-pass activations to be stored — you need h_t for every t in order to compute ∂h_{t+1}/∂h_t. On a T-step sequence with hidden dimension d and batch size B, the activation memory is O(BTd). For a long document (T = 10,000), a wide model (d = 2048), and a batch size of 32, that is 600M floats — 2.4 GB at fp32 — just for the hidden states, before weights or gradients are accounted for. This is why training RNNs on long sequences is memory-bound rather than compute-bound, and why the truncation trick below is universal in practice.

Truncated BPTT

The practical fix is truncated BPTT (TBPTT): chop the sequence into chunks of length k (typical values 20–200), run the forward pass through the whole chunk, backpropagate only within the chunk, and carry the final hidden state forward into the next chunk's forward pass without gradient flow. The hidden state is stateful across chunks; the gradient is not. This caps memory at O(Bkd) and bounds the effective gradient horizon at k steps. The trade-off is that the model cannot learn dependencies longer than k, because no gradient signal ever reaches parameters through a path longer than that. A very long-range dependency — say, a pronoun on page 2 referring to a noun on page 1 — is literally invisible to truncated BPTT, which is part of why long-context language modelling was so hard in the RNN era.

Real-time recurrent learning

An alternative to BPTT is real-time recurrent learning (RTRL, Williams and Zipser 1989), which propagates gradients forwards through time rather than backwards. It maintains, at each step, a running Jacobian ∂h_t/∂θ which is updated recursively alongside the hidden state. RTRL has O(d²) memory independent of T — attractive for streaming settings where you cannot store the full sequence — but has O(d⁴) time per step, which is catastrophically expensive for realistic hidden dimensions. RTRL survives in theoretical work and in low-dimensional control settings, but has never been practical for training the RNNs of this chapter. More recent work (Tallec and Ollivier 2017, UORO; Mujika et al. 2018, KF-RTRL) proposes unbiased low-rank approximations that trade a little variance for much lower cost.

Gradient checkpointing for RNNs

The memory cost of full BPTT can be reduced by gradient checkpointing (Chen, Xu, Zhang, Guestrin 2016), which stores only every √T-th hidden state on the forward pass and recomputes the intermediate states during the backward pass. This trades O(T) memory for roughly O(T^{3/2}) total compute, typically a 30–50% overhead. On long sequences it is the tool of last resort when truncation is unacceptable — e.g. for training a model that genuinely needs the full gradient signal for the whole sequence.

Why BPTT is hard. The unrolled graph of a T-step RNN is T layers deep, with tied weights at every layer. Gradient flow through such a graph is multiplicative — the gradient at step t is the product of T − t Jacobians ∂h_{i+1}/∂h_i. If those Jacobians have eigenvalues consistently less than 1, the gradient vanishes; if they consistently exceed 1, it explodes. Section four is about this exact fact.

Vanishing and exploding gradients

The reason a vanilla RNN cannot learn long-range dependencies is that the gradient signal carrying information about the loss has to flow backwards through the recurrence, and that recurrence multiplies the gradient by a Jacobian at every step. After many steps, the gradient has either vanished to zero — so no learning happens — or exploded to infinity, so training becomes unstable. Hochreiter named the problem in his 1991 diploma thesis; Bengio, Simard, and Frasconi proved in 1994 that it is fundamental to the architecture. Everything in the rest of this chapter is a response.

The math of gradient flow

Consider the vanilla RNN h_t = σ(W h_{t−1} + U x_t). The Jacobian of the transition is ∂h_t/∂h_{t−1} = D_t · W, where D_t = diag(σ′(·)) is a diagonal matrix of derivatives. To propagate a gradient from step T to step t, you multiply T − t such Jacobians: ∂h_T/∂h_t = ∏_{i=t+1}^{T} D_i W. If the largest singular value of D_i W is less than 1, this product shrinks exponentially in T − t — the gradient vanishes. If the largest singular value exceeds 1, it grows exponentially — the gradient explodes. There is no middle ground: for a generic W, one of these two fates is inevitable once T is large.

The tanh factor

The derivative of tanh is bounded in [0, 1], with a maximum of 1 at the origin. In a well-behaved hidden state most |h_t| values are not near zero — if they were, the nonlinearity would be wasted — so σ′(·) is typically well below 1, often around 0.25 in the saturated regime. This multiplies the Jacobian by a factor less than 1 at every step before the W multiplication, tilting the network towards vanishing. ReLU would not have this problem, but ReLU in an RNN introduces its own pathology: the recurrence is now unbounded, and the hidden state can blow up additively.

Why the explosion is easier to fix

Pascanu, Mikolov, and Bengio's 2013 paper On the difficulty of training recurrent neural networks showed that exploding gradients can be controlled by gradient clipping: rescale the gradient's norm whenever it exceeds a threshold c, g ← c · g / ||g||, before taking the SGD step. This does not change the descent direction, only the step size, and it prevents the rare huge-gradient events that otherwise destabilise training. Gradient clipping has been ubiquitous in RNN training since that paper, and is still recommended for transformer training on long sequences. The threshold c is usually set to something like 1 or 5; too small and the optimiser never moves, too large and the clipping does nothing.

Why the vanishing is harder to fix

Clipping does nothing for the vanishing case — you cannot un-shrink a gradient that has already been multiplied by a product of factors each less than 1. The information about what happened at step t literally does not reach the parameters being updated. Bengio et al.'s 1994 paper proved a stronger version: any architecture that preserves long-range information must have gradient norms bounded away from zero through the recurrence, which means ||W|| ≥ 1, which means the system is at the edge of stability. A vanilla RNN cannot escape this dilemma — learning long-range structure and stable training are mutually exclusive.

The gating solution

The architectures of sections five and six — LSTM and GRU — sidestep the dilemma by providing an additive path for information through time that is not routed through the multiplicative recurrence. In LSTM's terms, the cell state c_t = f_t ⊙ c_{t−1} + i_t ⊙ g_t is updated additively with gated contributions, so its Jacobian with respect to c_{t−1} is f_t — a diagonal of gate values near 1 when information should flow unchanged. When f_t ≈ 1 over many steps, the gradient through the cell state does not shrink. When f_t is small, the cell deliberately forgets. The gate values are learned, so the network figures out for itself when to preserve and when to discard — a far better outcome than the vanilla RNN's fixed multiplicative shrinkage.

Orthogonal initialisations and unitary RNNs

A parallel line of work attacks the vanishing-gradient problem at initialisation. Saxe, McClelland, and Ganguli's 2013 paper showed that orthogonal initialisations of W preserve singular values near 1 and allow much longer dependencies to be learned. Arjovsky, Shah, and Bengio's 2016 Unitary Evolution RNN forces W to stay unitary throughout training, guaranteeing no vanishing or exploding. The uRNN and its descendants (EURNN, Jing et al. 2017) work respectably but have not displaced LSTM in practice — the unitary constraint is expressive-limiting and the parameterisation is fiddly. They remain a reminder that architecture is one answer and initialisation is another.

Long Short-Term Memory

Sepp Hochreiter and Jürgen Schmidhuber's 1997 paper Long Short-Term Memory introduced the gated cell that solved the vanishing-gradient problem in practice. For two decades it was the default sequence-modelling architecture, and for the sequence-to-sequence era (2014–2017) it was the workhorse of machine translation, speech recognition, and most successful industrial deployments. The LSTM cell is not beautiful — it has gates on top of gates, many matrix multiplies per step — but it works.

The cell state idea

The LSTM separates the hidden state into two pieces: a cell state c_t that flows through time along a nearly-linear pathway with only elementwise gating, and a hidden state h_t that is a nonlinear readout of the cell state used for outputs and gate computations. The cell state is the gradient highway — it carries long-range information with multiplicative factors near 1 when the forget gate is open, so gradients flowing backwards through c_t do not vanish over many steps. The hidden state is the expressive summary that downstream layers and output heads actually read.

The gates

Three gates control the flow. The forget gate f_t = σ(W_f [h_{t−1}, x_t] + b_f) decides which entries of the cell state to keep (1) or erase (0). The input gate i_t = σ(W_i [h_{t−1}, x_t] + b_i) decides which entries of the new candidate content to write. The output gate o_t = σ(W_o [h_{t−1}, x_t] + b_o) decides which entries of the cell state to expose in the hidden state. The candidate content is g_t = tanh(W_g [h_{t−1}, x_t] + b_g), the new information that might be written. The update rules are c_t = f_t ⊙ c_{t−1} + i_t ⊙ g_t and h_t = o_t ⊙ tanh(c_t). All gate values are in (0, 1) via the sigmoid.

Why the cell state solves vanishing

The Jacobian ∂c_t/∂c_{t−1} = diag(f_t). When the forget gate is open — f_t near 1 — the Jacobian is near the identity, and gradient signals flowing backwards through c are preserved almost intact across many steps. When the forget gate closes, the cell deliberately erases information and the gradient through that entry stops there, which is the correct behaviour. The learned nature of the gates means the network decides when to remember and when to forget, and gradient descent shapes those decisions by rewarding networks that preserve information relevant to the loss.

Parameters and compute

An LSTM cell with hidden dimension d and input dimension k has four gates, each with a (d + k) × d weight matrix — 4d(d + k + 1) parameters total. Compared to a vanilla RNN cell's d(d + k + 1), the LSTM is four times as expensive in parameters and compute per step. Most of the modelling power in practice comes from the extra capacity, not the gating structure per se — ungated RNNs of equivalent parameter count, when trainable, often approach LSTM performance. But the gating structure is what makes the LSTM trainable at all on the long-range-dependency tasks where the vanilla RNN fails.

Variants and historical notes

The 1997 paper's original LSTM did not have a forget gate — the cell state just accumulated without a way to erase. Gers, Schmidhuber, and Cummins's 1999 paper added the forget gate, and this is the LSTM everyone means today. Peephole connections (Gers and Schmidhuber 2000) let the gates see the cell state directly, a modification with mixed empirical support. Greff, Srivastava, Koutník, Steunebrink, and Schmidhuber's 2017 paper LSTM: A Search Space Odyssey systematically ablated every component and found that the forget gate and the output activation are the components that matter most — the rest of the architecture is surprisingly robust to simplification. This motivated the GRU (section six), a simpler cell with similar empirical performance.

The forget-gate bias trick

Jozefowicz, Zaremba, and Sutskever's 2015 paper An empirical exploration of recurrent network architectures made a small but important observation: initialising the forget-gate bias b_f to a positive value (typically 1) starts training with the forget gate open by default, so information flows through the cell state almost unchanged at the start. Without this, the forget gate starts at 0.5 (sigmoid of 0), and the cell state decays geometrically with factor 0.5 per step, erasing signals over short horizons before training has a chance to shape the gate. The forget-bias-1 initialisation is in every modern LSTM implementation.

Why LSTMs won. A well-initialised LSTM with gradient clipping can learn dependencies over hundreds of time steps where a vanilla RNN gives up at ten. That practical fact — plus fifteen years of implementations, tuning recipes, and production deployments — made LSTM the default sequence architecture from roughly 2014 until the 2017 transformer paper. In domains where transformers have not taken over (low-resource time series, on-device audio, some reinforcement-learning policies), LSTMs remain the default today.

Gated Recurrent Units

Cho, van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, and Bengio's 2014 paper Learning phrase representations using RNN encoder–decoder for statistical machine translation introduced the Gated Recurrent Unit — a simplification of the LSTM with one fewer gate and no separate cell state. The GRU performs comparably to LSTM on most benchmarks at a quarter fewer parameters and slightly faster compute per step. For a long stretch of the mid-2010s, GRU vs LSTM was one of the recurring hyperparameter debates.

The GRU cell

The GRU combines the LSTM's forget and input gates into a single update gate and merges the cell state and hidden state into one vector. The reset gate controls how much of the previous hidden state is used to compute the candidate content. The equations: r_t = σ(W_r [h_{t−1}, x_t]), z_t = σ(W_z [h_{t−1}, x_t]), h̃_t = tanh(W_h [r_t ⊙ h_{t−1}, x_t]), h_t = (1 − z_t) ⊙ h_{t−1} + z_t ⊙ h̃_t. The update gate z_t interpolates between keeping the old state and writing the new candidate; the reset gate r_t lets the cell ignore irrelevant parts of the old state when computing the new candidate.

GRU vs LSTM

The GRU has three weight matrices vs the LSTM's four — a 25% saving in parameters and compute. Both handle long-range dependencies well, both use additive state updates (the (1 − z_t) ⊙ h_{t−1} term is the gradient highway), and both are trained with the same machinery. Chung, Gulcehre, Cho, and Bengio's 2014 paper Empirical evaluation of gated recurrent neural networks on sequence modelling found the two architectures roughly equivalent on polyphonic music and speech tasks, with GRU slightly ahead on smaller datasets and LSTM slightly ahead on larger ones. Jozefowicz et al. 2015 searched through thousands of gated architectures and concluded that variants within this family are mostly interchangeable: what matters is that you have additive state updates and at least one gate controlling what is preserved.

When to choose which

Modern practice in the narrow domain where GRUs and LSTMs still dominate — small sequence models, on-device audio, time-series forecasting — is roughly: GRU when compute or parameter count matters more than the last 0.5% of accuracy; LSTM when the task has very long dependencies or when you are matching existing literature. In machine translation and language modelling, both have been displaced by the transformer, so the question is now mostly historical. A notable exception is that ONNX, TFLite, and other on-device runtimes have heavily optimised LSTM kernels, which can make LSTMs faster to deploy in practice even though they are theoretically more expensive.

Other gated variants

The space of gated recurrent cells is enormous. Highway networks (Srivastava, Greff, Schmidhuber 2015) applied gated shortcuts to feedforward nets — a forerunner of residual connections. Minimal gated units (Zhou, Wu, Zhang, Zhou 2016) reduce the GRU to a single gate. SRU (Simple Recurrent Unit, Lei, Zhang, Wang, Dai, Artzi 2018) parallelises most of the computation across time steps, approaching a CNN's throughput. IndRNN (Li, Li, Chen, Wu, Gao 2018) applies an elementwise-only recurrent matrix to avoid gradient coupling across neurons. None of these has displaced LSTM/GRU as the default gated architecture, but the family tree is wide, and the 2020s have seen renewed interest (section sixteen) from the state-space-model community.

Bidirectional and deep RNNs

A left-to-right RNN's hidden state at time t depends only on x_1, …, x_t — it has no access to the future. For tasks where the full sequence is available at training time and prediction time — part-of-speech tagging, named-entity recognition, parsing, protein-structure prediction — the future is informative and there is no reason not to use it. The bidirectional RNN runs two recurrences, one forward and one backward, and concatenates their hidden states at each position. Deep RNNs stack multiple recurrent layers vertically. The combination of the two is what most "RNN" experiments from the mid-2010s actually meant.

Bidirectional RNNs

Schuster and Paliwal's 1997 paper Bidirectional recurrent neural networks proposed the architecture: run a forward RNN to produce →h_t summarising x_1, …, x_t, a backward RNN to produce ←h_t summarising x_t, …, x_T, and concatenate [→h_t ; ←h_t] as the per-position representation. Training is ordinary BPTT through both recurrences. BiRNNs substantially outperform unidirectional RNNs on tagging and parsing tasks, because disambiguating a word often depends on what comes after it — "bank" is a river bank or a financial bank depending on the rest of the sentence. The cost is that a BiRNN cannot be used for streaming prediction, because the backward pass requires the whole sequence.

Deep RNNs

A deep RNN stacks L recurrent layers, where layer ℓ's cell takes as input the hidden state from layer ℓ − 1 at the same time step: h^{(ℓ)}_t = f(h^{(ℓ)}_{t−1}, h^{(ℓ−1)}_t). Each additional layer deepens the per-step computation — giving the network time to mix features at a single position — while the recurrence across time continues to carry temporal state. El Hihi and Bengio's 1996 paper Hierarchical recurrent neural networks for long-term dependencies and Graves's 2013 Generating sequences with recurrent neural networks established deep recurrent stacks as the default for strong language modelling. Typical depths were 2–4 for BiLSTMs, 4–8 for large LSTMs; going deeper ran into the same training difficulties that motivated residual connections in CNNs.

Residual and highway connections in RNNs

By analogy with ResNet in CNNs, residual connections between stacked RNN layers — h^{(ℓ)}_t = h^{(ℓ−1)}_t + RNN_ℓ(h^{(ℓ−1)}_t, h^{(ℓ)}_{t−1}) — allow much deeper recurrent stacks to be trained. Google's 2016 NMT system (section thirteen) used an 8-layer LSTM with residual connections in both encoder and decoder, which would have been untrainable without the skip connections. The same pattern — deeper is better, but only with skip connections — generalises across architectures and is one of the durable lessons of the deep-learning era.

Dropout in RNNs

Applying dropout naively to an RNN — randomly dropping hidden units at every time step — destroys the recurrent signal, because the dropped dimensions break the continuity of the hidden state across time. Zaremba, Sutskever, and Vinyals's 2014 paper Recurrent neural network regularization proposed applying dropout only on the non-recurrent connections (input and output, not the recurrence itself). Gal and Ghahramani's 2016 A theoretically grounded application of dropout in recurrent neural networks proposed variational dropout: use the same dropout mask at every time step, so the recurrent signal's consistency is preserved while the regularisation is maintained. Variational dropout is in every modern LSTM/GRU implementation.

Layer normalisation for RNNs

Batch normalisation works poorly in RNNs because the statistics across a batch vary wildly with sequence length — a batch of T-step sequences has T different distributions of activations, and BN wants to normalise each position separately. Ba, Kiros, and Hinton's 2016 paper Layer normalization proposed normalising across the features of a single example rather than across the batch, which removes the batch-size dependence and works cleanly with the recurrent structure. Every production RNN of the late 2010s used LayerNorm instead of BatchNorm; the transformer inherited this choice.

Character-level language models

One of the most striking demonstrations of what RNNs could do was Andrej Karpathy's 2015 blog post The Unreasonable Effectiveness of Recurrent Neural Networks, which trained a multi-layer LSTM on a character-level language-modelling objective — predict the next character given the previous ones — and got back a network that could generate fake Shakespeare, fake C code, fake Wikipedia, and fake mathematics with surprising fluency. The char-RNN was a proof of concept for the next decade of generative language modelling.

The setup

Tokenise a text corpus as a sequence of characters — typically 70–256 unique symbols, depending on whether you include Unicode, whitespace handling, and case normalisation. Train an LSTM to predict the next character given the previous ones, with a cross-entropy loss summed across every position. At inference, sample one character at a time from the network's output distribution, feed it back as the next input, and continue until a length limit or stop token. The whole apparatus can be trained on a few megabytes of text overnight on a single GPU, which made the char-RNN the first generative language model that most practitioners actually ran rather than read about.

What the char-RNN learned

Karpathy's visualisations showed that individual cells in the trained char-RNN's hidden state track interpretable features: one cell tracks whether the current position is inside quotation marks, another tracks how many tab characters ago the last newline was (indentation depth in code), another tracks whether a Markdown header is being written. These learned features are not supervised — they emerge from the next-character prediction task — and they suggest that RNNs discover interpretable structure when the task rewards it. The discovery was one of the earliest examples of what would become a major research theme (mechanistic interpretability, circuits, superposition).

Sutskever, Martens, Hinton 2011

Four years before Karpathy's blog post, Ilya Sutskever, James Martens, and Geoffrey Hinton had already trained a character-level multiplicative-RNN language model (Sutskever, Martens, Hinton 2011), Generating text with recurrent neural networks. Their model produced fake Wikipedia articles that were coherent at the sentence level and occasionally at the paragraph level — a strong result for the era. The paper is historically underappreciated; it contained the seeds of sequence-to-sequence, of large-scale LM training, and of the sampling-based evaluation methodology that dominated the 2010s. Sutskever carried many of these ideas with him into the OpenAI era a few years later.

Word-level language models and perplexity

Parallel to the char-RNN line, the word-level language-modelling benchmark (Penn Treebank, later Wikitext-2 and Wikitext-103) became the standard for comparing sequence models. Mikolov et al.'s 2010 paper Recurrent neural network based language model set the early baseline; Zaremba–Sutskever–Vinyals 2014 pushed perplexity on PTB below 80; Merity, Keskar, and Socher's 2017 AWD-LSTM (averaged-weight-decay LSTM) set the final RNN-era state of the art with perplexity around 60. Perplexity — the exponential of average negative log-likelihood per token — was the metric everyone tracked, and it was the metric the transformer eventually demolished.

The char-RNN's limits

Char-RNNs produce locally coherent text — correct spelling, plausible syntax within a few tokens — but rapidly lose coherence over longer spans. The 2015 char-RNN's "fake Shakespeare" is recognisably Shakespearean for a few lines and then drifts into gibberish. This is the practical face of the long-range-dependency problem: even an LSTM with forget gates cannot maintain a coherent thread of meaning for hundreds or thousands of tokens. The transformer's ability to do this — via attention that directly connects distant positions — is a large part of why it replaced the RNN as the default language-model architecture.

Sequence-to-sequence models

In 2014 two papers — Sutskever, Vinyals, and Le's Sequence to Sequence Learning with Neural Networks and Cho et al.'s Learning phrase representations using RNN encoder–decoder — proposed the architecture that would dominate machine translation for the next three years and would shape every generative language model that came after. The idea is stunningly simple: use one RNN to read the input sequence and compress it into a single vector, and use a second RNN to produce the output sequence conditioned on that vector.

The architecture

The encoder is an RNN (typically an LSTM) that reads the source sequence x_1, …, x_n and produces a final hidden state c = h_n. This context vector is supposed to summarise the entire source. The decoder is a second RNN that starts from c as its initial hidden state and generates the output sequence y_1, …, y_m one token at a time, with each y_t conditioned on the decoder's previous tokens and the context. A start-of-sequence token begins decoding; an end-of-sequence token terminates it. Both encoder and decoder are trained end-to-end to maximise the log-likelihood of the target sequence given the source.

Why this framing was a revelation

Before seq2seq, machine translation was done with statistical machine translation (SMT) pipelines — phrase tables, language models, reordering models, a log-linear combination, a beam-search decoder. Each component was trained separately and tuned jointly, and the pipeline had hundreds of hyperparameters and several specialist PhDs worth of feature engineering. Seq2seq replaced the entire pipeline with a single neural network and a single loss function. For a while the neural model was not quite competitive; within two or three years it was winning every benchmark by wide margins and every production MT system had switched. The same framing then generalised: summarisation, dialogue, question answering, code generation — all tasks that can be expressed as "read one sequence, produce another" — became seq2seq tasks.

The reverse-input trick

Sutskever, Vinyals, and Le made a small but crucial empirical observation: if you reverse the source sequence before encoding it, translation quality goes up substantially. Their explanation is that the final encoder state c then contains the beginning of the source in recent memory, which is where the decoder starts generating — the first target tokens correspond to the first source tokens, so having the first source tokens freshly in the context vector helps. This trick is a very concrete illustration of the fixed-context-vector bottleneck: the information is there in some sense, but its position in the encoder state matters, because long-range dependencies (the end of the source influencing the first target token) are not reliably preserved.

The bottleneck problem

Compressing a sentence of arbitrary length into a single fixed-size vector is the obvious architectural flaw of vanilla seq2seq. For short sentences it works; for long sentences the BLEU score collapses — papers from the era routinely showed performance degrading past about 30 source words. The fundamental issue is information-theoretic: a d-dimensional vector has finite capacity, and longer sentences have more to remember. Bahdanau, Cho, and Bengio's 2014 Neural Machine Translation by Jointly Learning to Align and Translate solved the problem by introducing attention — the decoder, at each step, reads not a single context vector but a weighted combination of all encoder states — which is the subject of the next section and the ancestor of the transformer's self-attention.

Vinyals et al.'s 2015 grab-bag

After the MT success, seq2seq was tried on everything. Vinyals and Le's 2015 A neural conversational model trained a seq2seq on movie dialogue and demonstrated a chatbot that could hold coherent conversations — a result that seemed uncanny at the time and quaint now. Vinyals, Bengio, Kudlur's 2015 Order matters: sequence to sequence for sets extended the framework to set inputs. Vinyals, Fortunato, Jaitly's 2015 Pointer networks used the attention weights themselves as outputs, solving combinatorial problems like convex hull and TSP. Pointer networks foreshadowed copy mechanisms in summarisation and, more distantly, the output heads of modern QA systems. The seq2seq frame was a generator of research directions for a full two years.

Attention for sequence models

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio's 2014 paper Neural Machine Translation by Jointly Learning to Align and Translate is the most important sequence-modelling paper of the 2010s. It introduced attention — the mechanism that replaces the fixed-length context vector with a learned weighted combination over all encoder states — and in doing so created the architectural primitive that would, three years later, displace recurrence entirely in the transformer. Attention was invented to patch a specific flaw in seq2seq; it turned out to be general enough to replace the whole framework.

The mechanism

At each decoder step t, compute an alignment score e_{t,i} between the current decoder state s_t and each encoder state h_i, for i = 1, …, n. Normalise these with a softmax to get attention weights α_{t,i} = softmax(e_{t,·})_i. The decoder's context for this step is a weighted sum c_t = ∑_i α_{t,i} h_i, and the next-token prediction uses both s_t and c_t. Now the decoder can "look at" different parts of the source at different output steps — attending to the subject when generating the target verb, attending to the object when generating the target object, and so on — and the fixed-length-vector bottleneck is gone.

Bahdanau vs Luong attention

Bahdanau's 2014 formulation used an additive alignment score e_{t,i} = v^⊤ tanh(W_s s_{t−1} + W_h h_i), which is a small MLP applied to each (decoder-state, encoder-state) pair. Luong, Pham, and Manning's 2015 Effective approaches to attention-based neural machine translation proposed simpler alternatives — dot-product attention e_{t,i} = s_t^⊤ h_i and general attention e_{t,i} = s_t^⊤ W h_i — and showed they work comparably at lower compute. The transformer's scaled dot-product attention is a direct descendant of Luong's dot-product form, with the scaling factor 1/√d added to keep the softmax well-behaved in high dimensions.

Attention as learned alignment

One of the striking results in Bahdanau et al.'s paper was the visualisation of attention weights. For an English-French translation, the α matrix showed that when the decoder produced the French verb, it attended heavily to the English verb; when it produced a French preposition, it attended to the corresponding English preposition; when word order was different between the languages, the attention diagonal bent appropriately. No alignment signal had been given to the network during training — only the source-target pairs and a cross-entropy loss — and yet alignment emerged as a side effect of the model learning to translate. Attention visualisations became a standard diagnostic tool, and the interpretability they offered was one of the main selling points of the architecture.

Soft vs hard attention

Bahdanau's attention is soft: every encoder state contributes a fraction, and the whole mechanism is differentiable. Hard attention, in which the decoder picks a single encoder state stochastically, was explored in Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, and Bengio's 2015 Show, attend and tell paper on image captioning. Hard attention is more interpretable and potentially more efficient, but requires the REINFORCE-style policy gradients (not direct backprop) to train, which is slower to converge and higher variance. Soft attention's dominance was a choice to prioritise end-to-end trainability over theoretical purity.

The road to self-attention

In Bahdanau's seq2seq-with-attention, the attention is cross-attention: the decoder queries the encoder states, not its own past. Self-attention — where every position attends to every other position in the same sequence — was a natural extension, first applied in Cheng, Dong, and Lapata's 2016 Long short-term memory-networks for machine reading and in Parikh, Täckström, Das, and Uszkoreit's 2016 A decomposable attention model for natural language inference. Both papers used attention in addition to, not instead of, recurrence. Lin et al.'s 2017 A structured self-attentive sentence embedding pushed further. The full step — drop the recurrence, use only self-attention and cross-attention, add positional encodings — was taken by Vaswani et al.'s 2017 Attention is all you need (section seventeen), which ended the seq2seq era overnight.

Teacher forcing and exposure bias

A seq2seq decoder is trained to predict the next token given the previous ones. At training time, the "previous ones" are the ground-truth target tokens; at inference time, they are the model's own previous predictions. This mismatch — training on true tokens, testing on predicted tokens — is called exposure bias, and it is a subtle but pervasive source of degradation in sequence models.

Teacher forcing

During training, the decoder's input at step t is the true target token y_{t−1}^*, not the decoder's own previous prediction ŷ_{t−1}. This is called teacher forcing (Williams and Zipser 1989), and it has two virtues: (i) it stabilises training, because errors do not compound — if the decoder predicts a wrong token at step 3, steps 4–T are still conditioned on the correct token at step 3. (ii) it parallelises across time, because all decoder steps can be computed independently given the ground-truth sequence — you do not need to wait for the sampling of ŷ_{t−1} to compute the step-t forward pass. Teacher forcing is how every seq2seq paper from 2014–2020 trained its models.

The exposure-bias problem

At inference time, the decoder consumes its own predictions. If it makes an error at step 3, step 4 sees an input it never saw during training — a wrong token in context — and the decoder's behaviour on that distribution is undefined. Errors compound: a small mistake at step 3 can produce a different step-4 prediction, which produces a different step-5 prediction, and so on. The model is being evaluated on a distribution its training never covered. This is exposure bias, and it is a specific instance of the general train/test distribution shift pathology in sequence modelling.

Scheduled sampling

Bengio, Vinyals, Jaitly, and Shazeer's 2015 Scheduled sampling for sequence prediction with recurrent neural networks proposed a fix: during training, randomly use the model's own previous prediction ŷ_{t−1} instead of the ground truth y_{t−1}^* with some probability p, annealed from 0 at the start of training to some positive value later. This exposes the model to its own errors during training, closing the distribution gap. Scheduled sampling works in practice but introduces a subtle optimisation issue — the training loss is no longer a proper log-likelihood, because the conditioning is now a mix of true and predicted tokens — and Huszar 2015 showed it can push the model towards incoherent modes. The method was popular for a few years and then largely abandoned for NMT in favour of better training stability.

Professor forcing

Lamb, Goyal, Zhang, Zhang, Courville, and Bengio's 2016 Professor forcing: a new algorithm for training recurrent networks proposed an adversarial alternative: train a discriminator to distinguish the decoder's hidden states under teacher forcing from its hidden states under free sampling, and train the decoder to fool the discriminator. The decoder thus learns to have the same internal dynamics in both regimes, which eliminates exposure bias without breaking teacher-forced training. The method is more stable than scheduled sampling but adds significant complexity.

Minimum-risk training

Shen, Cheng, He, He, Liu, Sun, Liu's 2016 Minimum risk training for neural machine translation proposed training against a sequence-level metric (BLEU) using REINFORCE-style policy gradients, with samples drawn from the model's own distribution. This directly optimises what we care about at inference and avoids exposure bias by construction. Ranzato, Chopra, Auli, and Zaremba's 2016 MIXER paper (Sequence level training with recurrent neural networks) mixed ML training and sequence-level RL in a schedule. These methods squeeze out another 1–2 BLEU on top of teacher forcing, at the cost of more complex training. They are mostly historical now — large-scale pretraining plus reinforcement-learning-from-human-feedback (RLHF) in the LLM era is the modern descendant of this idea.

The LLM-era view

In the 2020s, exposure bias largely disappeared as a central concern — not because it was solved, but because the scale of training data and the quality of the resulting models made its effects invisibly small relative to other error sources. Very large language models trained with plain teacher forcing on trillions of tokens seem to cope with their own outputs well enough that the old exposure-bias concerns feel irrelevant. Whether this is because scale masks the problem or because it has somehow been dissolved is an open question; the historical framing remains important because it shaped two decades of research on sequence training.

Decoding strategies

A sequence model defines a distribution p(y | x). At inference time, you want to produce some y — either the most likely, or a sample from the distribution, or something in between that balances quality and diversity. The problem is that finding the exact argmax is intractable for any non-trivial sequence model: the search space is |V|^T, exponentially large in the output length. Every sequence model of the last decade uses some approximate decoding algorithm, and the choice of algorithm shapes the output distribution in sometimes surprising ways.

Greedy decoding

The simplest strategy: at each step, output the argmax of the model's distribution ŷ_t = argmax p(y_t | y_{<t}, x). Greedy is fast (one forward pass per step), deterministic, and often works well enough for strongly peaked distributions. Its failure mode is local: a greedy choice at step t that looks slightly better than the alternative may lead to a dead end at step t + 3, where the locally worse choice would have led to a much better continuation. Because greedy never looks ahead, it cannot escape these traps.

Beam search

Beam search keeps the k best partial sequences at every step, where k is the beam width. At step t + 1, each of the k beams is extended by every possible next token, producing k · |V| candidates; the top k by accumulated log-probability are kept. Typical beam widths are 4–10 for machine translation and 1–5 for language generation. Beam search is O(kT|V|) — more expensive than greedy but much cheaper than exact search, and it catches most of the short-range errors that greedy commits. For deterministic tasks (translation, summarisation, code generation) it is the default.

The length bias

A naive beam search has a well-documented bias: it prefers short sequences, because every extra token multiplies probability by a factor less than 1, so longer sequences have lower joint probability almost by definition. The fix is length normalisation (or "length penalty"): divide the log-probability by T^α for some α in [0, 1], tuned on a validation set. Wu et al.'s 2016 GNMT paper used a more elaborate coverage penalty that also penalises beams that ignore parts of the source. Without length normalisation, beam search consistently produces translations shorter than the reference.

Stochastic decoding

For open-ended generation — creative writing, dialogue, code where there are many plausible continuations — maximum-likelihood decoding produces dull, repetitive, safe output. Holtzman, Buys, Du, Forbes, and Choi's 2019 The curious case of neural text degeneration documented this clearly: beam search on a large language model produces repetitive, degenerate output; sampling from the model's distribution produces more varied output at the cost of occasional errors. Two popular sampling strategies: top-k sampling (Fan, Lewis, Dauphin 2018) restricts the next-token distribution to the top k tokens and renormalises; nucleus (top-p) sampling (Holtzman et al. 2019) restricts to the smallest set of tokens whose cumulative probability exceeds p, which adapts the cut-off to the entropy of the distribution. Nucleus with p = 0.9 or p = 0.95 is the standard for creative generation.

Temperature

A softmax with temperature τ replaces softmax(z) with softmax(z / τ). τ < 1 sharpens the distribution (more deterministic), τ > 1 flattens it (more random), τ = 0 is the argmax (greedy), τ → ∞ is uniform sampling. Temperature is orthogonal to top-k and top-p — you can combine them (sample from top-p with temperature). Modern language-model APIs expose temperature, top-p, and sometimes top-k as user-tunable parameters, because the right combination depends on the application: τ near 0 for factual QA, τ around 0.7–1.0 and p around 0.9 for creative writing.

The diverse-beam and constrained-decoding worlds

Vijayakumar, Cogswell, Selvaraju, Sun, Lee, Crandall, Batra's 2016 Diverse beam search modifies beam search to produce multiple distinct outputs by penalising similarity between beams. Constrained decoding (Post and Vilar 2018) forces the output to contain or exclude specific tokens or substrings. Speculative decoding (Leviathan, Kalman, Matias 2023) uses a small "draft" model to propose token blocks that a larger "verifier" model accepts or rejects in parallel, accelerating inference 2–3× without changing the distribution. Decoding, despite being the "obvious" part of sequence modelling, has remained a live research area.

Neural machine translation

Machine translation was the flagship application that drove seq2seq research in 2014–2017 and the field where the benefits of the new architecture were most dramatic. The old phrase-based statistical machine-translation systems, painstakingly built over two decades and responsible for every commercial translation product at the start of that period, were completely replaced by neural models within three years. The story is the clearest example of a deep-learning method sweeping an entire sub-field.

From SMT to NMT

Statistical machine translation (Brown et al. 1990, Koehn 2010) decomposed translation into separate components: a phrase table (learned alignments of source phrases to target phrases), a language model (fluency of the target), a reordering model, and a log-linear combination of these and other features tuned on held-out data. The pipeline had hundreds of hyperparameters, required millions of aligned sentence pairs to train, and depended on linguistic preprocessing (tokenisation, morphological analysis, parsing) that varied per language pair. Moses (Koehn et al. 2007) was the dominant open-source toolkit; Google Translate in 2014 ran phrase-based SMT. Early NMT systems (Bahdanau et al. 2014; Sutskever et al. 2014) were roughly competitive with SMT on large data pairs; by 2016 they were winning by 3–5 BLEU across the board.

Google's 2016 Neural Machine Translation system

Wu, Schuster, Chen, Le, Norouzi, Macherey, Krikun, Cao, Gao, Macherey, Klingner, Shah, Johnson, Liu, Kaiser, Gouws, Kato, Kudo, Kazawa, Stevens, Kurian, Patil, Wang, Young, Smith, Riesa, Rudnick, Vinyals, Corrado, Hughes, Dean's 2016 paper Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation was the production NMT system that replaced phrase-based SMT inside Google Translate. It was a deep stacked LSTM — eight encoder layers, eight decoder layers, residual connections between layers, Bahdanau attention between encoder and decoder, wordpiece tokenisation for handling rare words, and a large multi-GPU training infrastructure. The system improved average translation quality by 60% and reduced error rates by ~55% relative to phrase-based, with particularly dramatic gains on morphologically rich languages (Finnish, Turkish, German). The paper is a model of how to describe a production deep-learning system and remains one of the most read industry papers of the decade.

Subword units

A key practical innovation was the handling of out-of-vocabulary words. A closed vocabulary of, say, 50,000 words misses proper names, typos, rare technical terms, and compounds — which SMT handled by memorising but which NMT had no good answer for. Sennrich, Haddow, and Birch's 2016 Neural machine translation of rare words with subword units applied byte-pair encoding (BPE) to build a subword vocabulary, merging common character bigrams into tokens iteratively until the vocabulary reaches a target size (typically 30,000–50,000). Unknown words are tokenised into their constituent subwords, so "unpresidented" becomes "un@@ president@@ ed" and the model can still represent it. BPE (and its descendants SentencePiece, WordPiece) became universal in sequence modelling, and is how modern LLMs still tokenise input.

Back-translation and data augmentation

Sennrich, Haddow, and Birch's 2016 Improving neural machine translation models with monolingual data proposed back-translation: take monolingual target-language data, translate it into the source language with a preliminary model, and add the synthetic (source, target) pairs to training. This effectively expands the training set several-fold at the cost of some noise, and is one of the most reliably effective NMT techniques. The idea has been picked up in other domains (back-translation of code, speech-to-text with synthetic audio) and remains a standard data-augmentation tool for seq2seq problems.

Zero-shot and multilingual NMT

Johnson, Schuster, Le, Krikun, Wu, Chen, Thorat, Viégas, Wattenberg, Corrado, Hughes, Dean's 2017 Google's multilingual neural machine translation system: enabling zero-shot translation trained a single NMT model on many language pairs with a "target language" token prepended to the source. The surprise was that the model could translate between pairs it had never seen — zero-shot translation — by interpolating across languages it had seen. This foreshadowed the multilingual capabilities of later large language models, which exhibit strong cross-lingual transfer without explicit multilingual training.

The transformer aftermath

By 2018 every serious NMT system had switched from LSTM-based seq2seq to transformer-based seq2seq. The transformer's training parallelism made much larger models practical, its attention mechanism scaled better with sentence length, and its stacked self-attention improved quality on every language pair. GNMT's 8-layer LSTM was a high-water mark for recurrent NMT; every system since has been some variant of the encoder–decoder transformer. The ideas that survived from the RNN era — subword tokenisation, back-translation, length penalty, beam search, the basic encoder–decoder structure — were all carried into the transformer era largely unchanged.

Speech recognition and CTC

Speech recognition — mapping a variable-length audio signal to a variable-length sequence of characters or words — is the other major application that drove RNN development. Unlike translation, the input and output are unaligned in a specific way: an audio frame does not correspond to any single character; a character can span many frames or none. Alex Graves's 2006 Connectionist Temporal Classification loss is the mathematical trick that makes end-to-end RNN training feasible for this setting, and the DeepSpeech lineage is the industrial application of it.

The alignment problem

Classical hidden-Markov-model speech recognisers handled the alignment problem by explicitly modelling phoneme states and decoding with the Viterbi algorithm. For a neural network to produce a character sequence from an audio sequence, it has to somehow emit characters at the "right" time without being told when that is. One naive approach: run the RNN frame-by-frame and have it emit one character per frame. This fails because the audio frame rate (100/s) is much higher than the character rate (5–10/s) and there is no sensible way to assign each frame a target character during training.

The CTC loss

Graves, Fernández, Gomez, and Schmidhuber's 2006 Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks proposed a clean solution. Extend the output alphabet with a blank symbol. At each frame, the RNN emits a distribution over {characters, blank}. A frame-level output sequence like "h h h _ e _ l l _ l _ o _ _" collapses to the target "hello" by the rules (i) remove blanks and (ii) merge adjacent duplicates. Many frame-level sequences collapse to the same target, and the CTC loss marginalises over all of them using a forward–backward dynamic-programming algorithm — it computes the total probability the RNN assigns to any frame sequence that collapses to the target, and the loss is the negative log of that. Backprop through the CTC loss is a standard chain-rule extension. No alignment supervision is needed.

DeepSpeech

Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates, Ng's 2014 Deep Speech: scaling up end-to-end speech recognition at Baidu was the production application of CTC with a deep RNN. A stack of bidirectional LSTMs (later GRUs) read mel-spectrogram frames and emitted character distributions at each frame; the CTC loss trained it end-to-end on thousands of hours of audio. DeepSpeech 2 (Amodei et al. 2016) pushed to English and Mandarin with 11,000 hours of training data. The system did not need explicit phoneme dictionaries, hand-crafted language models (though an external LM helped), or hand-aligned audio. The simplicity was striking.

Listen, Attend, and Spell

Chan, Jaitly, Le, and Vinyals's 2016 Listen, attend and spell replaced CTC with attention-based seq2seq for speech recognition. The encoder processes the audio at multiple resolutions and the decoder attends over encoder states while generating the character sequence. LAS has advantages (joint modelling of acoustics and language, no independence assumption between output tokens) and disadvantages (not streaming, requires teacher forcing, more data-hungry). The LAS vs CTC debate continued for years, and modern production ASR systems typically combine elements of both — hybrid CTC/attention losses, transducer architectures (Graves 2012), or joint CTC-attention decoding.

RNN transducers

Graves's 2012 Sequence transduction with recurrent neural networks proposed the RNN transducer (RNN-T), which combines a CTC-style frame-level encoder with a language-model-style output decoder that predicts each output token conditioned on previous outputs. RNN-T allows streaming decoding (unlike LAS) and captures output-token dependencies (unlike pure CTC). It is the architecture underlying most production on-device speech-recognition systems in the late 2010s and early 2020s — Google's, Apple's, and Amazon's mobile speech recognisers have all been versions of RNN-T.

Temporal Convolutional Networks

For most of the 2010s, "sequence model" meant "RNN". Bai, Kolter, and Koltun's 2018 paper An empirical evaluation of generic convolutional and recurrent networks for sequence modeling made a quiet but striking contribution: causal dilated convolutions — i.e. 1D CNNs — match or beat LSTMs on every sequence benchmark they tested, at comparable or lower compute. This was not the claim that convolutional networks are generally better, but the narrower observation that the recurrence itself is not the essential ingredient.

The architecture

A Temporal Convolutional Network stacks 1D convolutions along the time axis with two key constraints. (i) Causal: the convolution at position t sees only positions ≤ t, not the future. This is enforced by zero-padding on the left side of the input. (ii) Dilated: successive layers use exponentially growing dilation factors (1, 2, 4, 8, …), so the receptive field grows exponentially in depth. A 10-layer TCN with kernel size 3 reaches a receptive field of 3 · (2^{10} − 1) = 3069 time steps — comparable to what an LSTM could in principle carry but cannot reliably learn.

Why TCNs work

TCNs inherit everything good about CNN training: parallelism across the time axis (every position's output is computed independently in the forward pass), no vanishing-gradient problem through time (only through depth, and residual connections handle that), stable gradients, and fast throughput on modern hardware. The price is that the receptive field is fixed at training time — you cannot adapt it dynamically based on what the input looks like — whereas an RNN's effective receptive field is in principle unbounded. In practice the RNN's "in principle" is the vanishing-gradient problem, and the TCN's fixed receptive field of a few thousand steps is usually more than enough.

WaveNet

Before Bai et al., Van den Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior, and Kavukcuoglu's 2016 WaveNet: a generative model for raw audio had already used causal dilated convolutions to generate speech at sample-level — 16 kHz audio, tens of thousands of samples per second of output. WaveNet was the first deep generative model to match the naturalness of concatenative speech synthesis in a blind listening test and remains the foundation of most modern text-to-speech. It is in some sense the strongest possible argument for causal dilated convolutions: a problem with very high-frequency structure and enormous sequences, solved by a CNN architecture.

PixelCNN, Gated PixelCNN

The same causal-convolution idea, applied in 2D, underlies the PixelCNN family (Van den Oord, Kalchbrenner, Kavukcuoglu 2016; Van den Oord et al. 2016 gated version). These autoregressive image models generate images pixel-by-pixel with a 2D analog of causal convolutions — each pixel's distribution conditions on pixels above and to the left of it — and were the strongest generative image models before GANs and VAEs pulled ahead on different quality axes. The PixelCNN lineage also influenced autoregressive transformers by demonstrating that explicit autoregressive modelling at pixel scale was feasible.

Why RNNs did not die immediately

Despite Bai et al.'s result, RNNs continued to be the default for sequence modelling until the transformer made the question moot. The reasons were conservatism of practitioners, the accumulated tooling around LSTMs, and the specific tasks (streaming ASR, on-device audio) where the RNN's streaming property is non-negotiable. TCNs survive in production speech synthesis, sample-level audio modelling, and some time-series forecasting pipelines. The larger point they made — that recurrence is not the only route to sequence modelling — set the intellectual stage for the transformer.

Modern RNNs and state-space models

After the transformer took over in 2017–2018, recurrent architectures were widely considered obsolete — elegant historically, still useful in a few niches, but not a credible competitor for the serious modelling work of the era. Then, around 2020, a quiet comeback began. State-space models, linear RNNs, and new variants like Mamba and RWKV started to match transformer quality on sequence-modelling benchmarks at lower inference cost and genuinely linear-time scaling. The story is still unfolding.

The attention-cost problem

Transformer self-attention has O(N²) time and memory complexity in sequence length N. This is fine at N = 2048 but painful at N = 100,000 and prohibitive at N = 1,000,000. RNNs have O(N) inference cost and O(1) memory per step (the hidden state), which becomes a decisive advantage once context lengths cross some threshold. The modern-RNN revival is largely a response to the quadratic cost of attention on long sequences.

Linear RNNs

A linear RNN uses h_t = A h_{t−1} + B x_t — no nonlinearity in the recurrence — and reads out via y_t = C h_t. This seems too restrictive, but when stacked with nonlinearities between layers (like the FFN in a transformer), the resulting architecture has surprising expressive power. Crucially, a linear recurrence can be computed as a parallel scan or equivalently as a carefully structured matrix multiply, so training throughput matches a transformer's even though the inference-time computation is sequential. Orvieto, Smith, Gu, Fernando, Gulcehre, Pascanu, De's 2023 Resurrecting recurrent neural networks for long sequences showed that linear RNNs with careful initialisation and normalisation match or beat transformers on long-range benchmarks like Long Range Arena.

State-space models

Gu, Goel, and Ré's 2021 Efficiently modeling long sequences with structured state spaces (S4) proposed a deeper formalism. A continuous-time state-space model ẋ = Ax + Bu, y = Cx can be discretised into a linear recurrence h_t = Āh_{t−1} + B̄u_t, with the matrix Ā constrained to a structured form (HiPPO-style) that captures long-range dependencies well. S4 matched transformers on Long Range Arena and pushed the frontier on genuinely long sequences (tens of thousands of tokens). Mehta, Gupta, Cutkosky, Neyshabur's 2022 Long range language modeling via gated state spaces (GSS) and a series of follow-ups refined the approach.

Mamba

Gu and Dao's 2023 Mamba: linear-time sequence modeling with selective state spaces made the state-space parameters input-dependent — the A, B, C matrices become functions of the input x_t, adding back some of the content-dependent routing that fixed-matrix SSMs lacked. Mamba matches transformer quality on language modelling at similar parameter counts and scales to long sequences (up to 1M tokens) with linear compute. It is the most credible non-transformer architecture at the time of this writing, and whether it will displace transformers in mainstream use is one of the live questions of the field.

RWKV and other hybrid designs

RWKV (Peng et al. 2023) combines an RNN-style linear recurrence with transformer-style attention in a single architecture, giving it the training parallelism of a transformer and the O(1) inference memory of an RNN. The architecture has been scaled to 14B parameters and produces competitive language models. Other hybrid designs — Hyena (Poli et al. 2023) using long convolutions, Megalodon (Ma et al. 2024) using exponential moving averages, RetNet (Sun et al. 2023) using retention mechanisms — explore the same space. Whether any of these will become the dominant architecture of the 2030s, or whether the transformer will remain default, is not yet clear.

The larger point

The modern-RNN revival reinforces the lesson of this chapter: the specific architectural form (RNN vs transformer vs state-space) matters less than the three design questions — how state is updated, how long-range dependencies are preserved, and how training works. The transformer won the 2017 round because self-attention gave elegant answers to all three. The modern RNNs and SSMs are a genuine competitor because they give answers that are at least as good on most axes and much better on inference cost at very long sequences. The architecture family that seemed decisively obsolete five years ago is having a quiet renaissance.

The limits of RNNs and the transformer succession

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin's 2017 Attention is all you need ended the seq2seq era. The transformer, a model with no recurrence and no convolutions, matched and quickly surpassed every recurrent NMT system, then every recurrent language model, then every recurrent speech model, then almost every sequence model in every domain. The reasons for the succession are a useful catalogue of what RNNs could not do.

The three limits RNNs could not overcome

First: training parallelism. An RNN's computation at step t depends on t − 1, so forward and backward passes are inherently sequential across time. A transformer's self-attention — every position attending to every other — is parallelisable across all T positions, and training throughput scales with the number of GPUs rather than being bottlenecked on sequence length. This alone made transformers 5–10× faster to train at equivalent parameter count, and on large-scale data the faster training compounded. Second: long-range interaction. An RNN must route information between positions i and j through a chain of |j − i| intermediate hidden states, each of which can degrade the signal. A transformer routes it in a single attention operation. Even with gating, RNNs in practice struggled with dependencies beyond a few hundred tokens; transformers with 2048-token contexts were immediately stronger. Third: positional resolution. An RNN encodes position implicitly in the order it processes tokens; a transformer encodes position explicitly via positional embeddings, which can be designed or learned to suit the task.

The perplexity gap

On the 1-billion-word language modelling benchmark — the standard LM benchmark from 2015–2018 — AWD-LSTM (the best RNN) achieved perplexity around 24. The transformer-based GPT-2 (Radford, Wu, Child, Luan, Amodei, Sutskever 2019) achieved perplexity below 15 at 117M parameters, and pushed below 10 at 1.5B parameters. The gap compounded — a 0.5-nat-per-token advantage multiplied over millions of tokens is a lot — and transformers kept scaling where RNNs had plateaued.

What transferred from the RNN era

Almost every non-architectural idea from the RNN era survived into the transformer era. Subword tokenisation (BPE, SentencePiece). Sequence-to-sequence framing with encoder and decoder. Cross-attention between encoder and decoder. Teacher forcing and beam search. Dropout and LayerNorm. Back-translation. Scheduled sampling. Adam with warmup schedules. Residual connections and layer stacking. The transformer swapped out one architectural primitive — the recurrence — and kept almost everything else. This continuity is why the transition felt rapid from the outside: it was the same game with a better piece.

What the transformer lost

RNNs have genuine advantages that transformers sacrificed. Inference memory — an RNN's hidden state is O(d) independent of sequence length; a transformer's KV cache grows linearly. Streaming — an RNN naturally processes tokens one at a time; a transformer has to be carefully engineered for streaming. Long-sequence cost — an RNN's inference is O(NT); a transformer's is O(N T²). These are the three points the modern-RNN revival (section sixteen) is trying to reclaim. The transformer's dominance is less complete for very long sequences, on-device inference, and settings where the quadratic cost is prohibitive.

The hybrid future

The 2020s direction seems to be hybrid: architectures that combine transformer-style attention with RNN-style recurrent state, picking up each family's strengths. Retrieval-augmented generation puts an external memory behind a transformer. Mixture-of-experts reduces transformer inference cost without losing capacity. State-space layers (Mamba) handle very long contexts cheaply alongside transformer layers. The clean architectural taxonomy of RNN vs transformer vs CNN is blurring into a soup of composite architectures, each optimised for a specific deployment regime. The sequence-modelling problem — handling variable-length ordered data with long-range dependencies — is still the same problem it was in 1990; the tools we have for it are just much better.

Where it compounds in ML

Sequence models were the architecture family that taught neural networks to handle variable-length ordered data. Even in the post-transformer era, the ideas in this chapter are everywhere — because the transformer inherited them rather than replacing them, and because the sequence-modelling frame is now the frame for almost every ambitious problem in ML.

Ideas the transformer inherited

Almost everything interesting about the transformer came from the RNN era. The encoder–decoder structure is seq2seq's. The attention mechanism is Bahdanau's. The beam-search decoding is MT's. Subword tokenisation, cross-attention, teacher forcing, layer normalisation, dropout on non-recurrent connections, residual connections, deep stacking — all developed in the LSTM era, carried into transformers largely unchanged. The transformer's contribution was one architectural swap: drop the recurrence, use parallel self-attention instead. That swap was enough to end an era, but it was a much smaller change than the surrounding engineering suggests.

The conditional-sequence-model frame

Sequence models gave the field a frame that has proven nearly universal. Translation: p(target | source). Summarisation: p(summary | document). Speech recognition: p(text | audio). Code completion: p(continuation | prefix). Dialogue: p(response | history). Image captioning: p(caption | image). Image generation (autoregressive): p(pixels | prompt). Question answering: p(answer | question, context). By the time transformers took over, every serious ML problem had been reframed as a conditional sequence model — and the transformer simply provided a better architecture for the frame the RNN era had established.

Where RNNs still dominate

Some niches remain firmly RNN territory. On-device speech recognition (RNN transducers have small hidden states, streaming decoding). Time-series forecasting where an interpretable, low-parameter model is preferred (statsforecast and neuralforecast toolkits still emphasise RNN variants). Reinforcement-learning policies that need to aggregate a history of observations (LSTMs are the default in most RL frameworks). Mid-size financial and scientific time-series applications where transformer overhead is not justified. The LSTM is a tool every ML practitioner should still reach for when the sequence length and computational budget are modest.

What the chapter adds up to

Sequence models are the architecture family that built the bridge from "deep learning works on images" (the previous chapter) to "deep learning works on language, speech, and everything else". The bridge took twenty years — Hochreiter's 1991 thesis to Sutskever's 2014 seq2seq paper — because the vanishing-gradient problem was fundamental and required a real architectural innovation (gating) to solve. Once solved, the field advanced rapidly through seq2seq, attention, and eventually the transformer. The intellectual heritage of those twenty years is embedded in every modern generative model, every chatbot, every code-completion system, every speech recognisor. Even the current interest in state-space models and hybrid architectures is, in some sense, the field circling back to question whether the transformer's architectural choices are the last word — a question that this chapter argues is still open.

The closing bridge

This chapter ends with gated recurrence as the core of the sequence-model family and self-attention as the revolution that replaced it at the default. The next chapter is about that revolution in detail: attention mechanisms — soft vs hard, self vs cross, single-head vs multi-head — the primitive that the transformer elevated from a clever fix for seq2seq into the foundation of modern deep learning.

How to read this chapter

Contents

Why sequence models?

What counts as sequence data

Why fixed-length architectures fail

Stateful computation

Sequences as conditional distributions

The three design questions

The recurrent neural network

The recurrence

The unrolled computation graph

The three roles the hidden state plays

Input, output, and hidden dimensions

Many-to-one, one-to-many, many-to-many

The RNN as a dynamical system

Backpropagation through time

The algorithm

The memory cost

Truncated BPTT

Real-time recurrent learning

Gradient checkpointing for RNNs

Vanishing and exploding gradients

The math of gradient flow

The tanh factor

Why the explosion is easier to fix

Why the vanishing is harder to fix

The gating solution

Orthogonal initialisations and unitary RNNs

Long Short-Term Memory

The cell state idea

The gates

Why the cell state solves vanishing

Parameters and compute

Variants and historical notes

The forget-gate bias trick

Gated Recurrent Units

The GRU cell

GRU vs LSTM

When to choose which

Other gated variants

Bidirectional and deep RNNs

Bidirectional RNNs

Deep RNNs

Residual and highway connections in RNNs

Dropout in RNNs

Layer normalisation for RNNs

Character-level language models

The setup

What the char-RNN learned

Sutskever, Martens, Hinton 2011

Word-level language models and perplexity

The char-RNN's limits

Sequence-to-sequence models

The architecture

Why this framing was a revelation

The reverse-input trick

The bottleneck problem

Vinyals et al.'s 2015 grab-bag

Attention for sequence models

The mechanism

Bahdanau vs Luong attention

Attention as learned alignment

Soft vs hard attention

The road to self-attention

Teacher forcing and exposure bias

Teacher forcing

The exposure-bias problem

Scheduled sampling

Professor forcing

Minimum-risk training

The LLM-era view

Decoding strategies

Greedy decoding

Beam search

The length bias

Stochastic decoding

Temperature

The diverse-beam and constrained-decoding worlds

Neural machine translation

From SMT to NMT