In June 2017 a team at Google Brain posted a paper to arXiv with the extravagant title Attention Is All You Need. Its argument was straightforward. For half a decade sequence modelling had been dominated by recurrent networks — LSTMs and GRUs — whose left-to-right processing made them slow to train and hard to parallelise. A small group of recent papers had added attention to these recurrent encoders as an auxiliary mechanism, and attention had turned out to be where most of the work was actually done. So Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin did the obvious experiment: they removed the recurrence entirely. What remained was a stack of self-attention blocks with feed-forward layers between them, residual connections, layer normalisation, and a clever sinusoidal positional encoding. The paper reported state-of-the-art results on English-German machine translation at a fraction of the training time. Within eighteen months the transformer was the architecture of BERT, GPT, and every subsequent pretrained language model. Within three years it had swallowed computer vision (ViT), speech (Conformer, Whisper), protein folding (AlphaFold), reinforcement learning (Decision Transformer), and code generation (Codex). It is now the single most important neural architecture in applied machine learning — the substrate on which GPT-4, Claude, Gemini, LLaMA, PaLM, and every other large language model is built. This chapter is about what's inside the box: self-attention, the query/key/value decomposition, multi-head attention, positional encoding and its variants, the residual-plus-layernorm skeleton, the encoder and decoder stacks, the three major architectural variants (encoder-only, decoder-only, encoder-decoder), and the engineering extensions (FlashAttention, MoE, grouped-query attention, long-context techniques) that keep the transformer viable at frontier scale.
Sections one through four establish why the transformer exists and how its core mechanism works. Section one is the motivation — the limitations of recurrent and convolutional sequence models that attention was designed to fix, and what it meant to discover that "attention is all you need". Section two introduces self-attention itself: the intuition that each position in a sequence should be a weighted combination of all other positions, and the mathematical form this takes. Section three unpacks the query/key/value decomposition — the three learned projections that let self-attention learn what to attend to and what to retrieve. Section four covers scaled dot-product attention, the specific equation that the original paper proposed and that every modern transformer still uses at its core.
Sections five through eight are the other three pieces of the basic transformer block. Section five is multi-head attention — why one attention computation is not enough, what the heads learn in practice, and how the concatenation and output projection fit together. Section six addresses positional encoding — attention is inherently permutation-equivariant, so the model needs some way to know where each token sits in the sequence, and this section walks through the sinusoidal encoding of the original paper along with learned absolute positional embeddings. Section seven covers residual connections and layer normalisation, the two pieces of training-stability machinery that surround every attention and feed-forward sublayer, with close attention to the pre-norm vs post-norm debate. Section eight handles the position-wise feed-forward network — the often-underrated second sublayer, which is where a substantial fraction of a transformer's parameters and capacity actually live.
Sections nine through twelve assemble the full architecture. Section nine walks through the encoder stack — the left half of the original Vaswani diagram. Section ten covers the decoder stack, including its distinguishing masked self-attention. Section eleven is cross-attention, the mechanism by which the decoder queries into the encoder's outputs — the technical heart of the original encoder-decoder and still the backbone of every seq2seq transformer. Section twelve is a careful treatment of masking — padding masks, causal masks, and the attention-pattern engineering that structured sparsity, sliding windows, and efficient long-context approaches all build on.
Sections thirteen through sixteen cover the three architectural variants and the evolution of positional representations. Section thirteen is encoder-only transformers — BERT, RoBERTa, DeBERTa, and the family of models built for representation and classification. Section fourteen is decoder-only transformers — GPT, LLaMA, Mistral, and the family that won the generative race. Section fifteen revisits positional variants — relative position, rotary embeddings (RoPE), ALiBi, and the engineering story of how we got from sinusoidal encodings to 100K-plus context windows. Section sixteen covers efficient attention — FlashAttention's IO-aware reformulation, sparse and linear approximations, multi-query and grouped-query attention, sliding window and ring attention.
Sections seventeen and eighteen close the chapter. Section seventeen is scaling and training dynamics — depth vs width, initialisation, activation functions, mixed precision, the practical matters that separate a transformer that works at 100M parameters from one that works at 100B. The closing in-ml section places the transformer in the broader landscape: vision (ViT, DETR, SAM), audio (Whisper, Conformer), protein structure (AlphaFold 2, 3), reinforcement learning (Decision Transformer, Gato), diffusion (DiT), and the small number of genuinely novel post-transformer architectures (Mamba, RWKV) that are beginning to challenge it. The chapter ends positioned for Pretraining Paradigms, the next chapter, which takes the architecture described here and asks: what do we train it to do?
Recurrent networks had two fundamental limitations — they were slow to train because they could not be parallelised along the sequence dimension, and they struggled to model long-range dependencies because the gradient signal decayed across many steps. The transformer solved both problems at once, by the simple expedient of removing recurrence altogether and replacing it with attention.
From 2014 through 2017 the dominant architecture for sequence-to-sequence tasks — machine translation, summarisation, speech recognition, dialogue — was the recurrent encoder-decoder. Sutskever, Vinyals, and Le's 2014 paper introduced the architecture; Bahdanau, Cho, and Bengio's 2015 paper added attention as a bridge between encoder and decoder. These models were state-of-the-art but expensive to train. A recurrent network processes tokens strictly left to right: the hidden state at position t depends on the hidden state at position t-1, which depends on t-2, and so on. There is no way to compute the hidden state at position t in parallel with the hidden state at position t-1. On GPUs, which thrive on massive parallelism, this is catastrophic. Training a large recurrent model on a billion tokens could take weeks.
Vanishing gradients were the second pathology. Backpropagation through a recurrent network multiplies many Jacobian matrices, one per timestep. Unless these matrices are near-isometric, the product either blows up or shrinks to zero across many steps. LSTMs and GRUs were designed to mitigate this via gating, and they partially succeeded — an LSTM can in practice remember information across a few hundred tokens — but long-range dependencies (like resolving a pronoun to an antecedent mentioned twenty sentences earlier) remained fragile. Attention, introduced by Bahdanau as an add-on to recurrent models, was already doing much of the work that the recurrence was nominally responsible for. Each decoder step would learn which encoder positions to attend to, pulling information directly from anywhere in the input rather than squeezing everything through a single recurrent bottleneck.
By 2017 several papers were asking the obvious question: if attention is doing the work, do we need the recurrence at all? Parikh et al. 2016 had shown that attention alone could match LSTMs on natural language inference. Kaiser and Bengio 2016 had introduced "Can Active Memory Replace Attention?" — a provocation more than a result. Then in June 2017 Vaswani and colleagues posted Attention Is All You Need, which simply removed all the recurrence from a sequence-to-sequence model and replaced it with stacks of self-attention layers. The results were striking: state-of-the-art English-German translation (28.4 BLEU on WMT 2014), achieved with a fraction of the training compute — a few days on eight GPUs rather than weeks on a cluster.
The win was not subtle. Because every token's representation could attend to every other token's representation in a single parallel operation, self-attention trained orders of magnitude faster per step than recurrence. Because every pair of positions was directly connected by exactly one attention computation rather than by a chain of |i - j| recurrent steps, long-range dependencies were handled natively. The cost — quadratic memory and compute in sequence length — was a problem that later work would address but that was, for the ranges of sequences considered in 2017, entirely acceptable on available hardware. The transformer was not a research-grade improvement on LSTMs; it was a generational replacement.
The transformer is the right architecture because it maximises what matters on modern hardware — throughput of parallelisable compute — while natively handling what recurrence did badly: long-range dependency and gradient flow. Every subsequent attempt to replace it has had to match both of those properties, and most have failed.
The core idea of self-attention is that each position in a sequence computes its new representation as a weighted average of all positions — including itself — where the weights are learned from the content of the sequence. The output at each position is therefore a function of the entire input, computed in a single parallel step.
Consider a sequence of n tokens, each represented as a d-dimensional vector, stacked into a matrix X ∈ ℝn×d. Self-attention transforms X into a new matrix Y ∈ ℝn×d where each row of Y is a weighted sum of the rows of X, with the weights determined by how similar each pair of rows is to each other. If tokens i and j are semantically related, Yi will pull strongly from Xj. If they are not, the contribution will be negligible. The result: a new representation for each position that is informed by the entire rest of the sequence.
Contrast this with the alternatives. A recurrent layer transforms each position by combining it with the single previous hidden state — information flows serially along the sequence. A convolutional layer transforms each position by combining it with a fixed-radius window of neighbours — information flows locally. Self-attention transforms each position by combining it with every other position — information flows globally, in a single step, from any position to any other. The connectivity pattern is the complete graph on n nodes rather than a chain or a small-neighbourhood grid.
This global connectivity has three consequences. First, the path length between any two positions is exactly one. In a recurrent network the shortest path between positions i and j is |i - j|, so gradient signals have to travel through many intermediate computations. In a self-attention layer the path is one hop. This is why transformers handle long-range dependency better than LSTMs: there is no long path for the signal to decay along. Second, the computation at every position is fully parallelisable — all n outputs can be computed simultaneously. Third, the cost is quadratic in sequence length: n² pairwise interactions must be computed. For n in the hundreds or low thousands, this is cheap; for n in the hundreds of thousands, it is the bottleneck that efficient-attention variants (§16) are designed to address.
Self-attention is also permutation-equivariant: if you shuffle the rows of X, the rows of Y are shuffled the same way, and the values are unchanged. Attention knows what other tokens the current token should look at, but it does not know where any of them are in the sequence. This is a feature — it decouples content from position, letting the model focus on what matters — but it also means that position must be supplied separately, via the positional encoding of §6. Without positional encoding, a transformer would be bag-of-tokens: it would treat the dog bit the man and the man bit the dog identically. Position is how the architecture learns that order matters.
Rather than computing attention weights directly from input representations, the transformer first projects each token into three separate spaces: queries (what am I looking for?), keys (what do I have to offer?), and values (what do I pass along if selected?). This decomposition gives the network far more representational flexibility than a single pairwise-dot-product scheme would allow.
The input matrix X ∈ ℝn×d is projected through three learned weight matrices WQ, WK, WV, each in ℝd×dk, to produce three new matrices of the same shape as each other: Q = X WQ, K = X WK, V = X WV. Each row of Q is a query vector; each row of K is a key vector; each row of V is a value vector. The three projections serve distinct roles and are allowed to learn whatever the data demands of them.
The conceptual metaphor, borrowed from information retrieval, is that self-attention is a soft database lookup. The query is the request a particular token is making ("I am a pronoun looking for an antecedent"). The keys are labels that other tokens advertise ("I am a noun phrase that could plausibly be an antecedent"). The values are the content that gets retrieved when a query matches a key ("here is my semantic vector, pass it through"). Attention weights are computed by comparing queries to keys: the closer a query is to a key (in dot-product similarity), the higher the weight assigned to that position's value. The output at each position is a weighted combination of values, where the weights come from query-key similarity.
Why three projections rather than one? Because the content a token offers (value) need not equal the label it advertises (key), which need not equal the request it is making (query). A word like she might advertise "I am an animate singular female noun" as its key, request "looking for a recent animate singular female noun phrase" as its query, and offer "pronoun reference to
The projections WQ, WK, WV are trainable parameters; there is nothing special about their initial values, and the training signal shapes each projection to do whatever the loss function demands. Empirically, the learned projections often exhibit interpretable structure: in a trained BERT, some attention heads align strongly with syntactic dependencies; others with coreference; others with semantic similarity. But this structure is emergent from the training objective, not imposed by design. The architecture simply provides three degrees of freedom per head; the data decides how to use them.
A useful mnemonic: query = what I need, key = what I advertise, value = what I give. In a well-trained head, queries and keys together determine who talks to whom, and values determine what gets said.
The single most recognisable equation in modern deep learning is softmax(Q KT / √dk) V. It computes pairwise query-key similarities, scales them by a factor that keeps the softmax well-behaved, converts to a probability distribution, and uses that distribution to take a weighted average of values.
The equation unpacks into four steps. First, Q KT computes an n × n matrix of raw attention scores, where entry (i, j) is the dot product between the i-th query and the j-th key — a measure of how well they align. Second, divide by √dk, where dk is the query/key dimension. Third, apply a softmax along each row, converting scores into a probability distribution that sums to one across the sequence. Fourth, multiply by V to produce the final n × dv output: each row is a convex combination of value vectors.
The √dk scaling is essential and easy to miss. Dot products of two random unit-norm vectors have variance proportional to dk. At dk = 64 (a typical per-head dimension) the raw dot products span roughly ±8, which after softmax produces extremely peaked distributions where one position gets essentially all the weight and the rest get zero. These near-one-hot distributions have vanishing gradients with respect to the non-selected positions, stalling training. Dividing by √dk brings the variance of the dot products to order one, keeping the softmax in a regime where gradients flow to many positions. This scaling is one of the transformer's most load-bearing "tricks" — remove it and large-dk training often fails to converge.
The softmax has a second important property. Because it normalises across all keys, attention forces the model to allocate attention rather than simply accumulate it. If a query matches ten keys equally well, it cannot attend to all ten at full strength — each gets one-tenth. This encourages the model to find distinctive matches and discriminate between positions, rather than treating attention as a generic "see everything" operator. The tradeoff is that softmax attention fundamentally cannot attend to more than roughly 1/ε positions at once without the weights decaying below noise; this is one motivation for multi-head attention (§5) and for some of the alternative attention formulations (§16).
Finally, the whole computation is dense and matrix-shaped. Implementing scaled dot-product attention on modern hardware is just four tensor operations — a matmul, a scale, a softmax, a matmul — each of which maps perfectly onto GPU tensor cores. The original paper reports that a transformer trains roughly 10× faster per step than a comparably sized LSTM for this reason. The efficiency is not an implementation detail; it is the economic argument for the entire architecture.
Instead of a single attention operation over the full hidden dimension, the transformer splits the hidden state into h smaller chunks and runs h parallel attention operations — one per head — each with its own learned projections. The results are concatenated and projected back. This turns out to be far more expressive than a single wider head.
The original paper uses h = 8 heads each with dk = dv = 64, which together span the full dmodel = 512 hidden dimension. Formally: for each head i ∈ {1, …, h} there are three learned matrices WQi, WKi, WVi; the input X is projected through them to produce Qi, Ki, Vi; scaled dot-product attention is applied independently in each head; the h outputs are concatenated along the feature dimension; a final linear projection WO ∈ ℝh·dv × dmodel mixes the heads back to the model dimension. The whole thing costs approximately the same as single-head attention at the full dimension, because the per-head cost is smaller by a factor of h.
Why does multi-head work? Because different heads can learn to attend to different kinds of relationships. A 2019 paper by Clark, Khandelwal, Levy, and Manning — What Does BERT Look At? — found that in trained BERT some heads align almost perfectly with syntactic dependency arcs (subject-verb, determiner-noun); some heads track coreference (pronouns attending to antecedents); some heads handle positional structure (attending to the previous token, or to the next period); others seem to spread broadly with no clear specialisation. A single head with the same total compute would have to somehow encode all of these patterns simultaneously in one attention map, which it cannot do — softmax forces each query to commit to a small number of keys.
The head width vs head count tradeoff is a subtle design dimension. Very narrow heads (dk = 16 or 32) may lack the dimensionality to represent fine-grained distinctions; very wide heads (dk = 128 or more) fall back toward single-head behaviour. The dmodel = 512, h = 8, dk = 64 choice in the original paper has held up remarkably well as a default; most subsequent transformers use variations around dk = 64–128. For very large models, multi-query attention and grouped-query attention (§16) break the symmetry by keeping head-specific queries but sharing keys and values, trading a small amount of expressiveness for substantial inference-time memory savings.
A practical consequence: multi-head attention is the first place where a transformer deviates meaningfully from a "single big attention op" model. All the per-head projections are learned independently; the final output mixer learns how to combine their outputs. This is why attention-map visualisations are per-head: there is no single attention pattern the model uses at a given layer, but rather h patterns running in parallel, each specialising on a different aspect of the input.
Self-attention is permutation-equivariant: shuffle the inputs, shuffle the outputs identically. To make the transformer sensitive to order, positional information is added to the input embeddings before attention begins. The original paper used a fixed sinusoidal encoding; most modern systems use learned absolute or rotary encodings.
The original sinusoidal encoding produces, for each position pos and each embedding dimension i, a value drawn from a sine or cosine of pos at a wavelength that depends on i. Concretely: PE(pos, 2i) = sin(pos / 100002i/d) and PE(pos, 2i+1) = cos(pos / 100002i/d). The result is a fixed (non-learned) vector for every position, added to the token embedding before the first attention layer. The wavelengths span from 2π (for the lowest-frequency dimension) to 2π · 10000 (for the highest), giving the model a multi-scale positional signal: some dimensions fire on fine-grained position differences, others on coarse ones.
Vaswani et al. argued two properties motivated the sinusoidal choice. First, the encoding can extrapolate to sequence lengths not seen during training, because the formula is defined for arbitrary pos. Second, it has the property that PE(pos + k) can be expressed as a linear function of PE(pos), which makes learning relative positional relationships easier. In practice the extrapolation argument has turned out to be weak — trained transformers generalise poorly beyond their training sequence length even with sinusoidal encoding — and modern positional schemes (§15) address this through different mechanisms.
The common alternative is learned absolute positional embeddings: a lookup table of max_length position vectors, each the same dimension as the token embedding, trained from scratch alongside the model. BERT, GPT-2, and many other systems use this. Learned encodings are more flexible than sinusoidal (they can capture idiosyncratic positional regularities in the training data) but they cannot extrapolate at all — positions beyond max_length have no embedding, and attempts to use the model on longer sequences produce random behaviour at those positions.
Both schemes share a crucial property: positional information is added to token embeddings, not concatenated. The same d-dimensional space is used to carry both "what" (the token) and "where" (the position); the attention mechanism must learn to disentangle them via its query and key projections. This is simple and efficient but has awkward consequences — the model has finite capacity to represent position, and position can "leak" into content in trained representations. Modern variants (rotary position embeddings, ALiBi, relative position biases; all in §15) move away from the add-to-embedding scheme in various ways, with the general theme that position should be injected where attention happens (at the QK dot product) rather than at the input.
Positional encoding is not cosmetic: without it, a transformer cannot tell the dog bit the man from the man bit the dog. How you encode position is a first-order architectural decision with direct consequences for context length, extrapolation behaviour, and training dynamics.
Every sublayer in a transformer — attention and feed-forward alike — is wrapped in a residual connection and a layer normalisation. This scaffolding is not cosmetic: it is what lets deep transformers train at all, and the choice of pre-norm vs post-norm materially affects stability at scale.
The residual connection around each sublayer computes output = sublayer(x) + x: the sublayer produces an update to the input representation, and the original input is added back. This trick, borrowed from ResNets, has two consequences. First, gradients flow through the network along the residual pathway even if the sublayer's own gradient signal is small — the network can never be worse than its input, because the identity is always available. Second, each layer only has to learn a difference from its input, which is typically a much easier optimisation problem than learning a whole new representation from scratch. Without residual connections, the original paper's 6-layer encoder would have been hard to train and a 48-layer GPT-3 would have been impossible.
Layer normalisation, introduced by Ba, Kiros, and Hinton in 2016, normalises each token's representation independently — subtracting the mean and dividing by the standard deviation of its d-dimensional vector, then applying a learned scale and shift. Unlike batch normalisation (which is computed across the batch dimension), layer norm is computed per-token, which makes it well-suited to variable-length sequences and small batches. In a transformer, layer norm prevents the scale of activations from exploding across many stacked sublayers and keeps the input distribution to each sublayer roughly stable.
Where the layer norm sits relative to the residual connection is the subject of a long-running architectural debate. The original paper used post-norm: output = LayerNorm(x + sublayer(x)). This was the convention in the first wave of transformers (BERT, GPT-1, GPT-2). But post-norm has a subtle stability problem at depth — gradient magnitudes can drift across layers, requiring learning-rate warmup and careful initialisation to train without diverging. Pre-norm: output = x + sublayer(LayerNorm(x)), which normalises the input to the sublayer rather than the output of the residual connection. Pre-norm is markedly more stable at scale: gradient norms stay better-behaved across depth, warmup requirements are reduced or eliminated, and training fails less catastrophically when hyperparameters are off. Nearly every modern transformer (GPT-3 and later, LLaMA, PaLM, Gemini) uses pre-norm.
Several efficient variants of layer norm have appeared in recent years. RMSNorm (Zhang and Sennrich 2019) drops the mean-centering and only rescales by root-mean-square, reducing computation with negligible quality cost — LLaMA uses it. DeepNorm (Wang et al. 2022) modifies the residual branch's weights to allow extremely deep post-norm training. The specific choice matters for training dynamics but rarely dominates final model quality; the important thing is that some form of normalisation is present. A transformer without layer norm is a transformer that will not converge.
The attention sublayer gets most of the attention, but every transformer block also contains a feed-forward sublayer — a small two-layer MLP applied independently to each position — that accounts for roughly two-thirds of the parameters and, by some interpretations, does most of the actual "computation" that attention merely routes between.
The feed-forward network (FFN) in each transformer block is a function FFN(x) = W2 · activation(W1 x + b1) + b2 applied independently to the representation at each sequence position. Crucially, no interaction between positions happens here — the FFN is a per-token transformation. The first linear layer projects the dmodel-dimensional input up to a wider hidden dimension dff (typically dff = 4 · dmodel), applies a pointwise nonlinearity, then the second linear layer projects back down to dmodel. This is just a standard MLP.
Although it looks simple, the FFN is where a substantial fraction of the model's parameters live. For a transformer with dmodel = 512 and dff = 2048, each FFN block has approximately 2 · 512 · 2048 ≈ 2M parameters — more than the attention sublayer's roughly 4 · 5122 ≈ 1M. For modern large models this ratio holds: a 70B-parameter transformer has more FFN parameters than attention parameters. Scale laws and sparsity studies have converged on the view that FFN capacity is as important as attention capacity, and mixture-of-experts (§16) targets exactly this sublayer when seeking scale-efficient growth.
Different activation functions have come and gone. The original paper used ReLU. GPT-2 switched to GELU, a smooth approximation that has become the standard in encoder models. Recent systems — PaLM, LLaMA — use SwiGLU (a gated linear unit with a Swish activation), which empirically works a little better per parameter. The pattern in this sequence is small quality gains from smoother activations with gating; the underlying FFN structure has not changed.
An interpretability thread, starting with Geva et al.'s 2021 paper Transformer Feed-Forward Layers Are Key-Value Memories, argues that FFN weights encode factual and linguistic knowledge in a key-value form: specific input patterns in the first layer trigger specific output patterns in the second. On this view, attention handles routing and composition while FFN handles storage and retrieval — a complementary architecture that together provides much more capacity than either in isolation. The view is not uncontested but has guided a lot of subsequent work on editing and steering language models.
A transformer encoder is a stack of N identical blocks, each containing a self-attention sublayer and a feed-forward sublayer, both wrapped in residual connections and layer normalisation. Running a sequence through the stack produces a contextual representation for every token — each position's output vector is a function of the entire input.
The encoder block has two sublayers: multi-head self-attention over the input sequence, followed by a position-wise feed-forward network. In the original post-norm formulation, each sublayer is wrapped as LayerNorm(x + Sublayer(x)); in modern pre-norm systems, as x + Sublayer(LayerNorm(x)). Both sublayers map ℝn × d to ℝn × d, so blocks can be freely stacked. The input to the first block is the sum of token embeddings and positional encodings; the output of the last block (after a final layer norm in pre-norm systems) is the encoder's final representation.
After N encoder blocks, the representation at each position has incorporated information from every other position, potentially many times over. Position 7 in the final layer is a nonlinear function of positions 1 through n in the input, mediated through N rounds of attention. This is why encoder representations are usefully described as contextual: the vector for a given token captures not only that token's identity but everything the model has learned about how that token fits into the surrounding sequence. It is the foundation of every representation-learning use of transformers.
The original paper used N = 6 layers for the base model and N = 12 for the large model, with dmodel = 512 and 1024 respectively. BERT-base used N = 12, BERT-large N = 24. Recent systems have pushed depth much further — PaLM-540B has N = 118; GPT-3 has N = 96; LLaMA 2-70B has N = 80. Exactly how deep vs wide to go is a hyperparameter subject to scaling-law analysis (§17): for a fixed parameter budget, deeper-and-narrower and shallower-and-wider produce different capability profiles, but both can be trained to similar overall performance with the right scaling.
Crucially, an encoder does bidirectional self-attention: at each position, the attention mechanism can see tokens both before and after — there is no causal mask. This is what makes an encoder useful for representation and classification but ill-suited to open-ended generation. Running an encoder on a sequence produces context-aware vectors for every token, which you can then pool or classify; it does not produce a next-token distribution, because the encoder has no notion of "next" — it sees the entire sequence at once. For generation, you need the decoder (§10).
A transformer decoder is similar to an encoder but with two key differences: its self-attention is masked so each position can only attend to earlier positions, and it includes an additional cross-attention sublayer that reads from the encoder's output. The result is an architecture designed to generate sequences token by token while conditioning on an encoded input.
The decoder block has three sublayers rather than two. First, masked self-attention: attention over the decoder's own token sequence, but with an additive mask that sets -∞ in the score matrix for all position pairs (i, j) where j > i. After the softmax these become exactly zero, so position i can only attend to positions at or before itself. This causal mask is what makes the decoder autoregressive: each output position is computed using only past context, which means the same forward pass that encodes a full training sequence also simulates next-token generation at every position simultaneously.
Second, cross-attention: a self-attention-shaped computation where queries come from the decoder's current representation but keys and values come from the encoder's final output (§11 covers this in detail). Third, a position-wise feed-forward network, identical in form to the encoder's. Each of the three sublayers is wrapped in the same residual-plus-LayerNorm scaffold as the encoder. Decoders are typically stacked to the same depth as their paired encoders.
The decoder's masking is what lets generation be autoregressive: at inference time, you feed the decoder whatever tokens you have generated so far, compute its output at the last position, sample the next token from that distribution, append it to the input, and repeat. At training time, thanks to the causal mask, you can feed the entire target sequence at once and compute the loss for next-token prediction at every position in parallel — a massive speedup over recurrent generation, which has no equivalent trick. This property, teacher forcing with a causal mask, is why decoder-style transformers train so efficiently: the forward pass naturally computes every prediction in the sequence with one matrix multiplication per layer.
In the original encoder-decoder design, the decoder's output at each position is projected to the vocabulary via a final linear layer (often tied to the input embedding matrix), yielding logits over the vocabulary at every position. Softmax over those logits gives next-token probabilities. Training minimises cross-entropy between the predicted distribution and the ground-truth next token at each position. This is the same objective that every language model uses; the encoder-decoder variant just augments it with cross-attention to a source sequence. Pure decoder-only systems (GPT-style) drop the encoder and cross-attention and are described in §14.
Cross-attention is how a decoder reads from an encoder. The mechanism is identical to self-attention — queries, keys, values, scaled dot product, softmax — but the queries come from one sequence (the decoder's current state) and the keys and values come from another (the encoder's final output). It is the single piece of machinery that makes encoder-decoder transformers work.
In the original paper's encoder-decoder, each decoder block contains a cross-attention sublayer between the masked self-attention and the feed-forward network. The input to this sublayer is the decoder's current representation; the queries are computed from that input via Q = xdec WQ; but the keys and values are computed from the encoder's final output, via K = xenc WK and V = xenc WV. The attention computation proceeds exactly as before: softmax(Q KT / √dk) V. The result is that each decoder position's output is a weighted combination of encoder-output vectors, with weights determined by how well the decoder's queries match the encoder's keys.
The intuition is that the decoder, as it generates each output token, needs to decide which parts of the input to look at. For translation, generating the English word cat probably requires attending to the encoder's representation of the French word chat. For summarisation, generating a sentence in the summary requires attending to the relevant parts of the source document. The encoder-decoder attention pattern in a trained translation system is often interpretable as an approximate soft alignment between source and target — a learned, content-dependent analog of the word-alignment tables that classical statistical machine translation systems used to compute.
Cross-attention is also the mechanism that makes multi-modal transformers work. In CLIP-like vision-language models, a text decoder cross-attends to image encoder outputs; in a speech model, a text decoder cross-attends to audio encoder outputs. In DALL-E and Stable Diffusion's text-to-image systems, an image generator cross-attends to text encoder outputs. In every case the story is the same: one sequence of queries, another sequence of keys and values, a soft retrieval from the second conditioned on the first. The generality of this mechanism — the same six-line formula applied across modalities — is a large part of why the transformer has eaten so much of modern ML.
In decoder-only models (§14), there is no cross-attention because there is no separate encoder to cross-attend to. The entire input sequence — prompt plus generated tokens — is fed through the same stack of decoder blocks, and masked self-attention alone handles both "reading" the prompt and "writing" the response. This architectural simplification is one reason decoder-only models have scaled so well: fewer moving parts, a single unified attention pattern, and the same training objective for prompts and completions alike.
Encoder-decoder cross-attention was the mechanism that first broke NMT benchmarks in 2017 and remains the workhorse of seq2seq transformers (T5, BART, mT5, Whisper, most image captioners). Decoder-only systems (GPT, LLaMA) fold its job into self-attention over a concatenated prompt-plus-generation sequence.
Attention masks do one simple job — they add -∞ to scores in the attention matrix, which softmax then zeroes out — but they are the single most important lever for controlling what a transformer can and cannot see. Padding masks, causal masks, sliding-window masks, sparse-attention masks — they are all the same mechanism, wielded differently.
The simplest mask is the padding mask. Transformers process sequences in batches, but different sequences in a batch have different lengths. To pack a batch into a rectangular tensor, shorter sequences are padded with a reserved <PAD> token. Without intervention, attention would happily attend to these meaningless padding positions, causing the model to treat <PAD>-filled regions as real content. The padding mask sets -∞ in the attention score matrix at positions corresponding to padded tokens, so no query ever attends to them. Both encoder self-attention and decoder cross-attention use padding masks over their respective sources.
The causal mask (also called the look-ahead mask or autoregressive mask) sets -∞ in the upper-triangular portion of the attention score matrix, so position i can attend only to positions j ≤ i. This is the mechanism that makes the decoder autoregressive (§10). Without a causal mask, the decoder could see future tokens during training — including the very token it is trying to predict — and training would converge to a trivial identity mapping. With a causal mask, training proceeds normally and the trained model can generate one token at a time at inference.
More elaborate masks carve out structured sparsity patterns. A sliding-window mask (used in Longformer, Mistral) lets each position attend only to a fixed neighbourhood of nearby positions, reducing attention cost from quadratic to linear in sequence length. A global-plus-local mask lets a handful of designated positions attend everywhere while most positions only attend locally. A block-sparse mask (used in BigBird, Sparse Transformer) divides the sequence into blocks and allows attention only within blocks plus a few global connections. All of these are variations on the same theme: decide what attention patterns your model needs, encode them as a mask, feed that mask into the standard attention computation.
Masks also encode more specialised constraints. Prefix-LM masks allow bidirectional attention within a designated prefix and causal attention thereafter, enabling a single decoder-only model to use encoder-style attention for prompts and decoder-style attention for generation. Segment masks prevent attention across document boundaries in packed training batches. FIM (fill-in-the-middle) masks support infilling tasks. The general lesson: most of what distinguishes one transformer variant from another — at the level of "what can attend to what" — is implemented as a different attention mask. The underlying attention operation is the same.
The first major architectural descendant of the original transformer kept only the encoder half. BERT, RoBERTa, DeBERTa, and their relatives are encoder-only models, trained with bidirectional objectives for representation and classification rather than generation. For a long period (2018–2022) they were the state of the art for nearly every NLP benchmark.
The defining move of encoder-only transformers is that they drop the decoder entirely, along with the causal mask and the autoregressive generation objective. What remains is an encoder stack with bidirectional self-attention: every position can attend to every other position in both directions. This is trained not for next-token prediction but for masked language modelling — a certain fraction of input tokens (15% in the original BERT) are replaced with a [MASK] symbol, and the model is trained to predict the original identities from context. Because the context is bidirectional, the model must integrate evidence from tokens on both sides to make its prediction, which pushes it to learn rich contextual representations.
BERT (Devlin et al. 2018) was the breakthrough. A 340M-parameter encoder pretrained on BooksCorpus plus English Wikipedia, with masked LM plus a next-sentence-prediction auxiliary loss. Fine-tuning this pretrained encoder on downstream tasks — question answering, NLI, NER, sentiment, sentence classification — set new state-of-the-art results on essentially every benchmark. RoBERTa (Liu et al. 2019) was an engineering refinement: more data, longer training, dynamic masking, no next-sentence-prediction, better preprocessing. DeBERTa (He et al. 2020) added disentangled positional attention and produced another incremental but real improvement. ELECTRA (Clark et al. 2020) replaced masked LM with a more sample-efficient replaced-token-detection objective. By 2021 encoder-only fine-tuning was the default workflow for supervised NLP.
For classification and representation tasks, encoder-only has several advantages over decoder-only. Bidirectional attention gives every position full left-and-right context, which is strictly more information than causal attention's left-only view. The models can be pretrained with token-level supervision (masked LM predicts every masked position's token), which is more sample-efficient per training token than next-token prediction. And the fine-tuning recipe — stack a classifier on top of the pretrained encoder's [CLS] or pooled output, train a few epochs — is simple and reliable.
The limitation is that encoder-only models cannot generate open-ended text. They can predict masked tokens one at a time, but generating a coherent paragraph requires a decoder's autoregressive structure. As the field's interest moved from representation to generation — and from fine-tuning to zero-shot prompting — encoder-only models receded from the frontier. They remain highly useful as embedding models, classifiers, and retrievers (Sentence-BERT, E5, BGE, ColBERT all use encoder-only backbones), and continue to dominate tasks where bidirectional context matters most. But for the general-purpose pretrained models that define the 2023–2025 era, encoder-only is no longer the default choice.
The GPT family — and by extension LLaMA, Mistral, PaLM, Gemini, Claude, and essentially every frontier language model of 2025 — are decoder-only transformers. They strip out the encoder and cross-attention entirely, leaving a single stack of causally-masked self-attention blocks trained with next-token prediction. This simplicity has turned out to be the architecture that scales best.
A decoder-only transformer has exactly two sublayers per block: masked self-attention and a feed-forward network. There is no cross-attention, no separate encoder, no distinction between input and output. The model takes an arbitrary sequence of tokens (the prompt plus whatever has been generated so far) and produces a probability distribution over the next token. Training is next-token prediction on raw text: given a document, shift the target by one, compute cross-entropy at every position. At inference, generation is autoregressive: sample a token from the distribution, append it to the input, repeat.
GPT (Radford et al. 2018) introduced the pattern; GPT-2 (Radford et al. 2019) scaled it to 1.5B parameters and demonstrated unexpectedly good zero-shot behaviour on many tasks; GPT-3 (Brown et al. 2020) scaled to 175B parameters and showed that a sufficiently large decoder-only model could perform a wide range of tasks from a few in-context examples, without any task-specific fine-tuning. Post-GPT-3, the field mostly converged on decoder-only as the default. Open models followed: GPT-NeoX, OPT, BLOOM, LLaMA 1 and 2 and 3, Mistral, Mixtral, Qwen, Gemma, DeepSeek. All share the same core architecture, differing mainly in scale, positional encoding choice (§15), efficient-attention variants (§16), and pretraining data.
The decoder-only architecture has three practical advantages at scale. First, architectural simplicity: one sublayer pattern, one attention type, one training objective. Fewer moving parts means fewer things to tune, fewer places for training instability to develop, and a cleaner hardware mapping. Second, unified input-output handling: the same mechanism that encodes the prompt generates the response, so any downstream task can be cast as conditional text generation — translation, summarisation, classification, question answering, reasoning, code — all reduce to continuing a prompt. Third, in-context learning: at sufficient scale, decoder-only models exhibit few-shot learning behaviours that encoder-only models do not obviously share, and the mechanism by which this happens remains only partially understood.
Decoder-only has weaknesses too. Causal attention gives less information per training token than bidirectional attention, so encoder-only models can be more sample-efficient at equal scale. Representation-focused tasks — retrieval, classification, clustering — still often work better with encoder-only backbones. For some applications, a prefix-LM (bidirectional attention over the input, causal attention over the output) or a proper encoder-decoder system offers the best of both. But the gravity of the field since GPT-3 has been strongly toward decoder-only; it is the architecture on which every major frontier LLM of 2024 and 2025 is built.
The decoder-only architecture won because it scales. At a few hundred million parameters, encoder-only and decoder-only are roughly matched. At 100B+ parameters, the decoder-only model's unified training and inference story — the same forward pass handles prompt and generation, the same cross-entropy loss applies everywhere — makes it dramatically easier to train and operate than the alternatives.
The sinusoidal and learned absolute positional encodings of the original transformer have largely been replaced by schemes that inject position at the attention operation rather than at the input embedding. Rotary position embeddings (RoPE) and ALiBi are the dominant choices in modern models; each makes context-length extension cheaper than the original design.
The basic problem with absolute positional encoding is that the model learns a representation for position i that is specific to i being that exact number. Positions beyond the training length have no learned representation, and the model's dependence on the specific numeric embeddings makes extrapolation fragile. The alternative is to encode relative position — how far apart two tokens are — which is both more linguistically natural (syntax and semantics generally care about relative order, not absolute position) and easier to generalise across sequence lengths.
Relative position bias (Shaw et al. 2018; used in T5) adds a learned scalar bi-j to each attention score, indexed by the position difference. The mechanism is simple, the encoding is explicitly relative, and extrapolating to longer sequences requires only extending the bias table. T5's relative position bias bucket-encodes distance at log-scale, which limits resolution at long range but works well in practice. ALiBi (Press et al. 2021) simplifies further: a fixed, non-learned linear penalty that grows with distance, parameterised only by a per-head slope. ALiBi-trained models extrapolate smoothly to sequences much longer than their training length, a property the original sinusoidal encoding was supposed to have but mostly did not.
Rotary Position Embedding (RoPE; Su et al. 2021) has become the most widely adopted modern positional scheme. The idea is to rotate the query and key vectors by an angle proportional to their position in the sequence, using pairs of dimensions to encode rotations in 2D subspaces. The result is that the attention dot product between two positions depends naturally on their position difference (via the difference of rotation angles) rather than on absolute positions. RoPE preserves norms, is analytically clean, and — crucially — allows context length extension via various interpolation and scaling tricks (NTK-aware interpolation, YaRN, Dynamic NTK) that have pushed production models from 2K to 128K or even 1M context without full retraining. LLaMA, Mistral, Qwen, DeepSeek, and most recent open models use RoPE.
Extending context length is now a significant research subfield. Beyond the positional-encoding changes above, techniques include training on longer sequences (expensive), continued pretraining with position interpolation (cheaper), context extension with LongRoPE, and architectural tricks like sliding-window attention (§16). The landscape is moving fast and the specific best-practice changes every few months, but the general direction is clear: positional encoding is increasingly something to design for long context from the start, not something to graft on later.
Standard self-attention costs O(n²) memory and compute in sequence length, which is fine at n = 512 and ruinous at n = 100,000. The efficient-attention literature is a decade of cleverness aimed at breaking that quadratic barrier while preserving quality — and a separate thread of hardware-aware reformulations that have yielded the biggest practical wins.
There are three main approaches. The first is sparse attention — use attention masks that constrain each position to attend to only a small, structured subset of others. Sparse Transformer (Child et al. 2019) uses strided patterns; Longformer (Beltagy et al. 2020) combines a sliding local window with a few designated global tokens; BigBird (Zaheer et al. 2020) adds random attention on top. All of these preserve the softmax-attention mechanism but sparsify the connectivity, reducing cost from O(n²) to O(n·k) where k is the per-token budget. They work well in practice for documents of tens of thousands of tokens.
The second is linear attention — reformulate the attention computation to avoid the n × n score matrix entirely. Linformer approximates the keys and values with a low-rank projection; Performer uses positive random features to approximate the softmax kernel; Linear Transformer reorders the matrix multiplications to compute (KT V) first, yielding linear time and memory. These methods achieve genuine linearity but trade off quality — in head-to-head benchmarks they generally underperform full attention at equal parameter counts. They remain active research but have not displaced quadratic attention at frontier scale.
The third — and by far the most consequential in production — is hardware-aware exact attention. FlashAttention (Dao et al. 2022, with v2 and v3 improvements) observes that the bottleneck in standard attention implementations is not the FLOPs but the memory traffic between GPU HBM and on-chip SRAM. By tiling the computation to keep intermediate values in fast memory and recomputing them on the backward pass, FlashAttention achieves the exact same result as standard softmax attention but 2–4× faster and with memory that is linear in sequence length rather than quadratic. It has become the de facto standard; nearly every modern training framework uses FlashAttention or an equivalent kernel.
Two inference-time optimisations deserve separate mention. Multi-query attention (MQA; Shazeer 2019) shares a single key and value tensor across all attention heads, with each head keeping its own query projection. This dramatically shrinks the KV cache that dominates inference memory for long-context generation, at a small quality cost. Grouped-query attention (GQA; Ainslie et al. 2023) sits between MQA and full multi-head attention, with query heads grouped to share a smaller number of KV heads — a tunable tradeoff between quality and KV-cache size. LLaMA 2, Mistral, and most other recent open models use GQA. Combined with FlashAttention and KV-cache quantisation, these tricks make 32K+ context inference practical on a single GPU.
Scaling a transformer from 100M to 100B parameters is not a matter of setting a bigger config and hitting run. Every order of magnitude exposes new failure modes — loss spikes, gradient explosions, attention entropy collapse — and the best-practice recipe for large-scale training is a collection of mutually reinforcing choices that the community has only converged on in the last few years.
The depth-vs-width tradeoff is the first scaling decision. For a fixed parameter budget you can allocate to more layers (deeper) or to wider per-layer dimensions (wider). Empirically both work, but deeper models are harder to train — gradient flow across many layers is fragile — and wider models are more communication-efficient under tensor parallelism. Scaling laws (Kaplan et al. 2020; Hoffmann et al. 2022's Chinchilla analysis) give quantitative guidance: for a given compute budget, there is an optimal ratio of parameters to training tokens, and within that, a rough optimal aspect ratio for depth vs width. Modern frontier models tend to be wider than they are deep: GPT-3's 175B has 96 layers with dmodel = 12288; LLaMA 2-70B has 80 layers with dmodel = 8192.
Initialisation and normalisation together govern training stability at scale. Xavier and He initialisations, designed for feed-forward networks, do not quite work for transformers; modern schemes scale residual-branch weights by a factor that depends on depth (so that the total variance through many layers stays bounded) and often initialise the final projection of each sublayer at a smaller scale. Pre-norm (§7) largely supersedes post-norm at scale. RMSNorm replaces layer norm in many recent models. Combined, these choices eliminate much of the training instability that plagued the first wave of giant models.
Activation functions and mixed precision are the next two knobs. ReLU has been replaced by GELU, SwiGLU, or GeGLU in most modern FFN sublayers — the smoother, gated variants offer small but consistent quality gains and tend to be more stable at scale. Training in FP16 or BF16 (with selective FP32 accumulations) roughly doubles throughput relative to pure FP32, at the cost of requiring careful loss scaling and — occasionally — recovery from loss spikes where FP16 underflow corrupts gradients. BF16 has largely supplanted FP16 for large-scale training because its wider exponent range avoids most underflow issues. Inference quantisation to INT8, INT4, or lower is now routine and is how consumer-grade LLMs fit on single GPUs.
Finally, training parallelism. At scale, a single model does not fit on a single GPU. Data parallelism replicates the model across GPUs and splits the batch; tensor parallelism splits individual weight matrices across GPUs; pipeline parallelism splits layers across GPUs; ZeRO and FSDP shard optimizer state, gradients, and parameters across the data-parallel dimension. Megatron-LM, DeepSpeed, and FSDP offer different combinations of these approaches, all with the common goal of keeping a trillion-parameter model training without any single GPU needing to hold the whole thing. The specifics are beyond this chapter, but the conceptual point matters: modern transformer training is as much a distributed-systems problem as it is a numerical-optimisation problem.
The transformer began as an NLP architecture. Within five years it was also the state of the art in vision, speech, protein structure, reinforcement learning, and multi-modal modelling. The same attention-plus-FFN-plus-residual template, trained on increasingly ambitious data, has eaten most of the map of applied machine learning.
In computer vision, the Vision Transformer (ViT; Dosovitskiy et al. 2020) demonstrated that images could be treated as sequences — divide an image into patches, flatten each into a vector, add positional encoding, pass through a transformer encoder — and that the resulting model could match or exceed CNNs on ImageNet classification given enough data. DETR applied transformers to object detection, Swin Transformer added a hierarchical structure that reintroduced some CNN-like inductive bias, and SAM (Segment Anything) became the dominant segmentation model. The vision-language models of 2023–2025 — CLIP, BLIP, LLaVA, Flamingo, GPT-4V, Claude 3, Gemini — all use transformer backbones for both modalities.
In speech, Conformer (a hybrid of self-attention and convolution) took over automatic speech recognition in 2020. Whisper (Radford et al. 2022) trained a straightforward encoder-decoder transformer on 680,000 hours of multilingual speech and became the effectively universal baseline for ASR. Modern TTS systems (VALL-E, StyleTTS 2) and end-to-end speech models (GPT-4o's voice mode) are decoder-only transformers operating on audio codec tokens rather than text tokens.
In biology, AlphaFold 2 (Jumper et al. 2021) used a specialised transformer-like attention mechanism (the Evoformer) over multiple sequence alignments to predict protein structure to near-experimental accuracy — arguably the most consequential scientific application of the architecture. AlphaFold 3 extended this to arbitrary biomolecular complexes. ESM (Lin et al. 2023) demonstrated that even without the MSA input, a large transformer trained on protein sequences alone learns structure implicitly in its representations. Protein language models and genome language models are now established subfields.
In reinforcement learning, Decision Transformer (Chen et al. 2021) reformulated RL as a sequence-modelling problem — given a trajectory of states, actions, and desired returns, predict the next action — and showed that a causal transformer trained on offline data could match or exceed classical RL algorithms on many benchmarks. Gato (Reed et al. 2022) trained a single transformer on 604 tasks spanning text, images, and physical control. And in generative modelling more broadly, Diffusion Transformers (DiT) replaced the U-Net backbone of diffusion models with a transformer, which is now the architectural choice for Stable Diffusion 3 and Sora.
For most of its history the transformer has looked unreplaceable. A small but growing subfield explores alternatives: state-space models (S4, Mamba) that scale linearly in sequence length and match transformer quality on some language benchmarks; RWKV, a recurrent formulation with transformer-like parallel training; Hyena and related long-convolution architectures. None has displaced the transformer at frontier scale as of 2025, but the landscape is active and the question of what comes next is genuinely open. The next chapter in this series — Pretraining Paradigms — takes the architecture described here as given and asks the question that has shaped the last five years of applied ML: what do we train it on, and to do what?
The transformer is now the dominant architecture across most of applied ML, not because it has no weaknesses but because its weaknesses are easier to engineer around than the weaknesses of the alternatives. Scaling works; training is tractable; the same machinery transfers across modalities. Any successor will need to match all three properties.
The transformer literature is vast, fast-moving, and — because so much of it lives on arXiv rather than in journals — spread out. The selection below is curated to give you the route from the original paper to the modern frontier. Read Vaswani et al. first. Read the GPT and BERT papers second. Everything else branches from there.
The canonical deep-learning textbook. Predates the transformer but provides the foundation — residual connections, layer norm, attention as bridge between sequences, the RNN story the transformer replaced. Chapter 10 on sequence modelling is the essential background for understanding what the transformer inherited and what it jettisoned.
The freely available draft of the standard NLP textbook now has excellent chapters dedicated to transformers (Chapter 10), large language models (Chapter 11), and fine-tuning (Chapter 12). Consistently well-explained, with the diagrams and notation used in most of the field.
A practical, code-first treatment of the transformer from three Hugging Face engineers. The book walks through the architecture and its training using the transformers library, covering encoder-only, decoder-only, and encoder-decoder variants with hands-on examples. The right book if you want to build rather than just read.
A line-by-line walkthrough of the original transformer paper, with executable PyTorch code interleaved with the paper's text. The single best way to understand the architecture end-to-end — every detail the paper glosses over is pinned down here in working code. Updated periodically to track modern conventions.
The most widely read informal introduction to the transformer — a series of diagrams that walk through self-attention, multi-head attention, and the encoder-decoder structure. Ideal for first contact with the architecture. Alammar's follow-ups on BERT and GPT are equally good.
A careful, mechanistic rewrite of transformer mathematics aimed at understanding what individual attention heads do. The paper's decomposition of attention as a sum of per-head operations on the residual stream is the conceptual scaffolding behind most modern interpretability research. Essential if you want to think about transformers internally rather than just externally.
The original transformer paper. Introduces self-attention, multi-head attention, scaled dot product, sinusoidal positional encoding, and the encoder-decoder stack — all of the pieces this chapter is about. Short (15 pages), clear, and the mathematical notation is still the standard. Read this before anything else.
The paper that launched the encoder-only branch of the transformer family. Introduces masked language modelling as a pretraining objective and demonstrates that a single pretrained encoder can be fine-tuned to state-of-the-art results on essentially every NLP benchmark. The single most cited NLP paper of the last decade.
The paper that launched the decoder-only branch. A generative pretraining objective (next-token prediction) followed by task-specific fine-tuning beat every task-specific architecture of its day. Overshadowed at the time by BERT; in retrospect the architecture whose descendants won.
Scales GPT-1 from 117M to 1.5B parameters and demonstrates that sufficiently large language models exhibit surprising zero-shot task performance without any fine-tuning. The paper that first made "emergence" a real topic and the initial release is a case study in the complications of releasing frontier models responsibly.
The GPT-3 paper. 175B parameters, and the demonstration that in-context learning from a few examples could match or exceed fine-tuning on many tasks. The paper that shifted the field's emphasis from task-specific models to general-purpose ones accessed via prompting — the paradigm this chapter's architecture now lives in.
T5 is the modern reference point for encoder-decoder transformers, and the paper is an unusually thorough ablation study — pretraining objectives, architecture choices, data mixtures, all tested at scale. If you want to see how encoder-decoder, encoder-only, and decoder-only compare head-to-head under matched conditions, this is the paper.
The original layer-norm paper. Predates the transformer but supplies its second-most-important piece of training-stability machinery (after residual connections). Compact and clearly written.
The ResNet paper, which introduced the residual connections that the transformer borrowed wholesale. The identity shortcut that makes deep stacks trainable is the single most important architectural trick in post-2015 deep learning.
The paper that introduced attention. Bolted onto a recurrent encoder-decoder, the attention mechanism of this paper was the seed idea that the original transformer paper then took to its logical conclusion. Reading it is the best way to see why the transformer is not as strange a leap as it first appears.
The original encoder-decoder paper, using stacked LSTMs. The architecture the transformer replaced, and the setting in which encoder-decoder structure (and cross-attention in a crude form) was first explored for NLP.
The other significant encoder-decoder transformer, using a denoising objective that combines masked-LM-style corruption with seq2seq training. Highlighted here as the complement to T5 — a slightly different take on what encoder-decoder pretraining should look like.
A careful engineering study of BERT's training recipe. Bigger batches, more data, longer training, no next-sentence-prediction, dynamic masking. RoBERTa is the encoder-only model you should actually use rather than BERT, and the paper is a lesson in how much of pretraining quality is recipe rather than architecture.
An influential interpretability study of BERT's attention heads. Shows that specific heads align with specific linguistic phenomena — syntactic dependencies, coreference, positional patterns — providing empirical support for the multi-head-diversity argument in §5.
The paper that refocused attention (no pun intended) on the feed-forward sublayer, arguing that FFN weights encode factual and linguistic knowledge as key-value memories. The backbone of a substantial subsequent literature on model editing and knowledge localisation.
The original scaling-laws paper. Empirical regularities relating loss to parameters, data, and compute, across six orders of magnitude. The paper that turned scale from a tactic into a research programme, and a prerequisite for understanding the decisions behind GPT-3 and its successors.
A corrective to Kaplan et al., showing that previous large models had been trained with too few tokens for their parameter count. The "Chinchilla" prescription — roughly 20 training tokens per parameter — now guides most frontier training runs and explains the heavy data emphasis of post-2022 models.
The RoPE paper. Introduces the rotation-based positional encoding that has become the default in LLaMA, Mistral, Qwen, and most modern open models. The mathematical setup in §3 of the paper is the single clearest exposition of why rotating QK works.
The ALiBi paper. A different approach to positional encoding that uses a fixed linear penalty on attention scores, with the virtue of extrapolating cleanly to sequences much longer than training. Used in MPT and a handful of other models.
The paper behind the FlashAttention kernel. A beautifully engineered combination of tiling, recomputation, and memory-hierarchy awareness that delivers 2–4× speedup over standard attention with no loss in quality. The default on every modern training stack, and a model for how hardware-aware ML engineering should look.
Successor papers. v2 doubles throughput with better parallelism; v3 targets Hopper-era hardware (H100, H200) with asynchrony and FP8. The pace of improvement on the attention kernel alone is a lesson in how much low-level engineering still matters at frontier scale.
The terse original paper on multi-query attention. Six pages, one idea: share keys and values across heads to dramatically shrink the KV cache at inference. A small quality cost for a large operational one. The foundation on which GQA and modern long-context inference are built.
The grouped-query attention paper. Sits between MQA and full multi-head attention with a tunable number of KV groups, recovering most of MQA's inference savings with very little quality loss. Used in LLaMA 2/3, Mistral, and most modern open models.
The paper that introduced sliding-window plus global-token attention for long documents. The combination of dilated sliding windows and a small set of global attention points has influenced most subsequent sparse-attention schemes.
Local + global + random attention, proving that carefully chosen sparse attention can match full attention theoretically. An important companion to Longformer and one of the canonical references for sparse attention patterns.
The Vision Transformer. Demonstrated that a plain transformer encoder, given image patches as tokens, could match or exceed CNNs on ImageNet. The paper that started the transformer's conquest of vision.
Reintroduces hierarchical structure and locality into ViT, producing a vision transformer that is more practical for dense-prediction tasks. One of the most widely deployed vision-transformer variants.
A straightforward encoder-decoder transformer trained on 680,000 hours of multilingual speech with simple next-token prediction. Became the default baseline for ASR. A lesson in how far the transformer plus data and scale can go without architectural exoticism.
The AlphaFold 2 paper. Uses a specialised transformer-style attention mechanism (Evoformer) over multiple sequence alignments and pair representations. Arguably the most scientifically consequential single application of the transformer to date.
Reformulates RL as sequence prediction: given a sequence of states, actions, and returns, predict the next action. The paper that showed the transformer architecture could absorb reinforcement learning as another instance of sequence modelling.
The most serious post-transformer architectural contender of the last two years. A selective state-space model with linear scaling in sequence length that matches or exceeds transformer quality on some language benchmarks. Whether it generalises to the frontier is the open question.
The LLaMA paper. A carefully engineered decoder-only transformer — pre-norm, RMSNorm, SwiGLU, RoPE — trained on 1–1.4T tokens. The model release that triggered the current open-LLM ecosystem, and the reference architecture for most 2023–2025 open-source work.
A surgically precise modification to residual scaling that stabilises training at extreme depth. Demonstrates a 1000-layer transformer training smoothly. A case study in how the seemingly mundane choice of initialisation and residual scaling governs scaling behaviour.
The library that effectively standardised access to pretrained transformers. Implements hundreds of model architectures under a common API, with tokenisers, training utilities, and a model hub of hundreds of thousands of checkpoints. The default starting point for any applied transformer work.
The official FlashAttention CUDA kernels, now integrated into PyTorch's scaled_dot_product_attention and every major training framework. Reading the implementation is an education in how to write fast, memory-aware ML kernels.
A library of efficient transformer building blocks from Meta, including memory-efficient attention, sparse patterns, and composable block configurations. Widely used for ViT-style vision models and as a research sandbox for attention variants.
NVIDIA's reference large-scale transformer training code. Implements tensor parallelism, pipeline parallelism, and sequence parallelism, and has been used to train many of the largest open models (GPT-NeoX, BLOOM, OPT, Mistral, MPT). The single best place to see how frontier training is actually done.
Microsoft's training and inference framework, home of the ZeRO family of optimizer-state-sharding techniques. Its partner library DeepSpeed-Inference offers efficient serving of very large transformers.
A high-throughput LLM serving system built around PagedAttention — an operating-systems-style virtual memory scheme for the KV cache. Has become the open-source reference for high-QPS LLM inference, and a lesson in how much of serving performance is memory management rather than compute.
A library for mechanistic-interpretability work on transformers, with rich hooks for intermediate activations and a clean decomposition of the residual stream. The workhorse tool of modern interpretability research.