The last chapter ended with attention as a patch: Bahdanau, Cho, and Bengio's 2014 soft-alignment mechanism, introduced to fix the fixed-length-vector bottleneck in RNN sequence-to-sequence models, quietly opened a door that three years later swallowed the rest of sequence modelling. Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin's 2017 Attention Is All You Need demonstrated that a network built only from attention — no recurrence, no convolutions — could train faster and match or beat every recurrent sequence model. By 2018 the transformer was winning every NLP benchmark; by 2020 it had absorbed computer vision (ViT), speech (Whisper, Conformer), biology (AlphaFold), reinforcement learning (Decision Transformer), and almost every other domain that deep learning touched. This chapter is about the mechanism at the centre of that revolution. We build attention from first principles: the query–key–value abstraction that came out of information retrieval, the scaled dot-product operation that generalises Bahdanau and Luong, the multi-head architecture that lets different heads attend to different relations, the positional encodings that recover word order after attention's permutation invariance, the masking schemes that make causal decoding work. We then examine the cost models (quadratic in sequence length, where the efficient-attention literature tries to push back) and the interpretability programme (what attention patterns reveal, and what they do not). The story is the one clearest case in recent ML history of a single architectural idea replacing an entire research programme: attention did to RNNs in three years what convolution did to hand-engineered image features in one. This chapter is about why it worked, how it works, and what its limits are.
Sections one through four build the attention primitive from scratch. Section one motivates the mechanism by returning to the encoder–decoder bottleneck of the last chapter and asking what happens if the decoder could look back at the encoder's hidden states at every step; the answer is attention, and the formulation generalises well beyond sequence-to-sequence models. Section two catalogues the soft-versus-hard-attention distinction and the additive-versus-multiplicative score variants that dominated the 2014–2017 literature, ending with the scaled dot-product form that the transformer adopted as default. Section three introduces the query–key–value abstraction that made attention into a general retrieval primitive — a learnable soft-memory lookup that can be invoked anywhere in a network. Section four is the theoretical core of the chapter: self-attention — the special case in which queries, keys, and values all come from the same sequence, making the operation a differentiable message-passing layer whose receptive field is all other positions at once.
Sections five through eight build the architecture that wraps around self-attention. Section five develops multi-head attention — running h parallel attention operations with different learned projections, concatenating the outputs, and projecting back down — and argues that the multiplicity of heads, rather than their individual capacity, is where the expressive power comes from. Section six is the positional encoding problem: self-attention is permutation-equivariant, so the model cannot distinguish "the dog chased the cat" from "the cat chased the dog" without an explicit encoding of position. We cover Vaswani et al.'s 2017 sinusoidal encodings, learned positional embeddings (GPT, BERT), relative position biases (T5, Transformer-XL), and the modern rotary positional embeddings (RoPE) that dominate contemporary LLMs. Section seven assembles the pieces into the transformer block — multi-head self-attention, feed-forward network, residual connections, layer normalisation — and discusses the pre-norm versus post-norm variants. Section eight distinguishes encoder, decoder, and encoder–decoder configurations and maps them to real models (BERT, GPT, T5).
Sections nine through twelve cover the operational details that attention requires in practice. Section nine is cross-attention — the decoder-queries-encoder mechanism that inherits directly from Bahdanau 2014 and remains the backbone of seq2seq transformers, retrieval-augmented generation, and multi-modal models. Section ten is the causal mask — the lower-triangular attention mask that turns a permutation-equivariant operation into an autoregressive one, and the reason GPT-style language models can be trained with teacher forcing in parallel. Section eleven covers attention patterns and what they reveal — induction heads, syntactic attention, positional attention, the "attention is not explanation" debate — drawing on the circuits-interpretability literature (Elhage et al. 2021; Anthropic circuits thread). Section twelve is efficient attention — the linear-time variants (Performer, Linformer, Linear Attention), sparse variants (Longformer, BigBird), and the empirical finding that simpler is often better than clever.
Sections thirteen through seventeen examine attention's reach and limits. Section thirteen covers FlashAttention (Dao, Fu, Ermon, Rudra, Ré 2022) — the IO-aware implementation that made attention memory-efficient on modern GPUs and is now the default kernel in every serious training stack. Section fourteen explores long-context methods: sliding window, sink tokens (StreamingLLM), landmark attention, RoPE scaling (NTK-aware, YaRN), and the modern million-token-context models. Section fifteen is the beyond-text story — Vision Transformers (Dosovitskiy et al. 2020), Perceiver (Jaegle et al. 2021), AlphaFold's axial attention, Decision Transformer — the evidence that attention generalises well beyond language. Section sixteen is interpretability — what attention patterns tell us about what a model is doing, and the current state of mechanistic interpretability's attempt to understand transformers from first principles. Section seventeen covers alternatives to attention — state-space models (Mamba), linear RNNs, MLP-Mixer — that revisit the core design question and propose different answers. The closing in-ml section places attention in the larger landscape: as the primitive that made the last eight years of frontier ML possible, as the reason language modelling became the universal task, and as the operation every post-2017 model is ultimately built out of.
Recurrence forces every piece of information in a sequence through a single fixed-size bottleneck. Attention lets the model look — directly, selectively, and in parallel — at any position it needs.
The sequence-to-sequence models of Sutskever, Vinyals, and Le (2014) compressed an entire source sentence into a single hidden vector before producing any output. That vector carried everything: subjects, verbs, rare names, negations, the tone of a question. For short inputs it worked remarkably well. For anything longer than a couple of clauses, accuracy collapsed. The bottleneck was not subtle — it was the architecture telling the model to forget.
Bahdanau, Cho, and Bengio framed the alternative in 2014 with brutal economy. Instead of forcing the encoder to summarise the whole source into one vector, let the decoder align to the source at every output step. At each decoding position, compute a distribution over source positions — a soft pointer — and use it to mix the encoder states into a context vector tailored for that step. The decoder is no longer reading a summary; it is reading the source directly, focused differently each time it produces a token.
The core intuition is content-based lookup. The decoder holds a query (what am I trying to produce next?); the encoder exposes a set of keys and values (what does each source position represent?); attention computes how similar the query is to each key, turns those similarities into weights, and returns the corresponding weighted sum of values. It is associative memory done with soft assignments instead of hard indices, learned end-to-end by gradient descent.
The attention move. Replace compress-then-decode with look-up-while-decoding. The encoder keeps all of its hidden states. The decoder chooses, at every step, which of them matter now.
Once the principle was named, it escaped its original setting within three years. Attention over encoder states became attention within the encoder — self-attention — and then the only primitive the model used. The transformer (Vaswani et al., 2017) kept attention and threw out recurrence entirely, and the shape of modern deep learning was decided.
Attention exists in two flavours — a differentiable weighted average and a discrete sampled choice. The soft variant won, for reasons that are really about gradients.
Soft attention assigns a probability to every position and returns the expectation: context = Σᵢ αᵢ · vᵢ, where α is a softmax over scores. Every weight is nonzero; every gradient flows. The operation is smooth, fully differentiable, and trains with ordinary backpropagation. This is the form Bahdanau et al. (2014) introduced and the one essentially every modern system uses.
Hard attention, by contrast, samples a single position (or a small set) and attends only to it. It is faithful to the metaphor — we really do look at one place — but it introduces a discrete choice into the computation. Gradients cannot pass through the sampling step; training relies on REINFORCE-style estimators with high variance, or on Gumbel-softmax relaxations. Xu et al. (2015), in "Show, Attend and Tell," used both on image captioning and found soft attention easier to train while hard attention occasionally found sharper alignments. The trade-off is not decided by elegance — it is decided by how much noise the optimiser can tolerate.
Within soft attention, the scoring function is itself a design choice. Bahdanau used an additive score: a small MLP combining query and key, scoring each pair. Luong, Pham, and Manning (2015) compared additive with multiplicative (dot-product) scoring and found the multiplicative version cheaper and comparable in quality when dimensions were modest. The dot product wins on hardware: it is a single matmul, exactly the operation GPUs are built to do fastest.
Scaled dot product. For query dimension d_k, the raw dot product grows with √d_k and pushes the softmax into saturation. Dividing by √d_k keeps scores in a stable range — a tiny change with an outsized effect on training stability.
The scaling trick, named almost as an afterthought in the transformer paper, is what makes multiplicative attention trainable at scale. Nearly every large model since uses it without reflection — a fix so natural that it disappears into the definition.
The three-way split of attention inputs is the clean abstraction that turned attention from a trick into a general-purpose primitive.
In the sequence-to-sequence era, attention was tightly coupled to its setting: a decoder hidden state scored against encoder hidden states to produce a context vector. The transformer paper refactored this into a general pattern. Every attention operation consumes three tensors: queries Q, keys K, and values V. Each is produced by a linear projection of some input — possibly the same input, possibly different ones.
The formula is almost anticlimactic:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) V
QKᵀ gives a matrix of pairwise similarities. The softmax turns each row into a distribution over keys. Multiplying by V returns, for every query, a weighted sum of values — the query's personalised read of the value store. Three matmuls, one softmax, one scalar divide; this is the whole operation.
Why separate keys from values? The keys decide who to attend to; the values decide what to return. Decoupling them lets the model learn addresses and contents independently — a routing decision distinct from the payload.
This abstraction is the reason attention became ubiquitous. Pick where Q, K, and V come from and you recover essentially every variant. Encoder–decoder attention: Q from the decoder, K and V from the encoder. Self-attention: all three from the same sequence. Cross-attention in a vision model: Q from text tokens, K and V from image patches. Retrieval-augmented generation: Q from the current context, K and V from a stored document collection. The same three-line kernel, endlessly reused.
It is also why attention is often described as a differentiable hash table. A hash table takes a query, matches it against keys, and returns an associated value — an exact, discrete lookup. Attention does the same thing with soft matching and weighted averaging. It is the kind of data structure you would have invented by hand and been surprised to find was trainable.
When Q, K, and V all come from the same sequence, every position can attend directly to every other position. Recurrence becomes unnecessary.
Self-attention is the special case where a sequence attends to itself. Each token produces its own query, key, and value by separate linear projections. The query at position i scores against every key in the sequence, producing attention weights; those weights combine the values into a new representation for position i. Every position is updated in parallel, each with its own view of the whole.
Three consequences make this the primitive of modern deep learning. First, the path length between any two tokens is one. An RNN has to propagate information through t hidden-state updates to connect the first and last tokens; self-attention connects them in a single step. Long-range dependencies are no longer a credit-assignment problem. Second, the computation parallelises perfectly. Every query–key–value triple can be computed independently, and the QKᵀ matmul uses the full width of the GPU. This is why transformers train so much faster than RNNs on modern hardware — not because the FLOP count is smaller, but because it is far more parallel. Third, the operation is permutation-equivariant. Attention, by itself, has no notion of position — a property that is both a feature (architectural neutrality) and a bug (something else must supply order).
Cost. Self-attention is O(n²) in sequence length — the n × n attention matrix has to exist, at least conceptually. Section 12 is the efficiency story; section 13 is the hardware rewrite that let the quadratic cost become tolerable.
The quadratic cost is the one honest drawback of attention. For the first several years it was a limitation everyone accepted in exchange for the training speed and capability gains. The later work on efficient attention, FlashAttention, and long context tries to recover the scaling properties without giving up the modelling advantages.
One attention operation is one lookup pattern. Doing several in parallel — each in its own subspace — lets the model track several kinds of relationships at once.
A single attention head computes one distribution over positions per query. A token can attend, strongly, to one other token — or spread its mass across a few — but it has one attention pattern to work with. That is often not enough. In a sentence, a given word might want to attend to its syntactic governor for agreement and to a semantic antecedent for meaning and to a nearby modifier for phrasing, all at once.
Multi-head attention solves this by running h attention operations in parallel, each with its own projections W_Q^i, W_K^i, W_V^i, and concatenating their outputs before a final linear projection. With h heads and projection dimension d/h, the total parameter count and FLOP cost are essentially the same as one head at full width — but the model now has h independent attention patterns per layer.
In practice, heads specialise. Empirical studies of trained transformers find heads that track positional offsets ("the previous token"), syntactic relations ("the subject of this verb"), coreference, copy patterns, and, in language models, the remarkable induction heads that implement in-context pattern matching. Specialisation is not enforced — no loss term demands it — but it emerges reliably from the inductive bias of parallel subspaces.
Heads as perspectives. Multi-head attention is not just a capacity trick. It is the mechanism by which a transformer layer can simultaneously compute along many relational axes — a fan-out that a single dot-product could not produce.
The choice of h is mostly empirical. Base transformers used 8 heads; modern large models use 16, 32, or more. Recent work (multi-query attention, grouped-query attention) has found that sharing keys and values across heads while keeping queries distinct preserves most of the benefit at a fraction of the memory cost during inference — a tuning that matters enormously for serving large models.
Self-attention is permutation-equivariant — shuffle the inputs and you shuffle the outputs, with no change to the relationships it learns. Language is not permutation-equivariant. Something has to put position back in.
The original transformer injected position through sinusoidal positional encodings: a fixed matrix where each row is a combination of sines and cosines at geometrically-spaced frequencies. Adding this matrix to the input embeddings gives each position a unique fingerprint that is easy to compute, requires no parameters, and — because sinusoids are translations of each other — lets the model learn to attend by relative offset with a simple linear transformation of positions.
Learned positional embeddings dropped the sinusoids and simply allocated a trainable vector per position, used by BERT and early GPT. They are more flexible but cap at the training context length — a position the model never saw has no embedding.
Two developments pushed positional encoding into its modern form. Relative positional encodings (Shaw et al., 2018; Raffel et al.'s T5) argued the model cares about relative offsets rather than absolute positions, and modified the attention scores to depend on the difference i − j. Rotary positional embeddings (RoPE; Su et al., 2021) implement this cleanly: rotate the query and key vectors by an angle that depends on position before computing their dot product. The resulting similarity depends only on the relative offset, and — because rotations extend naturally to unseen positions — RoPE extrapolates to longer contexts with modest scaling tricks.
ALiBi. Press, Smith, and Lewis (2021) proposed Attention with Linear Biases — no positional embedding at all, just a linearly-decaying bias added to attention scores. The simplest positional scheme to describe, and strong at extrapolation. A reminder that the right inductive bias can beat elaborate engineering.
RoPE dominates modern open-source language models; ALiBi appears in some long-context systems; learned and sinusoidal encodings persist in older and smaller designs. The taxonomy keeps growing, but the question never changes: how does the model know where it is?
Attention is the headline primitive, but a transformer layer is attention wrapped in a specific scaffolding — residual connections, normalisation, a feed-forward network — that matters as much as the attention itself.
A standard transformer block has two sub-layers. The first is multi-head self-attention. The second is a position-wise feed-forward network — an MLP with one hidden layer, applied identically to every position, typically widening the hidden dimension by a factor of four before projecting back. Each sub-layer is wrapped in a residual connection and a LayerNorm, following the pattern x ← x + Sublayer(Norm(x)) (pre-norm) or x ← Norm(x + Sublayer(x)) (post-norm).
The residual connections are not cosmetic. They create additive paths through the network, letting gradients flow directly back to earlier layers — the same trick that made ResNet trainable at depth. Without residuals, stacking dozens of attention layers produces the same optimisation pathologies that recurrence suffered from. With them, hundreds of layers train routinely.
LayerNorm (Ba, Kiros, and Hinton, 2016) normalises each token's activations independently, in contrast to BatchNorm's across-batch normalisation. The choice matters because attention sequences are variable length and depend on neighbouring tokens through their activations — batch statistics would mix information across examples in ways self-attention was designed to keep separate. Pre-norm (normalise before the sub-layer) has become the default; it trains more stably at depth than the post-norm used in the original paper.
Attention mixes, MLP processes. A useful decomposition: the attention sub-layer moves information between positions; the feed-forward sub-layer transforms information within each position. Alternating the two — mix, process, mix, process — is the transformer's core loop.
Variations on the theme are everywhere. Modern large models commonly swap ReLU for GELU or SwiGLU in the feed-forward; swap LayerNorm for RMSNorm; add parallel attention and MLP paths rather than sequential ones (as in GPT-J and PaLM). These are tuning choices, not architectural departures. The block is still attention + MLP + residuals + normalisation. Everything else is parameter count and training tricks.
The transformer block can be wired in three main configurations. Each supports a different family of tasks, and each has its flagship model.
Encoder-only transformers — most famously BERT (Devlin et al., 2019) — stack blocks with bidirectional self-attention. Every token can attend to every other. They produce contextual representations rather than generating text directly; training uses masked language modelling (predict hidden tokens from context) and related objectives. Encoder-only models dominate classification, retrieval, and representation learning. They are optimal when you need to understand an input, not produce an output.
Decoder-only transformers — GPT (Radford et al., 2018; Brown et al., 2020) and its successors — stack blocks with causal self-attention, in which each position can only attend to positions at or before it. Training is autoregressive next-token prediction. The same architecture generates by sampling one token at a time, feeding each prediction back as input. Decoder-only models dominate open-ended generation and have become the default architecture for general-purpose language models — GPT-4, Claude, Llama, Gemini, Mistral, Qwen, DeepSeek — because next-token prediction at scale turned out to be a shockingly general objective.
Encoder–decoder transformers preserve the original Vaswani design: a bidirectional encoder consumes a source sequence; a causal decoder generates the target sequence, with cross-attention layers that let the decoder attend to the encoder's output. T5 (Raffel et al., 2020) is the canonical modern instance. BART (Lewis et al., 2019) and mT5 extend it with denoising and multilingual pre-training. Encoder–decoders are the natural fit for conditional generation tasks — translation, summarisation, structured transformation — where input and output are clearly distinct sequences.
The three shapes, roughly. Encoder-only: understand. Decoder-only: generate. Encoder–decoder: transform input to output. Most tasks can be reformulated to fit any of the three, but the alignment between shape and problem has real consequences for sample efficiency and quality.
The research gravity has moved decisively toward decoder-only designs for general-purpose models, mostly because the autoregressive objective absorbs so many tasks cleanly and scales so well. But encoder-only and encoder–decoder models remain the right default for specific problems — search and classification for the former, task-specific translation and summarisation for the latter — and the gap has narrowed but not disappeared.
Self-attention mixes a sequence with itself. Cross-attention mixes one sequence into another — the mechanism for conditioning generation on external context.
In cross-attention, Q comes from one sequence and K, V come from another. It is exactly the Bahdanau–Cho–Bengio encoder–decoder attention of 2014, relocated to the transformer era and generalised. The decoder asks questions (queries); the encoder provides an indexable library (keys and values); the decoder retrieves the combination tailored to its current need.
Cross-attention is where conditioning lives. When a text-to-image model generates an image from a caption, cross-attention layers in the denoising network read the caption's embeddings — queries from image positions attending to keys and values from text. When a vision-language model answers a question about an image, the language decoder cross-attends to patch embeddings from a vision encoder. When Perceiver (Jaegle et al., 2021) processes arbitrary modalities, a small set of learned queries cross-attends to a large, modality-agnostic input set, bringing the O(n²) cost under control by making n on the query side small.
The same mechanism also powers retrieval-augmented generation. Queries come from the current context; keys and values come from an external document store retrieved at inference time. The model is not just recalling from its parameters; it is actively looking things up in a corpus. This is attention used as a tool for grounding — a soft read of a body of text that exists outside the model weights.
Cross-attention as conditioning. If you want a transformer to behave differently depending on some external input — a caption, an image, a database, a task description — cross-attention is the standard way to plumb that input into the computation. It is the transformer's conditional operation.
In modern decoder-only large language models, cross-attention is often implicit: the "context" and the "generation" share the same sequence, and the decoder's causal self-attention handles both. But the moment there is a real distinction between input and output — text to image, image to text, document store to question — cross-attention reappears, doing exactly what Bahdanau's original alignment did, on a larger stage.
Language generation is causal: position t cannot depend on position t+1. A single mask added to the attention scores enforces this, and that mask is what turned self-attention into a generative engine.
In causal self-attention, each position can only attend to positions at or before it. This is implemented by setting the upper-triangular portion of the QKᵀ matrix to −∞ before the softmax — infinities turn to zeros under the softmax, and the forbidden positions contribute nothing. Training a causal language model is then pleasingly parallel: every position sees only its past, but all positions compute in a single forward pass. The target for position t is simply the token at position t+1.
This parallelism during training is the single biggest reason transformers overtook RNNs for language modelling. An RNN has to unroll sequentially to compute a loss on a 2 048-token sequence — 2 048 dependent steps. A causal transformer computes the loss on the same sequence in one parallel matmul, subject to the attention mask. On GPU hardware, this collapses a thousand-fold serial dependence into a single compute kernel.
Autoregressive decoding at inference time is the reverse: the model produces tokens one at a time, each conditioned on all previously-generated tokens. The causal mask guarantees that training and inference compute the same function — the model never had access to future tokens during training, so it is well-defined to sample without them.
KV cache. During generation, the keys and values for past positions never change. Caching them avoids recomputing attention for the whole prefix at every step. This is a pure inference optimisation — nothing architectural changes — but it is the difference between a model that serves cheaply and one that does not.
The combination of a causal mask during training and a KV-cached generation loop during inference is the operational heart of modern decoder-only large language models. Everything that matters about running them — latency, throughput, memory pressure — is determined by how efficiently this loop executes. Section 13 on FlashAttention is, in large part, the story of making this loop faster.
Peer inside a trained transformer and you find heads that implement legible computations — positional offsets, coreference, copy mechanisms, pattern-matching circuits. Attention patterns are the closest thing neural networks have to program listings.
Because attention weights are normalised distributions, they are directly interpretable as soft pointers. Visualising the attention matrix of a trained head often reveals a structured pattern: a diagonal (attend to yourself), an off-diagonal (attend to the previous token), a column (everyone attends to some salient token), or a block structure that tracks sentence boundaries. Heads in the same layer develop different patterns; heads at different depths compose these patterns into more elaborate behaviours.
The most striking finding is the induction head, identified by Olsson, Elhage, and colleagues at Anthropic (Elhage et al., 2022). An induction head, implemented as a composition of two attention heads across two layers, detects repeated sequences of the form "A B ... A" and predicts "B" — the model completing the pattern it has just seen. Induction heads appear reliably during training, tend to form in a sharp transition rather than gradually, and are strongly associated with the emergence of in-context learning — the ability of large language models to adapt their behaviour based on examples in the prompt without any weight updates.
This line of work — mechanistic interpretability — treats a transformer as a program whose components can be reverse-engineered. It has produced a catalogue of circuits: small combinations of heads and MLP neurons implementing specific functions. The field is young and the results are still partial, but it has already overturned the intuition that neural networks are inscrutable; for small and medium models, many behaviours can be traced to specific mechanisms.
Attention as legible structure. Convolutional filters, after much effort, yielded interpretable feature maps. Attention weights are interpretable by construction — they are distributions over positions, not coefficients of unknown basis vectors. This is an underrated reason the transformer era is also the interpretability era.
It is worth being careful about what this means. As we will see in section 16, attention weights are not always a faithful explanation of a model's predictions; the MLP sub-layers do work that attention weights cannot describe; and the story gets more tangled at scale. But the transparency of attention, relative to previous architectures, is real, and it is one of the reasons interpretability research has accelerated so much since 2017.
The O(n²) cost of full attention is tolerable for a few thousand tokens and ruinous past that. A whole subfield has tried to replace the quadratic with something subquadratic — sometimes by changing what attention computes, sometimes by changing how.
The first wave of efficient attention was sparse. Restrict each query to attend only to a subset of keys — a local window, or a window plus some global anchors — and the cost scales linearly. Longformer (Beltagy, Peters, and Cohan, 2020) combined sliding-window attention with a few global tokens that every position attends to. BigBird (Zaheer et al., 2020) added random connections to the local-plus-global mix, arguing for theoretical coverage properties. Sparse Transformer (Child et al., 2019) factorised the attention pattern across dimensions for images. These models scale to tens of thousands of tokens at linear cost.
The second wave was kernelised. The Performer (Choromanski et al., 2020) and Linear Attention (Katharopoulos et al., 2020) observed that the softmax kernel can be approximated by explicit feature maps — essentially random Fourier features for the exponential kernel — so that attention becomes a sequence of matrix multiplications whose shape is independent of n. The cost drops to O(n · d²). The accuracy cost is real but bounded, and for very long sequences it pays back.
The third wave was low-rank. Linformer (Wang et al., 2020) projected the key and value sequences into a small fixed-dimension subspace along the sequence axis, collapsing the n × n matrix into n × k for small k. Simple, effective on specific tasks, but with an awkward dependence on input length.
The bitter lesson, again. Most of these methods were proposed when quadratic attention seemed unaffordable. FlashAttention (next section) made the quadratic itself fast enough that many efficient-attention variants became unnecessary for moderate contexts. The ones that survived are the ones that also offered genuine modelling gains — sliding windows for true long-range tasks — not just speed.
The efficient-attention literature is sprawling, and many of its methods now feel historical. But the motivation that produced it — context length is the binding constraint on what attention-based systems can reason about — has only grown more pressing as long-context models have become a serious design target.
Attention is bottlenecked not by FLOPs but by memory movement. Rewrite the kernel to respect the GPU's memory hierarchy, and the quadratic becomes affordable.
FlashAttention (Dao, Fu, Ermon, Rudra, and Ré, 2022) is the most consequential piece of systems work in the transformer era. The observation is simple: a naive attention implementation computes the full n × n attention matrix, writes it to HBM (the GPU's main memory), and reads it back for the softmax and the weighted sum. On modern GPUs, HBM reads and writes are the bottleneck — the actual matmul is fast; shuttling the intermediate matrix through memory is slow.
FlashAttention rewrites the attention computation to never materialise the full n × n matrix in HBM. It tiles Q, K, and V into blocks that fit in the much faster on-chip SRAM, uses an online softmax (Milakov and Gimelshein, 2018) to compute the normalisation incrementally, and writes only the final output. The FLOP count is unchanged — it is still the same attention — but the memory traffic drops by an order of magnitude. Training is two to four times faster; longer sequences become feasible at the same memory budget.
The cascade is remarkable. FlashAttention-2 (Dao, 2023) refined the tiling and work partitioning, extracting more from the same insight. FlashAttention-3 (Shah et al., 2024) adapted the approach to Hopper-generation hardware, using asynchronous execution and lower-precision formats. Each generation of GPUs gets a matched generation of attention kernel, and the gap between "attention in a textbook" and "attention on real hardware" keeps widening.
Memory is the new FLOPs. For transformers at scale, the binding resource is not arithmetic throughput — it is memory bandwidth and memory capacity. FlashAttention was the first mass-adopted rewrite to take this seriously, and it shifted the whole field's sense of which optimisations matter.
FlashAttention also changed the efficient-attention conversation. Many sparse and low-rank schemes were proposed to bring the quadratic scaling down; FlashAttention kept the quadratic but made it fast enough that, up to contexts of 16K or 32K tokens on modern hardware, it dominates. The honest benchmarks are now "full attention with FlashAttention" versus "approximate attention" — and full attention, suddenly, is hard to beat.
The context window is the most-discussed number on any model card. Extending it — from 2K to 8K to 128K to millions of tokens — is a combination of positional encodings that extrapolate, kernels that stay fast, and training regimes that teach the model to use the space.
The first wall was positional. A model trained at 2 048 tokens with learned absolute positions simply has no embedding for position 2 049. RoPE and ALiBi (section 6) removed this wall by making positions relative and extrapolable. But even then, accuracy degraded sharply beyond the training context length — the model had never practised attending across such spans.
Context extension techniques — YaRN, NTK-scaling, position interpolation (Chen et al., 2023) — modify the base frequency of RoPE so that the same rotary angles cover a longer stretch, usually followed by a short fine-tuning run. This is the standard move that gets base models from 4K or 8K training contexts to 32K, 64K, or 128K serving contexts with modest compute.
The second wall was kernel efficiency. FlashAttention made full attention fast at moderate lengths; beyond those, the quadratic cost dominates again. Ring attention (Liu et al., 2023) shards the key and value matrices across devices and rotates them in a ring, letting a cluster of GPUs compute attention over millions of tokens collectively. Sliding-window attention at inference (used in Mistral's architecture) restricts each token's attention span to a window while letting information flow through many layers, approximating a long context at sublinear cost.
Retrieval as a long-context alternative. You can pay for long context in tokens — keep everything in the window — or you can pay for it in retrieval — store the corpus externally and fetch what the query needs. Both end up using attention; retrieval-augmented generation is cross-attention over a retrieved subset, and long-context is self-attention over the whole. The trade-off is precision versus recall versus cost, and neither approach dominates.
The frontier keeps moving. By 2024 several production systems advertised context windows of 1M tokens or more — enough to read entire codebases or books into a single prompt. Whether models actually use those tokens effectively, as opposed to merely tolerating their presence, is a separate empirical question; benchmarks like needle-in-a-haystack tests and long-document QA try to measure real usage, and the answers are improving but not solved.
Attention was born in machine translation. Within five years it had conquered vision, speech, biology, reinforcement learning, and protein structure prediction. The primitive is too general to stay in one modality.
The Vision Transformer (Dosovitskiy et al., 2021) applied the transformer unchanged to images by splitting each image into fixed-size patches, treating each patch as a token, and running ordinary self-attention. At sufficient scale and data it matched or exceeded convolutional networks on classification benchmarks. The premise of section 04 of the CNN chapter — that spatial inductive bias is necessary for vision — turned out to be replaceable by scale and attention, at least for many tasks.
Whisper (Radford et al., 2022) did the same for speech: an encoder–decoder transformer consuming log-mel spectrogram patches, trained on hundreds of thousands of hours of weakly-supervised audio. It outperformed specialist speech-recognition pipelines that had been tuned for decades, simply by scaling attention on raw acoustic inputs.
AlphaFold 2 (Jumper et al., 2021) placed attention at the heart of protein structure prediction. Its Evoformer module interleaves attention over a multiple sequence alignment with attention over a pair representation of amino-acid relationships, combining evolutionary information with geometric reasoning. The accuracy gap it opened over prior methods was the largest single leap in structural biology in decades.
Perceiver (Jaegle et al., 2021) pushed generality further, using a small set of learned queries that cross-attend to arbitrarily-structured inputs — images, point clouds, audio, video — all processed by the same architecture. Decision Transformer (Chen et al., 2021) recast reinforcement learning as conditional sequence modelling with attention. Every modality the field cared about ended up with an attention-based baseline, and most of those baselines eventually became the state of the art.
Attention as general primitive. CNNs were vision architectures; RNNs were sequence architectures; MLPs were feature-vector architectures. Attention is none of these and all of them. Tokenise the input — patches, spectrogram frames, amino acids, trajectories — and attention takes over. The modality-specific inductive biases the previous era prized have been repeatedly traded for more data and attention layers.
The pattern is not universal. Physics, signal processing, and tasks with strong symmetries still benefit from architectures that bake the symmetry into the network (equivariant models, for instance). But the default choice, for most new problems involving structured input and abundant data, is now a transformer.
The transparency of attention weights invites interpretation. The invitation is partially misleading — attention is legible but not always faithful — and the careful work is about knowing which reads hold up.
The early enthusiasm — just look at the attention weights to understand what the model is doing — ran into a sharp critique. Jain and Wallace (2019), in "Attention is not Explanation," demonstrated that a model's attention weights could be radically permuted without changing its predictions; the weights were not uniquely determined by the computation they produced. Wiegreffe and Pinter (2019) replied that this did not mean attention weights were useless for interpretation, only that naive reading of a single head's pattern could mislead.
The modern consensus is that attention weights are one legible signal among several and should be interpreted alongside the MLP sub-layers, residual stream flows, and the outputs of other heads. Mechanistic interpretability (section 11) has matured into a programme that traces how information moves through a transformer: which tokens carry which features, which heads route those features, which MLP neurons transform them, how heads compose across layers.
Sparse autoencoders (Cunningham et al., 2023; Templeton et al., 2024) have emerged as a powerful complement, decomposing the transformer's residual stream into a large dictionary of sparse, interpretable features — individual directions in activation space that correspond to recognisable concepts. The combination of head-level analysis and feature-level analysis is the current frontier of mechanistic interpretability in large models.
What interpretability buys. Not a full explanation of any large model — that remains out of reach. But a growing catalogue of named mechanisms (induction heads, copy heads, name-mover heads, feature circuits) that let researchers make specific, testable claims about what a model is doing, and in some cases intervene on behaviour by editing the mechanism.
The stakes are not merely academic. Safety arguments for large models rest heavily on whether their computations can be audited at some level of detail. Interpretability research, made possible in large part by attention's structural legibility, is one of the few avenues by which such audits might become practical.
Attention is dominant, not final. A new generation of state-space models and hybrid architectures is testing whether the quadratic cost and the specific inductive bias of attention are actually necessary.
The most visible challenger is the state-space model family. S4 (Gu, Goel, and Ré, 2022) showed that carefully-parameterised linear time-invariant systems could model sequences with millions of time steps at linear cost, outperforming transformers on the Long Range Arena benchmark. Mamba (Gu and Dao, 2023) introduced selective state-space models with input-dependent dynamics, closing the gap to transformers on language modelling and, at comparable scales, occasionally exceeding them. RWKV (Peng et al., 2023) and Retentive Networks (Sun et al., 2023) are related designs that keep a recurrent form at inference while training in parallel.
These alternatives share a common insight: the all-pairs interaction of self-attention is not always necessary. A selective recurrent state, if the recurrence is expressive enough, can carry similar information at linear cost. For very long sequences — genomic data, long audio, streamed video — this is a real advantage. For language at moderate contexts, the evidence is mixed; attention's strengths are hardest to beat exactly where most current applications sit.
The practical response has been hybridisation. Jamba (Lieber et al., 2024) interleaves Mamba and transformer layers; Griffin and RecurrentGemma combine local attention with recurrent blocks; Zamba does similarly. The hypothesis is that the two primitives are complementary — recurrence for long-range state, attention for in-context pattern matching — and that hybrid stacks get the best of both.
Post-attention, or not. It is an open empirical question whether the next dominant architecture will still be transformer-based, will be a full state-space successor, or will be some hybrid whose attention layers are a minority of the stack. What is clear is that the transformer is no longer the only serious candidate — for the first time since 2017.
Predictions are cheap and usually wrong. The more conservative read is that attention has become a primitive like convolution: foundational, well-understood, still heavily used, and increasingly combined with other primitives rather than standing alone. The exotic architectures of the next decade may look less like "attention" or "recurrence" and more like careful compositions of both.
Attention is not just an architecture — it is a set of ideas the rest of the field has absorbed. The transformer chapter of ML is partly the history of what attention taught the field to take for granted.
The most portable lesson is the QKV abstraction. It turned out that almost any content-based lookup, between any two things, could be phrased as queries, keys, and values — and that making the lookup differentiable made it trainable end-to-end. Retrieval-augmented generation, memory-augmented networks, neural caches, external tool use, mixture-of-experts routing: all rely, explicitly or implicitly, on the same soft-lookup template. The attention kernel escaped the attention layer.
The second lesson is all-pairs interaction. Self-attention's n² structure was originally seen as a cost; more recent work treats it as a feature. When you can afford it, letting every part of the input interact with every other part is a remarkably strong default. Graph neural networks have moved in this direction (graph transformers); molecular and protein models have adopted it; physics simulators have started to as well. The inductive bias of "everything can talk to everything" is less restrictive than most priors, and more data turns that permissiveness into generalisation.
The third lesson is scale-first design. Transformers became dominant not because they were the most elegant architecture for small problems but because they scaled better than the alternatives. The design philosophy they exemplified — few inductive biases, maximally parallel computation, trust the data — has reshaped how new architectures are proposed. The first question asked of a new design is now: how does it scale?
The compounding bet. Attention was introduced to fix a translation bottleneck. It ended up rewriting the default architecture across domains, reshaping the interpretability agenda, and supplying the abstractions that the field now uses to plumb models together. It is one of the clearest examples in machine learning of a single idea compounding across problems because its framing was right.
The next chapter on Foundation Models and Transfer Learning picks up this thread from a different angle — what happens when you pre-train one of these attention-based architectures at scale on broad data, and then re-use the result across tasks. The transformer is the vessel; the vessel's contents, and the economics of filling and re-using them, are a story of their own.
Attention's history is unusually compressed — a 2014 encoder–decoder alignment trick, a 2017 "you can throw out recurrence" paper, and then a decade of reshaping every area of deep learning. The reading list below follows that arc: anchor textbooks that teach the modern view, the foundational papers that introduced each primitive, modern extensions that are still actively evolving, and the software that makes these ideas usable at scale.
The canonical NLP textbook, freely available in draft form. The chapters on transformers, self-attention, language models, and machine translation have been rewritten in the transformer era and are the clearest textbook entry point to these ideas. Updated essentially yearly.
The Hugging Face team's practitioner guide. Walks through encoder, decoder, and encoder–decoder transformers with working code, covering tokenisation, fine-tuning, efficient attention, and deployment. Excellent for connecting the conceptual picture to the actual stack.
Chapter 12 on transformers is the most careful modern textbook derivation of self-attention, multi-head attention, positional encodings, and the full transformer block. Equations are pinned to figures; the exercises are worth doing.
The Bishop successor text, fully rewritten for the deep-learning era. Chapter 12 on transformers covers self-attention, multi-head attention, the transformer block, and pre-training with unusual care for the mathematical structure.
The interactive textbook with runnable PyTorch, TensorFlow, MXNet, and JAX implementations side by side. Chapters 10 (attention mechanisms) and 11 (transformer) build the whole architecture from scratch with executable code — the best single source for going from equations to a training run.
The two most influential online expositions of the transformer architecture. Careful diagrams of every step — tokenisation, embedding, QKV projection, attention, multi-head concatenation, decoding. Still the recommended first read after the Vaswani paper itself.
The paper that introduced attention. Replaces the fixed-length encoder vector of seq2seq with a context vector computed by soft alignment over source hidden states at every decoding step. The idea that touched off everything that followed.
The systematic comparison of global vs. local attention and additive vs. multiplicative scoring. The paper that made dot-product attention the default by showing it was cheaper and comparable in quality.
The first high-profile application of attention outside machine translation, and the clearest early comparison of soft and hard attention. The attention-over-image-regions visualisations became the standard way to explain what attention does.
The transformer paper. Introduces scaled dot-product attention, multi-head attention, positional encodings, the transformer block, and the encoder–decoder stack — the core vocabulary of modern deep learning. One of the most influential papers of the decade.
The encoder-only transformer that established pre-training + fine-tuning as the dominant paradigm for language understanding. Masked language modelling and next-sentence prediction as objectives that produce transferable representations.
The decoder-only transformer scaled up, and the first serious demonstration that autoregressive language modelling at scale produced surprisingly general capabilities without task-specific training. The template for everything that followed.
The paper that made in-context learning visible to the whole field. Scaling GPT-2 by two orders of magnitude unlocked the ability to perform new tasks from a handful of examples in the prompt — no weight updates required. The moment the implications of scale became undeniable.
The systematic encoder–decoder study that framed every NLP task as text-to-text. Covers relative positional embeddings, pre-training objectives, model sizes, and fine-tuning at a level of empirical rigour that made it the standard reference for encoder–decoder design choices.
The Vision Transformer. Split an image into patches, treat each patch as a token, run a standard transformer. At sufficient scale and data it matched state-of-the-art CNNs. The paper that ended the assumption that vision required convolutional inductive biases.
Attention reshaping a domain that had resisted decades of effort. The Evoformer module combines attention over multiple sequence alignments with attention over pair representations, producing a system that solved protein structure prediction at a level of accuracy essentially matching experimental methods.
The rotary positional embedding (RoPE) paper. Introduces a way to encode relative position through rotations of query and key vectors that has since become the dominant positional encoding in modern large language models, largely because it extrapolates gracefully to longer contexts.
The most consequential systems paper of the transformer era. A tiled, IO-aware rewrite of the attention kernel that never materialises the full attention matrix in HBM, making quadratic attention fast enough to stay the default up to long contexts. Now the standard implementation in virtually every serious framework.
An encoder–decoder transformer trained on 680 000 hours of weakly-supervised audio. A demonstration that attention plus scale plus noisy data could leapfrog decades of specialist speech-recognition engineering. Open-sourced and widely used.
The paper that identified induction heads as a mechanism behind in-context learning. One of the defining documents of mechanistic interpretability, and the clearest demonstration that specific attention patterns implement specific, identifiable computations.
No positional encoding at all — just a linearly-decaying bias on attention scores. The simplest positional scheme, strong at extrapolation, and a counterweight to more elaborate engineering. A reminder that the right inductive bias beats clever parameterisation.
The influential sceptical paper that showed attention weights can be radically permuted without changing predictions — undermining naive reading of attention as explanation. The correct antidote to early over-enthusiasm about attention as transparent reasoning.
Sliding-window self-attention plus a small set of global tokens — linear cost in sequence length, strong performance on long-document tasks. The clearest example of structured sparse attention that survived the FlashAttention reset.
A sparse attention pattern combining random, windowed, and global connections, with theoretical arguments about coverage. An influential design point in the efficient-attention literature.
Kernelised attention with explicit feature maps for the softmax kernel, giving linear-cost attention via FAVOR+. The most theoretically careful of the linear-attention variants.
Multi-query attention — share keys and values across heads while keeping per-head queries. Modest quality cost, major inference-time memory savings. The pattern underlying grouped-query attention and much modern serving infrastructure.
Grouped-query attention. Interpolates between multi-head and multi-query attention by sharing keys and values across groups of heads. The tuning now standard in open-weight models that need both quality and cheap inference.
The refinement of FlashAttention that extracts substantially more throughput from the same tiling idea. Now the default in most large-model training stacks.
The Hopper-generation rewrite, exploiting asynchronous execution and FP8 to push another substantial speedup. Demonstrates how tightly attention kernels are now coupled to specific hardware generations.
The simplest effective recipe for extending a RoPE-based model's context length — interpolate position indices rather than extrapolate. The foundation for most long-context extension techniques that followed.
A distributed attention algorithm that shards keys and values across devices and rotates them in a ring. Enables attention over millions of tokens by sharing the quadratic cost across a cluster. One of the key ingredients in modern million-token-context systems.
The paper that put state-space models back in serious contention with transformers for long-sequence tasks. A carefully-parameterised linear time-invariant system with strong scaling and competitive quality on Long Range Arena.
Selective state-space models — input-dependent dynamics that close much of the remaining quality gap to transformers on language modelling at linear cost. The most serious recent challenger to attention as the dominant sequence primitive.
The hybrid architecture that interleaves Mamba blocks with transformer layers at scale, producing a competitive general-purpose language model. A concrete exemplar of the hybrid thesis — that attention and state-space recurrence are complementary rather than competing.
A small set of learned queries cross-attending to arbitrary input modalities. One of the most elegant uses of cross-attention as a general conditioning mechanism, and a concrete design for handling very large input sets without quadratic cost on them.
The paper that brought sparse autoencoders into mainstream interpretability. Decomposes the residual stream into sparse, interpretable features — a complement to head-level analysis that has become a pillar of mechanistic interpretability.
Scaling sparse-autoencoder feature extraction to a production-scale language model. A demonstration that mechanistic interpretability can say something specific and testable about the features real models represent.
The dominant open-source library for transformer models. Thousands of pre-trained weights, unified interfaces for encoder, decoder, and encoder–decoder configurations, and the de facto standard for fine-tuning, inference, and research distribution.
The reference implementation of FlashAttention and its successors. Directly integrated into PyTorch's scaled_dot_product_attention and into essentially every serious training stack.
A high-throughput inference server built around PagedAttention — a KV-cache memory manager inspired by virtual memory. Made serving large transformer-based models at low cost tractable for a wide range of teams.
NVIDIA's optimised transformer inference stack — FlashAttention kernels, quantisation, tensor and pipeline parallelism. The canonical vendor-optimised path for large-model serving on NVIDIA hardware.
The main library for mechanistic interpretability research on transformers. Instruments attention heads, residual streams, and MLP activations with the hooks needed to run the circuit-level analyses described in §11 and §16.
Reference implementations of the selective state-space models discussed in §17. The home for research code on the main non-attention sequence architecture in serious contention today.