Part VI · NLP & Large Language Models · Chapter 06

Large language models at scale, what happens when you train the same simple architecture on a hundred times more data and a hundred times more parameters — the capabilities that appear, the capabilities that do not, and the arguments over whether any of this counts as genuine understanding.

The story of the last seven years in language modelling is, to a first approximation, a story of scale. The transformer architecture introduced in 2017 has not changed in any essential way. The pretraining objectives laid out in 2018–2020 have not changed in any essential way. What has changed is size — models a hundred times larger, trained on a hundred times more data, with a hundred times more compute — and what has surprised almost everyone is how much that one lever matters. Capabilities that were absent or barely present at small scale become routine at large scale: following instructions from examples in a prompt, reasoning through multi-step problems, writing correct code, translating between languages the model was barely trained on, using tools, coordinating multi-turn dialogue. Some of these capabilities appear gradually; some appear sharply at a threshold and have been called emergent, which is a word doing a lot of work. This chapter catalogues what large language models can do, argues carefully about what they cannot, looks hard at the failure modes that scale does not fix, and sketches the open empirical questions about why any of this works at all.

How to read this chapter

Sections one and two frame scale as a research programme. Section one is why scale — the narrative from GPT-2 onwards, the bet that size would substitute for cleverness, and the reasons that bet paid off further than almost anyone predicted. Section two covers scaling laws in the specific form that makes them useful for thinking about capabilities, rather than just training loss — the difference between a loss curve that falls smoothly with compute and a capability curve that does not.

Sections three and four are the central theoretical debate. Section three presents emergent capabilities as originally described by Wei et al. 2022 — behaviours that appear at a scale threshold and are absent below it. Section four covers the debate that followed — Schaeffer et al. 2023's argument that apparent emergence is often an artefact of threshold-based metrics, and the more careful empirical picture that has since developed. Neither position is decisive; both are essential for thinking clearly about what scaling gives you.

Sections five and six introduce the most distinctive new capability of large models. Section five is in-context learning — the phenomenon that a sufficiently large language model can acquire new behaviours from examples in the prompt, without any parameter updates. Section six is few-shot prompting as the practical technique, with the dependence of performance on the number, selection, ordering, and framing of in-context examples.

Sections seven through nine examine reasoning behaviour in detail. Section seven is chain-of-thought reasoning, which turned step-by-step reasoning into a competent behaviour via prompting alone. Section eight surveys what these models cannot reliably do — compositional generalisation, long-horizon planning, problems that require working memory across many steps. Section nine zooms in on mathematical reasoning as the sharpest testbed.

Sections ten through twelve cover capability-mode specialisations that matter economically. Section ten is code generation — HumanEval, coding assistants, and why code is a particularly clean signal for reasoning. Section eleven is multilinguality — the surprising cross-lingual transfer that falls out of mostly-English training, and where it breaks. Section twelve is tool use and function calling — the turn from "text in, text out" to models that can invoke external systems.

Sections thirteen through sixteen are about failure modes and their mitigations. Section thirteen is memorisation, which is both a capability and a liability. Section fourteen covers hallucination and factuality — why language models confabulate, and the partial remedies. Section fifteen is safety behaviours and jailbreaks, the arms race between alignment training and adversarial prompting. Section sixteen is interpretability — what we actually know about the inside of a large language model, and what we do not. The closing in-ml section sketches what the frontier looks like in early 2026 and where the remaining uncertainties live.

Why scaleThe "bitter lesson," the GPT-2/3 surprise, scale as a research programme
Scaling laws revisitedLoss curves vs capability curves, Kaplan, Chinchilla, what they predict
Emergent capabilitiesWei 2022, threshold effects, capabilities that appear with scale
The emergence debateSchaeffer 2023, metric artefacts, what remains of emergence
In-context learningLearning from examples in the prompt without weight updates
Few-shot promptingExample selection, ordering, sensitivity, prompt engineering
Chain-of-thought reasoningStep-by-step prompting, self-consistency, reasoning scaffolds
Reasoning and its limitsCompositional generalisation, planning, working memory
Mathematical reasoningMATH, GSM8K, the sharp testbed for capability claims
Code generationHumanEval, Copilot, why code is a clean reasoning signal
MultilingualityCross-lingual transfer, low-resource languages, the asymmetry
Tool use & function callingRetrieval, calculators, code execution, agents
MemorisationVerbatim recall, training-data extraction, capability vs liability
Hallucination & factualityWhy models confabulate, calibration, partial remedies
Safety behaviours & jailbreaksRefusals, adversarial prompts, the arms race
InterpretabilityProbing, features, circuits, mechanistic interpretability
The capability frontierWhere things stand, what is still missing, the next chapter

§1

Why scale

The most influential empirical finding of the last seven years is that a simple architecture trained on much more data and with much more compute gets much better at language, and continues to do so over many orders of magnitude. Scale did not fix every problem, but it dissolved enough of them that "just make it bigger" became a rational research strategy.

Richard Sutton's 2019 essay The Bitter Lesson argued that, across decades of AI research, the methods that ultimately succeeded were the ones that could leverage more compute — search, then learning — and that the methods that tried to hand-code domain knowledge consistently lost to methods that learned it from data. Sutton's observation was a historical claim, not a prediction, but by the time GPT-3 appeared in 2020 it was being quoted as prophetic. The decade-long effort to build elaborate linguistic pipelines, structured priors, and clever inductive biases into language models had been eclipsed by a single transformer pretrained on the internet. The bitter lesson, once again, had won.

The specific evidence was accumulating throughout 2019 and 2020. GPT-2, at 1.5B parameters, could write plausible paragraphs of text and — in its most striking result — perform simple tasks like translation and summarisation with no task-specific training, given only an appropriate prompt. GPT-3, at 175B parameters, could do this at a level competitive with fine-tuned specialist models across dozens of benchmarks. Nothing about the architecture had fundamentally changed. Nothing about the pretraining objective had fundamentally changed. What had changed was the number of parameters, the number of training tokens, and the number of GPU-hours spent. The scale hypothesis — that intelligence-like behaviour would emerge from sufficiently large models trained on sufficient data — was no longer a speculation.

Three structural properties made scale a rational investment rather than a gamble. First, loss falls predictably with compute: scaling laws (§2) mean that if you know your budget, you know roughly what loss you will achieve, and therefore roughly what capability bucket the resulting model will fall into. Second, the marginal returns to scale did not obviously diminish across the ranges people could afford to explore — every time someone built a model an order of magnitude larger, it turned out to have new or sharper capabilities, and each step justified the next. Third, scaling a single architecture is an industrial problem, not a research problem: once the recipe works at 1B parameters, taking it to 10B or 100B is engineering, not exploration. This made scale tractable in a way that, say, neural architecture search was not.

This is not to say scale is free, or that there are no other levers. Post-training techniques (RLHF, DPO, Constitutional AI — the subject of the next chapter) now contribute as much to the behaviour of a frontier model as pretraining does. Data curation has grown into a research area of its own. Inference-time compute, via chain-of-thought and tool use and repeated sampling, has become a second axis of capability orthogonal to training compute. But the background against which all of these techniques operate is a pretrained base model, and the pretrained base model is overwhelmingly determined by scale. The current question is not whether scale matters — it plainly does — but what scale alone does not give you. The rest of this chapter is, in effect, an attempt at a careful answer to that question.

Scale is not a theoretical principle. It is an empirical regularity: across several orders of magnitude, making the same model bigger and giving it more data makes it substantially better, in ways that are predictable for training loss and surprising for behaviour. That regularity is the core empirical fact of the modern era of language modelling.

§2

Scaling laws revisited

Scaling laws relate training loss to compute, parameters, and data. They are clean, they are predictive, and they have been stable across roughly six orders of magnitude. But loss is not capability, and the relationship between the two is where most of the interesting argumentation in this chapter lives.

The empirical picture, covered in more detail in the previous chapter's §13–14, is this: Kaplan et al. 2020 established that language-model cross-entropy loss L falls as a power law in compute C, parameters N, and data D, over ranges of 10⁻⁹ to 10⁻² PF-days of compute, and did so with remarkable cleanness — the log-log plots are straight lines with almost no scatter, the kind of experimental result that looks artificially tidy. Hoffmann et al. 2022 (Chinchilla) refined the allocation rule to roughly 20 tokens per parameter. The relationship between compute and loss is, at this point, one of the best-characterised empirical laws in the field.

The subtlety is that the loss that scales so cleanly is the pretraining loss, which is the cross-entropy of the next token. That quantity is well-behaved mathematically — it is a smooth function of the model's output distribution — and it is what the training objective explicitly minimises. But it is not directly the quantity users care about. What we care about is downstream capability: can the model do arithmetic, follow instructions, write a program that runs, translate from Icelandic? The mapping from pretraining loss to capability is what turns a clean scaling law into a messy question about emergence, thresholds, and saturation.

Often the mapping is smooth. On benchmarks measured as continuous scores — perplexity on held-out text, BLEU on translation, ROUGE on summarisation — capability tends to improve monotonically with reduced pretraining loss, and scaling laws can be recast as capability-vs-compute curves with reasonable fidelity. For these tasks, bigger really does mean proportionally better, and the investment case for scale is straightforward. Sometimes the mapping is discontinuous. On tasks measured as accuracy with a sharp threshold — a problem is solved correctly or it is not — the capability curve can look flat for many orders of magnitude of scale and then rise sharply. This is what "emergent capabilities" refers to, and it is the subject of §3 and §4.

A practical consequence worth naming: scaling laws are useful for planning compute budgets but are less useful for predicting which specific capabilities a frontier model will have. A lab running a 10²⁶-FLOP training run can predict pretraining loss within a few per cent, but it cannot reliably predict whether the resulting model will solve a particular benchmark, because the loss-to-capability mapping is task-dependent and sometimes sharp. This is part of why frontier models continue to surprise both their developers and external evaluators — not because the loss curve is unpredictable, but because the mapping from loss to behaviour has no general closed form.

§3

Emergent capabilities

Wei et al. 2022 catalogued behaviours that appear in large models and are absent in smaller ones, under the heading of emergent abilities. The paper reshaped how people talked about scale: the gains from bigger models were not merely quantitative; in some cases they looked categorical. The idea is contested (§4), but the empirical catalogue is real.

Wei et al.'s definition is operational: an ability is emergent if it is absent in smaller models and present in larger ones, where "absent" means performance is at or near random-chance level. By this definition, they identified dozens of benchmarks — arithmetic, multi-step reasoning, code generation, cross-lingual transfer — where performance was flat at chance across many orders of magnitude and then rose sharply when models crossed some threshold scale. The threshold varied by task: for some, it was around 10²² FLOPs of training compute; for others, an order of magnitude more. But the pattern — flat, then sharp, then continued improvement — was consistent enough to look like a real structural property of scaling rather than a measurement coincidence.

Specific examples from the paper are illustrative. Three-digit addition was essentially unsolvable at small scale and crossed to near-perfect accuracy in the 10²²–10²³ FLOP range. The ability to follow instructions expressed in natural language showed a similar pattern. Multi-step word problems — "If Sally has 3 apples and gives 2 to Bob…" — went from random to consistently correct as models grew. In-context learning itself (§5) was called out as an emergent ability: small models did not learn from prompt examples; large models did. If you believed the framing, capabilities were appearing in kind with scale, not merely in degree.

The concept resonated far beyond the paper. "Emergent capabilities" became a rhetorical anchor for scale-enthusiast arguments: the gains from making models bigger were not merely incremental, and therefore we had reason to expect further, unpredictable capabilities from further scaling. AGI-adjacent discussions leaned on this vocabulary heavily. So did funding pitches for bigger compute clusters. The framing was influential even outside the community of researchers who were actually running the experiments, and it is fair to say that "emergence" became the most-cited concept in popular writing about LLMs for the next two years.

Three caveats are worth holding before §4 introduces the critique. First, Wei et al. themselves were careful to note that "emergence" as they used it is a descriptive claim about observed scaling curves, not a mechanistic claim about phase transitions or genuinely novel structure. Second, they noted that the set of emergent abilities is not fixed — some previously-emergent abilities become solvable at smaller scales as training improves, data improves, or prompting improves. Third, the catalogue of emergent abilities is much shorter than the catalogue of continuously improving abilities. Most capabilities that benefit from scale do so smoothly. The discontinuous subset is interesting, but it is a subset.

§4

The emergence debate

Schaeffer, Miranda & Koyejo (NeurIPS 2023) pushed back on the emergence framing with a simple argument: if you use discontinuous metrics, you will see discontinuous behaviour, regardless of whether anything discontinuous is actually happening. Their critique substantially recalibrated the field, without eliminating emergence as a phenomenon entirely.

The argument runs as follows. Many of the benchmarks Wei et al. used to demonstrate emergence are scored by exact-match accuracy: an answer either is correct or is not. For a task like three-digit addition, you need every digit to be right; one mistake and you score zero. Under this metric, a model that is gradually improving — producing the right answer most of the time, then slightly more often, then slightly more often — will look flat for a while and then jump, because accuracy only registers changes that cross the binary threshold. Schaeffer et al. showed that if you rescored the same benchmarks with continuous metrics like cross-entropy of the correct answer, or partial-credit exact-match (e.g. Levenshtein distance to the correct string), the apparent emergence largely disappeared. The underlying capability was in fact improving smoothly with scale; it was the metric that was binary.

This is a serious critique because it is partly correct. For some of the benchmarks in Wei et al.'s original catalogue — three-digit addition is the canonical example — the Schaeffer rescoring is convincing, and the apparent threshold really is an artefact of exact-match accuracy. But the critique does not apply uniformly. Some capabilities do appear relatively sharply, even under continuous metrics. In-context learning itself (a small model really does not learn from prompt examples; a large model really does) is hard to dissolve into a metric artefact. Cross-lingual transfer has genuine threshold behaviour. Certain multi-step reasoning benchmarks continue to look discontinuous under partial-credit metrics. The empirical picture is more nuanced than either "emergence is real" or "emergence is a measurement artefact."

The subsequent work — notably Schaeffer's own follow-ups and several replication studies from 2023–24 — has produced something like a consensus: a substantial fraction of what Wei et al. originally called emergent is indeed a metric artefact, but a smaller residue does appear to involve genuine threshold-like behaviour. The interesting research question is now why that residue exists — whether it reflects something about the structure of in-context learning, or compositional reasoning, or certain kinds of retrieval from parametric memory — rather than whether emergence as a blanket phenomenon is real. The word "emergent" has become more precisely used, and the framing has been trimmed back to where the evidence actually supports it.

Two practical upshots follow from this debate. First, when someone reports an "emergent capability" on a new benchmark, the first question should be whether the metric is binary and whether the emergence survives a continuous rescoring. A surprising amount of the time it does not. Second, the continuously-improving view of scale — capabilities get gradually better with more compute, in a way scaling laws already predict — is a reasonable default, and the burden of proof is on claims of discontinuous behaviour. Emergence has not been eliminated, but it has been domesticated.

This is a good example of how careful empirical methodology matters even when the underlying technology works spectacularly well. The broad fact that scale produces capability is not in doubt. The specific claim that capabilities appear in kind at threshold scales, with no antecedent at smaller scales, is in doubt. Keeping those two claims distinct is essential for clear thinking about what scale actually does.

§5

In-context learning

The most distinctive new capability of large language models is the ability to acquire new behaviours from examples inside the input prompt, with no parameter updates. Show the model a few input-output pairs and then a query, and it will complete the query in the same pattern. This is not how any previous generation of ML systems worked.

In-context learning was named and studied by Brown et al. in the GPT-3 paper. The observation: if you provide a large language model with a prompt of the form Q: [question] A: [answer], repeated several times with different examples, followed by a final Q: and no answer, the model will complete the final answer in the style of the preceding examples. It will translate if the examples were translations; it will classify if the examples were classifications; it will extract if the examples were extractions. No gradient updates occur during this process. The model's weights are unchanged. The "learning" happens entirely inside the forward pass.

Mechanistically, in-context learning is a forward-pass phenomenon: the model's attention pattern over the prompt identifies the task structure and applies it to the query. There is a growing body of interpretability work — Olsson et al. 2022 on induction heads, Akyürek et al. 2022 on gradient-descent emulation, Garg et al. 2022 on trained-from-scratch in-context learners — that gives partial mechanistic explanations for how this works in specific cases, but no complete theory. The high-level intuition is that the model has been pretrained on documents that contain implicit task demonstrations (tutorials, examples, question-and-answer pairs) and has learned a general "do what the preceding examples are doing" circuit that transfers to unseen tasks at inference time.

In-context learning is the capability that made prompting, as practised today, viable. Before GPT-3, adapting a pretrained model to a new task required fine-tuning on labelled data. After GPT-3, adapting a model to a new task could be as simple as writing a prompt. This lowered the barrier to using language models from "train a machine-learning model" to "write a good prompt," which is several orders of magnitude easier, and is the reason LLM products can be built by users who have never trained anything themselves. The shift from fine-tuning to prompting, as the default adaptation mechanism, is one of the defining industrial effects of large-scale language modelling.

The flip side is that in-context learning is sample-inefficient relative to gradient-based learning, and it is sensitive to the exact form of the prompt in ways that fine-tuning is not. Give a model three good examples and it often performs well; give it three slightly different ones, or in a different order, and performance can swing substantially (§6). The capability is real and surprising, but it is not magic, and treating it as a replacement for careful fine-tuning in all circumstances misses what each technique is good for.

§6

Few-shot prompting

Few-shot prompting is the practical technique that exploits in-context learning: give the model a handful of example input-output pairs in the prompt, and rely on the model to generalise the pattern to your query. The performance is strong enough to be useful and sensitive enough to be annoying.

The basic recipe is simple: select k example input-output pairs (typically k between 1 and 32), format them consistently, concatenate them with the query, and let the model complete. The number of examples matters — in general, more examples help, with returns diminishing past 4–8 for most tasks — but the selection and ordering of examples also matter, sometimes more than the count. Lu et al. 2022 showed that example order alone could swing accuracy by twenty points on some classification benchmarks, which is uncomfortably large for a technique that is supposed to be robust.

A subtler finding, from Min et al. 2022, is that the labels in few-shot examples matter less than you would expect — sometimes replacing them with random labels from the label set barely degrades performance. What seems to matter more is the format (what kind of thing is being asked), the label space (what kinds of answers are permitted), and the distribution of inputs (what does a typical query look like). The model appears to be using the examples primarily to infer task structure rather than to learn a specific input-output mapping from them. This makes in-context learning closer to "schema induction from demonstrations" than to "supervised learning on a small dataset."

The practical consequence for prompt engineering is that getting few-shot prompting to work well is a mix of science and craft. Several heuristics reliably help: using examples that are representative of the input distribution; including diverse examples rather than redundant ones; structuring prompts with consistent delimiters; placing harder or more informative examples near the end of the context; asking the model to respond in a specific format. Several heuristics reliably hurt: inconsistent formatting across examples; confusingly verbose or noisy examples; prompts that implicitly contradict each other. The craft of prompt engineering is largely a craft of minimising accidental noise in a technique that is already sensitive to it.

Few-shot prompting has also been mostly superseded for frontier-model deployment by instruction-tuned models (the subject of the next chapter). Once a model has been fine-tuned to follow natural-language task descriptions — "classify the sentiment of the following review" — the benefit of concrete examples in the prompt diminishes, because the model can often do the task zero-shot. But few-shot prompting remains the canonical paradigm for research comparisons, for cases where a task is hard to describe in words, and as a fallback when zero-shot performance is inadequate.

§7

Chain-of-thought reasoning

Asking a model to "think step by step" before answering unlocks substantial gains on reasoning-heavy tasks, with no change to the model itself. Chain-of-thought prompting turned multi-step reasoning from a visible weakness of large language models into a reliable capability, and established inference-time compute as a lever orthogonal to training compute.

Wei, Wang, Schuurmans et al. 2022 introduced chain-of-thought (CoT) prompting as a simple modification: instead of few-shot examples that pair an input with a final answer, use examples that pair an input with an intermediate reasoning trace and then a final answer. "Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have? A: Roger started with 5 tennis balls. 2 cans of 3 tennis balls each is 6. 5 + 6 = 11. The answer is 11." The effect on downstream performance was substantial — GSM8K accuracy more than doubled for PaLM-540B — and it scaled: CoT helped larger models more than smaller ones, sometimes dramatically.

Kojima et al. 2022 showed that even zero-shot chain-of-thought worked: simply appending "Let's think step by step." to a prompt produced much of the gain, without any examples at all. This suggested that the capability to reason in explicit steps was latent in the pretrained model and merely needed to be elicited. The mechanism appears to be that chain-of-thought gives the model more inference-time compute — more forward-pass tokens in which to build up an internal representation of the problem — and structures that compute around explicit intermediate states that the model can attend to when producing the final answer.

Several refinements extended the basic idea. Self-consistency (Wang et al. 2022) samples multiple chain-of-thought traces and takes the most frequent final answer, exploiting the fact that the traces explore different reasoning paths. Tree-of-thoughts (Yao et al. 2023) structured the search over reasoning steps more explicitly. Reflexion, ReAct, and various agentic frameworks combined CoT with tool use and external feedback. All of these are variations on a single theme: scaffold the model's inference-time compute in ways that help it decompose hard problems into tractable sub-problems.

The broader implication is that reasoning capability is not fully determined by training compute. A model that is pretrained to the same loss can produce quite different answers depending on how its inference-time compute is structured, and this has become a second axis of research in its own right. The current frontier — exemplified by OpenAI's o-series models, Anthropic's extended-thinking Claude variants, and DeepSeek's R1 — treats inference-time reasoning as a first-class trainable capability, sometimes with reinforcement learning targeted explicitly at the quality of the reasoning trace. The line between "prompting" and "training" has blurred considerably in the two years since CoT became standard practice.

§8

Reasoning and its limits

Large language models are better at reasoning than their predecessors by any reasonable measure, but there is a clear and consistent set of things they still struggle with — compositional generalisation, long-horizon planning, problems that require careful tracking of intermediate state. Understanding these failure modes is essential to using the models responsibly.

Compositional generalisation — the ability to combine known primitives in novel combinations — has been a persistent weakness. Benchmarks like SCAN (Lake & Baroni 2018) and COGS (Kim & Linzen 2020) showed that neural sequence models could learn individual instructions and individual recombinations but struggled with specific new combinations not seen during training. Scale has improved the picture but not dissolved the problem: models still make systematic errors on out-of-distribution compositions, especially when the relevant primitives are rare. The errors tend to be the ones humans find surprising — simple patterns extrapolated wrongly in obvious ways.

Long-horizon planning is a second reliable weakness. A model can usually handle one or two dependent reasoning steps well; five or ten becomes unreliable; twenty rarely succeeds without an external scaffold. Problems that require explicit search — planning a multi-step solution, backtracking from a dead end, maintaining multiple candidate plans — tend to expose this weakness. The architectural reason is hypothesised to be that transformers at inference time do a fixed amount of compute per token; they have no native mechanism for "thinking longer" about a hard step without producing more tokens. Chain-of-thought is the workaround, but even extensive chain-of-thought has a ceiling, and frontier research in 2024–26 has focused on getting models to allocate inference-time compute more adaptively.

Working-memory limits show up on tasks that require tracking precise intermediate state across many tokens: multi-digit multiplication by hand, tracking multiple variables through a multi-step story, carrying partial sums across a long table. Models get confused in predictable ways, often in ways humans would also find hard without a scratchpad. Explicit scratchpadding — asking the model to write down its work — helps, as does tool use (§12), because either mechanism offloads working memory to the visible context rather than relying on hidden activations.

Reasoning failures have been widely documented on puzzles designed to exploit them. The reversal curse (Berglund et al. 2023) — a model knows "A is B" but cannot answer "What is B?" in general — is a famous example of an asymmetry that is hard to explain mechanistically but robust empirically. Simple logical reasoning problems that require careful tracking of quantifiers and scope remain challenging. Planning benchmarks like PlanBench (Valmeekam et al. 2023) show that even frontier models struggle with certain kinds of formal planning, often in ways that do not improve much with scale.

The practical lesson is not that language models cannot reason — they plainly can, on most reasonably-sized problems — but that their reasoning is uneven. They are strong at pattern-matching kinds of reasoning and weaker at the kinds that require careful bookkeeping. When you need reliable long-horizon reasoning, you typically need an external mechanism: search scaffolds, verifiers, tools that can do the bookkeeping, or a human in the loop. This is not a failure of language models; it is a design constraint for building systems on top of them.

§9

Mathematical reasoning

Mathematics is the sharpest empirical testbed for reasoning claims about language models. Answers are right or wrong, proofs check or do not, and partial credit is rarely ambiguous. The trajectory of mathematical performance from GPT-3 through the present is a useful history of how LLM reasoning has actually progressed.

The relevant benchmarks cluster into two groups. Grade-school word problems — the GSM8K dataset (Cobbe et al. 2021) — test whether models can parse a natural-language problem, translate it into arithmetic, and execute correctly. Olympiad-level problems — MATH (Hendrycks et al. 2021) and its successors — test competition-level skill across algebra, calculus, combinatorics, and number theory. GSM8K is essentially a language-understanding benchmark with a numerical answer; MATH is a genuine reasoning benchmark. The two have tracked each other in rough lockstep as models have improved.

The progression is stark. GPT-3 (2020) scored around 5% on GSM8K and under 5% on MATH. Minerva (Lewkowycz et al. 2022), a PaLM model continue-pretrained on mathematical content plus chain-of-thought prompting, reached 78% on GSM8K and 50% on MATH. GPT-4 (2023), instruction-tuned and with chain-of-thought enabled, reached around 92% on GSM8K and over 50% on MATH out of the box. The o-series reasoning models (late 2024) and their successors — combining chain-of-thought with reinforcement learning on reasoning traces — pushed MATH scores past 95% and opened serious progress on olympiad-level benchmarks like AIME and USAMO. DeepSeek-R1 (2025) demonstrated that comparable performance was achievable with open weights and much less compute than had been previously assumed necessary.

Several features of this trajectory are worth noting. First, the gains came from multiple stacked interventions — scale, data curation, chain-of-thought, tool use, reinforcement learning on reasoning — not from any single magic ingredient. Second, the relationship between scale and math performance is one of the cleaner examples of smoothly-improving capability: bigger models are reliably better, with no obvious discontinuities once chain-of-thought is applied. Third, mathematical reasoning has been one of the few areas where relatively cheap interventions at the post-training or inference stage (CoT, RL-on-reasoning-traces) produced returns comparable to large amounts of additional pretraining.

Caveats remain. A model that can solve 95% of GSM8K can still produce confidently wrong answers on the remaining 5%, sometimes with no warning signal. Benchmark contamination is a real concern — many of these problems are now trivially searchable online, and the training corpora have almost certainly seen them. Olympiad-level results need to be interpreted carefully, as the hardest problems often require insight or creativity that is not well-captured by averaged benchmark scores. But the broad trajectory — from essentially no quantitative reasoning ability in 2020 to near-expert mathematical performance on average-case problems by 2025 — is a genuine empirical achievement, and it is the strongest evidence we have that scale plus careful training can produce genuine reasoning capability, not merely pattern-matched surface imitation.

§10

Code generation

Code is an unusually clean testbed for language models because it is verifiable — a generated program either runs and returns the right answer or it does not — and because its training data is large, well-structured, and free of most of the noise that affects natural-language corpora. Code capability has been a leading indicator of reasoning capability more broadly.

The modern era of LLM coding starts with Codex (Chen et al. 2021), a GPT-3 variant continue-pretrained on GitHub. The accompanying HumanEval benchmark — 164 hand-written Python problems with hidden test cases — became the standard measure of coding competence, and its progression traces the field: Codex at 28.8% pass@1; GPT-4 at 67% pass@1; Code Llama 70B at 67% pass@1; DeepSeek-Coder at 70%+; the most recent frontier closed-source and open-source models at 85–95%. HumanEval has been saturated to the point that the field has moved on to harder benchmarks (MBPP, APPS, SWE-bench, LiveCodeBench) that test more realistic engineering tasks.

Code capability benefits from most of the techniques covered in this chapter stacked together. Scale helps (bigger models write better code). Chain-of-thought helps (reasoning about the problem before writing code improves correctness). Tool use helps (running the code and iterating based on test results improves correctness substantially). Domain-specific pretraining on code corpora helps (Code Llama's continued pretraining added 20+ points over the base Llama). Reinforcement learning from execution feedback helps (several modern code models have been RL-tuned on pass/fail signals). The combined effect is that coding performance has advanced faster than almost any other capability area, and several commercial coding assistants now achieve >80% first-shot correctness on realistic tasks.

Two practical integrations deserve mention. Coding assistants — GitHub Copilot, Cursor, Zed, Windsurf, the integrated code-editing modes in frontier chat products — combine LLM-generated suggestions with IDE integration, real-time diagnostics, and direct execution. They are the first generative AI product category to achieve clear product-market fit among a professional user base, and their usage has changed the day-to-day practice of software engineering in ways that are still being absorbed. Agentic coding — Claude Code, OpenAI's Codex agent, Aider, and similar systems — gives the model a shell and a filesystem, turns coding into a multi-step task with iteration and tool use, and handles tasks that previously required a human engineer. The SWE-bench benchmark (Jimenez et al. 2024) measures performance on real-world bug-fixing tasks from GitHub issues; frontier agents have moved from 2% (early 2024) to 70%+ (mid-2025) in about eighteen months.

The reason code has been such a successful capability area is partly intrinsic — formal languages are easier for language models than natural language in some ways, because the syntax is unambiguous and the semantics are executable — and partly training-data-driven: there is enormous amounts of well-formatted open-source code available, with implicit quality signals from stars, downloads, and test suites. Whether this will generalise to other formal-language domains (theorem proving, formal verification, scientific simulation) is an active research question. Early results on Lean-based theorem proving are promising but far less mature than code generation.

§11

Multilinguality

Large language models trained predominantly on English develop surprising cross-lingual transfer — they can often read, translate, and generate in languages that made up single-digit percentages of their training mix. The asymmetries are real and persistent, but the capability is genuine.

The empirical pattern, documented across many frontier models, is that a model trained on a corpus that is 90% English can nevertheless operate reasonably in French, Spanish, German, Chinese, Japanese, and dozens of other languages. Quality varies: the model is typically best in the language that dominates the training mix, somewhat less fluent in other high-resource languages, noticeably less fluent in medium-resource languages like Polish, Indonesian, or Vietnamese, and frequently unreliable in low-resource languages like Yoruba, Swahili, or Tigrinya. But the fact that any reasonable capability exists in languages with essentially no deliberate training effort is a non-trivial empirical finding.

The mechanism appears to be a combination of shared subword vocabulary — tokenisers trained on multilingual corpora create overlapping vocabulary across related languages — and shared latent representations — linguistic regularities that hold across languages (subject-verb-object structure, predicate-argument relations, temporal deixis) can be encoded once and applied across the languages that share them. Aryabumi et al. 2024 and related work have shown that even small amounts of text in a target language, mixed into a predominantly-English corpus, suffice to unlock substantial capability in that language, suggesting that the model is building shared machinery with the target language providing just enough anchoring.

The asymmetries show up in several places. Tokenisation efficiency favours English: English text tokenises at roughly 0.75 tokens per word, while Chinese is close to 1 token per character, and languages like Thai or Burmese can be 2–5× less efficient. This directly affects context-window economics and inference cost for non-English users. Factual knowledge is strongest in English and weakest in low-resource languages, because the training data is unevenly distributed. Cultural grounding skews to the cultures represented in the training data, which are disproportionately Western and Anglophone. Dialectal variation is often flattened — a model may handle standard Mandarin or standard Arabic well but struggle with regional varieties that are under-represented in the written corpus.

The policy implications are real. A language model deployed globally serves its users unequally — faster for English speakers, cheaper for English speakers, more accurate for English speakers — and closing that gap requires deliberate effort in data collection, tokeniser design, and evaluation. Several multilingual-first models (Aya from Cohere, the Qwen series in Chinese/English, the various Mistral multilingual editions) have made this a design priority. The trend in frontier models is toward larger vocabularies and more deliberate multilingual data mixing, which narrows the gap without eliminating it.

§12

Tool use & function calling

A language model that can call a calculator when it needs to do arithmetic, or search the web when it needs current information, or execute code when it needs to verify a claim, is substantially more capable than one that cannot. Tool use turns the LLM from a static answer-box into the reasoning core of a larger system.

The insight behind tool use is that language models are excellent at knowing when they need external help and relatively poor at doing certain kinds of work internally. Arithmetic with many digits is a classic example — the model is much worse at multiplication than a calculator would be at multiplication, but the model is very good at recognising when multiplication is called for. If you let the model emit a calculator call, execute it externally, and feed the result back, you get the best of both. The same logic applies to web search for fresh information, Python execution for data processing, database queries for structured lookups, and vector retrieval for grounding in specific document collections.

Toolformer (Schick et al. 2023) was an early paper that demonstrated the basic recipe: have the model generate (or be fine-tuned to generate) tool-call tokens that the runtime intercepts, executes, and injects back into the context. ReAct (Yao et al. 2022) formalised the interleaving of reasoning steps with action steps. Modern function-calling APIs — Anthropic's tool-use, OpenAI's function calling, and their equivalents — expose this as a first-class interface: the developer describes the available tools in JSON schema, and the model decides which to call with what arguments. The runtime handles execution and returns results; the model continues with the results in context.

Tool use transforms several capability areas. Arithmetic and symbolic computation become reliable, because the model delegates to a computer-algebra system. Factual accuracy on recent events improves dramatically, because the model can search rather than relying on parametric memory (see also Chapter 09 on retrieval-augmented generation). Complex reasoning benefits, because the model can offload bookkeeping and verification to tools. Code generation in particular benefits from an execution tool — the model can write code, run it, see the output, and iterate, which catches a substantial fraction of the errors a single forward pass would make.

The natural extension is agentic systems that execute sequences of tool calls over long horizons with minimal human intervention. Anthropic's Claude Code (which powers this chapter's generation), OpenAI's Codex agent, Cognition's Devin, and dozens of domain-specific agents are all applications of this pattern. The current capability boundary — which is moving fast — is roughly at "tasks that require tens of tool calls over minutes to hours of wall-clock time, with clear success signals along the way." Tasks that require hundreds or thousands of coherent steps, or long-horizon goal pursuit with ambiguous signals, remain difficult. The weaknesses identified in §8 on reasoning limits are the weaknesses that also bound agentic performance, which is not a coincidence — agents are just reasoning over longer horizons with more side effects.

§13

Memorisation

Large language models memorise substantial portions of their training data verbatim. This is sometimes a capability — named facts, direct quotes, specific code snippets — and sometimes a liability — copyrighted content, private information, training-data leaks. The line between useful memorisation and problematic memorisation is where much of the current legal and technical argument lives.

Carlini et al. 2020–23 established the empirical baseline: a language model will reproduce, verbatim, a non-trivial fraction of the sequences it was trained on, given the right prompt. The fraction scales with model size (bigger models memorise more), with training-data repetition (duplicated content is memorised more reliably), and with how "distinctive" a piece of text is (common phrases are harder to attribute; distinctive passages are memorised as units). Training-data extraction attacks — carefully crafted prompts that elicit verbatim training sequences — have been demonstrated successfully against every frontier model that has been tested, including commercial ones.

From one perspective, memorisation is central to the model's usefulness. A model that cannot reproduce facts — who wrote Pride and Prejudice, when the French Revolution started, how numpy.argmax works — is much less useful than one that can. The model's parametric knowledge is its memorisation of training data, abstracted and compressed but fundamentally rooted in what it has read. In this sense there is no sharp line between "reasoning" and "memorisation" for models: the capability to answer a factual question depends on the model having encountered that fact in training, even if it paraphrases rather than quotes verbatim.

From another perspective, memorisation is a serious liability. Copyright holders argue that verbatim reproduction of protected works constitutes infringement, and several ongoing lawsuits (Authors Guild v. OpenAI, the NYT v. OpenAI, Bartz v. Anthropic) turn partly on empirical demonstrations of verbatim regurgitation. Privacy concerns arise when training data includes personal information (email addresses, phone numbers, medical details) and the model can be coerced into reproducing it. Security concerns arise when training data includes credentials — API keys, passwords — that appear in public scrapes of code repositories.

Mitigations are partial. Deduplication of training data reduces the rate of verbatim memorisation substantially (Lee et al. 2022). Differential-privacy training gives mathematical guarantees at the cost of significant capability loss, and is not currently practical at frontier scale. Output filtering — refusing to produce long verbatim matches to known copyrighted works — is a weaker guarantee but is widely deployed. Licensing data to have the legal right to train on it shifts the legal question without changing the technical one. None of these dissolves the underlying fact that a model that learns from text necessarily retains some of it, and that the boundary between "useful knowledge" and "verbatim reproduction" is not currently well-defined.

§14

Hallucination & factuality

Language models generate fluent, confident-sounding text whether or not the underlying content is true. The failure mode — producing plausible-sounding fabrications — has been named "hallucination," and it is one of the most persistent and least-solved problems in practical LLM deployment.

Hallucination is a broad term that covers several distinct failure modes: invented facts, invented citations, invented quotes, invented code library functions, invented historical events, confidently-wrong answers to questions with real answers, and plausible-sounding but subtly incorrect explanations. What these share is that the model has no mechanism to know it is wrong — it is producing high-probability continuations of the prompt, and "high probability" and "factual" are often correlated but not identical. A fluent wrong answer is, for the model's internal metric, indistinguishable from a fluent right answer, because the training objective does not penalise confident wrongness.

Several factors contribute. The pretraining objective rewards confident prediction; it does not reward honest uncertainty. The training data contains some false information (the internet) which the model may learn as fact. Coverage is uneven — the model knows a lot about famous authors and frequently-discussed topics, less about obscure ones, and may invent plausible-sounding content in the gaps. Instruction-tuning and RLHF typically reward sounding helpful, which can trade off against acknowledging uncertainty. The combination produces a model that will produce an answer of the expected shape for nearly any question, regardless of whether it has reliable information on the topic.

Mitigations fall into three buckets. Retrieval — the subject of Chapter 09 — grounds generation in specific retrieved documents, so the model can cite sources and the user can verify. Retrieval does not eliminate hallucination (models can still misread retrieved passages or invent details not in them), but it reduces it substantially and provides auditability. Calibration training — tuning the model to express uncertainty appropriately and to refuse to answer when unsure — helps, but is hard to do without over-training refusals. Verification — having a second model or a rule-based system check the first model's output against trusted sources — is increasingly common in production systems but adds latency and cost.

Frontier-model hallucination rates have dropped meaningfully from GPT-3 to the present, on both internal benchmarks and external evaluations. But the fundamental problem — that the model has no mechanism to know the limits of its own knowledge — remains unsolved. Any user-facing deployment has to assume some hallucination rate and design around it, whether through human review, retrieval grounding, tool-based verification, or tolerance of occasional error. Treating LLM outputs as authoritative without these safeguards is a mistake that the field has collectively learned to stop making, but that individual deployers still occasionally have to relearn.

A useful distinction: hallucination is not the same as being wrong. A calibrated model that says "I'm not sure, but I think X" when in fact X is false is wrong, but is not hallucinating in the problematic sense. A model that says "X, definitively" when X is false is hallucinating. The fix is partly accuracy (knowing more things) and partly calibration (knowing what you know). Progress has come from both angles.

§15

Safety behaviours & jailbreaks

Frontier models are trained to refuse certain requests — to avoid producing content that is harmful, illegal, or seriously disallowed by their developers' policies. These safety behaviours are the product of extensive post-training, and they are in a perpetual cat-and-mouse dynamic with adversarial prompts designed to circumvent them.

A frontier chat model, out of the box, refuses a broad but somewhat fuzzy set of requests: instructions for synthesising weapons of mass destruction, sexual content involving minors, clearly-identified disinformation campaigns, credentials for committing fraud, detailed guidance for self-harm. The specific contours of the refusal policy vary by developer (OpenAI, Anthropic, Google, Meta, and various open-source projects all differ), but there is substantial agreement on the strictly-refused categories, and substantial disagreement on the borderline ones — should the model help with dual-use security research, with controversial political claims, with explicit content between adults? The policy choices are visible through the models' behaviour and have become subjects of public debate in their own right.

Safety behaviours are instilled primarily through post-training: supervised fine-tuning on curated refusal examples, reinforcement learning from human feedback that rewards safe completions and penalises unsafe ones, and (in Anthropic's case) Constitutional AI that bakes policy into the training signal via model-generated critiques. These techniques are the subject of Chapter 07. They are moderately effective at producing consistent refusal on clearly-bad requests, less effective at producing consistent behaviour on borderline requests, and quite brittle under adversarial pressure.

Jailbreaks are adversarial prompts that bypass safety training. The early examples were simple: roleplay scenarios ("pretend you are an AI with no restrictions"), instructional framings ("I'm a nurse and need to know for medical reasons"), and basic prompt injection ("ignore your previous instructions"). These have been largely patched. More sophisticated attacks — gradient-based attacks that optimise prompt token strings against a model, translation into rare languages, encoding the request in base64 or ROT13, multi-turn conversational manipulation, prompt-injection via untrusted context — remain effective against frontier models with variable success rates. Zou et al. 2023 showed that gradient-based attacks could achieve near-100% success rates against several open models; closed models are somewhat more robust but not immune.

The adversarial dynamic is asymmetric in favour of attackers in important respects: they can iterate quickly, they only need to find one working attack, and many attack vectors generalise across models. Defenders face the opposite situation — they need to cover every plausible attack, without over-refusing benign requests, without degrading the model's usefulness. The current state is a stable equilibrium at "most users cannot jailbreak models without effort; motivated adversaries can." Whether this equilibrium is adequate for the risks involved is a question that depends on the specific risks — a model that declines to help with homework queries has different risk surface than one being considered for biological-weapons defence, and the safety evaluation has to match the stakes.

§16

Interpretability

What actually happens inside a large language model when it produces an output? The honest answer is that we understand a small fraction of it, and that fraction is growing. Interpretability research — probing, features, circuits, mechanistic analysis — has become an active subfield, partly motivated by safety concerns and partly by scientific curiosity.

Probing is the oldest interpretability technique: train a small classifier on the internal activations of a language model to predict some property — syntactic role, coreference, sentiment, knowledge of a specific fact — and see whether it succeeds. If it does, the model's activations encode that property; if it does not, they may not. Probing established, as early as 2018–19, that BERT and its successors encode substantial linguistic structure (syntax, semantic roles, factual associations) in interpretable ways across their layers. It is a blunt tool — it tells you what information is present, not how it is used — but it is also a reliable one.

Mechanistic interpretability is a newer and more ambitious programme: reverse-engineer specific circuits inside the model that implement specific behaviours. The reference early paper is Olah et al. 2020 on vision models; for language models, the landmark work has been Anthropic's "circuits" thread, Neel Nanda's open-source TransformerLens, and a growing body of academic work. The most influential single result is the identification of induction heads (Elhage et al. 2022) — attention heads that implement a "copy the next token after a previous occurrence of the same token" pattern, which turns out to be the mechanistic substrate for much of in-context learning. Identifying and understanding induction heads was one of the first genuinely mechanistic explanations of a non-trivial LLM behaviour.

More recent work has pushed in several directions. Sparse autoencoders (SAEs) — neural networks trained to decompose model activations into sparse combinations of interpretable "features" — have produced dictionaries of monosemantic features in large models (Anthropic 2024, OpenAI 2024, DeepMind 2024). Many of these features correspond to surprisingly specific concepts — "talking about the Golden Gate Bridge," "code that handles security errors," "references to the Roman Empire" — and can be manipulated causally to change the model's behaviour. Activation patching — swapping activations between runs to isolate which components of the model are responsible for a specific output — has allowed researchers to attribute specific behaviours to specific layers and attention heads. Circuit analysis continues to reverse-engineer specific capabilities (indirect-object identification, modular arithmetic, factual recall) in detail.

The state of the art as of early 2026 is partial but increasingly structured. We have a reasonable mechanistic story for some aspects of in-context learning, some aspects of factual recall, and some aspects of basic reasoning. We do not have a complete story for any large model, and the effort required to understand a given behaviour grows quickly with model size. Whether mechanistic interpretability will scale to frontier models — or whether fundamentally different tools will be needed — is an open empirical question, and one that matters beyond academic curiosity because mechanistic understanding is one of the most plausible paths to verifying that a model is behaving the way we want it to.

§17

The capability frontier

Where do large language models stand, as of early 2026, relative to what we wanted them to be? Substantially more capable than anyone expected in 2020, substantially less capable than the most optimistic projections of 2023, and genuinely useful for a large and growing set of practical tasks — with a residue of stubborn limitations that further scale has not obviously fixed.

The capabilities that have clearly arrived include: fluent multilingual writing and translation; competent code generation and agentic coding on non-trivial tasks; near-expert performance on most standardised mathematical benchmarks; reasonable performance on professional-grade tests in law, medicine, and finance; useful tool use and function calling; coherent multi-turn conversation; summarisation, extraction, and classification at levels that match or exceed trained specialists on most standard evaluations. These are not speculative claims — they are the measured behaviour of deployed commercial systems in routine use by tens of millions of professionals.

The capabilities that have not arrived, or have arrived only partially, include: robust long-horizon planning with ambiguous success signals; reliable performance on tasks requiring careful working-memory tracking without scratchpadding; generalisation to genuinely novel composition of primitives not present during training; calibrated uncertainty that accurately tracks the model's own reliability; consistent behaviour under adversarial pressure; autonomous operation over days or weeks without human intervention and without drift; the kind of domain expertise that professional specialists acquire over decades of practice. The frontier is moving on several of these; others are proving structurally hard.

Two structural trends are worth naming at this point. The first is that post-training and inference-time techniques have become as important as pretraining for frontier capability. Models that are pretrained to the same loss can have dramatically different behaviours depending on their RLHF regimes, their reasoning-training regimes, and the inference-time scaffolding they are deployed under. The next chapter (Instruction Tuning & Alignment) covers this in detail; its relevance here is that "scale alone" is no longer a sufficient description of how frontier capability is produced. The second trend is that retrieval, tools, and agents are consuming an increasing share of the capability conversation. A competent language model plus a good retrieval system plus the right set of tools outperforms a more expensive language model without those things on most realistic tasks. The productive frontier is increasingly systemic rather than model-internal.

The open questions for the next few years are recognisable to anyone who has followed the field closely. Will scaling laws continue to hold, or will the data wall or compute-efficiency curves bend the trajectory? Will post-training techniques continue to extract disproportionate value from pretrained base models, or will we hit ceilings on what they can do? Will mechanistic interpretability mature enough to support meaningful verification of model behaviour, or will we continue to deploy systems we cannot fully audit? Will agentic architectures with long-horizon autonomy become reliable enough for high-stakes use, or will they plateau at short-horizon reliability? The chapters that follow — alignment (Ch 07), fine-tuning (Ch 08), retrieval (Ch 09), and evaluation (Ch 10) — are the current field's best answers to these questions, but they are answers in progress rather than settled results.

The modern era of language modelling has been, to an unexpected degree, a long exercise in asking "what happens if we make it bigger?" and getting non-trivial answers. Scale matters, and will continue to matter. But the systems that work best today are no longer pure-scale artefacts; they are carefully post-trained, carefully scaffolded, carefully equipped with retrieval and tools, and carefully evaluated. The rest of this part of the compendium is about the techniques that turn a pretrained base model into a useful deployed system.

Large language models at scale, what happens when you train the same simple architecture on a hundred times more data and a hundred times more parameters — the capabilities that appear, the capabilities that do not, and the arguments over whether any of this counts as genuine understanding.

How to read this chapter

Contents

Why scale

Scaling laws revisited

Emergent capabilities

The emergence debate

In-context learning

Few-shot prompting

Chain-of-thought reasoning

Reasoning and its limits

Mathematical reasoning

Code generation

Multilinguality

Tool use & function calling

Memorisation

Hallucination & factuality

Safety behaviours & jailbreaks

Interpretability

The capability frontier

Further reading

Textbooks & tutorials

Foundational papers

Modern extensions

Software & tooling