Part VI · NLP & Large Language Models · Chapter 05

Pretraining paradigms, the training recipes that turn a transformer into a language model — the self-supervised objectives, the data pipelines, the scaling decisions, and the specific choices that separate a GPT from a BERT from a T5.

A transformer is an architecture. A language model is a transformer that has been trained — typically on hundreds of billions of tokens of raw text, over weeks or months on thousands of accelerators, using some self-supervised objective that teaches the network to predict part of its input from the rest. What that objective is, what data you use, how you tokenise it, how you schedule the training, how you scale compute relative to data: these choices — collectively, the pretraining paradigm — are what determine whether your model becomes a BERT-style encoder or a GPT-style generator, whether it is useful for classification or for open-ended reasoning, whether it generalises across tasks or overfits to a narrow specialty. Pretraining is where the modern foundation model is made. The architecture gets the headlines; the pretraining paradigm decides what the model can actually do. This chapter covers the three canonical objectives (causal LM, masked LM, denoising), the three canonical architectural branches (encoder-only, decoder-only, encoder-decoder), the data story (scraping the web, deduplicating it, filtering it, mixing it), the tokenisation question (BPE vs WordPiece vs SentencePiece and why it still matters), the scaling laws that now govern every training run, and the way pretraining has spread from language to vision, to code, to proteins, to the single universal paradigm of contemporary machine learning.

How to read this chapter

Sections one and two establish the conceptual frame. Section one argues why pretraining exists — the data bottleneck in supervised learning, the insight that raw text itself contains enormous amounts of structure the model can learn from, and how pretraining-then-fine-tuning replaced train-from-scratch as the default workflow. Section two introduces self-supervised learning, the broader family of techniques that pretraining belongs to, and the reason self-supervision scales when fully supervised learning does not.

Sections three through five cover the three canonical pretraining objectives. Section three is causal language modelling — next-token prediction, the objective behind GPT, LLaMA, and every decoder-only model. Section four is masked language modelling — BERT's objective, in which random tokens are replaced with a mask and the model predicts their identity from bidirectional context. Section five covers denoising and span-corruption objectives — T5's span masking, BART's noise family, and FIM (fill-in-the-middle) infilling, which sit between MLM and CLM on the spectrum of supervision.

Sections six through nine are the architectural branches. Section six is encoder-only pretraining — BERT, RoBERTa, ELECTRA, DeBERTa — the family of models built for representation and classification. Section seven is decoder-only pretraining — the GPT lineage that ended up dominating the field. Section eight is encoder-decoder pretraining — T5, BART, mT5 — the models that retain the original Vaswani structure and the objectives that exploit it. Section nine is prefix-LM and unified objectives — UL2, FLAN, and the attempts to get the best of all three branches from a single pretraining run.

Sections ten through twelve cover the data story. Section ten is tokenisation — BPE, WordPiece, SentencePiece, and why the choice of tokeniser still has first-order effects on model quality. Section eleven is pretraining data sources — CommonCrawl, C4, the Pile, RedPajama, Dolma, and the question of what "the text corpus of the internet" actually contains. Section twelve is data curation and filtering — deduplication, quality filters, toxicity and safety filters, and the increasingly sophisticated recipes that separate a pretraining corpus that works from one that does not.

Sections thirteen through fifteen are about scale. Section thirteen covers scaling laws — Kaplan et al. 2020 and the realisation that loss drops as a clean power law in compute, parameters, and data. Section fourteen is the Chinchilla correction — Hoffmann et al. 2022, which showed that earlier large models were under-trained, and gave the field its current 20-tokens-per-parameter prescription. Section fifteen covers training dynamics in practice — loss spikes, recovery, warmup schedules, learning-rate decay, and the operational reality of training a model for months on thousands of GPUs.

Sections sixteen through eighteen close the chapter. Section sixteen is multi-modal pretraining — CLIP's contrastive image-text objective, vision-language models, audio models, and the generalisation of pretraining beyond text. Section seventeen covers domain-specific pretraining — code models (Codex, StarCoder), math models (Minerva, DeepSeekMath), protein language models (ESM), and the question of when to pretrain on a domain rather than fine-tune a general model. The closing in-ml section reflects on pretraining as the unifying paradigm of contemporary ML, the rise of the foundation model as an engineering and economic unit, and the transition to the next chapter on scale and emergent capabilities.

Why pretrainingThe supervised-data bottleneck, transfer, pretrain-then-fine-tune
Self-supervised learningPredicting hidden parts of the input, why it scales
Causal language modellingNext-token prediction, the GPT objective, autoregressive training
Masked language modellingBERT's objective, bidirectional context, [MASK] and its replacements
Denoising & span corruptionT5's span masking, BART's noise family, FIM infilling
Encoder-only pretrainingBERT, RoBERTa, ELECTRA, DeBERTa
Decoder-only pretrainingGPT, LLaMA, the lineage that won
Encoder-decoder pretrainingT5, BART, mT5 — the seq2seq branch
Prefix-LM & unified objectivesUL2, FLAN, mixtures of MLM and CLM
TokenisationBPE, WordPiece, SentencePiece — the first-order choice
Pretraining data sourcesCommonCrawl, C4, the Pile, RedPajama, Dolma, the books problem
Data curation & filteringDeduplication, quality filters, safety, mixture weighting
Scaling lawsKaplan 2020, power laws, loss as a function of compute
Compute-optimal trainingChinchilla, 20 tokens per parameter, over-training, the data wall
Training dynamics in practiceLoss spikes, warmup, learning-rate schedules, checkpointing
Multi-modal pretrainingCLIP, vision-language models, audio, joint encoders
Domain-specific pretrainingCode, math, biology — when to specialise vs fine-tune
Pretraining as a paradigmFoundation models, the economic turn, the next chapter

§1

Why pretraining

Supervised learning is expensive — labelled data is scarce, task-specific, and often hostage to human annotators. Pretraining turns that economic problem on its head: you learn almost everything the model needs to know from unlabelled text, and reserve the labelled data for the last and smallest step. This is the shift that made language models feasible at scale.

Before 2018, the dominant workflow for NLP was train a model from scratch on each task. If you wanted a sentiment classifier, you collected a few thousand labelled reviews and trained a small recurrent or convolutional network end-to-end. If you wanted a question-answering model, you collected a few thousand question-answer pairs and trained a different model end-to-end. The embedding layer might be initialised from word2vec or GloVe (see §3 of Word Embeddings) — that was the extent of knowledge transfer. Every other parameter was learned from the task's labelled data alone. When the labelled data ran out, performance plateaued. And labelled data always ran out.

The deeper issue was that supervised signals are narrow. A sentiment label tells the model whether a review is positive or negative; it does not tell the model what English grammar looks like, or what words mean, or how sentences cohere into discourse. All that rich structural knowledge had to be re-learned from scratch from the task signal alone — which is hopeless, because the task signal is far too weak to support learning a whole language. The result was that models with the capacity to represent language well could not, in practice, be trained well enough to use that capacity.

Pretraining breaks the bottleneck by supplying a much richer signal from a much larger resource. The insight — visible already in word2vec, crystallised by ULMFiT and ELMo in 2018, and industrialised by BERT and GPT later that year — is that the text itself contains enormous amounts of supervision, if you choose an objective that forces the model to model the text's own structure. Predicting the next word, predicting a hidden word, reconstructing a corrupted span — each of these objectives gives the model a dense, self-generated training signal from every token of every document. Once a model has been trained to do these things well, adapting it to a downstream task (sentiment, QA, translation) requires only a small amount of labelled data, because most of the hard work — learning the language — has already been done.

The economic consequence is profound. A lab with a large pretraining run and small per-task budgets can now beat a lab that trains every task from scratch, on every task. A researcher who fine-tunes a pretrained model on 500 labelled examples can match what used to require 50,000. And because pretrained models can be downloaded and reused, the cost of the pretraining run is amortised across every downstream user — one expensive training, millions of cheap adaptations. This is the shape of the pretrain-then-fine-tune paradigm, which is not a technique so much as a new industrial logic for how NLP gets built.

Pretraining is the answer to the supervised-data bottleneck. It replaces task-specific labelled data with self-supervised signal from large text corpora, producing a general-purpose starting point that can be specialised cheaply. Everything else in this chapter — the choice of objective, the choice of architecture, the choice of data, the scaling — is a commentary on this one shift.

§2

Self-supervised learning

Self-supervision is the trick that makes pretraining possible: take raw data, hide part of it, and train the model to predict the hidden part from the visible part. The labels are not human-generated; they come from the data itself. This turns every document into training data and makes the available supervision scale with the internet.

Formally, a self-supervised objective defines a function f(x) → (x_visible, x_hidden) that splits an input into two parts, and asks the model to produce x_hidden given x_visible. The choice of split is the objective. If x is a sentence of n tokens and we hide the last token given the first n - 1, we have causal language modelling (§3). If we hide 15% of the tokens at random, we have masked language modelling (§4). If we hide whole spans of tokens and ask the model to reproduce them in order, we have span corruption (§5). Each choice induces a different pretraining distribution and different downstream strengths, but the logical shape is the same: the model predicts missing data from the rest of the data.

Why does this work so well? Because predicting the hidden part of natural text well requires the model to learn almost everything we want it to know. To predict the next word of a paragraph, the model must understand the grammatical structure of the sentence so far, the meaning of the content words, the topic of the paragraph, the logical flow of the argument, sometimes world knowledge about the entities mentioned. A model that predicts next words at the level of a competent reader is a model that has acquired, implicitly, a great deal of linguistic and factual competence. The next-word-prediction loss is a summary statistic — a single number — but it stands in for a rich joint distribution over language.

Self-supervision also scales with data and with compute in a way that supervised learning does not. Adding more labelled examples requires more human effort; adding more pretraining data requires only crawling more web pages. Adding more supervised training steps eventually overfits the labels; adding more pretraining steps, within reason, continues to reduce the pretraining loss, and the downstream gains have historically kept coming. This is what Kaplan 2020 and the Chinchilla paper (§14) later formalised as scaling laws: loss falls as a power law in compute, data, and parameters, over many orders of magnitude.

There is one more property worth naming. Self-supervised objectives are task-agnostic: they do not know or care what downstream problem the model will be adapted to. This is unusual. Most of machine learning, since its inception, has tailored the objective to the task. Self-supervision instead trains on a generic density-modelling objective and lets the downstream task sort itself out during fine-tuning — or, in the largest models, during prompting alone, with no parameter updates. The fact that a single pretrained model can be pointed at hundreds of tasks, with no retraining, is one of the defining empirical findings of the 2020s.

§3

Causal language modelling

The simplest and most consequential pretraining objective is causal language modelling: given a sequence of tokens, predict the next one. The model sees the left context, computes a probability distribution over the vocabulary, and is trained to put its mass on the token that actually came next. Iterate this over every position of every document, and you have a complete training signal.

Causal LM — also called autoregressive LM or left-to-right LM — is the oldest pretraining objective for neural language models. Bengio et al. 2003 trained a feedforward network to predict the next word given a fixed-width window. Mikolov et al. 2010 did the same with recurrent networks. Radford et al. 2018 applied it to transformer decoders at scale (GPT-1), and the GPT line of models — GPT-2, GPT-3, GPT-4, GPT-5 — has used essentially the same objective throughout, with increasing model size, increasing data, and decreasing assumptions about what will not work.

Mechanically, causal LM requires the transformer's attention to be causally masked: a token at position t may attend to tokens at positions ≤ t, but not to any position > t (see Transformer Architecture §12). Without this mask the model would trivially copy the next token from the attention input and learn nothing. With the mask, the model must build a representation of the left context that predicts what the next token will be — this is the difficult thing, and the thing the model ends up being good at. The loss is the sum of cross-entropies across positions: L = −Σ_t log p(x_t | x_<t).

Causal LM has several virtues that have made it the dominant objective in frontier models. First, it is generative by construction: a model trained to predict the next token can be used at inference time to generate text by sampling from its own predictions. There is no architectural mismatch between pretraining and generation; you train the same model you deploy. Second, it is dense: every position of every document contributes a training signal, rather than just the masked 15% that MLM uses. For the same wall-clock time, causal LM extracts more gradient per document. Third, it has scaled extraordinarily well: the same simple objective, applied to 100× more data and 100× more parameters, yields models of dramatically greater capability without any change in the training recipe.

Its weakness, as a representation-learning objective, is that it only sees left context. For a pure classification task — where the whole input is available and we just want the best representation of it — bidirectional context (§4) is strictly more informative. For years this was the accepted trade-off: causal LM for generation, masked LM for classification. Empirically, at large enough scale, this trade-off has largely collapsed. GPT-style models are now used for classification via prompting, and the gap in representation quality has narrowed enough that causal LM's other advantages — generation, simplicity, density — dominate. The asymmetry in attention is no longer the decisive factor it was in 2018.

§4

Masked language modelling

Masked language modelling hides a random subset of tokens and asks the model to predict them from the surrounding context — on both sides. This gives the model bidirectional access to the whole sequence and, for a while, produced the strongest text representations on the planet. BERT, RoBERTa, and DeBERTa all use variants of this objective.

Devlin et al. 2018 introduced masked language modelling as the primary objective of BERT. The recipe is simple: pick 15% of the input tokens at random; replace 80% of those with a special [MASK] token, 10% with a random vocabulary token, and 10% leave unchanged. Train the model to predict, at each of the selected positions, the original token that was there. Because the transformer sees the full (unmasked) input as its attention field — no causal mask — predictions can use left and right context simultaneously. A masked token in the middle of a sentence sees everything around it.

The 80/10/10 recipe is not arbitrary. If 100% of the selected positions were replaced by [MASK], the model would learn to predict tokens only when they are masked, creating a distribution mismatch with fine-tuning (where [MASK] never appears). The 10% random and 10% unchanged positions force the model to maintain a reasonable distribution over every token it sees, because it cannot tell from the input alone whether a position is being "asked about" or not. This is a small engineering detail with large empirical consequences.

MLM's win over causal LM, for representation-learning downstream tasks, was substantial at BERT's release: state-of-the-art on GLUE, SQuAD, and most of the benchmarks that existed in 2018. The reason is intuitive — bidirectional context is strictly more informative than left-only context for producing a representation of the current token — but the magnitude of the win was still surprising. It persuaded much of the research community that encoder-only MLM models were the future of NLP representation, and for the next two years BERT and its descendants dominated classification, retrieval, and extractive QA benchmarks.

The trade-off, as already noted, is that MLM is not natively generative. You cannot easily sample text from a BERT, because it does not model the distribution of next tokens; it models the distribution of masked tokens given their neighbours. You can generate token-by-token by iteratively masking and predicting, but the result is slow and lower-quality than autoregressive sampling. MLM also has lower signal density than CLM — only 15% of positions contribute to the loss per step, compared to every position for CLM. For classification this is tolerable; for pure scaling, it looks increasingly like a weakness.

Subsequent variants refined the recipe. RoBERTa (Liu et al. 2019) showed that BERT was undertrained and that dropping the next-sentence-prediction auxiliary objective, training longer, and using bigger batches improved results substantially. ELECTRA (Clark et al. 2020) replaced masking with replaced-token detection, training a discriminator to distinguish real tokens from ones generated by a small generator — a more sample-efficient objective. DeBERTa (He et al. 2020) added disentangled attention over content and position. Each of these made encoder-only models better without departing from the fundamental MLM idea.

§5

Denoising & span corruption

Between pure left-to-right causal LM and pure token-level masked LM lies a family of denoising objectives — corrupt the input in various ways, and ask the model to reconstruct the original. Span corruption, as used by T5, is the most influential member of this family. BART, UL2, and fill-in-the-middle variants are its close relatives.

The generalisation step, articulated most clearly in the T5 paper (Raffel et al. 2020), is this: instead of masking individual tokens, mask contiguous spans of tokens, and instead of asking the model to predict the masked tokens in place, have an encoder-decoder model read the corrupted input and generate the missing spans in order. The corruption function is part of the objective's hyperparameters; the encoder sees the corrupted sequence, the decoder emits the missing parts. This is more general than MLM (which assumes token-level masking) and more general than CLM (which assumes the "missing" part is always the suffix).

T5's span corruption objective drops 15% of the tokens, arranged into spans of mean length 3, and replaces each span with a sentinel token <extra_id_0>, <extra_id_1>, etc. The decoder then has to output the original spans, each preceded by the corresponding sentinel, in order. This is a harder prediction problem than MLM because the spans are longer, and it is a more natural one for encoder-decoder architectures because the decoder already has the machinery to emit arbitrary token sequences. T5 used this objective to pretrain a family of encoder-decoder models that matched or exceeded BERT on most classification benchmarks while also handling generation tasks natively.

BART (Lewis et al. 2020) explored a wider family of corruption functions: token masking, token deletion, text infilling (span masking), sentence permutation, document rotation. The encoder sees a corrupted document; the decoder reconstructs the original in full. Different corruption functions emphasise different downstream strengths — span infilling helps fill-in-the-blank tasks, sentence permutation helps discourse tasks, document rotation helps summarisation — and Lewis et al. found that text infilling plus sentence permutation was the best single combination for a generalist model. BART became a standard backbone for summarisation and controlled generation.

A close cousin, fill-in-the-middle (FIM) pretraining (Bavarian et al. 2022), adapts causal LM to fill holes as well as extend text. The training data is reformatted so that a document A | B | C is presented as <pre> A <suf> C <mid> B: the decoder-only model sees the prefix and suffix and is trained to generate the missing middle. This lets a single autoregressive model support both continuation (classic completion) and insertion (code editing, document editing) without changing architectures. FIM became the de facto objective for code models — Codex's successor, Code LLaMA, and most modern code-editing models use it or a close variant.

§6

Encoder-only pretraining

Encoder-only models — BERT and its descendants — use a masked language modelling objective on a bidirectional transformer encoder. They produce excellent contextual representations for classification, retrieval, and other "read the whole thing" tasks, but they are awkward for generation. The encoder-only lineage peaked around 2020 and has since ceded the leading position to decoder-only models.

The architectural commitment of an encoder-only model is that every position attends to every other position without a causal mask. This makes sense when the goal is to produce a representation of the input that downstream task heads will consume: for classification, the representation at the [CLS] token is fed into a linear classifier; for sequence labelling (NER, POS), each position's representation is fed into a per-token classifier; for extractive QA, the representations are used to predict answer-span boundaries; for retrieval, the [CLS] representation becomes a document embedding. In each case, the whole input is available, and bidirectional context is unambiguously the right thing.

BERT (Devlin et al. 2018) was the first to industrialise this design at scale. BERT-base had 110M parameters, 12 layers, and was pretrained on Wikipedia plus BooksCorpus — about 3.3B tokens total — using MLM plus a next-sentence-prediction (NSP) auxiliary objective that was later shown to be unhelpful. BERT-large had 340M parameters. Both models set state-of-the-art results on essentially every NLP benchmark that existed, and the fine-tuning recipe — add a thin task head, train end-to-end on the task's labelled data for a few epochs — became the default workflow for two or three years.

RoBERTa (Liu et al. 2019) revisited BERT's training recipe and found it dramatically undertrained. By dropping NSP, increasing the batch size, training on 160GB of text rather than 16GB, using dynamic masking (re-sampling masked positions each epoch), and training for much longer, RoBERTa substantially outperformed BERT at the same parameter count. The lesson — compute, not architecture, had been the bottleneck — was repeated many times in subsequent years.

ELECTRA (Clark et al. 2020) replaced MLM with a more sample-efficient alternative. A small generator network produces plausible replacements for corrupted tokens; the main model, the discriminator, is trained to distinguish real tokens from generated ones at every position. This produces a signal on 100% of positions rather than 15%, and is computationally cheaper per training example. ELECTRA reached BERT-large quality at a fraction of BERT-large's compute. DeBERTa (He et al. 2020, v3 in 2021) added disentangled attention that separates content and position, and was for some time the best encoder-only model on GLUE and SuperGLUE.

Encoder-only models remain the right choice for certain workloads — classification at scale, retrieval, reranking, extractive QA — because they produce a representation in one forward pass and do not need decoding. Modern retrieval stacks (cross-encoders, bi-encoders, ColBERT) still use encoder-only backbones. But their share of frontier research attention shrank as decoder-only models, with their generative flexibility and stronger scaling behaviour, came to dominate.

§7

Decoder-only pretraining

Decoder-only models — GPT and its descendants — use causal language modelling on a transformer decoder stack. They are generative by construction, they scale extraordinarily well, and they have become the default architecture for frontier language models. Everything from GPT-4 to Claude to LLaMA to Gemini is a decoder-only transformer.

A decoder-only model keeps only the decoder stack of the original transformer, and removes the cross-attention layers that would have connected it to an encoder. What remains is a stack of transformer blocks with causally masked self-attention: each token attends only to tokens at earlier positions, and a language-model head predicts the next token at each position. Training is causal LM on a large text corpus; inference is autoregressive sampling from the model's own output distribution.

The lineage is Radford et al.'s GPT-1 (2018), GPT-2 (2019), and GPT-3 (2020), each substantially larger than the last: 117M, then 1.5B, then 175B parameters. GPT-2 was the first to demonstrate strong zero-shot behaviour — the model could do new tasks from a prompt alone, without task-specific training — and GPT-3 made that behaviour the main story, with in-context learning from a handful of examples matching fine-tuned specialist models on many benchmarks. The architecture barely changed across these generations; the gains came almost entirely from scale.

From 2020 onwards the open-weights ecosystem built a parallel lineage: Meta's OPT (2022), LLaMA 1 (2023), LLaMA 2 (2023), and LLaMA 3 (2024); Mistral (2023); Falcon (2023); Qwen (2023 onwards); Gemma (2024 onwards). All of these are decoder-only causal LMs. They differ in their positional encoding schemes (RoPE vs ALiBi, see §15 of Transformer Architecture), in their activation functions (SwiGLU has become dominant), in their normalisation (pre-norm with RMSNorm is now standard), and in the details of their data and training schedules. But the fundamental recipe — stack transformer decoder blocks, train with causal LM on a large corpus — is shared.

Why did decoder-only win? Several reasons compounded. First, the objective is dense: every token contributes to the loss, so the model extracts more training signal per document than MLM. Second, pretraining and generation use the same forward pass, so there is no train-test mismatch when the model is deployed as a generator. Third, decoder-only models expose a very simple interface at inference — give it text, it produces more text — which composes well with prompting as a task interface. Fourth, scaling laws (§13) turned out to be friendlier to decoder-only models in practice; their loss curves are smoother and their capabilities emerge more reliably with scale. None of these advantages was decisive in isolation, but together they were enough to collapse the earlier encoder-only/decoder-only/encoder-decoder trichotomy into a single dominant branch.

§8

Encoder-decoder pretraining

The original transformer was an encoder-decoder, and for a while the best general-purpose pretrained models — T5, BART, mT5 — kept that architecture. The encoder reads the input; the decoder writes the output. This remains the natural fit for translation, summarisation, and any task where input and output are linguistically distinct.

An encoder-decoder transformer consists of two stacks connected by cross-attention: the encoder reads the input sequence bidirectionally and produces a set of contextualised representations; the decoder generates the output sequence autoregressively, and at each layer cross-attends to the encoder's output to pull in information from the input. Training is typically done on a denoising objective — T5's span corruption, BART's text-infilling — where the encoder sees a corrupted sequence and the decoder emits the original. The decoder can also be trained with plain sequence-to-sequence supervision once pretraining is done.

T5 (Raffel et al. 2020) was the flagship encoder-decoder model of the BERT era. It framed every NLP task as text-to-text: translation was "translate English to German: X" → "Y"; classification was "cola sentence: X" → "acceptable" or "not acceptable"; summarisation was "summarize: X" → "Y". This unified interface, combined with span-corruption pretraining on the C4 corpus, produced a family of models (T5-small through T5-11B) that matched or exceeded BERT-style models on classification while handling generation natively. BART (Lewis et al. 2020) pursued a similar design with a richer corruption family and became a popular choice for summarisation.

mT5 (Xue et al. 2021) extended T5 to 101 languages using the same recipe, and remained one of the strongest multilingual models for several years. Flan-T5 (Chung et al. 2022) applied large-scale instruction-tuning to T5, showing that the encoder-decoder architecture could produce competitive instruction-following models. Within Google, the encoder-decoder lineage continued through LaMDA and into some of the Gemini variants, although exact architectural details are not always public.

Why did encoder-decoder lose ground? Not because it was worse on translation or summarisation — there it remains perfectly respectable — but because decoder-only models, scaled up, ate its lunch on almost everything else. A sufficiently large decoder-only model can treat any input-output task as a prompt-completion problem, handling input and output within a single causal stream. The cross-attention machinery that distinguishes encoder-decoder is an architectural commitment that does not pay for itself at the largest scales, and the extra engineering complexity (two stacks instead of one; cross-attention cache management; separate tokenisation of input vs output) is real. The decoder-only architecture's winning move was simplicity that composed well with scale.

§9

Prefix-LM & unified objectives

Between pure causal LM and pure encoder-decoder lies a family of unified objectives that try to get the best of both. Prefix-LM, UL2, and various mixtures let a single model learn bidirectional context over some prefix and autoregressive continuation over some suffix — combining the representational advantages of MLM with the generative fluency of CLM.

A prefix-LM is a decoder-only model with a modified attention mask: within a designated prefix portion of the sequence, attention is bidirectional (every prefix token attends to every other prefix token); within the suffix portion, attention is causal (each suffix token attends only to earlier suffix tokens and to the full prefix). Training loss is applied only over the suffix. The model thereby learns to use the full prefix bidirectionally — as if it were an encoder — while generating the suffix autoregressively — as if it were a decoder. This is architecturally simpler than a full encoder-decoder, because there is only one stack of transformer blocks, but it matches some of encoder-decoder's behavioural properties.

Prefix-LM was introduced by Dong et al. 2019 (UniLM) as a way to unify BERT-style and GPT-style training in a single model, with the attention mask selected per-example. Raffel et al. 2020 revisited it in the T5 paper and found it competitive with encoder-decoder on many tasks. Tay et al. 2022 (UL2) pushed the idea further by mixing multiple objectives during pretraining: regular causal LM on some batches, prefix-LM on some, and denoising spans of various lengths on others. The resulting model could be prompted in different modes — "use this example as an MLM task" or "use this example as a CLM task" — and performed well across the whole spectrum.

FLAN (Wei et al. 2022, and successors) is not a pretraining objective but a related idea: take a pretrained model and fine-tune it on a massive collection of instruction-following tasks, formatted uniformly as input-output pairs. The fine-tuning teaches the model to interpret natural-language task descriptions — "summarise the following article" — as specifications of what to do, rather than as prose to be continued. FLAN-T5 and FLAN-PaLM demonstrated that instruction-tuning generalises: a model tuned on 1,800 tasks can follow instructions on tasks it never saw during tuning. This technique, scaled up, became the RLHF pipelines of GPT-3.5 and later models — which are really instruction-tuning plus a preference-learning stage stacked on top.

The conceptual point here is that there is no single "correct" pretraining objective; there is a spectrum between full bidirectional encoders and full autoregressive decoders, and different points on the spectrum serve different downstream profiles. The empirical winner at frontier scale has been plain causal LM plus post-training — instruction-tuning, RLHF, constitutional AI — but the objective-mixing research line continues to explore whether a richer pretraining objective could yield better models at fixed compute. For now, that exploration has produced interesting results without dislodging CLM from the top of the tree.

§10

Tokenisation

Before a model can be pretrained, text must be turned into integers. The choice of tokeniser — how text is segmented into units — is a foundational commitment that affects vocabulary size, sequence length, multilingual behaviour, and the cost of everything downstream. BPE, WordPiece, and SentencePiece are the three algorithms that matter.

The tokeniser's job is to convert a Unicode string into a sequence of integers drawn from a fixed vocabulary. The two extremes are clear: character-level tokenisation produces very long sequences with a small vocabulary, and word-level tokenisation produces short sequences but cannot handle out-of-vocabulary words. Modern models use subword tokenisation, which sits in between: common sequences become single tokens, rare sequences are broken into smaller pieces, and every possible input can be tokenised without an out-of-vocabulary token.

Byte-Pair Encoding (BPE), originally a compression algorithm, was adapted for NLP by Sennrich et al. 2016. Start with a vocabulary of single characters (or bytes). Repeatedly find the most frequent adjacent pair of tokens in the training corpus and merge them into a new token, extending the vocabulary. Stop when the vocabulary reaches a target size (typically 32k–256k). The result is a vocabulary where common words are single tokens, common morphemes (like -ing, un-) are single tokens, and rare words decompose into pieces. GPT-2 and most decoder-only models use BPE, often operating on UTF-8 bytes rather than Unicode characters so that every possible input is representable.

WordPiece (Schuster & Nakajima 2012, popularised by BERT) is a close cousin that uses a slightly different merging criterion — maximise the likelihood of the training corpus under a unigram language model, rather than pick the most frequent pair. The practical differences from BPE are small. SentencePiece (Kudo & Richardson 2018) is an implementation that operates directly on raw text (no pre-tokenisation into whitespace-delimited words) and supports both BPE and unigram-LM algorithms. It is the standard choice for multilingual and non-whitespace-delimited languages (Chinese, Japanese, Thai).

Tokeniser choice has consequences that persist for the life of the model. Vocabulary size trades off between model parameters (the embedding matrix and LM head grow linearly with vocabulary) and sequence length (a bigger vocabulary means shorter sequences, which means fewer tokens to pay for in attention). Multilingual vocabularies disproportionately penalise languages with smaller training footprints: non-Latin scripts often consume 2–5× more tokens per unit of text than English does, which means the same context window holds less of their content. Recent tokenisers — GPT-4's cl100k, LLaMA 3's 128k-token vocabulary — have widened vocabularies partly to address this, but the asymmetry remains visible.

A practical consequence worth internalising: pretraining data composition interacts with tokeniser training. If your corpus is 90% English and 10% Chinese, BPE will allocate most of its vocabulary slots to English patterns, and Chinese text will tokenise into many short pieces. This shows up at inference time as higher cost per character and lower effective context length for non-English users. Vocabulary design is not a mere engineering detail; it is a policy decision that shapes who the model serves.

§11

Pretraining data sources

A frontier language model is trained on trillions of tokens of text. Where does that text come from? The short answer is: mostly the internet. The longer answer — which public corpora matter, how they are curated, and what is missing from them — is the subject of several years of intensive research and some high-stakes legal disputes.

The foundational public resource is CommonCrawl, a non-profit that has been crawling the web since 2008 and publishes the raw crawls as monthly snapshots totalling tens of petabytes. CommonCrawl is the substrate from which most pretraining corpora are derived — but in raw form it is mostly junk: boilerplate, navigation, spam, near-duplicates, machine-translated content. Almost all modern pipelines filter it heavily before use. The filtering is itself a research area (§12).

Specific derived corpora have been influential. C4 (Raffel et al. 2020) is the "Colossal Clean Crawled Corpus", a filtered version of CommonCrawl (one snapshot, about 750GB after filtering) that was used to train T5. The Pile (Gao et al. 2020) is an 825GB corpus assembled by EleutherAI, combining CommonCrawl-derived text with curated high-quality sources (ArXiv, PubMed, StackExchange, Wikipedia, GitHub, books, legal texts, mathematics). The Pile was the training corpus for GPT-Neo and GPT-J. RedPajama (Together 2023) reproduced the LLaMA-1 training data mixture as an open 1.2T-token dataset. Dolma (Soldaini et al. 2024) is AI2's 3T-token open corpus, used to train OLMo.

Beyond web text, the high-quality curated buckets matter enormously per-token. Books contribute long-form structure, complex argumentation, and vocabulary that is rare on the web. Academic papers (ArXiv, PubMed) contribute technical writing. Code repositories (GitHub) contribute programming ability. Mathematics corpora (proof repositories, competition problems) contribute symbolic reasoning. Wikipedia contributes encyclopaedic factual content. Each of these is smaller than CommonCrawl but up-weighted in the training mixture because per-token they teach the model more useful things.

The books problem — whether training on copyrighted books constitutes fair use — has become a major legal issue. Early open corpora like BooksCorpus and the books component of The Pile included material of uncertain copyright status. Several lawsuits (Authors Guild v. OpenAI, Bartz v. Anthropic, etc.) are now shaping the space. In parallel, there is a push toward provenance-certified corpora — data that was licensed, public-domain, or opt-in — which has produced projects like CommonPile (EleutherAI's licensed-content corpus). How this resolves will determine the composition of open pretraining data for the rest of the decade.

§12

Data curation & filtering

Raw crawled text is not training data. It must be deduplicated, filtered for quality, stripped of unsafe content, and mixed in careful proportions. The pipeline between CommonCrawl and a pretraining-ready corpus is long, and the choices made inside it have disproportionate effects on the resulting model.

Deduplication comes first and matters most. CommonCrawl contains enormous amounts of near-duplicate content: the same news article on a hundred mirror sites; the same Stack Overflow answer on a hundred scraper sites; boilerplate headers, footers, and navigation chrome repeated across millions of pages. Training on duplicates wastes compute at best and degrades generalisation at worst — the model effectively trains multiple times on the duplicated examples, which is not the same as training once. Lee et al. 2022 showed that aggressive deduplication (MinHash-based, at paragraph or document level) substantially improves model quality at fixed compute.

Quality filtering is the next stage. The goal is to keep coherent, informative text and discard boilerplate, machine-generated noise, keyword-stuffed SEO pages, and low-effort social-media chatter. Early filters were handcrafted: heuristics on line length, punctuation ratios, stopword frequencies, language detection, and so on. Modern filters are model-based: train a classifier (often a small LM or a reference-model likelihood ratio) to distinguish "high-quality" text from the raw CommonCrawl distribution, and keep only the highest-scoring documents. The reference of "high quality" is typically a curated corpus like Wikipedia or books — the filter learns to pick crawled text that looks similar. Penedo et al. 2024 (FineWeb) is a public example of the full pipeline.

Safety filtering removes content that is illegal, abusive, or otherwise unsuitable. This includes sexually explicit material involving minors (strictly removed), personally identifiable information (often redacted or removed), and extreme toxicity or hate speech (usually filtered, though the threshold is contested). Safety filtering trades off coverage against risk: too aggressive and the model becomes less knowledgeable about the world; too permissive and the model internalises harmful content that will surface later. There is no neutral answer, and different labs make different calls.

Mixture weighting is the final lever. Given multiple subcorpora — crawled web, books, code, papers, Wikipedia — the training-time mixture determines what the model ends up looking like. A model trained on 100% web text will be good at casual English and bad at mathematics. A model with 10% code in its mixture will be competent at programming; a model with 50% code (like Code LLaMA's continued pretraining) will be an expert. Ratios are tuned experimentally, often using small proxy models to pick mixtures before committing to a full-scale run. The practical upshot is that "more data" is not a scalar; the shape of the data matters as much as the quantity.

A striking finding from the last two years of data research: the marginal return on extra pretraining tokens falls off fast once you are past the obvious duplicates and low-quality pages. Data quality has become at least as important as data quantity. Phi-1, Phi-2, and Phi-3 (Microsoft, 2023–24) demonstrated that carefully curated, textbook-quality training data can produce models that punch substantially above their token-count weight. This line of work pushed the field from "scrape everything" toward "curate carefully".

§13

Scaling laws

The central empirical finding of the 2020s is that language-model loss falls as a power law in compute, data, and parameters — over many orders of magnitude, with remarkably smooth curves. This is what makes scaling a rational investment rather than a gamble. If you know your budget, you can predict your loss.

Kaplan et al. 2020 were the first to study neural-LM scaling systematically across a wide range of scales. They trained many transformer language models of varying size (768 parameters to 1.5B) for varying amounts of compute on varying amounts of data, and plotted the resulting loss. What they found was strikingly clean: loss as a function of compute C followed a power law L(C) = (C_c/C)^α with α ≈ 0.05; similar power laws held for L(N) (number of parameters) and L(D) (number of training tokens). Over six orders of magnitude of compute, loss fell predictably, with no visible knee or plateau.

The implication was extraordinary. If you want to build a better language model, you do not need to come up with a better architecture or a cleverer objective; you just need to train a bigger model for longer on more data, and the loss will fall by a predictable amount. This reframed the research agenda: the question was no longer "what is the right architecture?" but "given this compute budget, what is the cheapest way to spend it?" Kaplan's answer — keep the data fixed and grow the model — was the conventional wisdom from 2020 through early 2022.

Kaplan's recommendations also had a specific bias that would later be corrected. Their analysis held data roughly fixed at a few hundred billion tokens, and suggested that given more compute, you should grow the model (not the data) aggressively. The result was a wave of enormous models trained on surprisingly little data: GPT-3 (175B parameters, 300B tokens), Megatron-Turing NLG (530B parameters, 270B tokens), Gopher (280B parameters, 300B tokens). These models were impressive, but as the next subsection explains, they were not compute-optimal — they were undertrained for their size.

Subsequent work (Hoffmann et al. 2022; see §14) refined scaling laws by varying both parameters and data, and found substantially different optimal allocations. More recent work (Hoffmann 2023, Muennighoff 2023, DeepSeek 2024) has extended scaling laws to include data quality effects, inference cost, and the downstream capability emergence phenomena that Wei et al. 2022 popularised as "emergent abilities". The picture is more nuanced than Kaplan suggested in 2020, but the core empirical finding — loss is a smooth, predictable function of compute over many orders of magnitude — remains the foundation on which modern pretraining planning rests.

§14

Compute-optimal training

Hoffmann et al. 2022 (the "Chinchilla" paper) showed that frontier models of the GPT-3 era were dramatically undertrained — that at a fixed compute budget, you should use roughly 20 tokens per parameter, not 2. This recalibrated the field and gave us the 20-to-1 heuristic that still underpins most pretraining planning.

Hoffmann et al. 2022 ran a more thorough scaling study than Kaplan had, training over 400 models from 70M to 16B parameters on a range of dataset sizes from 5B to 500B tokens. Their analysis varied parameters and data jointly, and asked: at a given compute budget, what pairing of (N, D) minimises loss? Their answer: N and D should scale roughly equally. If you double compute, you should double both the parameter count and the training-token count, not just one or the other. This contradicts the Kaplan recipe, which kept data roughly fixed and scaled only parameters.

The practical upshot was the "Chinchilla ratio": roughly 20 tokens per parameter is compute-optimal. GPT-3 at 175B parameters had been trained on 300B tokens, or about 1.7 tokens per parameter — twelve times too few. Chinchilla itself was a 70B-parameter model trained on 1.4T tokens (20 tokens per parameter) — much smaller than GPT-3 but with more training. It substantially outperformed GPT-3 on benchmarks despite the smaller size. This was the first clear demonstration that the existing frontier had been allocating compute wrongly.

The Chinchilla recalibration had immediate consequences. LLaMA 1 (Touvron et al. 2023) trained a 65B model on 1.4T tokens, hewing to the Chinchilla ratio and reporting results competitive with GPT-3. PaLM 2 (2023) moved in the same direction. For a few years, "Chinchilla-optimal" was the planning default. But within eighteen months the field had moved again — this time toward over-training, where models are trained well past the 20:1 ratio because the inference cost over a model's deployment life-time so vastly exceeds the one-time training cost that over-training the training to shrink the model is net cheaper. LLaMA 3 (Dubey et al. 2024) trained its 8B model on 15T tokens, which is 1,875 tokens per parameter — roughly 90× the Chinchilla-optimal ratio. The resulting model is small and cheap to serve.

Looming over all of this is the data wall: the estimate that high-quality public text on the internet amounts to something like 10–100 trillion tokens total, and the largest training runs are now consuming meaningful fractions of that. If you extrapolate current trends, the field will exhaust available high-quality text within a few years. This has prompted active research on synthetic data, on training on multiple epochs of the same data, on transfer between modalities, and on smarter data curation — all aimed at stretching the effective token budget past where naive scaling would predict diminishing returns. The data wall may be the structural force that eventually bends the scaling curves, but for now it remains forecast rather than realised.

§15

Training dynamics in practice

Pretraining a frontier language model for weeks on thousands of GPUs is an engineering feat. Loss curves spike, gradients blow up, hardware fails, and learning-rate schedules have to be tuned with care. The practical recipes — warmup, cosine decay, checkpointing, loss monitoring — are lore accumulated over hundreds of failed runs.

The canonical schedule, shared with only minor variation across most modern pretraining runs, is linear warmup followed by cosine decay. During warmup (the first 0.1–1% of total steps) the learning rate rises linearly from zero to its peak value, which prevents the model from taking destructive steps early when gradients are noisy. During the main phase the learning rate follows a cosine curve from peak to a small fraction of peak (typically 0.1× or 0). This schedule was popularised by Loshchilov & Hutter's SGDR paper and is now essentially universal.

Loss spikes — sudden, dramatic increases in training loss, sometimes by orders of magnitude — are a notorious failure mode of large-scale pretraining. They usually trace to a single training example or microbatch that produces extreme activations or gradients, sometimes to numerical instability in specific layers (attention softmax is a common culprit), sometimes to hardware faults. When a spike happens, the standard response is to roll back to the last healthy checkpoint, skip the offending batch, and resume. OPT-175B's training logs (Zhang et al. 2022) are a candid public record of how messy this can be — dozens of rollbacks, several emergency architecture changes, and many weeks of wall-clock debugging.

Mixed-precision training (fp16 or bf16 forward and backward, fp32 optimiser state) is universal at scale, for memory and throughput reasons. The main stability trick is bf16, which has the same exponent range as fp32 and thus avoids the overflow/underflow problems that plagued fp16 training. Optimiser choice is nearly always AdamW. Batch size is large — millions of tokens per step in frontier runs — because gradient variance at the token scale is enormous and has to be averaged out.

Checkpointing — saving model state periodically — is the last line of defence against disaster. Checkpoints are written every few thousand steps; each one is hundreds of gigabytes to terabytes; storage and I/O bandwidth become real concerns. Frontier runs additionally checkpoint the optimiser state (which is roughly 2× the model size for AdamW in fp32) and the data-loader position, so that a resumed run sees the same data in the same order as it would have seen in an uninterrupted run. This determinism matters more than it sounds — tiny differences in data order compound into different final models.

There is a whole subdiscipline around parallelism in pretraining: data parallelism across GPUs in a group, tensor parallelism across GPUs inside a transformer block, pipeline parallelism across layers, expert parallelism for mixture-of-experts models, ZeRO sharding for optimiser state, FSDP for everything. Frameworks like Megatron-LM, DeepSpeed, and torch's FSDP abstract these, but the configurations that actually run well on a given cluster are tuned empirically. A month of a pretraining run is often preceded by weeks of benchmarking.

§16

Multi-modal pretraining

The same self-supervised logic that works for text works, with modifications, for images, audio, and their combinations. CLIP pretrained a joint image-text space with contrastive loss; vision-language models extend the causal-LM objective across modalities; audio models like Whisper pretrain on aligned transcripts. Pretraining is no longer just about text.

CLIP (Radford et al. 2021) was the breakthrough that introduced contrastive image-text pretraining. Take 400 million image-caption pairs scraped from the web; train an image encoder and a text encoder jointly such that matching pairs have high cosine similarity and non-matching pairs have low similarity. The result: a joint embedding space where images and text can be compared directly. This enabled zero-shot image classification (embed the image, embed each candidate label's text description, pick the nearest), which had been considered out of reach for deep-learning models. CLIP's embeddings also turned out to be extraordinarily useful as a building block — they are the conditioning signal for Stable Diffusion, DALL-E 2's text encoder, and many downstream vision systems.

Vision-language models (VLMs) extend the approach to generative settings. Flamingo (Alayrac et al. 2022), BLIP-2 (Li et al. 2023), LLaVA (Liu et al. 2023), and the GPT-4V / Claude / Gemini multimodal systems all operate on the same template: use a pretrained image encoder to turn images into a sequence of embeddings, project those embeddings into the text model's embedding space, and interleave them with text tokens. The language model then processes the combined sequence autoregressively as if the image tokens were just another kind of word. Training is on (image, text) pairs where the model predicts the text given the image — essentially captioning as a causal-LM objective.

Audio pretraining has followed a parallel path. Wav2Vec 2.0 (Baevski et al. 2020) applied a masked-reconstruction objective to raw audio, learning self-supervised representations that could then be fine-tuned on small amounts of labelled speech. HuBERT (Hsu et al. 2021) refined the approach. Whisper (Radford et al. 2022) took a different tack: instead of self-supervised pretraining, it trained a sequence-to-sequence transformer on 680,000 hours of (audio, transcript) pairs scraped from the web — effectively a weakly supervised encoder-decoder, which turned out to generalise exceptionally well to new languages and domains. For music, MusicGen (Copet et al. 2023) and related models apply decoder-only causal LM to neural audio codec tokens.

The logical end of this trend is models that accept arbitrary mixtures of modalities as input and produce arbitrary mixtures as output. Gemini and GPT-4o have moved in this direction, accepting interleaved text, images, audio, and (in some versions) video, and generating text, images, or speech. The tokenisation step for non-text modalities is the main new machinery — images get VQ-VAE or vision-transformer tokens, audio gets neural-codec tokens — but once everything is tokens, the pretraining objective looks remarkably like plain causal LM.

§17

Domain-specific pretraining

General-purpose pretraining produces a generalist; many applications need a specialist. There are three ways to get one — pretrain from scratch on domain data, continue-pretrain a generalist on domain data, or fine-tune a generalist on domain tasks — and each has a different cost/capability trade-off. Code, biology, and mathematics are the three domains where specialisation has paid off most clearly.

Code pretraining is the most economically significant specialisation. Codex (Chen et al. 2021) was a GPT-3 variant continue-pretrained on GitHub code and produced the first GPT-quality programming assistant. Code LLaMA (Rozière et al. 2023) is LLaMA 2 continue-pretrained on 500B tokens of code, with FIM-style infilling to support editing. StarCoder, DeepSeek-Coder, and Qwen-Coder pursue similar recipes with open weights. The effect of code-specialisation is dramatic — pass@1 on HumanEval jumps by tens of points — and the resulting models now underpin essentially all commercial code-assistant products.

Biomedical models form the second large specialisation cluster. BioBERT, PubMedBERT, ClinicalBERT, and BlueBERT are encoder-only models pretrained on PubMed abstracts, clinical notes, or both; they outperform general BERT on biomedical NER, relation extraction, and question answering. Galactica (Taylor et al. 2022) attempted a generalist science model trained on 48M scientific papers; it was withdrawn after being shown to confidently produce plausible-sounding nonsense, a cautionary tale about pretraining on high-prestige domains without the post-training that calibrates confidence. More recent specialist biomedical models (Med-PaLM, Clinical Camel) layer domain pretraining with instruction-tuning and human feedback to manage the confidence problem.

Mathematical pretraining is the newest of the three. Minerva (Lewkowycz et al. 2022) took PaLM and continue-pretrained on 38B tokens of mathematical content (ArXiv, curated web pages); the resulting model achieved state-of-the-art on MATH and GSM8K by substantial margins. DeepSeek-Math (2024), Llemma (2023), and Qwen-Math (2024) pursue the same playbook with open weights. The pattern is clear: pretraining on a mathematically dense corpus makes the model substantially more capable at mathematical reasoning, even if general-purpose capability is slightly reduced.

When to specialise vs fine-tune is a practical question with a fairly reliable answer. If the domain has trillion-token resources available (code, legal corpora, some scientific literatures), continued pretraining is worth the cost — it adds capabilities rather than just adapting existing ones. If the domain has only billion-to-ten-billion-token resources, fine-tuning (possibly with LoRA or similar parameter-efficient methods) extracts most of the value at a fraction of the cost. If the domain is small enough to fit in a prompt — dozens of examples, maybe — just use in-context learning. The three regimes correspond, roughly, to the three scales of data you have available.

§18

Pretraining as a paradigm

Pretraining is not just a technique — it is the economic engine of modern AI. It produced the foundation-model concept, reshaped the research agenda from "invent new models" to "scale existing ones", and concentrated frontier development in a small number of labs with the capital to train them. Understanding that paradigm is the precondition for understanding everything that comes after.

The foundation model framing, introduced by Bommasani et al. 2021, captured what was new: a single pretrained model serves as the foundation for many downstream applications, rather than each application having its own model. This is the industrial logic of pretraining made explicit. Foundations are capital-intensive to build, but cheap to specialise and deploy. The structural parallel is to cloud infrastructure: AWS is expensive to build, cheap to use; GPT-4 is expensive to train, cheap to call. The business model follows the cost structure.

This reshaped the research agenda in ways that are still being absorbed. Before 2018, NLP research was dominated by architecture and objective innovation — new attention mechanisms, new loss functions, new regularisers. After 2020, almost all the research that moved the frontier was about scaling — making pretraining cheaper, making data cleaner, understanding what emerges at scale. Architecture research continued, but with a new constraint: whatever you proposed had to play well with scale, or it did not matter. The field re-focused around a narrower set of questions, and the resulting concentration of effort produced dramatic capability gains in just a few years.

The economic turn is the other durable change. Frontier pretraining now costs in the tens to hundreds of millions of dollars per run. The clusters required are available to a handful of labs globally. This has concentrated model development in ways that the previous decade of ML never saw, and it has made who trains the models a question of public concern rather than academic interest. Open-weights releases (LLaMA, Mistral, Qwen, DeepSeek) are a partial counter-weight, but the marginal cost of a frontier run has grown faster than most institutions' budgets.

The next chapter picks up where this one ends. A pretrained model — the output of everything described here — is a raw base model, good at predicting text but not yet calibrated to be useful. Turning it into a helpful, honest, harmless assistant requires fine-tuning and alignment: supervised task-adaptation, instruction-tuning, RLHF, constitutional AI, and all the other techniques that make a language model into a product. That story — how you take the foundation and build on top of it — is the subject of the next chapter.

Pretraining is to modern AI what electricity was to the twentieth century: a general-purpose capability produced at massive scale in a few places and distributed everywhere to do everything. The techniques in this chapter are the turbines; the techniques in later chapters are the wiring and the appliances. Together they constitute an infrastructure, and like any infrastructure they favour those who can afford to build it and shape the world around those who cannot.

How to read this chapter

Contents

Why pretraining

Self-supervised learning

Causal language modelling

Masked language modelling

Denoising & span corruption

Encoder-only pretraining

Decoder-only pretraining

Encoder-decoder pretraining

Prefix-LM & unified objectives

Tokenisation

Pretraining data sources

Data curation & filtering

Scaling laws

Compute-optimal training

Training dynamics in practice

Multi-modal pretraining

Domain-specific pretraining

Pretraining as a paradigm

Further reading

Anchor textbooks & tutorials

Speech and Language Processing

Foundation Models for Natural Language Processing

On the Opportunities and Risks of Foundation Models

The Illustrated BERT, ELMo, and co.

Hugging Face NLP Course

Stanford CS324: Large Language Models

Pre-Trained Models: Past, Present and Future

Foundational papers

A Neural Probabilistic Language Model

Semi-supervised Sequence Learning

Universal Language Model Fine-tuning for Text Classification (ULMFiT)

Deep Contextualized Word Representations (ELMo)

Improving Language Understanding by Generative Pre-Training (GPT-1)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners (GPT-2)

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)

BART: Denoising Sequence-to-Sequence Pre-training

Language Models are Few-Shot Learners (GPT-3)

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Scaling Laws for Neural Language Models

Training Compute-Optimal Large Language Models (Chinchilla)

Emergent Abilities of Large Language Models

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Modern extensions

LLaMA: Open and Efficient Foundation Language Models

LLaMA 2: Open Foundation and Fine-Tuned Chat Models

The Llama 3 Herd of Models

Mistral 7B

DeepSeek-V3 Technical Report

Qwen Technical Report

OLMo: Accelerating the Science of Language Models

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Dolma: An Open Corpus of Three Trillion Tokens

FineWeb: Decanting the Web for the Finest Text Data at Scale

Deduplicating Training Data Makes Language Models Better

Textbooks Are All You Need (Phi-1)

Code Llama: Open Foundation Models for Code

Solving Quantitative Reasoning Problems with Language Models (Minerva)

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Flamingo: a Visual Language Model for Few-Shot Learning

Scaling Data-Constrained Language Models

Software & tooling

Hugging Face Transformers

Megatron-LM

DeepSpeed & ZeRO

PyTorch FSDP

SentencePiece & tiktoken

nanoGPT

llm-foundry & composer

datatrove & dolma toolkit