Part V · Deep Learning Foundations · Chapter 07

Transfer learning and pretraining, the economic revolution that turned deep learning from a one-task-one-model discipline into a field where every serious system starts from weights someone else trained.

For most of deep learning's first decade, every new task started from random initialisation. If you wanted to classify medical images, or translate a new language pair, or detect spam, you collected a labelled dataset, picked an architecture, and trained until the loss stopped moving. Every model was its own island. The first cracks in this paradigm came from computer vision in the early 2010s, when it became clear that features learned by a large ImageNet classifier were reusable — a network trained on 1.2 million photographs had discovered general-purpose visual primitives that transferred to radiographs, satellite images, and every downstream vision benchmark. The pattern was formalised, accelerated, and eventually universalised: first with word embeddings (word2vec 2013, GloVe 2014), then with contextual word vectors (ELMo 2018), then with the decisive BERT and GPT papers of 2018–2020, which established that a model pretrained on enormous amounts of unlabelled text could be adapted to essentially any downstream NLP task with modest fine-tuning or, for large-enough models, with no weight updates at all. By the mid-2020s the frontier had moved to foundation models trained on trillions of tokens, adapted through instruction tuning, preference optimisation, and parameter-efficient fine-tuning — and the underlying idea in every case was the same one that ImageNet pretraining had foreshadowed a decade earlier: the expensive part of deep learning is the representation, and the representation generalises. This chapter is about that idea. We walk through what transfer learning is, why it works, how it evolved from feature extraction to self-supervised pretraining to foundation models, and the toolkit — fine-tuning, LoRA, instruction tuning, RLHF, in-context learning — that makes it usable in practice.

How to read this chapter

Sections one through five establish the landscape. Section one motivates transfer learning with the basic economic argument — representations are expensive to learn and cheap to reuse — and states the problem in its modern form: pretraining on abundant data, adapting to scarce data. Section two introduces the dominant paradigm, pretrain-then-finetune, and its two main modes (feature extraction and full fine-tuning). Section three unpacks the spectrum from frozen-feature reuse to full end-to-end adaptation, including the middle ground of partial unfreezing and layer-wise learning rates. Section four revisits the historical case that established the playbook — ImageNet pretraining — and explains why vision was the first field to discover that large supervised pretraining gave transferable representations. Section five is the conceptual pivot: self-supervised learning, the idea that you can get useful representations from unlabelled data by inventing surrogate tasks the network must solve.

Sections six through ten walk through the four canonical pretraining objectives that now dominate the field. Section six is masked language modelling — BERT's denoising objective, the paradigm-defining moment for NLP, and the template for every subsequent "mask, predict, reuse" training recipe. Section seven is autoregressive pretraining — the GPT line, next-token prediction at scale, the objective whose unreasonable effectiveness ended up underwriting every frontier language model. Section eight is contrastive learning, from SimCLR and MoCo to CLIP's enormously influential multimodal version — the family that taught vision to pretrain without labels and taught the field to think in terms of embedding geometry. Section nine is masked image modelling (MAE, BEiT, SimMIM) — the vision equivalent of BERT, and the return of reconstruction-based objectives in vision after a half-decade detour through contrastive methods. Section ten steps back and addresses domain adaptation — the more classical subfield concerned with covariate shift, label shift, and the old problem that a model trained on one distribution has to work on another.

Sections eleven through fifteen cover the adaptation side of the ledger. Section eleven handles zero-shot and few-shot transfer — the surprising ability of a sufficiently-pretrained model to perform new tasks with no examples or a handful. Section twelve takes up in-context learning and prompting — the GPT-3 observation that patterns shown in the prompt can steer model behaviour without weight updates, and the prompting literature that grew out of it. Section thirteen is instruction tuning — fine-tuning on demonstrations of how to follow instructions (FLAN, T0, Instruct-GPT), which turned raw language models into models that do what you ask. Section fourteen is alignment — RLHF (Ouyang et al. 2022), DPO, and the preference-optimisation family that pushes models beyond imitation toward behaviours humans prefer. Section fifteen is parameter-efficient fine-tuning — adapters, LoRA, prefix tuning, IA³ — the methods that let you customise a billion-parameter model by training a tiny fraction of its weights.

Sections sixteen through eighteen close the chapter. Section sixteen covers catastrophic forgetting and the continual-learning problem — what happens when you fine-tune and how to avoid destroying the general-purpose capabilities you started with (elastic weight consolidation, replay, LoRA's accidental preservation). Section seventeen is scaling laws and emergent abilities — Kaplan et al. 2020, Chinchilla, the data-model-compute frontier, and the phenomenon that certain capabilities seem to appear only past specific scale thresholds (with the caveat that "emergence" depends on the metric you pick). The closing in-ml section places pretraining in the wider landscape: as the economic engine of modern ML, as the reason foundation models exist as a commercial category, and as the abstraction that turned deep learning from an artisanal practice into a two-layer market of pretraining specialists and downstream adapters.

Why transfer learning?Data scarcity, representation cost, the economic case for reuse
The pretrain-then-finetune paradigmSource task, target task, the two-stage recipe
Feature extraction vs. fine-tuningFrozen backbones, full fine-tuning, layer-wise learning rates
The ImageNet pretraining eraDonahue 2014, Yosinski 2014, the vision playbook that started it all
Self-supervised learningSurrogate tasks, pretext objectives, pretraining without labels
Masked language modellingBERT, Devlin et al. 2019, the paradigm-defining denoising objective
Autoregressive pretrainingGPT, next-token prediction, the scaling recipe that won
Contrastive learningSimCLR, MoCo, CLIP, embeddings shaped by pulling and pushing
Masked image modellingMAE, BEiT, SimMIM, reconstruction returns to vision
Domain adaptationCovariate shift, label shift, adversarial adaptation, DANN
Zero-shot and few-shot transferCLIP zero-shot, GPT-3 few-shot, prompt as task specification
In-context learning and promptingGPT-3, chain-of-thought, prompt engineering, no-gradient adaptation
Instruction tuningFLAN, T0, InstructGPT, SFT on demonstrations
Alignment from human preferencesRLHF, DPO, preference optimisation, beyond imitation
Parameter-efficient fine-tuningAdapters, LoRA, prefix tuning, IA³, QLoRA
Catastrophic forgetting and continual learningMcCloskey 1989, EWC, replay, fine-tuning's quiet damage
Scaling laws and emergent abilitiesKaplan 2020, Chinchilla, compute-optimal training, emergence debates
Where it compounds in MLFoundation models, the two-layer economy, pretraining as infrastructure

§1

Why transfer learning?

Labelled data is scarce, representations are expensive, and most of what a deep network learns on one task is reusable on the next. The question is not whether to transfer — it is how.

Training a deep network from scratch is costly in every dimension that matters. It requires a labelled dataset large enough to identify the relevant features, compute capacity to run thousands of gradient steps over that dataset, and — most subtly — a problem clean enough that the optimisation actually finds the features you want. Any one of these can fail. Many tasks worth solving are data-starved: medical imaging with hundreds of examples per condition, low-resource languages with no curated parallel corpora, scientific problems where a single labelled example takes a week of experiment time.

The foundational observation of transfer learning is that the expensive part of deep learning — the learning of representations — generalises more widely than the downstream task does. A network trained to classify 1 000 categories of everyday objects learns edges, textures, shapes, part compositions, and object-level concepts that are useful for classifying radiographs or satellite images. A network trained to predict the next word in a sentence learns syntax, discourse structure, factual associations, and reasoning patterns that transfer to summarisation, classification, translation, or dialogue.

Framed economically, transfer learning changes the cost structure of ML. Instead of each downstream task bearing the full cost of representation learning, that cost is paid once — in advance, on abundant data — and amortised across arbitrarily many downstream adaptations. The pretraining compute looks enormous in isolation (GPT-4-scale training runs cost tens of millions of dollars) but looks tiny when divided across the thousands of products and applications the resulting model supports.

The central claim. The representations that make a model good at a large, data-rich source task are substantially the same representations that make it good at a related, data-scarce target task. This is an empirical claim, not a theorem — but it has held across domains with a regularity that has restructured the field.

Why it works, at the deepest level, is still partly open. Classical theory gives no guarantee that features learned on one distribution will help on another; modern learning theory offers frameworks (Ben-David et al. 2010; invariant representation learning) that bound the transfer gap under specific assumptions. In practice the empirical answer is the one the field has run with: pretraining does work, it works better the more you scale it, and a large fraction of downstream tasks benefit.

§2

The pretrain-then-finetune paradigm

Train once on a large source task; adapt cheaply to each target task. The two-stage recipe is the backbone of essentially every modern deep-learning system.

The paradigm has two phases. In pretraining, a model is trained on a source task chosen for abundance of data rather than direct utility — ImageNet classification, next-token prediction, masked token reconstruction, image–text contrastive alignment. The pretraining loss is rarely the thing the model is ultimately used for; it is a pretext for learning representations. In fine-tuning (or, more generally, adaptation), the pretrained weights are used as the initialisation for a target task, and a much smaller amount of task-specific data drives further optimisation.

The simplicity of the interface — take a checkpoint, resume training on new data, get a better starting point — is part of why the paradigm became so dominant. Any practitioner who can run gradient descent can fine-tune a pretrained model; doing so routinely outperforms any amount of from-scratch training on a data-scarce target.

The two stages differ in almost every operational detail. Pretraining uses enormous datasets, vast compute clusters, careful curriculum and data mixing, specialised loss functions, and long training runs with sophisticated schedulers. Fine-tuning is the opposite: small datasets, modest compute, a few epochs at a low learning rate, a task-specific head on top of the pretrained body, and a short schedule that stops before the base representations drift too far. The scale asymmetry is the whole point — pretraining is the expensive specialty, fine-tuning is the cheap commodity.

The checkpoint economy. A trained checkpoint is a marketable artefact. Hugging Face's Hub hosts hundreds of thousands; enterprises maintain private versions; the entire open-source LLM ecosystem is organised around downloading, adapting, and republishing them. This is not incidental — it is the natural market structure once pretraining is too expensive for any single adapter to do.

The question that the rest of this chapter answers is: what exactly varies between source and target tasks, and what adaptation methods bridge that gap? The rest of the field is, in a sense, a catalogue of answers to that question — one per pretraining objective, adaptation method, and deployment setting.

§3

Feature extraction vs. fine-tuning

The pretrained network can be used as a frozen feature extractor, as a fully trainable initialisation, or as anything in between. The trade-offs are about data, cost, and how different the target is from the source.

Feature extraction uses the pretrained network as a fixed function. The backbone weights are frozen; only a small task-specific head — typically a linear classifier or a shallow MLP — is trained on the target dataset. Cost is minimal (the backbone computes forward only, once per example), the base capability is preserved perfectly, and the method is a strong baseline in regimes where the target task is similar to the source.

Full fine-tuning unfreezes everything. Gradients flow through the entire network, backbone included, usually at a lower learning rate than pretraining. Capacity is highest — the network can rewrite its internal representations to fit the target — but so is the risk: enough fine-tuning can erase the general-purpose features the pretraining installed. Full fine-tuning tends to dominate on tasks that diverge noticeably from the source and where the target dataset is large enough to support representation updates without overfitting.

The middle ground is where most practical work lives. Partial unfreezing fine-tunes the last few layers and leaves the earlier ones frozen — a reasonable default because earlier layers tend to encode more general features. Layer-wise learning rates (ULMFiT; Howard and Ruder 2018) apply smaller updates to early layers and larger updates to later layers, preserving general features while adapting task-specific ones. Discriminative fine-tuning varies these schedules in even more structured ways. The parameter-efficient methods of §15 — LoRA, adapters, prefix tuning — can be read as another generation of this same design move: restrict the updates to a small, well-chosen subspace rather than the whole parameter tensor.

Choose by regime. Small target dataset, similar-to-source task: feature extraction. Small target dataset, different task: feature extraction plus a carefully-sized head. Large target dataset, different task: full fine-tuning with a low learning rate. Very large base model, any target: one of the parameter-efficient methods.

The empirical literature has been unusually well-documented here. Kornblith, Shlens, and Le (2019) systematically compared fine-tuning and feature extraction across 12 ImageNet-pretrained backbones and 12 downstream tasks — the kind of exhaustive study that tends to quiet debate. Their headline: better ImageNet accuracy translates to better transfer, fine-tuning usually wins fine-tuning, but the gap is modest when the target is ImageNet-like.

§4

The ImageNet pretraining era

Computer vision discovered transfer learning first. The playbook it wrote — pretrain a large classifier on ImageNet, reuse its features everywhere — ran for almost a decade before NLP caught up.

The moment is usually dated to Donahue et al.'s 2014 "DeCAF" paper and Yosinski et al.'s 2014 "How transferable are features in deep neural networks?" Both showed that features extracted from AlexNet — trained on ImageNet classification — outperformed hand-engineered vision features on essentially every downstream benchmark (scene recognition, fine-grained classification, domain generalisation). The representations the network learned on 1 000 ImageNet categories were not really about those categories at all; they were general-purpose visual primitives that happened to emerge under the classification pressure.

Within two years, "ImageNet pretraining + task-specific head" had become the default recipe in vision. Every object-detection system (R-CNN, Fast R-CNN, Faster R-CNN) was built on an ImageNet backbone. Every semantic segmentation system (FCN, U-Net-with-ResNet) was similarly. Medical imaging, satellite imagery, agricultural monitoring, manufacturing QA — all used ImageNet-pretrained CNNs as their starting point, with astonishingly little domain-specific data required. The paradigm was so pervasive that for years the standard benchmark for a new architecture was "how well does it transfer from ImageNet?"

The limits of the recipe also became visible. Supervised ImageNet pretraining inherits ImageNet's biases — photographic images, certain geographic and demographic distributions, certain object-versus-context balance. Medical and scientific imaging often pushed this too far. The next wave of vision pretraining moved toward self-supervised and large-scale noisy-supervised alternatives (§5, §8, §9), in part to escape the specific character of ImageNet labels.

Rethinking ImageNet pretraining. He, Girshick, and Dollár (2019) argued that once datasets were large enough, training from scratch could match fine-tuning — provoking a useful debate about when pretraining actually helps. The answer that emerged was nuanced: pretraining still helps when target data is scarce and is a safety margin even when it is abundant, but its dominance was less absolute than the paradigm implied.

The historical arc matters for understanding NLP's later path. Vision had a decade of experience with supervised pretraining before it began to transition to self-supervised methods; NLP essentially skipped supervised pretraining and went directly to large-scale self-supervision. The reasons are in §5 and §6.

§5

Self-supervised learning

Labels are a bottleneck. Self-supervision uses the structure of unlabelled data itself to generate training signal — the move that made pretraining scale.

Supervised pretraining runs into a hard constraint: the labels. ImageNet has 1.2 million labelled images, which is a lot in absolute terms and essentially nothing compared to the billions of images a serious pretraining run could consume. Expanding the supervised-label pool is expensive, slow, and subject to diminishing returns. The same issue is even more severe in NLP, where high-quality task labels are scarcer than in vision, and in speech and biology, where labelling is sometimes infeasible.

Self-supervised learning (SSL) solves the labelling problem by inventing training signal from the data itself. The model is given a pretext task — a surrogate objective chosen not because its solution is useful in its own right but because solving it requires learning useful representations. Hide a word in a sentence and predict it; crop two patches from an image and ask whether they came from the same source; mask out regions of an image and reconstruct them; shuffle the frames of a video and predict the original order. Each of these tasks has a ground-truth answer derivable from the raw data alone — no human annotation required — and solving them well turns out to demand something close to genuine understanding.

The early self-supervised vision literature (2015–2019) tried a parade of pretext tasks: jigsaw puzzles, rotation prediction, colourisation, exemplar discrimination, and several others. Most produced usable features; none became dominant. The real breakthroughs came when the pretext task was chosen well enough and scaled hard enough that it stopped mattering whether the task was "aligned" with downstream use — BERT's masked language modelling (§6), GPT's next-token prediction (§7), and the contrastive objectives of §8.

The lottery of pretext tasks. Most pretext tasks produce mediocre features. A few produce extraordinary ones. The difference is usually that the good pretext tasks require the model to capture the same semantic structure the downstream task depends on — not because they were designed to, but because that structure is what the data actually has to teach.

The broader shift this represents is one of the largest in modern ML. Supervised learning needed labels as input; self-supervised learning needs only the data. The supply curve shifts from "how many annotators can we hire" to "how much data can we store and train on," and the latter scales much better.

§6

Masked language modelling

Mask a word. Ask the network to fill it in. Repeat for a billion sentences. BERT's 2018 recipe opened the modern era of NLP pretraining and defined the denoising template for a decade.

Devlin, Chang, Lee, and Toutanova's BERT (Bidirectional Encoder Representations from Transformers, 2018–2019) formalised the masked language modelling (MLM) objective and paired it with an encoder-only transformer. Pretraining procedure: take a sentence; replace 15% of the tokens with a [MASK] placeholder (with small fractions swapped for random or kept-as-is tokens for training stability); ask the model to reconstruct the original tokens from their context. The loss is cross-entropy on the masked positions only.

The objective is a form of denoising autoencoding specialised to discrete inputs. Because every masked position can attend to the entire uncorrupted context on both sides, the representations the model learns are bidirectional — each token's embedding carries information about everything around it. This is the property that made BERT useful for classification and span-extraction tasks: fine-tune a small head on top of the [CLS] token or on top of token-level representations, and the pretrained backbone provides features that outperformed every prior NLP system on GLUE, SQuAD, and a dozen other benchmarks.

The BERT family expanded quickly. RoBERTa (Liu et al. 2019) showed that longer training and more data closed most of the gap between BERT-base and task-specific architectures; ALBERT (Lan et al. 2019) shared parameters to reduce model size; DeBERTa (He et al. 2020) decoupled content and position. The denoising template proliferated beyond language too: MAE for images (§9), Graph-BERT, masked-autoencoders for proteins, speech, and molecular graphs.

Why MLM worked. Predicting a masked token from bidirectional context is hard enough to require real language understanding (syntax, selectional restrictions, discourse), easy enough to provide a reliable gradient at every position, and general enough that the features transfer broadly. It hit the same pretext-task sweet spot that next-token prediction would hit on the generative side.

For a time, MLM-pretrained encoders were the practitioner's default for any NLP task that was not open-ended generation — classification, tagging, retrieval, extractive QA. Their dominance has eroded as decoder-only models have absorbed more tasks, but encoder-only MLM models remain the best choice for many bounded-input-bounded-output tasks, and the objective itself has lasting influence wherever denoising-style self-supervision appears.

§7

Autoregressive pretraining

Predict the next token. Repeat across trillions of tokens. The objective is embarrassingly simple; the capabilities it unlocks, at sufficient scale, are the defining result of the decade.

The GPT line (Radford et al. 2018, 2019; Brown et al. 2020; and the closed-source successors) is built on a single objective: next-token prediction. Given tokens x₁ … x_{t−1}, predict x_t. Repeat over every position in every training sequence. The loss is standard cross-entropy; the model is a decoder-only transformer with causal masking (§10 of the Attention chapter). There is nothing ornate about the recipe.

The remarkable fact is how much capability this objective loads into the model at scale. To reliably predict the next token in arbitrary text, a model has to learn spelling, morphology, syntax, lexical semantics, discourse structure, world knowledge, arithmetic, basic reasoning, stylistic patterns, common coding conventions, and dozens of other things. None of these are directly asked for; they are all implicitly required by the loss. The result is that a sufficiently large GPT-style model can be steered, via prompt, to perform an enormous range of tasks without any task-specific training — the "generalist model" phenomenon that §11 and §12 develop.

Autoregressive pretraining has two properties that make it especially favourable. First, it is a dense objective: every token in the sequence contributes a loss term, rather than only the 15% masked positions of MLM. The gradient signal per example is roughly six times richer, which translates to faster convergence per token of data. Second, the objective matches the inference mode exactly — a generative autoregressive model produces sequences one token at a time, precisely the operation it was trained on. There is no mismatch between pretraining and deployment the way there is for MLM, where inference-time use rarely involves masking.

The objective that won. For a few years, MLM and autoregressive pretraining competed. As scale grew, autoregressive pretraining won decisively — not because MLM was wrong but because next-token prediction scales more gracefully, transfers to more tasks, and directly supports generation. By 2023 every frontier model was decoder-only, autoregressively pretrained.

This is not the same as saying MLM is obsolete. Encoder representations from BERT-style models remain competitive on bounded classification and retrieval tasks, and often cheaper at inference. But the gravitational centre of the field has moved decisively toward autoregressive pretraining, for reasons the chapter on Foundation Models will take up at greater length.

§8

Contrastive learning

Pull together embeddings that should be alike; push apart embeddings that should be different. The resulting geometry carries most of what a supervised classifier would have learned, without requiring labels at all.

Contrastive learning's central mechanism is almost embarrassingly simple. Take a data point; produce two views of it (two augmentations of the same image, a caption and its image, two sentences about the same event). Pass both through an encoder. Pull their embeddings together; push them away from the embeddings of all other data points in the batch. The resulting loss — InfoNCE, introduced by Oord, Li, and Vinyals (2018) — is a softmax over similarities and backpropagates cleanly.

In vision, SimCLR (Chen, Kornblith, Norouzi, Hinton 2020) and MoCo (He et al. 2020) established that contrastive pretraining with heavy augmentation on unlabelled ImageNet could produce features competitive with supervised ImageNet pretraining. The augmentations (random crops, colour jitter, Gaussian blur) force the model to learn invariances rather than memorise pixels, and the "push-apart" term against the rest of the batch forces the embeddings to spread across the space.

The most influential instance is CLIP (Radford et al. 2021), which paired a vision encoder with a text encoder and trained them to align embeddings of matched image–caption pairs from a 400-million-pair dataset scraped from the web. The resulting joint embedding space supports zero-shot image classification (encode candidate class names as text; embed the image; pick the nearest), zero-shot image retrieval, and a broad array of multimodal applications. CLIP became the most-used vision encoder of the 2020s essentially because of this capability, and the technique has since been extended to video (Video-CLIP), audio (CLAP), and beyond.

Non-contrastive siblings. BYOL (Grill et al. 2020), DINO (Caron et al. 2021), and SimSiam (Chen and He 2021) showed you could remove the negative-pair term entirely and still learn good representations, as long as you used an asymmetric architecture (stop-gradients, momentum encoders, predictor networks). This was a surprise at the time and is still partly explained — the family is now called self-distillation and has become a dominant approach for image pretraining.

The contrastive era largely pushed the field to think in terms of embedding geometry — what kinds of neighbourhoods the encoder produces, how uniformly it spreads representations across the unit sphere, how quickly similarity under the encoder tracks similarity in semantic labels. The shift has outlasted the specific algorithms; even the generative frontier of the 2020s (diffusion models, latent-space VAEs) borrows heavily from the geometric intuitions that contrastive methods made standard.

§9

Masked image modelling

Mask image patches, reconstruct their contents, transfer the backbone. BERT's denoising template, relocated to vision, produced the strongest self-supervised image representations of the early 2020s.

For a few years contrastive methods dominated self-supervised vision. The tide turned with Masked Autoencoders (MAE; He et al. 2021). The recipe is a direct port of BERT: split an image into patches (the Vision Transformer tokenisation of §15 of the Attention chapter); mask out 75% of them; train an asymmetric encoder–decoder transformer to reconstruct the original pixels from only the visible patches. The high mask ratio is not a detail — it is the whole reason the method works. With too few patches hidden, the task reduces to copying; with 75% hidden, genuine reconstruction demands real understanding.

The asymmetric architecture is a speed trick. The encoder runs only on the visible patches — a quarter of the image — which makes pretraining roughly four times faster per forward pass. The decoder is small and operates on the full sequence (visible encodings plus mask tokens) to reconstruct pixels. At fine-tuning time only the encoder is kept; the decoder is thrown away.

BEiT (Bao, Dong, Wei 2021) used a discrete visual tokeniser to convert image patches into codebook indices and then ran a BERT-style MLM on those indices — bringing the analogy to NLP even closer. SimMIM (Xie et al. 2021) ran masked-pixel reconstruction with a simpler architecture and showed that many of MAE's design choices were interchangeable. The family collectively established masked image modelling as the strongest self-supervised baseline at ViT-B scale and above, surpassing contrastive methods on fine-tuning performance for many downstream tasks.

When reconstruction beats contrast. Contrastive methods learn invariances; reconstruction methods learn to represent the input faithfully. The difference matters for downstream tasks: dense prediction (segmentation, depth, detection) tends to prefer reconstruction-style features; linear classification tends to prefer contrastive ones. The methods are complementary more than competing.

The MAE recipe has proved portable. Masked-autoencoder variants have been applied to video (ST-MAE), audio (Audio-MAE), point clouds (Point-MAE), and molecular graphs. The underlying idea — mask a large fraction of the input, reconstruct, reuse the encoder — has become a canonical template for self-supervised pretraining in any modality with a natural tokenisation.

§10

Domain adaptation

Transfer learning's classical subfield: the task is the same, the distribution is different, and the question is how to bridge the gap without the target labels that would make it easy.

Domain adaptation is the older, narrower cousin of transfer learning. The canonical setting is: a source distribution with abundant labels, a target distribution with many unlabelled examples and few or no labels, and the same underlying task (classify images, detect objects, tag sentences). The question is how to produce a classifier that works well on the target given this asymmetry. Two generic failure modes motivate the field: covariate shift (the input distribution differs but the conditional label distribution is the same) and label shift (the label marginals differ). Both break the classical iid assumption behind supervised learning.

The pre-deep-learning playbook emphasised importance weighting, kernel mean matching, and feature-space transformations that align the two distributions. The deep-learning era contributed the domain-adversarial neural network (DANN; Ganin et al. 2016), which trains a feature extractor to simultaneously minimise a task loss on the source and confuse a domain classifier that tries to tell source from target. If the domain classifier cannot distinguish domains, the feature extractor has found a representation where the two distributions overlap — a necessary condition for the source-trained classifier to work on the target.

Other approaches: CORAL (Sun, Feng, Saenko 2016) matches second-order statistics of source and target features; MMD-based methods penalise kernel mean discrepancy between the two distributions; self-training and pseudo-labelling use the source classifier to label target data and iterate. The modern incarnation is test-time adaptation — adjust the model (often just its normalisation statistics) on the fly as target examples arrive, without retraining.

Is domain adaptation still a distinct field? Large foundation models, pretrained on diverse web data, often solve what were once thought to be hard domain-adaptation problems by default — their source distribution is broad enough that the target shift is within its coverage. The subfield has not disappeared, but much of its motivation has shifted to specialised settings (medical imaging, industrial sensor data, low-resource languages) where foundation models still fall short.

The framing that domain adaptation provides is still useful even when its specific algorithms are not. Whenever transfer fails — when fine-tuning a pretrained model does not recover source-like performance on a target — the question "what is the distribution shift between source and target?" is the right diagnostic. The answers drive most practical adaptation decisions, from augmentation design to data-mixing strategies.

§11

Zero-shot and few-shot transfer

Fine-tuning needs data. Zero-shot and few-shot transfer produce useful performance on a new task with no task-specific weight updates — and sometimes no examples at all.

Zero-shot transfer is the ability of a pretrained model to perform a task it has never been explicitly trained on, given only a natural-language description (or, in multimodal models, a set of candidate classes). The archetype is CLIP: present the model with an image and a set of class names phrased as "a photo of a cat," "a photo of a dog," etc.; compute the image–text similarity for each; pick the highest. No fine-tuning, no target labels, no task-specific training. The pretraining has done all the work; the task is fully specified at inference time.

Few-shot transfer relaxes this to a handful of labelled examples. The approach ranges from genuine fine-tuning on tiny datasets (5 examples per class) to in-context learning (next section), where examples are supplied in the prompt without any weight updates. The GPT-3 paper (Brown et al. 2020) popularised this framing: a single large pretrained model, steered by natural-language instructions and a few examples, producing usable output on dozens of tasks.

Both modes depend on the same precondition: the pretraining distribution has to be broad enough that the target task is roughly covered. CLIP can do zero-shot image classification on new class names only because the ∼400 million captioned images it saw during pretraining included roughly the right semantic coverage. GPT-3 can do few-shot translation only because web text includes enough parallel or near-parallel material to implicitly teach the task. Transfer in these regimes is real capability retrieval, not extrapolation.

Why it works. Zero-shot and few-shot transfer are downstream consequences of pretraining at sufficient scale. The capability is not a separate technique; it is evidence that the pretrained model has absorbed the task distribution well enough that task specification (the prompt, the candidate classes) is sufficient to surface the right behaviour.

The quality of zero-shot performance has become a standard benchmark for pretrained models, because it measures the breadth of capability without the confound of fine-tuning. A model that fine-tunes well but zero-shots poorly suggests that its pretraining representations are narrow, even if they can be adapted with effort. A model that zero-shots well on diverse tasks demonstrates genuinely broad representation — the foundation-model thesis in operational form.

§12

In-context learning and prompting

Put the examples in the prompt. The model adapts its behaviour without any weight updates. This is not a training technique — it is a capability that emerges from scale.

In-context learning (ICL) is the phenomenon, identified empirically in GPT-3, that a sufficiently large language model can perform a task it was never trained on simply by being shown examples of that task in its prompt. Write "English: dog → French: chien. English: cat → French: chat. English: bird → French:" and the model completes "oiseau." No gradient steps; no fine-tuning; no retraining. The mechanism by which this works is still partly open — induction heads (§11 of Attention) play some of the role — but the capability has been reproduced across every frontier language model.

In-context learning changes what "adaptation" means. A single pretrained model serves dozens or hundreds of tasks, each specified at inference time by a different prompt. Prompts can contain instructions, examples, formatting templates, reasoning scaffolds, or all of the above. Prompt engineering — the discipline of writing prompts that reliably produce desired behaviour — has become a substantial practice with its own literature and tooling.

One particularly consequential technique: chain-of-thought prompting (Wei et al. 2022). Include, in the few-shot examples, explicit reasoning steps ("Let's think step by step") before the answer; the model's own responses acquire similar reasoning structure, and accuracy on arithmetic, symbolic, and commonsense-reasoning tasks improves substantially. This is a case where prompt structure directly influences what computation the model runs — intermediate tokens are scratch space for the model to think in.

The prompt as a programming interface. For a large enough pretrained model, the prompt is the primary user-facing API. Instructions, formatting, examples, tools, and context all get combined into a single text string that configures the model's behaviour for a single inference. The practice is empirical; the theory is fragmentary; the field moves faster than either.

ICL is also the reason a single foundation model can serve an enormously diverse product surface. What used to require fine-tuning a distinct model per task now requires a well-written prompt. The cost structure shifts again: the expensive part of deployment is the inference compute, not the training compute, and the cost per task collapses to near zero if the pretrained model can handle the task by prompt alone.

§13

Instruction tuning

A raw pretrained language model is not a useful product — it completes text, it does not follow instructions. Instruction tuning bridges the gap.

A pure next-token-prediction model, trained on web text, inherits the behaviour of web text: it continues passages, stylistically imitates its context, and will happily produce plausible but off-topic completions when given a direct question. The model has the capabilities; what it lacks is the habit of using them the way a user would want. The gap between raw capability and reliable instruction-following is what instruction tuning fixes.

The technique is straightforward: collect a corpus of demonstrations — (instruction, ideal response) pairs — and continue pretraining on this corpus with the same next-token loss. The model, still a language model, learns to continue "Write a haiku about autumn:" with an actual haiku rather than with, say, a cluttered blog post about autumn. FLAN (Wei et al. 2022), T0 (Sanh et al. 2022), and Natural Instructions (Mishra et al. 2022) built early instruction-tuning corpora by reformatting existing NLP datasets into instruction–response pairs. The resulting models zero-shot to new tasks much better than their pre-instruction-tuning counterparts, because the instruction-following habit generalises.

InstructGPT (Ouyang et al. 2022) extended this with human-written demonstrations of good behaviour on open-ended prompts and combined instruction tuning with the preference-optimisation methods of §14. The resulting model was preferred by human raters over much larger base models at a variety of helpfulness benchmarks — the moment that crystallised "instruction tuning plus preference optimisation" as the standard recipe for turning a base model into an assistant.

The three layers. The current production stack for large language models is usually three layers: base pretraining (next-token on web-scale data), instruction tuning (supervised fine-tuning on demonstration data), and alignment (preference optimisation; §14). Each layer serves a distinct purpose, and each uses its own kind of data — ubiquitous, curated, and preference-labelled, respectively.

The boundary between instruction tuning and alignment has blurred. The term "supervised fine-tuning" (SFT) increasingly covers both, since the data format (instruction → preferred response) is the same. What distinguishes them is how the training signal is generated: SFT uses fixed demonstrations, alignment uses human or model comparisons over pairs of model outputs.

§14

Alignment from human preferences

Even after instruction tuning, models make systematic mistakes that demonstrations alone cannot easily correct. Preference optimisation trains the model on pairwise comparisons instead — the dominant method for producing aligned assistants.

Demonstrations are a one-sided signal: they say "do this" but not "don't do that." Many of the failure modes of a naive language model — confabulation, verbosity, refusals to engage, subtle toxicity — are easier to describe as comparisons than as demonstrations. Given two possible responses, a human can reliably say which is better; constructing the "better" response from scratch is much harder.

Reinforcement Learning from Human Feedback (RLHF; Christiano et al. 2017; Ziegler et al. 2019; Ouyang et al. 2022) operationalises this insight in three stages. First, train a reward model on pairwise preference data: present humans with two candidate responses to the same prompt; ask them which is better; train a model to predict these preferences. Second, use reinforcement learning (PPO is the standard algorithm) to fine-tune the language model to maximise the reward model's score. Third, clip the updates with a KL penalty against the supervised baseline so that the model does not drift too far from instruction-tuned behaviour. The resulting model acquires preferences that demonstrations alone could not teach.

RLHF is unwieldy in practice — it requires maintaining four models at once (policy, reference, reward, value), its training is unstable, and the reward model itself is often the bottleneck. Direct Preference Optimisation (DPO; Rafailov et al. 2023) re-derived the same objective as a simple supervised loss on the preference data, collapsing the pipeline into a single fine-tuning step with no RL at all. DPO and its successors (IPO, KTO, ORPO, SimPO) have largely displaced PPO-based RLHF for new projects, though both remain in active use.

Alignment is an empirical field. The techniques here are less than a decade old; the best practices change every few months; the open questions (how to scale oversight, whether preferences generalise beyond training distribution, how to align for behaviours humans cannot directly evaluate) are serious research problems. The standard RLHF-then-DPO recipe is the current state of the art, not a settled answer.

Preference optimisation is also not limited to post-hoc alignment of language models. The same pattern — "train a reward model on comparisons, optimise the policy to maximise it" — underlies preference-based reinforcement learning in robotics, game playing, and increasingly in image and video generation, where human comparisons of samples shape the output distribution more reliably than likelihood-based objectives.

§15

Parameter-efficient fine-tuning

Full fine-tuning updates every weight. Parameter-efficient methods update a tiny fraction — often less than 1% — and recover most of the quality at a small fraction of the cost.

Fine-tuning a 70-billion-parameter model on a task-specific dataset is expensive in every dimension: memory (you need to store gradients and optimiser state for every weight), storage (a fine-tuned checkpoint is as large as the base), and deployment (you now have two copies of a 140 GB model). Parameter-efficient fine-tuning (PEFT) methods train only a small number of new parameters while freezing the base model, reducing memory and storage to a fraction of full fine-tuning.

Adapters (Houlsby et al. 2019) insert small bottleneck MLPs between transformer layers and train only those. Prefix tuning (Li and Liang 2021) and prompt tuning (Lester et al. 2021) prepend a small number of learned vectors to the input and train only those. BitFit (Zaken et al. 2022) trains only the bias terms. All of these work; none became dominant — for different combinations of task type and base model, each has its strengths.

The method that broke through was LoRA (Low-Rank Adaptation; Hu et al. 2021). LoRA adds, to each weight matrix it targets, a low-rank update ΔW = BA where B ∈ ℝ^{d×r} and A ∈ ℝ^{r×k} with rank r ≪ d, k. Only A and B are trained; the base W is frozen. For a typical transformer, LoRA with rank 8 trains perhaps 0.1% of the parameters and matches full fine-tuning on most tasks. QLoRA (Dettmers et al. 2023) combined LoRA with 4-bit base-model quantisation, making it possible to fine-tune 65B-parameter models on a single consumer GPU — a practical breakthrough that unlocked open-source model customisation at scale.

LoRA as infrastructure. LoRA adapters are now a standard deployment object. A single base model serves many fine-tuned behaviours by swapping LoRA adapters at inference time; large-scale serving systems (vLLM, TensorRT-LLM) support multi-LoRA inference where thousands of small adapters share one base model. The adapter has become the productised unit of fine-tuning.

The success of PEFT is a case of economic structure shaping research direction. The cost of full fine-tuning was making base-model ownership concentrate in a few organisations; PEFT methods restored the ability of smaller teams to customise without re-training, and in the process made it possible to have many specialised variants of a single base model coexist cheaply.

§16

Catastrophic forgetting and continual learning

Fine-tuning a network on a new task can silently destroy its performance on the old one. The phenomenon is old, its solutions are partial, and it sets the terms for how we adapt foundation models.

McCloskey and Cohen (1989) gave the phenomenon its name: catastrophic forgetting, the empirical fact that a neural network fine-tuned on task B loses most of its performance on task A, even when A and B are closely related. The mechanism is straightforward — gradient updates on B modify weights that were tuned for A — but the consequences are inconvenient. Any fine-tuning run that matters has to reckon with what it destroys along the way.

The classical subfield of continual learning (or lifelong learning) tries to structure training so that new tasks can be learned without erasing old ones. Elastic Weight Consolidation (EWC; Kirkpatrick et al. 2017) adds a penalty proportional to how important each weight was for the old task, computed from the Fisher information. Replay methods periodically retrain on old data alongside new. Dynamically expanding networks add new parameters for new tasks. None of these methods solve the problem in generality; all of them help in specific settings.

The rise of foundation models has changed the shape of the problem without eliminating it. Full fine-tuning a large pretrained model on a narrow target task can degrade its general capabilities — a model fine-tuned for customer-service dialogue may become measurably worse at coding or mathematical reasoning. The fix, broadly, is to do as little damage as possible: parameter-efficient methods (§15) keep the base weights frozen; instruction tuning is done on broad, diverse demonstration mixtures rather than narrow ones; RLHF is followed by careful evaluation on capability benchmarks to detect regression.

LoRA as forgetting insurance. LoRA's accidentally-strong property is that the base model is never modified, which makes forgetting structurally impossible — throw away the adapter and you have exactly the original model back. This is part of why LoRA has become the default for application-level fine-tuning: not just because it is cheap, but because it is safe.

The deeper question — whether a neural network could, in principle, accumulate skills over its lifetime without forgetting — is an active research area. The current consensus is that the answer is "partially," and that for serious capability portfolios the cleanest solution is to keep the base model fixed and adapt locally through prompts, adapters, or retrieval rather than through gradient updates to the base weights.

§17

Scaling laws and emergent abilities

Loss curves for pretrained models follow predictable power laws in data, parameters, and compute. Some capabilities appear only past specific scale thresholds — or appear to, depending on the metric.

Kaplan et al. (2020) measured how pretraining loss for autoregressive transformers scales with model size, dataset size, and training compute. The relationships turned out to be clean power laws over many orders of magnitude: loss decreases roughly as N^{−α} in parameters N, D^{−β} in dataset size D, and C^{−γ} in compute C. The exponents are small — typically around 0.05 to 0.3 — but the regularity is remarkable. Pretraining behaves more like a physics experiment than like traditional engineering.

Hoffmann et al.'s Chinchilla paper (2022) refined this with the compute-optimal frontier: for a given compute budget, what combination of model size and training data produces the lowest loss? The answer reversed a tacit assumption — that bigger models were always better — and showed that most pre-2022 large models were substantially undertrained. A compute-optimal model trains on roughly 20 tokens per parameter; GPT-3 had used around 2. Chinchilla's prescription restructured training budgets across the industry.

Scaling laws do more than describe pretraining. They provide a predictive tool: before spending tens of millions of dollars on a training run, one can forecast its loss from runs 1/1000th the scale. The projection is reliable enough that major training decisions — model size, token budget, architectural choices — are routinely made by extrapolating scaling curves from small-scale experiments.

Emergent abilities. Wei et al. (2022) catalogued emergent abilities — tasks on which performance is near-chance at small scale and rises sharply past some threshold (multi-step arithmetic, instruction following, complex reasoning). The phenomenon is real in the sense that measured benchmark performance shows this pattern; Schaeffer et al. (2023) argued that the appearance of emergence is partly an artefact of discontinuous metrics (exact-match accuracy jumping from 0 to 1), and that smoother metrics often show smooth scaling instead. Both claims are probably right for different tasks.

Scaling laws are the reason the last decade's compute and data escalation has been economically rational. When improvements are predictable, investing at larger scale is a calculable bet rather than a gamble. The ceiling — if there is one — is a matter of ongoing empirical work, and the question "will scaling continue to pay off?" is perhaps the single most consequential open question in modern ML.

§18

Where it compounds in ML

Transfer learning turned deep learning from an artisanal practice into a two-layer economy. The pretrained artefact is now the primary commodity; almost everything built on top is adaptation.

The first compounding effect is economic. Pretraining at frontier scale has become too expensive for most organisations — a single large training run costs tens to hundreds of millions of dollars, more than the entire R&D budget of most labs. Downstream adaptation is orders of magnitude cheaper, and the gap is growing. This is not a bug; it is the point. A foundation model is infrastructure, amortised across thousands of downstream uses, each of which pays a small fraction of the total cost. The result is a two-layer market: a handful of labs doing pretraining, thousands of teams doing adaptation. Much of the practical work of modern ML happens in the second layer.

The second compounding effect is methodological. Transfer learning turned representation learning from a task-specific problem into a general-purpose problem: learn representations once, on abundant data, in a way that is useful for everything. This reframing is what made the scaling thesis coherent. It is hard to justify a billion-dollar training run for one benchmark; it is obvious to justify one if the resulting checkpoint is infrastructure for the next five years of downstream work.

The third compounding effect is conceptual. Transfer learning established a vocabulary — pretraining, fine-tuning, representations, adaptation, few-shot, instruction tuning, alignment — that now structures how the entire field talks about deep learning systems. Earlier frames (supervised vs. unsupervised; classification vs. regression; train vs. test) have not disappeared, but they have been embedded inside this larger two-stage picture, where the expensive stage happens once and the cheap stage happens everywhere.

The thesis in one sentence. Representations are expensive to learn and cheap to reuse, and that asymmetry — paid once upfront, amortised over everything downstream — is the economic engine that made modern deep learning both possible and irreversible.

The next chapter on Generative Models — VAEs, GANs, normalising flows, diffusion — picks up the thread from the generative side. Many of those models are themselves foundation-model-shaped: pretrain on broad data, adapt to specific generative tasks. Transfer learning is the underlying logic; generative pretraining is one of the specific forms it takes, and the one that has moved the frontier furthest in the last five years.

Transfer learning has two distinct literatures. The first is the classical domain-adaptation subfield — covariate shift, importance weighting, domain-adversarial training — that predates deep learning and remains active for specialised distribution-shift problems. The second is the modern pretraining literature — BERT, GPT, CLIP, MAE, LoRA, RLHF — which is less than a decade old and is the organising theory behind every frontier ML system. The reading list below tracks both, with weight on the second, and includes the software that makes these ideas operational.

Anchor textbooks

Textbook

Qiang Yang, Yu Zhang, Wenyuan Dai, Sinno Jialin Pan. Transfer Learning. Cambridge University Press, 2020.

The standard textbook on the classical side of the field — formalisms for domain adaptation, task transfer, multi-task learning, transfer metric learning, and heterogeneous transfer. Covers the pre-foundation-model literature with unusual care; complementary to the modern pretraining-focused texts below.

Textbook

Dan Jurafsky, James H. Martin. Speech and Language Processing. 3rd ed. draft, Stanford, ongoing.

The chapters on pretrained language models, transfer learning in NLP, and large language models are the clearest textbook treatment of the modern paradigm. Regularly updated; the draft tends to incorporate recent work faster than most textbooks.

Textbook

Lewis Tunstall, Leandro von Werra, Thomas Wolf. Natural Language Processing with Transformers. O'Reilly, 2nd ed. 2022.

The Hugging Face team's practitioner guide. Walks through pretraining, fine-tuning, and adapting transformer models with working code. The best single source for going from "I understand the concept" to "I have a fine-tuned model running."

Textbook

Simon J. D. Prince. Understanding Deep Learning. MIT Press, 2023.

Chapters 12 and 20 cover transformer pretraining and self-supervised learning respectively, with careful derivations and clear figures. The exercises give a practical feel for the trade-offs between pretraining objectives.

Textbook

Christopher M. Bishop, Hugh Bishop. Deep Learning: Foundations and Concepts. Springer, 2024.

The Bishop successor text. The chapters on transfer learning, self-supervised learning, and large-scale pretraining give a unified theoretical treatment that spans the classical and modern traditions.

Survey

Rishi Bommasani et al. On the Opportunities and Risks of Foundation Models. Stanford CRFM, 2021.

The 200-page survey that gave foundation models their name and laid out the landscape — capabilities, applications, limitations, societal implications. Uneven in places but indispensable as a snapshot of how the field saw itself at the moment the paradigm consolidated.

Foundational papers

Paper

Jeff Donahue et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. ICML 2014.

The paper that showed ImageNet-trained AlexNet activations served as general-purpose visual features, outperforming hand-engineered descriptors on a long list of downstream benchmarks. The empirical demonstration that launched the ImageNet-pretraining era in vision.

Paper

Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson. How Transferable Are Features in Deep Neural Networks? NeurIPS 2014.

The systematic companion study to DeCAF. Measured how feature transferability varies with layer depth and task similarity, establishing the now-standard intuition that early layers transfer broadly and late layers transfer narrowly.

Paper

Jeremy Howard, Sebastian Ruder. Universal Language Model Fine-tuning for Text Classification (ULMFiT). ACL 2018.

The paper that put transfer learning decisively on the NLP map before BERT. Introduced discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing — practical techniques that are still part of every serious fine-tuning playbook.

Paper

Matthew E. Peters et al. Deep Contextualized Word Representations (ELMo). NAACL 2018.

The first widely-used contextual word representation — bidirectional LSTM language models pretrained on a billion tokens and used as feature extractors for downstream NLP tasks. The bridge between static word embeddings and the transformer-based pretraining paradigm.

Paper

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

The paper that made pretrain-then-finetune the universal NLP paradigm. Masked language modelling plus next-sentence prediction on a bidirectional transformer produced representations that decisively set state of the art on essentially every NLP benchmark and catalysed the foundation-model era.

Paper

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI tech report, 2019.

The paper that crystallised the decoder-only autoregressive pretraining recipe. Scaled-up next-token prediction produced a model that could zero-shot and few-shot across tasks without task-specific training — the prototype of the foundation-model pattern.

Paper

Tom B. Brown et al. Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020.

The paper that made in-context learning the central phenomenon of large-model research. 175 billion parameters, broad few-shot capabilities, and the observation that pretrained models could be steered by natural-language prompts without any weight updates. Reshaped the entire adaptation literature.

Paper

Colin Raffel et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). JMLR 2020.

The most systematic empirical study of pretraining design choices ever conducted — objectives, architectures, dataset sizes, corruption strategies, fine-tuning variations. Reframed NLP as a text-to-text task and became a long-running reference for adaptive-pretraining decisions.

Paper

Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations (SimCLR). ICML 2020.

The paper that made contrastive self-supervised vision pretraining competitive with supervised pretraining, and did so with a strikingly clean recipe — heavy augmentation, NT-Xent loss, large batch sizes. The benchmark that the next several years of self-supervised vision chased.

Paper

Alec Radford et al. Learning Transferable Visual Models from Natural Language Supervision (CLIP). ICML 2021.

The paper that changed how the field thought about vision pretraining. Contrastive alignment between a 400M-pair image–text dataset produced a joint embedding space that supports zero-shot classification on essentially any visual category. The foundation of multimodal learning in the 2020s.

Paper

Kaiming He et al. Masked Autoencoders Are Scalable Vision Learners (MAE). CVPR 2022.

The paper that brought BERT-style masked reconstruction to vision with decisive results. 75% masking of image patches, an asymmetric encoder–decoder, and a pixel-reconstruction loss produced the strongest self-supervised ViT representations of the early 2020s.

Paper

Jared Kaplan et al. Scaling Laws for Neural Language Models. arXiv 2001.08361, 2020.

The paper that established that language-model pretraining loss follows clean power laws in parameters, data, and compute — over many orders of magnitude. The empirical foundation on which the scaling thesis of the decade rests.

Paper

Jordan Hoffmann et al. Training Compute-Optimal Large Language Models (Chinchilla). NeurIPS 2022.

The paper that corrected the scaling-law prescription. For a given compute budget, the compute-optimal model is smaller and trained on more data than the Kaplan paper suggested — a finding that reshaped training budgets across the field.

Paper

Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei. Deep Reinforcement Learning from Human Preferences. NeurIPS 2017.

The paper that introduced preference-based training — train a reward model on human pairwise comparisons, then optimise a policy against it. Originally a reinforcement-learning paper; later became the template for RLHF in language models.

Paper

Long Ouyang et al. Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS 2022.

The paper that consolidated the supervised-fine-tuning + RLHF recipe for turning base language models into useful assistants. Methodologically precise, empirically decisive, and the blueprint for every subsequent aligned-assistant model.

Paper

Edward J. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.

The paper that made parameter-efficient fine-tuning mainstream. Adds small low-rank matrices to weight matrices, trains only those, and matches full fine-tuning at a fraction of the cost. The technique that keeps the cost of model customisation manageable at frontier scale.

Modern extensions

Paper

James Kirkpatrick et al. Overcoming Catastrophic Forgetting in Neural Networks (EWC). PNAS, 2017.

Elastic weight consolidation: a Fisher-information-weighted penalty that keeps important weights for old tasks near their pretrained values while learning new tasks. The most influential classical approach to catastrophic forgetting and a useful baseline for all continual-learning work.

Paper

Yaroslav Ganin et al. Domain-Adversarial Training of Neural Networks (DANN). JMLR 2016.

The domain-adaptation paper that demonstrated adversarial alignment of feature distributions. A gradient-reversal layer turns a source-vs-target discriminator into a domain-invariance objective on the feature extractor. The method that made deep domain adaptation mainstream.

Paper

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. Momentum Contrast for Unsupervised Visual Representation Learning (MoCo). CVPR 2020.

The contrastive method that used a momentum-updated encoder and a queue of negative samples to decouple contrastive learning from large-batch training. Influential both for its results and for the design pattern, which reappears in later self-supervised methods.

Paper

Jean-Bastien Grill et al. Bootstrap Your Own Latent (BYOL). NeurIPS 2020.

The paper that showed contrastive pretraining could drop the negative-pair term — a result that was surprising at the time and still not fully understood. The starting point of the self-distillation family of self-supervised methods.

Paper

Mathilde Caron et al. Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021.

Self-distillation applied to Vision Transformers, producing features with remarkable emergent properties — object boundaries appearing in attention maps, unsupervised semantic segmentation, strong k-NN classification. A landmark in self-supervised vision.

Paper

Jason Wei et al. Finetuned Language Models Are Zero-Shot Learners (FLAN). ICLR 2022.

The paper that introduced instruction tuning as a distinct stage — fine-tune a pretrained language model on a broad collection of tasks reformatted as instructions, and its zero-shot performance on held-out tasks jumps substantially. The template that every subsequent instruction-tuned model follows.

Paper

Victor Sanh et al. Multitask Prompted Training Enables Zero-Shot Task Generalization (T0). ICLR 2022.

The BigScience counterpart to FLAN, built from a much more diverse set of prompt templates across tasks. Demonstrated that prompt variety matters for zero-shot generalisation and that instruction-tuning benefits can be had at much smaller scales than GPT-3.

Paper

Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.

The paper that identified chain-of-thought as a major in-context technique — include reasoning steps in prompt examples, and accuracy on arithmetic and reasoning tasks rises sharply. Launched the prompting-as-reasoning-scaffold literature.

Paper

Rafael Rafailov et al. Direct Preference Optimization (DPO). NeurIPS 2023.

The paper that collapsed the RLHF pipeline into a single supervised loss. Derived that maximising the implicit reward under a KL-regularised policy reduces to a simple classification objective on preference pairs. Now the dominant alternative to PPO-based RLHF.

Paper

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.

Combined 4-bit base-model quantisation with LoRA to make fine-tuning of 65B-parameter models possible on a single consumer GPU. A practical breakthrough that unlocked open-source LLM customisation at scale and made high-quality fine-tuning accessible to individual researchers.

Paper

Jason Wei, Yi Tay, Rishi Bommasani, et al. Emergent Abilities of Large Language Models. TMLR 2022.

The paper that catalogued tasks where performance transitions sharply from chance to useful past some scale threshold — with accompanying debate about whether emergence is real or a metric artefact (see Schaeffer et al. 2023 below). Together the two papers define the modern discussion of scale-dependent capability.

Paper

Rylan Schaeffer, Brando Miranda, Sanmi Koyejo. Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023.

The paper that argued emergence is often an artefact of discontinuous metrics — switching from exact-match to continuous metrics makes many "emergent" curves smooth. Did not end the debate but reframed it: real task-capability transitions exist; the specific metrology matters enormously.

Paper

Neil Houlsby et al. Parameter-Efficient Transfer Learning for NLP. ICML 2019.

The original adapter paper. Introduces small trainable bottleneck modules inserted between transformer layers, keeping the backbone frozen. The conceptual ancestor of LoRA and every subsequent parameter-efficient method.

Paper

Xiang Lisa Li, Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021.

Prefix tuning: prepend a small number of learned vectors to the input of every transformer layer and train only those. Competitive with full fine-tuning on generation tasks and a significant influence on later soft-prompt and prefix-based methods.

Paper

Simon Kornblith, Jonathon Shlens, Quoc V. Le. Do Better ImageNet Models Transfer Better? CVPR 2019.

The most systematic empirical study of supervised-pretraining transfer: 12 ImageNet-pretrained models, 12 downstream tasks, careful controls. The answer — "yes, with caveats" — quieted much of the debate and set the empirical baseline against which self-supervised methods were later measured.

Software

Library

Hugging Face. Transformers, Datasets, and Hub. huggingface.co, 2018–.

The central infrastructure for the pretrained-model ecosystem. Hundreds of thousands of checkpoints hosted and versioned; unified APIs for loading, fine-tuning, and inference; the de facto distribution channel for open-weight models. The reason the two-layer economy is accessible to non-specialists.

Library

Hugging Face. PEFT. github.com/huggingface/peft, 2023–.

The standard library for parameter-efficient fine-tuning. Implementations of LoRA, QLoRA, prefix tuning, prompt tuning, IA³, and adapters behind a unified API, with native integration into the Transformers stack. The tool that most fine-tuning projects actually use.

Library

Hugging Face. TRL (Transformer Reinforcement Learning). github.com/huggingface/trl, 2022–.

The reference implementation for preference-optimisation methods — SFT, PPO-based RLHF, DPO, ORPO, KTO, and their variants. The library that most open-source post-training work builds on.

Library

Meta AI. torchtune. github.com/pytorch/torchtune, 2024–.

PyTorch-native fine-tuning library with a focus on simplicity, auditability, and full-precision performance. Recipes for full fine-tuning, LoRA, QLoRA, and DPO, with explicit configuration rather than abstracted trainers.

Library

OpenMMLab. MMSelfSup and MMPretrain. github.com/open-mmlab/mmpretrain, 2020–.

The main open-source stack for self-supervised vision pretraining — reference implementations of SimCLR, MoCo, BYOL, DINO, MAE, BEiT, and their variants, with unified benchmarking. The place to go for self-supervised vision baselines.

Library

OpenCLIP, LAION, and collaborators. OpenCLIP. github.com/mlfoundations/open_clip, 2022–.

The open-source reimplementation of CLIP and its successors — trained on LAION's public image–text datasets and now the foundation of most open multimodal work. Includes scaled variants (ViT-L, ViT-H, ViT-G) that match or exceed the original closed-source CLIP on zero-shot benchmarks.

Transfer learning and pretraining, the economic revolution that turned deep learning from a one-task-one-model discipline into a field where every serious system starts from weights someone else trained.

How to read this chapter

Contents

Why transfer learning?

The pretrain-then-finetune paradigm

Feature extraction vs. fine-tuning

The ImageNet pretraining era

Self-supervised learning

Masked language modelling

Autoregressive pretraining

Contrastive learning

Masked image modelling

Domain adaptation

Zero-shot and few-shot transfer

In-context learning and prompting

Instruction tuning

Alignment from human preferences

Parameter-efficient fine-tuning

Catastrophic forgetting and continual learning

Scaling laws and emergent abilities

Where it compounds in ML

Further reading

Anchor textbooks

Foundational papers

Modern extensions

Software