Meta-Learning & Few-Shot Learning, learning how to learn from few examples.

Standard supervised learning starts each task from scratch with millions of labelled examples. Many problems do not have that luxury — a new bird species with five photographs, a rare disease with twenty patients, a new robot trying its first manipulation task. Meta-learning is the family of methods that learns from a population of related tasks how to adapt rapidly to new ones, and few-shot learning is the deployment regime that resulting models target. This chapter develops the meta-learning framework, the major model families (MAML and gradient-based meta-learning, prototypical networks and metric-based methods, memory-augmented and recurrent meta-learners), the modern reframing of in-context learning as implicit meta-learning, and the deployment patterns where rapid adaptation matters more than raw asymptotic accuracy.

Prerequisites & orientation

This chapter assumes neural-network fundamentals (Part V Ch 01–02), basic optimisation (Part I Ch 03), and familiarity with attention mechanisms (Part V Ch 06) and transformers (Part VI Ch 02) for the in-context-learning sections. The transfer-learning material of Part V Ch 07 is the natural prerequisite — meta-learning generalises transfer learning from "pretrain on one big task, fine-tune on another" to "pretrain on a distribution of tasks, adapt to any new one." No prior exposure to meta-learning is assumed.

Two threads run through the chapter. The first is the task distribution framing: meta-learning is supervised learning where each "datapoint" is a task — a small support set plus a query set — and the goal is to learn a procedure that produces good predictions on the query given the support. Once you internalise this framing, MAML, ProtoNets, and in-context learning are all instances of the same problem with different inductive biases. The second thread is the adaptation cost trade-off: gradient-based methods (MAML and friends) adapt by fine-tuning at test time, which costs gradient steps but is principled and bounds error explicitly; metric-based methods (ProtoNets) adapt by computing embeddings, which is fast and amortised but limited in expressiveness; in-context methods adapt by conditioning on examples in a forward pass, which is fastest of all but offloads the adaptation work to a much larger pretrained model. The chapter is organised so the methods appear in roughly this complexity order.

01

Why Meta-Learning Exists

Modern deep learning is profoundly data-hungry: ImageNet has a million examples, GPT-4 was trained on something like 10 trillion tokens, and a single new image-classification class commonly needs hundreds of labels to be learned reliably. Many problems we want to solve do not provide that many examples. Meta-learning is the response — train a system on a population of related tasks so that, when it sees a new task with only a handful of examples, it can adapt quickly. The chapter exists because the standard "more data, more compute" recipe genuinely fails in the few-shot regime, and meta-learning is the principled alternative.

The few-shot regime and why standard ML breaks

The defining setup of few-shot learning: at test time you are given a small support set of labelled examples (say, 5 examples per class for 5 new classes) and a query set of unlabelled examples to classify. Standard supervised learning on the support set alone is hopeless — 25 examples is not enough to fit a deep network from scratch, and naïve fine-tuning of a pretrained model overfits dramatically at this scale. The technical problem is that standard optimisation tries to find a good weight setting for this task in isolation, when the right thing to do is to find the weight setting that lets you adapt rapidly given the structure of this family of related tasks.

The framing matters. Few-shot learning is not "learning from few examples" in some absolute sense — that is impossible without prior knowledge — it is "transferring prior knowledge from many related tasks to a new task with few examples." Meta-learning makes the prior-knowledge step explicit and learnable.

Where the data scarcity comes from

Real applications run into few-shot regimes for several distinct reasons. Rare categories: in fine-grained classification (rare bird species, exotic plant diseases), the long tail of categories simply has few labelled examples. New categories: every product launch, every new species discovery, every new fraud pattern is a "task with no training data yet." Personalisation: each user's behaviour is its own task, with however few interactions they have produced. Domain shift: the model trained on one hospital's scanner has to adapt to a new hospital with limited labelled scans. Cost of labelling: medical, scientific, and legal labelling is expensive enough that even nominally large datasets are small relative to deep-learning needs. Meta-learning addresses all of these with the same machinery.

Why transfer learning is not enough

Transfer learning — pretrain on a big task, fine-tune on a small one — is an obvious first answer, and it often works. The reason meta-learning is needed beyond transfer learning is that fine-tuning is not always reliable on very small support sets, and the pretraining objective has nothing to do with adapting quickly. A model pretrained for "do well on ImageNet" is not the same as a model pretrained for "be the kind of model that, given five examples of a new class, will classify correctly." Meta-learning bakes the second objective directly into the training loss, producing models that are explicitly adaptable rather than merely starting from a useful initial point.

The empirical evidence has gone back and forth on this. Early meta-learning papers (2017–2018) showed dramatic gains over transfer learning. Subsequent careful comparisons (Tian et al. 2020, others) showed that strong transfer-learning baselines often catch up. The modern view is that meta-learning helps when (a) the task distribution is narrow and well-defined, (b) the support set is very small (1–5 examples), and (c) adaptation latency matters; transfer learning suffices when the new task has even tens of examples and there is no urgency about adaptation speed.

Why This Matters in 2026

The rise of in-context learning in large language models — where GPT-style models adapt to new tasks at inference time given a handful of examples in the prompt — is, mechanically, a form of meta-learning. Section 8 develops this connection. The implication is that the meta-learning framework is now central to how the largest models are deployed, even though the formal "meta-learning" label is no longer always attached. Understanding the framework is essential context for anyone working on prompt engineering, foundation-model adaptation, or rapid task generalisation.

02

The Task Distribution and Episodic Training

Meta-learning starts from a simple but powerful reframing: a "datapoint" is not a single labelled example, it is a whole task — a small set of training examples plus a set of query examples to predict on. The model is trained on many such tasks so that, at test time, it can handle a new task drawn from the same distribution. Getting the formalism right is the foundation for every method in the chapter.

Tasks and the task distribution

Formally, meta-learning assumes a task distribution p(𝒯). Each task 𝒯i consists of: a support set Si = {(xs, ys)} of labelled examples, and a query set Qi = {(xq, yq)} of additional examples used to score the model's adaptation. The meta-objective is to learn a model that, given a new task's support set, produces good predictions on the query set:

Meta-learning objective
minθ 𝔼𝒯 ∼ p(𝒯) [ ℒQ( fθ( S, xq ), yq ) ]
fθ(S, x) denotes the model's prediction on input x given support set S, parameterised by meta-parameters θ. The meta-loss is averaged over tasks drawn from p(𝒯), with the inner expectation over the query examples within each task. The meta-parameters θ are exactly what gets optimised; how θ is used to produce a prediction varies by method (gradient descent on copies of θ, attention over support embeddings, etc.).

The N-way K-shot benchmark

The standard evaluation protocol is N-way K-shot classification. At test time, sample N classes from a pool unseen during meta-training; sample K examples per class as the support set; sample some additional examples per class as the query set. Score the model's classification accuracy on the query. Common settings are 5-way 1-shot (5 classes, 1 example each — 5 support examples total) and 5-way 5-shot. Lower N and lower K make the problem harder; the regime where meta-learning dominates over transfer learning is roughly K ≤ 5.

ONE EPISODE OF META-TRAINING SUPPORT S 5 classes × 3 labelled A B C D E adapt META-LEARNER fθ inner-loop on S fθ' (adapted to this task) predict QUERY Q 2 unlabelled, score predictions → A ✓ → B ✓ → A ✗ → D ✓ → E ✓ Q(θ', y) query loss θQ meta-update on θ
One episode of meta-training. The support set S feeds the meta-learner, which performs an inner-loop adaptation to produce a task-specific model fθ'. That adapted model predicts on the query set Q, and the resulting query-loss flows back as a meta-gradient ∇θQ that updates the meta-parameters θ — not θ'. The episode is one "datapoint" in the outer optimisation; meta-training runs thousands of such episodes drawn from the meta-train class pool, and meta-test repeats the same forward pass on a disjoint set of classes.

The crucial discipline is the meta-train / meta-test split: the classes used at meta-training time are disjoint from the classes used at meta-test time. The model never sees the test classes during training, and its meta-test performance measures how well it generalises to genuinely novel classes. Without this discipline you are measuring something else entirely (the model's ability to reuse classes it has memorised), and a depressing fraction of pre-2020 meta-learning papers got this subtly wrong.

Episodic training

The standard meta-training procedure is episodic training: at each meta-training step, sample a task from p(𝒯), run whatever adaptation procedure the method uses on the support set, score on the query set, and backpropagate through the whole thing into the meta-parameters. Each "episode" is one complete N-way K-shot task. Tens of thousands of episodes are typical for meta-training. The result is a model whose meta-parameters are explicitly tuned to perform well on the average sampled task.

Episodic training mirrors the test-time setup, which is the crucial discipline: the meta-train objective directly measures what we care about at meta-test. This is what distinguishes meta-learning from a standard pretrain-then-fine-tune workflow, where the pretraining objective is task-agnostic and the fine-tuning at test is a separate procedure not optimised against. Section 9 returns to this in the context of evaluation pitfalls.

Standard benchmarks

Three datasets dominate meta-learning evaluation. Omniglot (Lake et al. 2015) is 50 alphabets of handwritten characters, often called "the transpose of MNIST" — 1623 classes with 20 examples each; the canonical small benchmark for few-shot character recognition. miniImageNet (Vinyals et al. 2016) is a 100-class subset of ImageNet split into 64 train / 16 validation / 20 test classes; the canonical mid-scale benchmark. Meta-Dataset (Triantafillou et al. 2020) is a more diverse 10-source benchmark designed to measure cross-domain generalisation; the modern standard for serious meta-learning evaluation. Section 9 covers these and their common pitfalls in detail.

03

Metric-Based Meta-Learning

The simplest and most empirically successful family of meta-learning methods is metric-based. The idea: learn an embedding network so that examples from the same class cluster together, and classify new examples by their distance to the support-set embeddings. No gradient updates at test time, no task-specific optimisation — just compute embeddings and compare. Prototypical Networks are the canonical instance, and the family remains the strongest baseline on standard few-shot benchmarks.

Siamese and triplet origins

The conceptual ancestor is the Siamese network (Bromley et al. 1993, popularised by Koch et al. 2015 for one-shot recognition): train a network to embed pairs of inputs so that same-class pairs are close and different-class pairs are far. Classify a new example by finding the support example with the most similar embedding. The training loss is contrastive — minimise distance for same-class pairs, maximise for different-class. The triplet loss (anchor, positive, negative) is a refinement that operates on triples rather than pairs. Both train embedding networks that work as similarity engines at test time.

Matching Networks

Matching Networks (Vinyals et al. 2016) extend the Siamese idea with two refinements. First, they use attention over the entire support set rather than nearest-neighbour to a single example — the predicted label is a weighted average of support labels, with weights given by a soft-attention over support embeddings. Second, they propose full-context embeddings, where each support example's embedding depends on the rest of the support set (via an LSTM or attention module). The result is a meta-learning system that is fully differentiable end-to-end and uses the entire support set rather than only the closest example.

Prototypical Networks

The dominant member of the family is Prototypical Networks (Snell et al. 2017). The recipe is unusually simple. Train an embedding network fθ. For each support example, compute its embedding. For each class, compute the prototype — the mean of its support embeddings:

Prototypical Networks (Snell et al. 2017)
ck = (1/|Sk|) Σ(x, y) ∈ Sk fθ(x)
p(y = k | x*) = exp( −d(fθ(x*), ck) ) / Σk' exp( −d(fθ(x*), ck') )
ck is the prototype for class k. The classification of a query x* is a softmax over negative distances to all prototypes — typically squared Euclidean distance, sometimes cosine. Training maximises the log-likelihood of the correct class on each query example, averaged over many sampled tasks.

The simplicity is deceptive. Prototypical Networks beat or match more elaborate methods on Omniglot and miniImageNet, work for arbitrary N and K without retraining (the same trained model handles 5-way 1-shot, 20-way 5-shot, etc.), and have linear inference cost in the support-set size. The 2020s production deployments of metric-based meta-learning are almost all ProtoNet-flavoured.

Relation Networks and learned distances

A natural extension is to learn the distance function rather than fixing Euclidean. Relation Networks (Sung et al. 2018) replace the Euclidean distance with a small learned MLP that takes two embeddings and outputs a similarity score. The whole system — embedding network plus relation network — is trained end-to-end. Relation Networks tend to outperform vanilla ProtoNet on harder benchmarks where Euclidean similarity is a poor surrogate, at the cost of some interpretability.

Why metric methods win in practice

Three reasons explain the empirical dominance of metric-based meta-learning. First, no test-time optimisation is needed — predictions are a single forward pass, which matters in production. Second, there are fewer hyperparameters than gradient-based methods (no inner-loop step size, no inner-loop step count). Third, the inductive bias is appropriate: most few-shot problems really do reduce to "find the support example most similar to this query," and metric methods bake that bias in directly. The cases where gradient-based methods (Section 4) outperform metric methods are typically those where the per-task adaptation involves more than re-classifying — for instance, regression with novel input ranges, or RL with novel dynamics.

04

MAML and Gradient-Based Meta-Learning

Where metric-based methods avoid test-time optimisation, gradient-based methods embrace it. Model-Agnostic Meta-Learning (MAML), introduced by Finn, Abbeel, and Levine in 2017, is the canonical instance: learn a weight initialisation such that, after a few gradient steps on the support set, the resulting model performs well on the query. The method is conceptually elegant, model-agnostic (works for classification, regression, or RL), and the most-studied paradigm in meta-learning.

The inner-outer loop structure

MAML has a nested optimisation structure that is the conceptual heart of gradient-based meta-learning. The inner loop performs task-specific adaptation: starting from meta-parameters θ, take K gradient steps on the support-set loss to produce task-specific parameters θ'. The outer loop updates the meta-parameters θ so that the inner-loop result performs well on the query set:

MAML (Finn et al. 2017)
θ'i = θ − α ∇θSi(θ)
θ ← θ − β ∇θ ΣiQi(θ'i)
α is the inner-loop step size (often a single step, sometimes 5–10), β is the outer-loop step size. The outer-loop gradient is taken with respect to the original θ, but it flows through the inner-loop update — meaning the gradient computation involves second-order derivatives. This is what makes MAML computationally heavy and is the primary engineering difficulty in scaling it.

The conceptual interpretation: MAML doesn't try to find weights that perform well on every task, it tries to find weights that are one gradient step away from performing well on every task. The weights serve as a meta-initialisation tuned for rapid adaptation rather than for direct prediction.

The second-order gradient problem

The exact MAML update requires computing gradients through the inner-loop gradient — a second-order derivative. For a network with D parameters and K inner steps, the cost is roughly K × D2 in the worst case, prohibitive for large networks. Two common simplifications dominate practice. First-order MAML (FOMAML) drops the second-order terms, treating the inner-loop update as if it were a fixed function of θ; the gradient becomes much cheaper at the cost of theoretical correctness. The empirical loss is small — FOMAML matches MAML on most benchmarks at a fraction of the compute. Reptile (Nichol et al. 2018) is even simpler: do K gradient steps on the support set, then move θ a fraction of the way toward the resulting θ'. No second-order anything; despite the simplification, Reptile competes with MAML on standard benchmarks.

Per-parameter inner-loop step sizes

A useful refinement is to make the inner-loop step size α a learnable per-parameter (or per-layer) quantity. Meta-SGD (Li et al. 2017) does exactly this: α becomes a vector with the same shape as θ, learned alongside θ in the outer loop. ALFA (Baik et al. 2020) generalises further with adaptive layer-wise hyperparameters. The empirical pattern: per-parameter step sizes consistently improve over the single-scalar α at modest extra cost and are now standard in production MAML implementations.

When MAML wins and when it loses

MAML's strength is its model-agnosticity — the same machinery applies to classification, regression, or RL with no architectural changes. This made it dominant in meta-RL and in problems where the per-task adaptation involves more than recomputing class boundaries. On standard few-shot classification benchmarks, MAML and its descendants are usually competitive with but not dominant over Prototypical Networks; the choice between them is application-specific. Where MAML genuinely shines is in continuous adaptation scenarios — robotics tasks where each task has different dynamics, recommendation systems where each user is a slightly different task — where the gradient-based adaptation captures structural changes that metric-based methods cannot.

05

Learned Optimisers and Hypernetworks

MAML treats the inner-loop optimiser as fixed (SGD) and learns only the initialisation. A more ambitious branch of meta-learning learns the optimiser itself — a recurrent network that takes the current parameters and gradient and outputs the next parameter update. A related idea: learn a hypernetwork that, given the support set, outputs the task-specific weights directly without any iterative inner loop. Both approaches push the boundary of what counts as "adaptation" and have produced some of the most expressive meta-learners in the literature.

Learning to learn by gradient descent by gradient descent

Andrychowicz et al. (2016) introduced the original learned optimiser: an LSTM that takes the current gradient and previous hidden state and outputs the next parameter update. Train the LSTM by unrolling its updates over many steps and backpropagating through the unrolled trajectory — meta-learning by direct optimisation of the optimiser's behaviour. The result was a learned optimiser that outperformed Adam on the task families it was trained on (small MLPs, small CNNs), demonstrating that the standard optimisers are not optimal but merely well-engineered defaults.

The 2010s wave of learned optimisers struggled to generalise — an LSTM trained to optimise small MNIST classifiers failed to optimise large ResNets — and the approach lost popularity for a while. The 2020s revival, anchored by the VeLO project (Metz et al. 2022) and successors, used much larger optimiser networks and much more diverse training distributions to produce learned optimisers that genuinely outperform Adam on out-of-distribution tasks. The compute cost of training a good learned optimiser is enormous, but once trained it can be applied at the scale of a normal optimiser.

Hypernetworks: skip the inner loop entirely

A complementary idea: rather than learning to optimise, learn to predict the weights directly. A hypernetwork (Ha et al. 2017) is a network that takes some input — for meta-learning, the support set — and outputs the weights of a task-specific network. The task-specific network is then used for predictions. There is no inner loop; the entire adaptation happens in the hypernetwork's forward pass. Training is end-to-end: backpropagate through both the prediction loss and the hypernetwork.

Hypernetworks shine on problems where the support set has structure that strongly determines the right task-specific weights — for instance, where the task is parameterised by a small set of context variables. Their weakness is parameter efficiency: a hypernetwork that outputs the weights of a large network must itself be very large, which can be wasteful. Most modern hypernetwork applications use them for last-layer or low-rank weight prediction rather than full network synthesis.

LEO: latent-space adaptation

An influential intermediate point is LEO (Latent Embedding Optimization, Rusu et al. 2019). The idea: rather than running the inner-loop optimisation in the high-dimensional weight space, encode the support set into a low-dimensional latent space, run the inner-loop optimisation there, then decode back to weights. The lower-dimensional inner loop is more sample-efficient and stable. LEO held the state of the art on miniImageNet and tieredImageNet for several years and remains a strong baseline; the latent-space-optimisation idea has been recycled in many subsequent methods.

Adaptive layer scaling and conditional computation

A pragmatic family of methods sits between full hypernetworks and pure inner-loop adaptation: keep the network's weights fixed across tasks, but learn task-specific scaling parameters (FiLM layers, batch-norm parameters, low-rank adapters). The adaptation predicts only these few parameters from the support set, which is cheap and parameter-efficient. This is essentially what LoRA-style fine-tuning of large models has become — a form of meta-learning where the support set conditions a low-rank perturbation of the base model's weights. The connection to in-context learning in Section 8 will make this intuition explicit.

06

Memory-Augmented and Recurrent Meta-Learners

A different angle: what if you treat the support set as a sequence and feed it through a recurrent network or a memory module before predicting on the query? The task-specific information ends up in the network's hidden state or in an external memory, not in changed weights. This was the dominant paradigm before MAML and ProtoNets and remains influential, especially as the conceptual bridge to in-context learning in Section 8.

Memory-augmented networks

The earliest serious meta-learners (Santoro et al. 2016, "Meta-Learning with Memory-Augmented Neural Networks") used Neural Turing Machines and related architectures that pair a controller network with an external differentiable memory. The support set is fed in sequentially; each support example is written to memory; the query is predicted by attending over the memory. The key insight: the controller learns a memory-read-and-write policy that effectively implements one-shot learning, with the external memory acting as the learned-on-the-fly classifier.

Memory-augmented architectures have appealing properties — a structured representation of "what we've seen so far in this task," explicit read/write operations that can be inspected. They lost popularity around 2017–2018 as MAML and ProtoNets matched or beat them with simpler architectures, but the underlying idea is enjoying a revival in the modern context-window-as-memory framing of long-context LLMs.

SNAIL: temporal convolutions plus attention

SNAIL (Mishra et al. 2018) is an influential architecture in this family that anticipated much of what transformers later did. SNAIL processes the support set with alternating temporal-convolution and attention blocks, builds up a context representation, and predicts on the query by attending into the context. The architecture predates and prefigures the transformer's role in in-context learning: the same general computational pattern (attend over the support set, predict on the query) is what GPT-style models do implicitly when given a few-shot prompt.

Sequence-to-task models

The general framing is "treat the task as a sequence — support examples followed by query inputs — and use a sequence model to produce the predictions." This is essentially what in-context learning will turn out to be in Section 8. The 2018-era meta-learning literature explored this with custom architectures (SNAIL, MANN, the various NTM variants) trained on episodic batches. The 2023-era LLM literature explored the same idea with off-the-shelf transformers trained on Internet-scale text.

Attention-based meta-learners

Pure-attention meta-learners — using a transformer encoder over the support set and a transformer decoder for the query — emerged as the standard architecture by 2020. CrossTransformers (Doersch et al. 2020) and the various Set Transformer variants are representative. The advantage over earlier RNN-based meta-learners is parallel processing of the support set; the disadvantage is the quadratic-in-support-size cost. For typical few-shot regimes (5–20 support examples), the cost is negligible.

Why this matters for in-context learning

The conceptual continuity is the chapter's most important payoff. A meta-learner that processes the support set with attention, then predicts on the query with cross-attention, is structurally identical to a transformer that takes a few-shot prompt and generates a response. Once you internalise this, you can read modern LLM behaviour through a meta-learning lens: GPT-style models trained on diverse text effectively meta-learn over an enormous task distribution, and their in-context-learning capabilities at deployment time are an emergent meta-learning skill. Section 8 develops this in full.

07

Bayesian Meta-Learning

Few-shot learning is structurally a regime of high uncertainty — five examples cannot pin down a complex decision boundary, and a sensible model should reflect that uncertainty in its predictions. Bayesian meta-learning brings the framework of Part XIII Ch 07 into the few-shot setting, producing meta-learners that output calibrated predictive distributions rather than point predictions. The methods in this section are the natural unification of meta-learning and Bayesian deep learning.

The probabilistic interpretation of MAML

One useful way to read MAML is as approximate Bayesian inference. The meta-learned weights θ play the role of a prior; the inner-loop gradient steps perform approximate posterior inference given the support set; the resulting θ' is a point estimate of the posterior. Grant et al. (2018) made this connection rigorous by showing that MAML with a quadratic regulariser performs approximate Bayesian inference under a Gaussian prior. The framing matters because it suggests how to extend MAML: replace the inner-loop point estimate with a proper posterior approximation, producing a probabilistic meta-learner.

PLATIPUS and probabilistic MAML

PLATIPUS (Finn et al. 2018) is the canonical Bayesian extension: instead of a single meta-learned θ, learn a distribution q(θ) over initialisations. At test time, sample initialisations from q, run MAML on each, and aggregate predictions. The result is a meta-learner that outputs a distribution over predictions reflecting both the data uncertainty in the support set and the model uncertainty in the meta-learned weights.

BMAML (Yoon et al. 2018) takes a related approach using stochastic-gradient Hamiltonian Monte Carlo to sample multiple particles in the inner loop, with each particle representing one posterior sample. Both methods explicitly target calibrated few-shot uncertainty and outperform vanilla MAML on benchmarks where uncertainty quality matters (active learning, downstream decision-making).

Neural processes as probabilistic meta-learners

The Neural Process family (Garnelo et al. 2018, covered in Part XIII Ch 07) is structurally a probabilistic meta-learner. A neural process takes a support set, encodes it into a context representation, and produces a predictive distribution (typically Gaussian) over the query outputs. Training is episodic — sample tasks, compute predictive log-likelihood on the query given the support, optimise. The Conditional Neural Process is the deterministic version; the full Neural Process adds a global latent variable for additional uncertainty.

The conceptual unification is clean: NPs are meta-learners that explicitly model uncertainty; ProtoNets and MAML are meta-learners that produce point predictions. For most practical few-shot classification, ProtoNets remain dominant; for regression, active learning, or scientific applications where calibrated probabilities matter, the NP family is the right starting point.

Posterior predictive in few-shot

The right object for evaluating few-shot uncertainty is the same as for standard Bayesian deep learning: the posterior predictive distribution. Bayesian meta-learners produce this naturally; non-Bayesian meta-learners can be augmented with the methods of Ch 07 (deep ensembles of meta-learners, MC dropout in the inner loop, Laplace approximations on the post-adaptation weights). The empirical evidence: Bayesian meta-learners produce better-calibrated predictions than vanilla meta-learners, especially in the very-few-shot regime where uncertainty matters most. The pattern mirrors standard Bayesian deep learning: deep ensembles of MAML or ProtoNets are a strong baseline that more elaborate Bayesian methods need to beat.

08

In-Context Learning as Meta-Learning

The most influential meta-learning result of the 2020s did not come from the meta-learning literature. GPT-3 (Brown et al. 2020) demonstrated that a sufficiently large language model trained on Internet text could perform new tasks at inference time given a handful of examples in the prompt — a behaviour the paper called "in-context learning" and explicitly framed as few-shot learning. The connection between this and the meta-learning framework of the chapter is exact, and understanding it changes how you think about both fields.

Few-shot prompting and the GPT-3 result

The basic in-context learning setup: prepend a sequence of (input, output) pairs to a prompt, then ask the model to produce the output for a new input. For text classification: "Review: Great movie! Sentiment: positive. Review: Terrible. Sentiment: negative. Review: Pretty good. Sentiment:" — the model completes with "positive." No gradient updates, no fine-tuning, no inner-loop optimisation; the adaptation is entirely a function of the prompt content during a single forward pass.

GPT-3 demonstrated this works surprisingly well for many tasks across NLP, with quality scaling smoothly as a function of model size and number of in-context examples. The 2022–2024 generation of LLMs (GPT-4, Claude, Gemini) extended this to longer context windows and richer task repertoires. By 2026, in-context learning is the default deployment pattern for most foundation-model applications, with full fine-tuning reserved for cases where ICL is not enough.

The meta-learning interpretation

Reading ICL through the meta-learning framework: the LLM's pretraining is meta-training, the prompt's example pairs are the support set, the prompt's final query is the query example, and the model's autoregressive output is the prediction. The meta-task distribution is, implicitly, the population of "tasks" present in Internet text — sentiment analysis, translation, summarisation, code completion, arithmetic, every other task that can be expressed as text. The transformer architecture, processing the prompt with attention, has all the components of the metric-based and attention-based meta-learners of Sections 3 and 6.

Several theoretical papers (Garg et al. 2022, von Oswald et al. 2023, Akyürek et al. 2023) have made the meta-learning interpretation precise for restricted settings. The core result: trained transformers on synthetic regression and classification tasks implement learned algorithms — gradient descent, Bayesian inference, ridge regression — in their forward passes, with attention layers serving as the algorithmic primitive. ICL is not a mystical emergent capability; it is meta-learning, accomplished by training a model big enough and on data diverse enough that meta-learning becomes implicit in the standard language-modelling objective.

Transformers as Bayesian meta-learners

A particularly clean theoretical result: transformers trained on linear regression tasks implement, in their forward pass, something close to Bayesian linear regression on the in-context examples. The model has implicitly inferred the prior from training data and performs approximate posterior inference over the in-context examples to predict the query. The same broad picture extends to classification (the model implements something close to Bayesian classification) and more complex tasks (the model approximates Bayesian inference over the implicit task distribution).

This framing makes specific predictions: ICL should work better with more examples (more data tightens the posterior), should fail when the new task is far from the training distribution (the prior is wrong), and should be sensitive to example ordering only insofar as the architecture is sensitive to it (transformers without explicit position encoding are theoretically permutation-invariant; in practice modern position encodings break this). All of these predictions are empirically supported.

Implications and limits

The implications are large. First, the meta-learning framework of this chapter remains the right conceptual lens for foundation models, even though the formal "meta-learning" label has fallen out of fashion. Second, the design of in-context examples (their order, their labels, their format) is genuinely a meta-learning problem and should be treated as one. Third, the limitations of ICL are predictable from meta-learning principles: tasks far from the implicit training distribution fail, support-set size matters, and combining ICL with task-specific fine-tuning often beats either alone (echoing the classical "transfer + meta-learning" recipe). For practitioners, the right way to think about prompt engineering is as meta-test-time data design — what examples should I include to get the strongest few-shot adaptation?

09

Benchmarks, Pitfalls, and Evaluation

Meta-learning evaluation is fraught. Subtle violations of the meta-train/meta-test discipline have produced many published results that did not replicate, and a handful of careful comparison papers in 2019–2020 substantially deflated the early meta-learning hype. This section covers the standard benchmarks, the common pitfalls, and the evaluation protocols that actually work.

The standard benchmarks

The chapter has already mentioned the three dominant benchmarks. Omniglot is small (28×28 grayscale character images, 1623 classes, 20 examples each); the canonical first benchmark. miniImageNet is the standard mid-scale benchmark — 100 ImageNet classes split 64/16/20 for meta-train/val/test, with 600 84×84 images per class. tieredImageNet is similar but with a hierarchical class split that requires more genuine generalisation. Meta-Dataset (Triantafillou et al. 2020) is the modern serious benchmark — 10 image datasets sourced from different domains, designed to measure cross-domain generalisation; it is harder than miniImageNet by every measure and is the right primary benchmark for new methods in 2026.

The pretrain-baseline result

The single most important result in 2019–2020 meta-learning evaluation was Tian et al.'s "Rethinking Few-Shot Image Classification" (and parallel work by Chen et al., Dhillon et al.). The headline: a strong baseline that simply pretrains a feature extractor with standard supervised learning on the meta-training classes, then trains a logistic-regression classifier on the support set, matches or beats most published meta-learning methods on miniImageNet. The result was widely interpreted as deflating meta-learning — much of the apparent gain over baselines was attributable to better feature extractors rather than to the meta-learning machinery itself.

The right interpretation, in 2026, is more nuanced. Strong feature extractors are most of the few-shot story when the support set is at the larger end of the few-shot regime (5+ examples per class). Meta-learning methods retain advantages in the very-low-K regime (1-shot, sometimes 5-shot) and in cross-domain settings (Meta-Dataset, where the test domain differs from training). For pure within-domain few-shot classification at moderate K, transfer learning with a strong backbone is hard to beat.

The episodic-training pitfall

A subtle issue: standard episodic training assumes the meta-test tasks are sampled from the same distribution as meta-training tasks. This is straightforward for within-domain benchmarks but breaks for cross-domain evaluation. Strong meta-learners trained episodically on miniImageNet often do worse on Meta-Dataset's cross-domain tasks than non-episodic baselines, because the episodic training overfits to the meta-training task distribution. Modern recommendations (Triantafillou et al. 2021): mix episodic and non-episodic training, evaluate on diverse benchmarks, and report task-distribution-shift performance separately.

What good evaluation looks like

The 2026 standard for serious meta-learning evaluation: report results on Meta-Dataset (not just miniImageNet), include a strong transfer-learning baseline, evaluate at multiple shot levels (1-shot, 5-shot, 20-shot), separate within-domain from cross-domain performance, and report confidence intervals over many random task samplings. Compute cost should also be reported — a method that beats baselines at 10× the compute is making a different trade-off than one that beats them at the same cost. The Meta-Dataset codebase enforces most of these protocols; new methods can be reasonably trusted only if they evaluate within it.

Negative results and replication

Several widely-cited meta-learning results have failed to replicate or have shrunk dramatically under stronger baselines. The 2019 wave of "rethinking meta-learning" papers established the modern more-skeptical evaluation culture; subsequent papers have been better behaved. The general lesson — strong baselines are essential, pretrained feature extractors are part of the comparison, episodic training is not a free lunch — applies broadly across the deep-learning literature and should be the default framing for any meta-learning result.

10

Applications and Frontier

Meta-learning shows up wherever rapid adaptation to new tasks matters: drug discovery with rare disease targets, robotics with new manipulation skills, personalisation with limited per-user data, foundation-model deployment via in-context learning. The deployment patterns differ — some applications need explicit episodic training, others get meta-learning for free from large-model pretraining — but the conceptual framework of the chapter applies broadly.

Drug discovery and molecular property prediction

Few-shot molecular property prediction is one of meta-learning's strongest application areas. The setting: a new biological target has a handful of measured molecules; predict activity for proposed novel molecules. Datasets like FS-Mol (Stanley et al. 2021) provide proper few-shot benchmarks for this regime. ProtoNet-style methods on graph-neural-network features are competitive baselines; MAML-style adaptation works well when the new task involves more than just new compounds (different assay conditions, different readouts). Production pipelines at AstraZeneca, Insitro, and several other companies use meta-learning machinery for early-stage virtual screening.

Robotics and meta-RL

Meta-RL — meta-learning applied to reinforcement learning — is a major application domain. The setting: a robot must adapt to a new task (new object, new dynamics, new goal) from a handful of episodes. PEARL (Rakelly et al. 2019), MAML-RL (Finn et al. 2017), and the various sim-to-real adaptation methods of Part XII Ch 04 all sit in this space. The 2024 generation of robot foundation models (RT-2, Octo, OpenVLA) treat meta-learning implicitly via large-scale pretraining on diverse manipulation tasks — once again, the in-context-learning pattern applied to a non-language domain.

Personalisation and recommendation

Each user's interaction history is a small support set; the recommendation task is to predict their next item. Meta-learning frameworks treat the user as the task and meta-learn from a population of users. Production deployments at TikTok, YouTube, and similar platforms use meta-learning machinery for cold-start recommendation, where a new user has very few interactions to base recommendations on. The pattern of "shared backbone plus per-user adaptation" is essentially MAML applied to recommendation, and remains a strong baseline against which more elaborate methods are evaluated.

Foundation-model adaptation

The dominant 2026 application is foundation-model adaptation via in-context learning, fine-tuning, or LoRA. Most users are not aware they are doing meta-learning when they craft a few-shot prompt or fine-tune a model on a small downstream dataset, but they are. The meta-learning framework provides the conceptual scaffolding for prompt engineering (treat prompts as test-time data), for understanding fine-tuning behaviour (the pretrained model is the meta-learned initialisation), and for the recurring observation that diverse pretraining produces stronger downstream adaptation than narrow pretraining (the meta-task distribution determines what the model can adapt to).

Frontier methods

Several frontiers are active in 2026. Algorithm distillation: train a transformer to imitate the trajectories of an RL algorithm, producing a model that performs RL-like exploration in its forward pass. Mechanistic interpretability of ICL: papers like Olsson et al.'s induction-heads work and the von Oswald gradient-descent-in-attention results form a programme of reverse-engineering the meta-learning behaviour of trained transformers. Scaling-law analysis of meta-learning: Chan et al. and others have studied how meta-learning quality scales with task-distribution diversity and model size, with practical implications for foundation-model training. Continual meta-learning: the meta-learning version of continual learning, where the task distribution itself drifts over time.

What this chapter does not cover

Several adjacent areas are out of scope. Pure transfer learning, including the pretraining-and-fine-tuning pipeline that dominates real deployments, is the subject of Part V Ch 07 and warrants separate treatment despite its conceptual overlap with meta-learning. Self-supervised representation learning is a related but distinct topic where the support set is implicit in the data structure rather than explicit. Multi-task learning is a precursor to meta-learning but distinct: it trains one model on many tasks simultaneously without an inner-loop adaptation step. Continual learning is the related problem of accommodating new tasks over time without forgetting old ones; Part XIII Ch 09 covers it. The neuroscience-flavoured literature on biologically-inspired few-shot learning, including work on episodic memory and the hippocampus-cortex distinction, deserves its own chapter and is best treated through that lens. And the Bayesian-non-parametric meta-learning literature (Pitman-Yor processes, hierarchical Dirichlet processes for clustering tasks) intersects with this chapter but lives mostly in classical Bayesian statistics.

Further reading

Foundational papers and surveys for meta-learning and few-shot learning. The Hospedales survey plus the canonical MAML and ProtoNet papers plus the Brown ICL paper is the right starting kit for practitioners.