Part XIII · Specialized ML Methods · Chapter 09

Continual & Lifelong Learning, training that doesn't forget.

Train a neural network on task A, then on task B, and the network's performance on A collapses — sometimes to chance. This is catastrophic forgetting, and it is the fundamental obstacle to building learning systems that accumulate knowledge over time. Continual learning is the family of methods that fights forgetting: regularising weights that mattered for previous tasks, replaying old data, expanding the network's capacity, separating shared and task-specific parameters. This chapter develops the framework, the three method families (regularisation, replay, architectural), the connection to neuroscience-inspired memory consolidation, and the deployment patterns for the lifelong-learning future where models update continuously rather than getting retrained from scratch.

Prerequisites & orientation

This chapter assumes neural-network fundamentals (Part V Ch 01–02), basic optimisation (Part I Ch 03), and familiarity with regularisation (Part V Ch 04). The Bayesian-deep-learning material of Part XIII Ch 07 is helpful background — several continual-learning methods (EWC and friends) have a clean Bayesian interpretation as posterior updates — but the chapter develops what is needed. The meta-learning framework of Ch 08 is a related but distinct topic; meta-learning trains for rapid adaptation to new tasks, continual learning trains for retention of old ones. Both share the underlying problem of "how does a network handle a sequence of tasks?" but their objectives are different.

Two threads run through the chapter. The first is the stability-plasticity dilemma: a network must be plastic enough to learn new tasks, but stable enough to retain old ones, and these requirements are in genuine tension. Every method in the chapter is a different way of negotiating that trade-off. The second thread is the three method families: regularisation methods (penalise changes to important weights), replay methods (rehearse old data while training on new), and architectural methods (allocate new capacity for new tasks). These are the standard taxonomy in 2026, and most production continual-learning systems combine techniques from at least two families.

In this chapter

Catastrophic Forgetting and Why It Happens interference · plasticity · sequential training
Continual-Learning Scenarios and Evaluation task / domain / class incremental · ACC · BWT · FWT
Regularisation Methods: EWC and Friends EWC · Synaptic Intelligence · MAS · Fisher information
Replay and Rehearsal Methods experience replay · iCaRL · GEM · A-GEM · reservoir
Generative Replay and Pseudo-Rehearsal GR · DGR · pseudo-rehearsal · diffusion replay
Architectural Methods: Progressive Networks and Beyond progressive nets · PackNet · piggyback · adapters
Bayesian Continual Learning VCL · sequential Bayes · posterior consolidation
Meta-Continual Learning and Online Adaptation OML · ANML · MER · meta-experience replay
Benchmarks, Pitfalls, and Evaluation Split-MNIST · CORe50 · Stream-51 · evaluation traps
Applications and Frontier edge devices · robotics · LLM updates · concept drift · frontier

Catastrophic Forgetting and Why It Happens

Train a network to classify cats versus dogs, then continue training it on cars versus boats, and a striking thing happens: the network's performance on cats versus dogs collapses, often to chance. This is not subtle degradation; it is wholesale erasure of the earlier task. The phenomenon is called catastrophic forgetting, and understanding why it happens is the precondition for every method in the chapter.

The phenomenon

Catastrophic forgetting (also called catastrophic interference) was identified in the late 1980s by McCloskey and Cohen, in the context of multi-layer perceptrons. Their experiments showed that a network trained sequentially on two tasks did not gracefully accumulate knowledge — it learned the second task by overwriting the first. The result was so consistent and so severe that it became the standard explanation for why connectionist models had not displaced symbolic AI: real intelligence requires accumulating skills over time, and standard neural networks couldn't do it.

The forgetting is mechanically simple to demonstrate. Take any modern network, train it on MNIST digits 0–4, then continue training on digits 5–9. Test accuracy on 0–4 will be near 0% by the end of the second phase. The network has not forgotten how to classify; it has reconfigured its weights to classify only the second set of digits. The output neurons for the first set never fire because the upstream features no longer support them.

The defining picture of catastrophic forgetting. During phase 1 (training on Task A) the model's accuracy on A rises and B is at chance. The instant phase 2 begins (training on Task B), accuracy on A collapses while accuracy on B rises — the network has overwritten its A-solution with a B-solution. Standard SGD has no mechanism for preserving the A-relevant weight configuration; every method in the chapter restores some form of preservation.

Why standard SGD causes forgetting

The mechanism is straightforward. SGD on the second task's loss has no reason to preserve the weight configuration that solved the first task — the gradient points purely toward the second task's optimum. If the first task's weight configuration happens to be far from the second task's optimum, the optimiser walks away from it. Crucially, the network has many equivalent solutions: different weight settings often give similar second-task performance, but only some of those settings preserve first-task performance. Plain SGD has no preference among them, so it usually picks one that destroys the first task.

This is the stability-plasticity dilemma in its sharpest form. A network must be plastic enough to fit the new data and stable enough to retain the old. Standard SGD is all plasticity, no stability; methods in this chapter restore stability through different mechanisms.

The biological contrast

Humans and animals do not catastrophically forget. A child who learns to recognise dogs can later learn to recognise cars without losing the dog skill. The neuroscience explanation involves several mechanisms: synaptic consolidation (synapses important for prior memories become harder to modify over time), memory replay during sleep (the hippocampus replays recent experiences to the cortex, interleaving them with consolidated knowledge), and structural plasticity (new neurons are added in some brain regions). The continual-learning literature draws explicitly on these mechanisms — EWC (Section 3) is the synaptic-consolidation analogue, replay methods (Section 4) are the hippocampal-replay analogue, and architectural methods (Section 6) are the structural-plasticity analogue.

What the chapter is and isn't about

The chapter focuses on the case where tasks arrive sequentially and only the current task's data is available (or a small buffer of past data). This is genuinely hard. The easier related problems — multi-task learning where all tasks' data is available simultaneously, fine-tuning where the original training data can be re-accessed, online learning on a stationary distribution — have their own well-developed methods that are mostly not the focus here. The defining constraint of continual learning is the data restriction, and most of the chapter's complexity comes from working around it.

The Practical Stakes

For most current ML production, catastrophic forgetting is dodged by retraining from scratch — collect all data, train once, deploy. This works for problems with stable distributions and rare retrains. It fails for systems that must adapt continuously: robots learning new skills in the wild, recommendation systems tracking shifting user preferences, foundation models being updated with new knowledge. As these regimes become more common in 2026, continual learning moves from "research curiosity" to "production requirement."

Continual-Learning Scenarios and Evaluation

"Continual learning" covers several distinct scenarios that differ in what the model knows about task boundaries and what changes between tasks. Confusing them — which is easy because much of the literature is sloppy about it — leads to comparing methods that solve different problems. The standard taxonomy of van de Ven and Tolias (2019) cuts the space cleanly and is the right starting point.

Three scenarios: task, domain, class incremental

Task-incremental learning: each task has its own output head, and the task identity is provided at test time. The network knows which task's head to use for each test example, and only needs to learn the per-task feature representations without interference. This is the easiest scenario; many methods (especially regularisation-based ones) handle it well. Example: classifying digits in five separate two-class problems, with task ID provided.

Domain-incremental learning: the input distribution changes across tasks but the label space does not, and task identity is not provided at test time. The network must produce the right label without knowing which task it came from, but the labels are the same set across tasks. Example: classifying digits where each task changes the rendering style (handwriting, printed, etc.) but the digit labels remain 0–9.

Class-incremental learning: new classes arrive over time and task identity is not provided at test time. The network must distinguish all classes seen so far, including across task boundaries. This is the hardest scenario and the most realistic — it matches the way new categories appear in the real world. Example: recognising bird species with new species added over time, with the model required to distinguish all species (including species from previous tasks) at any test point.

Why class-incremental is so hard

The class-incremental case is dramatically harder than the task-incremental case because the model has to compare classes from different tasks against each other. With only the current task's data available, the network has no way to learn appropriate calibration between current-task classes and prior-task classes — it never sees them in the same gradient step. The empirical pattern in benchmarks: many regularisation-based methods that work well in task-incremental settings collapse to near-chance performance in class-incremental settings. Replay methods (Section 4) close most of this gap because they explicitly bring prior-task examples back into the current gradient.

Standard metrics

Three metrics dominate continual-learning evaluation. Average accuracy (ACC) is the model's accuracy on all tasks seen so far, averaged after the final task. Backward transfer (BWT) measures how learning a new task affects performance on prior tasks — negative BWT is forgetting, positive BWT is unusual but possible (where new tasks reinforce old ones). Forward transfer (FWT) measures how prior tasks help future ones, compared to learning each task in isolation.

The standard reporting pattern: a method's headline number is its ACC, with BWT reported alongside as a forgetting diagnostic. A method with high ACC and low (negative) BWT is forgetting little; a method with high ACC at the cost of strongly negative BWT is essentially "good only on the most recent task."

Memory and compute constraints

Crucially, continual-learning evaluation must specify what the model is allowed to store. Methods that keep a small buffer of past examples (a few hundred per class) are fundamentally different from methods that keep nothing. Most modern benchmarks specify a fixed memory budget and require methods to compete within it; the resulting trade-off curve (memory vs. forgetting) is the most informative way to compare methods. The 2020s standard is to report results across several memory budgets (0, 200, 500, 1000 stored examples per task) so that the memory-forgetting trade-off is explicit.

Regularisation Methods: EWC and Friends

The first principled answer to catastrophic forgetting was Elastic Weight Consolidation (EWC), introduced by Kirkpatrick et al. in 2017. The idea is biologically motivated and mathematically clean: identify which weights mattered most for the previous task, and penalise changes to those weights when training on the new task. EWC and its successors form the regularisation family of continual-learning methods, the conceptually simplest family and a strong baseline against which more complex methods are evaluated.

EWC: posterior-as-prior

Elastic Weight Consolidation takes a Bayesian perspective on continual learning. After training on task A, the network's weights θ are a (point) approximation to the posterior p(θ | D_A). When task B arrives, the right Bayesian update is to use this posterior as the prior for task B: p(θ | D_A, D_B) ∝ p(D_B | θ) · p(θ | D_A). The technical move is to approximate the previous posterior as a Gaussian centred at the post-task-A weights with covariance given by the inverse Fisher information. The resulting loss for task B is:

Elastic Weight Consolidation (Kirkpatrick et al. 2017) ℒ(θ) = ℒ B (θ) + (λ/2) Σ i F i (θ i - θ* A,i) 2 ℒ B is the standard task-B loss; θ* A is the post-task-A weight; F i is the diagonal Fisher information for parameter i, computed on task A. The regulariser pulls each parameter toward its task-A value with strength proportional to its Fisher score — high Fisher means the parameter mattered for task A and should not move much. λ is a hyperparameter tuning the stability-plasticity trade-off.

The Fisher information is the natural measure of "importance" because it captures how much the loss is curved around the current parameters — high Fisher means the parameter's value is well-determined by task A's data. The diagonal approximation is the practical concession; the full Fisher matrix is too expensive to store. EWC works well in task-incremental settings, less well in class-incremental settings, and remains the canonical regularisation-based continual-learning method.

Synaptic Intelligence and online importance

EWC requires a separate Fisher computation after each task, which can be expensive. Synaptic Intelligence (Zenke et al. 2017) tracks per-parameter importance online during training, using the path integral of the gradient over the training trajectory. The intuition: parameters that move a lot during training, multiplied by the gradient direction at each step, are accumulating importance. The importance estimate is updated continuously, with no separate Fisher pass. SI matches or outperforms EWC on most benchmarks at lower computational cost and is the more common choice in 2026 production deployments.

Memory Aware Synapses

Aljundi et al.'s Memory Aware Synapses (MAS, 2018) takes another angle: importance is estimated from the sensitivity of the network's output (rather than its loss) to each parameter. The advantage is that MAS can compute importance on unlabelled data — useful when labels for past tasks are unavailable. MAS, EWC, and SI form the standard regularisation-based trio, and most continual-learning papers compare against all three as baselines.

Strengths and limitations

Regularisation methods are appealingly simple: no replay buffer, no architectural changes, just an extra term in the loss. They work reasonably well in task-incremental settings and are computationally cheap. Their fundamental weakness is that the importance estimate is a point estimate of curvature; when many parameters become "important" after enough tasks, the regulariser becomes too restrictive and the network loses plasticity. In class-incremental settings, regularisation methods alone typically achieve only 50–70% of the accuracy of the strongest replay-based methods. The modern view: regularisation is part of the toolkit, not the whole solution; production systems usually combine regularisation with replay.

Replay and Rehearsal Methods

If catastrophic forgetting is caused by the gradient never seeing prior-task data, the most direct fix is to make sure it does. Replay methods store a small subset of past examples and interleave them with current-task training, typically achieving the strongest empirical results in continual learning at the cost of memory and (often) privacy concerns about storing user data. The replay family is the dominant approach in 2026.

Experience replay: the core idea

Experience replay in its simplest form: maintain a small buffer of examples from past tasks, and at each training step on the current task, also include some samples from the buffer in the gradient computation. The buffer is typically much smaller than the original training data — a few hundred to a few thousand examples per task — but its presence in the gradient prevents the network from drifting away from past-task solutions.

The empirical headline: even very simple experience replay matches or beats almost every regularisation-based method on standard benchmarks, and crushes them on class-incremental settings where regularisation methods fall apart. The 2019–2021 literature established this clearly, and the attention shifted from "how do we avoid storing data?" to "how do we store the right data and use it well?"

Reservoir sampling and buffer management

The buffer must be filled with care. Naively keeping a fixed number of examples from each task biases the buffer toward recent or distant tasks depending on insertion order. Reservoir sampling maintains a uniformly-random sample from the entire stream — when example k arrives, swap it into the buffer with probability buffer-size/k. This produces a buffer where every past example has equal probability of being retained, regardless of how long ago it appeared. Reservoir sampling is the standard buffer-management strategy when no per-task structure is available.

For class-incremental learning, buffers are usually maintained per-class — each class gets a fixed number of slots, typically chosen to be the most representative examples. The iCaRL method (Rebuffi et al. 2017) selects examples that minimise the distance between the class mean in feature space and the buffer-stored examples; this herding-style selection produces more representative buffers than random selection.

iCaRL and the nearest-mean classifier

iCaRL (incremental Classifier and Representation Learning) is the most influential replay method. The recipe combines several pieces. First, a feature extractor is trained on the current task with the standard loss plus a distillation loss against the previous network. Second, the buffer is updated using herding-style selection. Third — and this is the key — classification at test time uses a nearest-mean classifier in feature space rather than the network's output head. The class means are computed over the buffer; new test examples are assigned to the closest class mean. This sidesteps the calibration problems that plague class-incremental learning's softmax output and dramatically improves results.

iCaRL became the canonical baseline for class-incremental benchmarks and remains competitive in 2026, with variants (LUCIR, BiC, PODNet, FOSTER) refining different parts of the recipe.

GEM and A-GEM: gradient projections

Gradient Episodic Memory (GEM, Lopez-Paz and Ranzato 2017) uses the buffer differently. Rather than mixing buffer examples into the loss, GEM computes the gradient on the buffer for each past task and projects the current-task gradient so that it does not increase the loss on any past task. The constraint: the projected gradient has positive inner product with each past-task gradient. This is a quadratic program at each step, which is why GEM is computationally expensive. A-GEM (Chaudhry et al. 2019) simplifies to a single average gradient across past tasks rather than per-task constraints, dropping the per-step cost dramatically while retaining most of the benefit.

Privacy and the data-storage problem

Storing past examples is at odds with privacy regulations (GDPR's right to erasure, HIPAA, sectoral data-retention rules). For applications with regulated data, replay buffers may simply not be allowed, which is why generative replay (Section 5) remains an active research area despite the practical superiority of stored-example replay. The compromise often used in production: replay buffers are short-lived (held only during active retraining) and rotated regularly, with the rotation policy designed to comply with the relevant data-retention regime.

Generative Replay and Pseudo-Rehearsal

If you can't store past data, generate it. Train a generative model alongside the classifier; when a new task arrives, sample synthetic past examples from the generator and replay them during training. The generative-replay idea sidesteps the privacy and storage issues of literal replay buffers, at the cost of generative-model quality bounding overall performance. The 2024 generation of diffusion models has made this approach much more practical than it was even five years ago.

The pseudo-rehearsal idea

Pseudo-rehearsal goes back to Robins (1995): generate synthetic inputs (in the early days, random noise; later, samples from a learned generator) and pair them with the current model's predictions, then mix these into the training data alongside the new task. The hope: the synthetic inputs and the model's predictions on them encode the model's current input-output mapping, which gets preserved through training on the new task. The early versions used trivial generators and worked badly; the modern versions use real generative models and work surprisingly well.

Deep generative replay

Deep Generative Replay (Shin et al. 2017) was the first scaled version: train a VAE or GAN alongside the main classifier. When a new task arrives, sample synthetic past examples from the generator and label them using the previous classifier (the "scholar-generator" pair). Train the new classifier on the new task data plus the labelled synthetic past examples; train a new generator on the new task data plus synthetic past examples. The new pair becomes the scholar-generator for the next task.

DGR worked moderately well on simple datasets but struggled on complex ones because GANs and VAEs of the 2017 era couldn't produce convincing-enough samples. As generation quality degrades over chained re-training, the method's fidelity drifts; this is the canonical failure mode of generative-replay approaches.

Diffusion-based replay

The 2022–2026 generation of diffusion models has revived generative replay. Diffusion models produce dramatically higher-quality samples than VAEs or GANs at comparable scale, and unlike GANs they admit principled likelihood-based training that resists mode collapse during chained retraining. Recent papers (e.g., DDGR, CL-Diffusion) demonstrate that diffusion-based generative replay closes most of the gap to literal-replay methods on benchmarks where past data cannot be stored. The compute cost is higher than literal replay (you have to train and sample from a generator), but the privacy properties are dramatically better.

Latent replay: a middle ground

An intermediate strategy is to store examples in feature space rather than in the input space — the network's intermediate representations from past examples are kept while the inputs themselves are discarded. This compresses the storage requirement (features are usually smaller than raw images) and provides some privacy protection (features are harder to invert to recover the original input than literal storage). Latent replay is the standard choice for on-device continual learning where storage is constrained and the input data is sensitive.

When generative replay is the right choice

Three situations favour generative replay over literal replay. First, when the data cannot be stored (regulatory, privacy, or device-storage constraints). Second, when the data distribution is naturally generative-friendly (images, audio, text — domains where generative models work well). Third, when the goal is generative-replay-style data augmentation as an end in itself, beyond just continual learning. For most other applications, literal replay outperforms generative replay at lower complexity, which is why production continual-learning systems still mostly use literal replay where they can.

Architectural Methods: Progressive Networks and Beyond

A different angle on the stability-plasticity dilemma: instead of asking the same network to handle all tasks, allocate fresh capacity for each new task while keeping prior-task parameters frozen. Architectural methods sidestep forgetting by construction — the parameters that solved task A are not modified when task B arrives, so they cannot be overwritten. The cost is parameter growth and sometimes loss of cross-task generalisation; the benefit is a clean theoretical guarantee against forgetting.

Progressive Networks

Progressive Networks (Rusu et al. 2016) is the canonical architectural method. For each new task, instantiate a new "column" — a fresh copy of the network architecture — and freeze all prior columns. The new column receives lateral connections from each prior column at every layer, allowing it to reuse prior-task features but not modify them. By construction, prior-task performance cannot degrade because prior-task weights cannot change; the new column adds whatever capacity is needed for the new task while leveraging prior representations through the lateral connections.

The cost is parameter growth — N tasks means N columns of weights — which is prohibitive for tasks numbering in the dozens. Progressive Networks were popular in early continual-RL work where tasks were few (5–10) and the per-column cost was acceptable. They remain a strong baseline against which more parameter-efficient architectural methods are evaluated.

PackNet and parameter pruning

PackNet (Mallya and Lazebnik 2018) pushes the per-task capacity argument in the opposite direction: rather than adding parameters per task, partition the existing network's parameters across tasks via pruning. After training on task A, prune the lowest-magnitude weights — say 50% of them — fix the remaining ones, and train on task B using only the pruned-out (zeroed) parameters. After task B, repeat for task C using the remaining unallocated parameters. The result is a single fixed-size network in which each subset of parameters is dedicated to a specific task, and forgetting is impossible by construction.

PackNet's elegance is its zero parameter growth. Its cost is that the per-task capacity declines as more tasks are added — task 5 gets the smallest fraction of the network. The number of tasks it can handle is bounded by the network's redundancy. In practice, PackNet handles 5–10 tasks well on networks designed with extra capacity, and is a popular choice for embedded continual-learning deployments where parameter count is fixed by hardware.

Adapter-based methods

The modern architectural approach uses adapters — small task-specific modules inserted into a fixed backbone network. The backbone is trained once (or pretrained from a foundation model) and frozen; each new task gets a fresh small adapter. Piggyback (Mallya, Davis, and Lazebnik 2018) takes this further with binary masks per task; HAT (Serra et al. 2018) uses learned attention masks. Adapter-based methods scale to more tasks than Progressive Networks (per-task cost is much smaller) and have become the dominant architectural strategy in the foundation-model era — every task-specific LoRA module is conceptually an adapter doing one-task continual learning on top of a frozen pretrained model.

Strengths and weaknesses

Architectural methods have the cleanest forgetting guarantees — they cannot forget by construction. Their cost is twofold: the parameter overhead per task (or per-task capacity reduction), and the difficulty of cross-task generalisation. A method that walls off task-specific parameters cannot easily benefit from positive transfer between tasks, and tasks that share underlying structure (like classifying different bird species) end up not sharing as much as they should. The 2026 consensus: architectural methods are the right choice when forgetting is intolerable and tasks are well-separated; replay-based methods are the right choice when transfer between tasks matters; production systems often combine both.

Bayesian Continual Learning

Continual learning is structurally a Bayesian problem: the posterior after task A becomes the prior for task B. Done exactly, this would solve the forgetting problem entirely — past data informs the prior, new data updates the posterior, no information is lost. Done approximately, with the kinds of variational and Laplace tools of Bayesian deep learning (Part XIII Ch 07), it produces a principled family of continual-learning methods. The connection to EWC is direct, and the recent literature has produced more powerful Bayesian continual learners.

The sequential-Bayesian formulation

The Bayesian continual-learning framework: maintain a posterior p(θ | D_1:t) over the network weights given all data seen up to task t. When task t+1 arrives, the right update is:

Sequential Bayesian update p(θ | D 1:t+1) \propto p(D t+1 | θ) \cdot p(θ | D 1:t) The previous posterior plays the role of the prior for the new task. If the posterior approximation were exact, this would be lossless — past information is fully encoded in the prior, so training on D t+1 with this prior reproduces the joint posterior on all data. In practice, every step approximates the posterior, and accumulated approximation error becomes the central difficulty.

EWC (Section 3) is the most basic instance: the posterior is approximated as a Gaussian centred at the MAP with covariance from the diagonal Fisher; the new-task loss adds a quadratic regulariser that pulls toward the previous mean with strength given by the Fisher. The Bayesian framing makes the EWC choice transparent — it is a particular Gaussian posterior approximation propagated through tasks.

Variational Continual Learning

Variational Continual Learning (VCL, Nguyen et al. 2018) extends EWC to a full mean-field-Gaussian variational posterior — every weight has its own Gaussian variational distribution rather than a point estimate plus diagonal covariance. The posterior after task t is used as the prior for task t+1, and the new variational posterior is fit by maximising the standard ELBO. The result is a continual-learning method that produces calibrated uncertainty as a side benefit and tends to outperform EWC on benchmarks where multiple posterior modes matter.

The cost is the standard variational-inference cost — twice as many parameters (means and variances), reparameterisation-trick training. The benefit is a more honest posterior that supports the kind of selective-prediction and OOD detection covered in Ch 07. Production deployments of Bayesian continual learning are still rare but the method has strong empirical support.

Posterior-distillation methods

An alternative to maintaining an explicit posterior is posterior distillation: when training on a new task, ensure the network's output distribution remains close to the previous network's output distribution on representative inputs. Methods like Learning without Forgetting (LwF, Li and Hoiem 2017) do exactly this with a knowledge-distillation loss against the previous network's logits. LwF is very simple (one extra loss term) and works well as a baseline in task-incremental settings; it was an important transitional method between regularisation-based and replay-based families.

Why Bayesian methods are not yet dominant

Despite the principled framing, Bayesian continual-learning methods have not become the dominant production choice. The reasons are mostly empirical: the approximation error in mean-field Gaussian posteriors compounds over many tasks, leading to pathologies (over-regularisation, brittle uncertainty estimates) that simpler replay-based methods avoid. Recent work has explored richer posterior approximations (rank-1 perturbations, normalising-flow posteriors, structured-covariance forms) that close some of this gap; the direction is active but not yet settled. For most applications in 2026, the Bayesian framing is conceptually useful but the production methods are replay-based.

Meta-Continual Learning and Online Adaptation

If meta-learning (Ch 08) trains a model to adapt rapidly to new tasks, and continual learning trains a model to retain old ones, the natural combination is a model that does both — adapts quickly without forgetting. The intersection has produced some of the most interesting recent continual-learning methods and connects directly to the way large foundation models behave when fine-tuned on streams of new data.

The OML framework

Online aware Meta-Learning (OML, Javed and White 2019) explicitly meta-learns representations that resist forgetting. The setup: meta-train on streams of tasks with the explicit objective of low forgetting after sequential updates. The inner loop performs a sequence of gradient updates on a stream; the outer loop measures total accuracy across the stream and updates the meta-parameters accordingly. The result is a representation that, when fine-tuned sequentially, retains past-task knowledge much better than a representation trained without this objective.

OML demonstrated something important: forgetting is largely a property of the representation, not just the optimisation. A well-meta-learned representation forgets dramatically less than a representation that was never trained for sequential robustness, with no other changes to the training pipeline. This suggests that pretraining on diverse tasks (which approximates the OML setup implicitly) is a powerful continual-learning strategy in itself.

ANML and modulating networks

A Neuromodulated Meta-Learning Algorithm (ANML, Beaulieu et al. 2020) extends OML with an explicit modulation network. The architecture has two parts: a prediction network that produces predictions and a neuromodulation network that gates which neurons fire on each input. The gating provides a built-in mechanism for selectively activating different parts of the network for different tasks, reducing interference. ANML achieves stronger continual-learning results than OML and remains a strong meta-CL baseline.

MER: meta-experience replay

Meta-Experience Replay (MER, Riemer et al. 2019) combines explicit experience replay with a meta-learning-style gradient alignment. At each step, MER computes gradients on current-task and replay examples and uses Reptile-style meta-updates (Section 4 of Ch 08) to align the gradients across tasks. The result is a method that explicitly trains for compatible gradients across tasks, reducing interference. MER outperforms vanilla replay on several benchmarks at modest extra cost.

Continual learning as a meta-learning instance

Stepping back: continual learning can be framed as a particular meta-learning problem where the "tasks" are sequences of tasks and the "performance" metric is the average post-stream accuracy. The meta-learner learns to perform online updates that respect both current performance and prior retention. This framing connects continual learning to the in-context-learning literature of Ch 08 — a sufficiently large foundation model trained on enough sequential-task data could in principle handle continual updates implicitly via in-context conditioning, without explicit weight updates at all.

The 2024–2026 literature on long-context LLMs as continual learners has produced encouraging early results: a foundation model with a long context window can implicitly maintain a "task memory" via the context and adapt to streams of tasks without weight updates. Whether this scales to true lifelong learning (millions of tasks over months) remains an open empirical question; the early evidence suggests partial success.

Benchmarks, Pitfalls, and Evaluation

Continual-learning evaluation is fraught. Several published-and-hyped methods have failed to replicate, several "obvious" baselines outperform elaborate methods on careful comparison, and the gap between benchmark performance and real-world deployment robustness is large. This section covers the standard benchmarks, the common evaluation pitfalls, and the protocols that actually work.

Standard benchmarks

The dominant benchmarks fall into three tiers. Split-MNIST and Permuted-MNIST are toy benchmarks built from MNIST: split the digit classes into 5 sequential 2-class tasks (Split-MNIST) or apply different fixed pixel permutations to define each task (Permuted-MNIST). They are too easy by 2026 standards but remain useful as sanity checks. Split-CIFAR-100 and Split-TinyImageNet are mid-tier — CIFAR-100 split into 10 or 20 sequential tasks of 10 or 5 classes each. Most published continual-learning methods report results on these. CORe50 and Stream-51 are real-world streaming benchmarks designed for class-incremental learning with realistic distributional shifts; they are the right standard for production-relevant evaluation in 2026.

The "online vs. offline" distinction

A subtle but crucial evaluation choice: how many epochs does the model get on each task's data? Offline continual learning allows multiple epochs per task (training to convergence on each task's data before moving on); online continual learning allows only a single pass through each task's stream. Online is dramatically harder — there is no chance to refine the per-task representation — and corresponds more closely to real deployment scenarios. Many published methods report only offline results; the online numbers are typically much weaker. The 2024 standard is to report both, with online as the primary metric.

The naive baseline problem

Several "obvious" baselines have proven surprisingly hard to beat. Naive fine-tuning (just train sequentially with no special continual-learning machinery) is the trivial baseline — it usually fails badly, as expected. But joint training (train on all tasks' data simultaneously, as if continual learning never applied) is the upper bound — methods trying to approximate this performance with sequential access to data are evaluated against it. The gap between the best continual-learning methods and joint training is typically 5–30 percentage points on hard benchmarks, sometimes much more in class-incremental settings. Closing this gap is the central open problem.

More embarrassingly, several papers have shown that simply scaling up model capacity closes much of the continual-learning gap without any specialised method — bigger networks forget less. This deflates the apparent value of fancy continual-learning methods and is part of why the field's claims have moderated in the post-2022 era.

Hyperparameter selection traps

A persistent pitfall: many continual-learning methods are hyperparameter-sensitive (the regularisation coefficient λ in EWC, the buffer size and memory-management policy in replay methods), and the "right" hyperparameter often depends on which task is being evaluated. Standard ML evaluation uses validation data from the same distribution as test data; in continual learning this becomes problematic because the validation set must come from the same task-stream distribution as test. Many published methods have implicitly used per-task hyperparameter tuning that would not be available in a real deployment. The modern standard (Hadsell et al., Mirzadeh et al.) is to specify hyperparameters before any task is seen and report sensitivity.

What good evaluation looks like

The 2026 standard for serious continual-learning evaluation: report ACC and BWT after the final task, evaluate at multiple memory budgets, separate task-incremental from class-incremental scenarios, evaluate online (single-epoch) and offline (multi-epoch) separately, fix hyperparameters before seeing tasks, and report both small-network and large-network variants. The Avalanche library (Lomonaco et al. 2021) implements most of these protocols; using it gives the field a consistent comparison surface that the more ad-hoc 2017–2020 papers lacked.

Applications and Frontier

Continual learning shows up wherever a model must adapt over time without losing prior knowledge — edge devices learning from local data streams, robots picking up new manipulation skills, recommendation systems tracking shifting preferences, and the increasingly important question of how to update large foundation models with new information. Each domain has its own deployment constraints, and the methods of the chapter combine in different ways for different applications.

Edge devices and on-device learning

Smartphones, smart cameras, and IoT devices increasingly need to learn from their local data streams without sending the data to the cloud (privacy, latency, bandwidth). The continual-learning machinery is essential — the device cannot store all past data, must adapt to user-specific patterns, and must do all this with limited compute. The 2024 generation of on-device continual learning uses some combination of tiny replay buffers (latent-feature replay), efficient adapter updates, and aggressive regularisation. Apple's Personalized Federated Learning, Google's Federated Analytics, and similar systems all incorporate continual-learning components.

Robotics and lifelong skill acquisition

Robots that learn new manipulation skills in deployment without forgetting old ones are the canonical lifelong-learning use case. The challenge is severe: new skills are learned from few demonstrations (Ch 12 Ch 03), the robot cannot store all prior experience, and forgetting a critical skill (don't drop the egg) is unacceptable. Production robotics continual-learning systems combine replay buffers with progressive-network-style architectural growth and aggressive evaluation gates that prevent deployment of updates that degrade safety-critical behaviours.

Recommendation systems and concept drift

Every recommendation system faces continuous concept drift — user preferences change, items churn, seasonal effects shift the distribution. Standard continual-learning machinery applied here typically uses experience replay (a rolling buffer of recent interactions), online updating (streaming SGD on new interactions), and periodic offline retraining as a stability anchor. The deployments at TikTok, YouTube, and similar platforms run continual-learning loops at production scale, with the explicit acknowledgement that "the model is never done training."

Foundation-model updating

The largest continual-learning challenge in 2026 is updating foundation models. A trained LLM has knowledge cut off at its pretraining date; integrating new knowledge without expensive full retraining is an open problem. Several approaches are active: continued pretraining on new data plus a fragment of the original (a form of replay), retrieval-augmented generation (sidestep the problem by storing new knowledge in an external retrieval index rather than the model), and edit methods (ROME, MEMIT — surgical updates to specific factual knowledge). The retrieval-augmented approach has so far been the most successful in production; the edit methods remain promising but unreliable; continued-pretraining-with-replay is the heaviest hammer and the most reliable.

Frontier methods

Several frontiers are active in 2026. Continual learning of foundation models at scale: how to incorporate millions of new tokens into a 100B-parameter model without retraining from scratch, with quality matching joint training. Continual reinforcement learning: agents that keep learning in deployment while preserving prior capabilities; the challenges combine continual-learning machinery with the stability problems of RL. Modular continual architectures: routing-based and mixture-of-experts approaches that allocate per-task experts implicitly, sidestepping forgetting through routing rather than parameter freezing. Theoretical analysis of forgetting: papers like Doan et al. and Mirzadeh et al. on the geometry of loss landscapes in continual learning, which connect to the broader effort to understand why and when methods work.

What this chapter does not cover

Several adjacent areas are out of scope. Online learning on stationary distributions — the classical statistical-learning setup where data arrives in a stream but the underlying distribution does not change — is well-developed in classical ML and lives mostly outside the modern continual-learning literature. Multi-task learning with simultaneous access to all tasks' data (as opposed to sequential access) is conceptually related but has a different methodological focus. Federated learning, where models are updated across distributed devices, intersects with continual learning when the per-device data streams shift over time but is conventionally treated through the federation lens. The cognitive-science literature on human memory consolidation and forgetting is the conceptual ancestor of continual learning but is largely descriptive rather than algorithmic. And the formal-learning-theory analysis of regret-style guarantees in online learning, while elegantly relevant, mostly produces theoretical bounds that do not translate to deep-learning practice.