Continual & Lifelong Learning, training that doesn't forget.
Train a neural network on task A, then on task B, and the network's performance on A collapses — sometimes to chance. This is catastrophic forgetting, and it is the fundamental obstacle to building learning systems that accumulate knowledge over time. Continual learning is the family of methods that fights forgetting: regularising weights that mattered for previous tasks, replaying old data, expanding the network's capacity, separating shared and task-specific parameters. This chapter develops the framework, the three method families (regularisation, replay, architectural), the connection to neuroscience-inspired memory consolidation, and the deployment patterns for the lifelong-learning future where models update continuously rather than getting retrained from scratch.
Prerequisites & orientation
This chapter assumes neural-network fundamentals (Part V Ch 01–02), basic optimisation (Part I Ch 03), and familiarity with regularisation (Part V Ch 04). The Bayesian-deep-learning material of Part XIII Ch 07 is helpful background — several continual-learning methods (EWC and friends) have a clean Bayesian interpretation as posterior updates — but the chapter develops what is needed. The meta-learning framework of Ch 08 is a related but distinct topic; meta-learning trains for rapid adaptation to new tasks, continual learning trains for retention of old ones. Both share the underlying problem of "how does a network handle a sequence of tasks?" but their objectives are different.
Two threads run through the chapter. The first is the stability-plasticity dilemma: a network must be plastic enough to learn new tasks, but stable enough to retain old ones, and these requirements are in genuine tension. Every method in the chapter is a different way of negotiating that trade-off. The second thread is the three method families: regularisation methods (penalise changes to important weights), replay methods (rehearse old data while training on new), and architectural methods (allocate new capacity for new tasks). These are the standard taxonomy in 2026, and most production continual-learning systems combine techniques from at least two families.
Catastrophic Forgetting and Why It Happens
Train a network to classify cats versus dogs, then continue training it on cars versus boats, and a striking thing happens: the network's performance on cats versus dogs collapses, often to chance. This is not subtle degradation; it is wholesale erasure of the earlier task. The phenomenon is called catastrophic forgetting, and understanding why it happens is the precondition for every method in the chapter.
The phenomenon
Catastrophic forgetting (also called catastrophic interference) was identified in the late 1980s by McCloskey and Cohen, in the context of multi-layer perceptrons. Their experiments showed that a network trained sequentially on two tasks did not gracefully accumulate knowledge — it learned the second task by overwriting the first. The result was so consistent and so severe that it became the standard explanation for why connectionist models had not displaced symbolic AI: real intelligence requires accumulating skills over time, and standard neural networks couldn't do it.
The forgetting is mechanically simple to demonstrate. Take any modern network, train it on MNIST digits 0–4, then continue training on digits 5–9. Test accuracy on 0–4 will be near 0% by the end of the second phase. The network has not forgotten how to classify; it has reconfigured its weights to classify only the second set of digits. The output neurons for the first set never fire because the upstream features no longer support them.
Why standard SGD causes forgetting
The mechanism is straightforward. SGD on the second task's loss has no reason to preserve the weight configuration that solved the first task — the gradient points purely toward the second task's optimum. If the first task's weight configuration happens to be far from the second task's optimum, the optimiser walks away from it. Crucially, the network has many equivalent solutions: different weight settings often give similar second-task performance, but only some of those settings preserve first-task performance. Plain SGD has no preference among them, so it usually picks one that destroys the first task.
This is the stability-plasticity dilemma in its sharpest form. A network must be plastic enough to fit the new data and stable enough to retain the old. Standard SGD is all plasticity, no stability; methods in this chapter restore stability through different mechanisms.
The biological contrast
Humans and animals do not catastrophically forget. A child who learns to recognise dogs can later learn to recognise cars without losing the dog skill. The neuroscience explanation involves several mechanisms: synaptic consolidation (synapses important for prior memories become harder to modify over time), memory replay during sleep (the hippocampus replays recent experiences to the cortex, interleaving them with consolidated knowledge), and structural plasticity (new neurons are added in some brain regions). The continual-learning literature draws explicitly on these mechanisms — EWC (Section 3) is the synaptic-consolidation analogue, replay methods (Section 4) are the hippocampal-replay analogue, and architectural methods (Section 6) are the structural-plasticity analogue.
What the chapter is and isn't about
The chapter focuses on the case where tasks arrive sequentially and only the current task's data is available (or a small buffer of past data). This is genuinely hard. The easier related problems — multi-task learning where all tasks' data is available simultaneously, fine-tuning where the original training data can be re-accessed, online learning on a stationary distribution — have their own well-developed methods that are mostly not the focus here. The defining constraint of continual learning is the data restriction, and most of the chapter's complexity comes from working around it.
For most current ML production, catastrophic forgetting is dodged by retraining from scratch — collect all data, train once, deploy. This works for problems with stable distributions and rare retrains. It fails for systems that must adapt continuously: robots learning new skills in the wild, recommendation systems tracking shifting user preferences, foundation models being updated with new knowledge. As these regimes become more common in 2026, continual learning moves from "research curiosity" to "production requirement."
Continual-Learning Scenarios and Evaluation
"Continual learning" covers several distinct scenarios that differ in what the model knows about task boundaries and what changes between tasks. Confusing them — which is easy because much of the literature is sloppy about it — leads to comparing methods that solve different problems. The standard taxonomy of van de Ven and Tolias (2019) cuts the space cleanly and is the right starting point.
Three scenarios: task, domain, class incremental
Task-incremental learning: each task has its own output head, and the task identity is provided at test time. The network knows which task's head to use for each test example, and only needs to learn the per-task feature representations without interference. This is the easiest scenario; many methods (especially regularisation-based ones) handle it well. Example: classifying digits in five separate two-class problems, with task ID provided.
Domain-incremental learning: the input distribution changes across tasks but the label space does not, and task identity is not provided at test time. The network must produce the right label without knowing which task it came from, but the labels are the same set across tasks. Example: classifying digits where each task changes the rendering style (handwriting, printed, etc.) but the digit labels remain 0–9.
Class-incremental learning: new classes arrive over time and task identity is not provided at test time. The network must distinguish all classes seen so far, including across task boundaries. This is the hardest scenario and the most realistic — it matches the way new categories appear in the real world. Example: recognising bird species with new species added over time, with the model required to distinguish all species (including species from previous tasks) at any test point.
Why class-incremental is so hard
The class-incremental case is dramatically harder than the task-incremental case because the model has to compare classes from different tasks against each other. With only the current task's data available, the network has no way to learn appropriate calibration between current-task classes and prior-task classes — it never sees them in the same gradient step. The empirical pattern in benchmarks: many regularisation-based methods that work well in task-incremental settings collapse to near-chance performance in class-incremental settings. Replay methods (Section 4) close most of this gap because they explicitly bring prior-task examples back into the current gradient.
Standard metrics
Three metrics dominate continual-learning evaluation. Average accuracy (ACC) is the model's accuracy on all tasks seen so far, averaged after the final task. Backward transfer (BWT) measures how learning a new task affects performance on prior tasks — negative BWT is forgetting, positive BWT is unusual but possible (where new tasks reinforce old ones). Forward transfer (FWT) measures how prior tasks help future ones, compared to learning each task in isolation.
The standard reporting pattern: a method's headline number is its ACC, with BWT reported alongside as a forgetting diagnostic. A method with high ACC and low (negative) BWT is forgetting little; a method with high ACC at the cost of strongly negative BWT is essentially "good only on the most recent task."
Memory and compute constraints
Crucially, continual-learning evaluation must specify what the model is allowed to store. Methods that keep a small buffer of past examples (a few hundred per class) are fundamentally different from methods that keep nothing. Most modern benchmarks specify a fixed memory budget and require methods to compete within it; the resulting trade-off curve (memory vs. forgetting) is the most informative way to compare methods. The 2020s standard is to report results across several memory budgets (0, 200, 500, 1000 stored examples per task) so that the memory-forgetting trade-off is explicit.
Regularisation Methods: EWC and Friends
The first principled answer to catastrophic forgetting was Elastic Weight Consolidation (EWC), introduced by Kirkpatrick et al. in 2017. The idea is biologically motivated and mathematically clean: identify which weights mattered most for the previous task, and penalise changes to those weights when training on the new task. EWC and its successors form the regularisation family of continual-learning methods, the conceptually simplest family and a strong baseline against which more complex methods are evaluated.
EWC: posterior-as-prior
Elastic Weight Consolidation takes a Bayesian perspective on continual learning. After training on task A, the network's weights θ are a (point) approximation to the posterior p(θ | DA). When task B arrives, the right Bayesian update is to use this posterior as the prior for task B: p(θ | DA, DB) ∝ p(DB | θ) · p(θ | DA). The technical move is to approximate the previous posterior as a Gaussian centred at the post-task-A weights with covariance given by the inverse Fisher information. The resulting loss for task B is:
The Fisher information is the natural measure of "importance" because it captures how much the loss is curved around the current parameters — high Fisher means the parameter's value is well-determined by task A's data. The diagonal approximation is the practical concession; the full Fisher matrix is too expensive to store. EWC works well in task-incremental settings, less well in class-incremental settings, and remains the canonical regularisation-based continual-learning method.
Synaptic Intelligence and online importance
EWC requires a separate Fisher computation after each task, which can be expensive. Synaptic Intelligence (Zenke et al. 2017) tracks per-parameter importance online during training, using the path integral of the gradient over the training trajectory. The intuition: parameters that move a lot during training, multiplied by the gradient direction at each step, are accumulating importance. The importance estimate is updated continuously, with no separate Fisher pass. SI matches or outperforms EWC on most benchmarks at lower computational cost and is the more common choice in 2026 production deployments.
Memory Aware Synapses
Aljundi et al.'s Memory Aware Synapses (MAS, 2018) takes another angle: importance is estimated from the sensitivity of the network's output (rather than its loss) to each parameter. The advantage is that MAS can compute importance on unlabelled data — useful when labels for past tasks are unavailable. MAS, EWC, and SI form the standard regularisation-based trio, and most continual-learning papers compare against all three as baselines.
Strengths and limitations
Regularisation methods are appealingly simple: no replay buffer, no architectural changes, just an extra term in the loss. They work reasonably well in task-incremental settings and are computationally cheap. Their fundamental weakness is that the importance estimate is a point estimate of curvature; when many parameters become "important" after enough tasks, the regulariser becomes too restrictive and the network loses plasticity. In class-incremental settings, regularisation methods alone typically achieve only 50–70% of the accuracy of the strongest replay-based methods. The modern view: regularisation is part of the toolkit, not the whole solution; production systems usually combine regularisation with replay.
Replay and Rehearsal Methods
If catastrophic forgetting is caused by the gradient never seeing prior-task data, the most direct fix is to make sure it does. Replay methods store a small subset of past examples and interleave them with current-task training, typically achieving the strongest empirical results in continual learning at the cost of memory and (often) privacy concerns about storing user data. The replay family is the dominant approach in 2026.
Experience replay: the core idea
Experience replay in its simplest form: maintain a small buffer of examples from past tasks, and at each training step on the current task, also include some samples from the buffer in the gradient computation. The buffer is typically much smaller than the original training data — a few hundred to a few thousand examples per task — but its presence in the gradient prevents the network from drifting away from past-task solutions.
The empirical headline: even very simple experience replay matches or beats almost every regularisation-based method on standard benchmarks, and crushes them on class-incremental settings where regularisation methods fall apart. The 2019–2021 literature established this clearly, and the attention shifted from "how do we avoid storing data?" to "how do we store the right data and use it well?"
Reservoir sampling and buffer management
The buffer must be filled with care. Naively keeping a fixed number of examples from each task biases the buffer toward recent or distant tasks depending on insertion order. Reservoir sampling maintains a uniformly-random sample from the entire stream — when example k arrives, swap it into the buffer with probability buffer-size/k. This produces a buffer where every past example has equal probability of being retained, regardless of how long ago it appeared. Reservoir sampling is the standard buffer-management strategy when no per-task structure is available.
For class-incremental learning, buffers are usually maintained per-class — each class gets a fixed number of slots, typically chosen to be the most representative examples. The iCaRL method (Rebuffi et al. 2017) selects examples that minimise the distance between the class mean in feature space and the buffer-stored examples; this herding-style selection produces more representative buffers than random selection.
iCaRL and the nearest-mean classifier
iCaRL (incremental Classifier and Representation Learning) is the most influential replay method. The recipe combines several pieces. First, a feature extractor is trained on the current task with the standard loss plus a distillation loss against the previous network. Second, the buffer is updated using herding-style selection. Third — and this is the key — classification at test time uses a nearest-mean classifier in feature space rather than the network's output head. The class means are computed over the buffer; new test examples are assigned to the closest class mean. This sidesteps the calibration problems that plague class-incremental learning's softmax output and dramatically improves results.
iCaRL became the canonical baseline for class-incremental benchmarks and remains competitive in 2026, with variants (LUCIR, BiC, PODNet, FOSTER) refining different parts of the recipe.
GEM and A-GEM: gradient projections
Gradient Episodic Memory (GEM, Lopez-Paz and Ranzato 2017) uses the buffer differently. Rather than mixing buffer examples into the loss, GEM computes the gradient on the buffer for each past task and projects the current-task gradient so that it does not increase the loss on any past task. The constraint: the projected gradient has positive inner product with each past-task gradient. This is a quadratic program at each step, which is why GEM is computationally expensive. A-GEM (Chaudhry et al. 2019) simplifies to a single average gradient across past tasks rather than per-task constraints, dropping the per-step cost dramatically while retaining most of the benefit.
Privacy and the data-storage problem
Storing past examples is at odds with privacy regulations (GDPR's right to erasure, HIPAA, sectoral data-retention rules). For applications with regulated data, replay buffers may simply not be allowed, which is why generative replay (Section 5) remains an active research area despite the practical superiority of stored-example replay. The compromise often used in production: replay buffers are short-lived (held only during active retraining) and rotated regularly, with the rotation policy designed to comply with the relevant data-retention regime.
Generative Replay and Pseudo-Rehearsal
If you can't store past data, generate it. Train a generative model alongside the classifier; when a new task arrives, sample synthetic past examples from the generator and replay them during training. The generative-replay idea sidesteps the privacy and storage issues of literal replay buffers, at the cost of generative-model quality bounding overall performance. The 2024 generation of diffusion models has made this approach much more practical than it was even five years ago.
The pseudo-rehearsal idea
Pseudo-rehearsal goes back to Robins (1995): generate synthetic inputs (in the early days, random noise; later, samples from a learned generator) and pair them with the current model's predictions, then mix these into the training data alongside the new task. The hope: the synthetic inputs and the model's predictions on them encode the model's current input-output mapping, which gets preserved through training on the new task. The early versions used trivial generators and worked badly; the modern versions use real generative models and work surprisingly well.
Deep generative replay
Deep Generative Replay (Shin et al. 2017) was the first scaled version: train a VAE or GAN alongside the main classifier. When a new task arrives, sample synthetic past examples from the generator and label them using the previous classifier (the "scholar-generator" pair). Train the new classifier on the new task data plus the labelled synthetic past examples; train a new generator on the new task data plus synthetic past examples. The new pair becomes the scholar-generator for the next task.
DGR worked moderately well on simple datasets but struggled on complex ones because GANs and VAEs of the 2017 era couldn't produce convincing-enough samples. As generation quality degrades over chained re-training, the method's fidelity drifts; this is the canonical failure mode of generative-replay approaches.
Diffusion-based replay
The 2022–2026 generation of diffusion models has revived generative replay. Diffusion models produce dramatically higher-quality samples than VAEs or GANs at comparable scale, and unlike GANs they admit principled likelihood-based training that resists mode collapse during chained retraining. Recent papers (e.g., DDGR, CL-Diffusion) demonstrate that diffusion-based generative replay closes most of the gap to literal-replay methods on benchmarks where past data cannot be stored. The compute cost is higher than literal replay (you have to train and sample from a generator), but the privacy properties are dramatically better.
Latent replay: a middle ground
An intermediate strategy is to store examples in feature space rather than in the input space — the network's intermediate representations from past examples are kept while the inputs themselves are discarded. This compresses the storage requirement (features are usually smaller than raw images) and provides some privacy protection (features are harder to invert to recover the original input than literal storage). Latent replay is the standard choice for on-device continual learning where storage is constrained and the input data is sensitive.
When generative replay is the right choice
Three situations favour generative replay over literal replay. First, when the data cannot be stored (regulatory, privacy, or device-storage constraints). Second, when the data distribution is naturally generative-friendly (images, audio, text — domains where generative models work well). Third, when the goal is generative-replay-style data augmentation as an end in itself, beyond just continual learning. For most other applications, literal replay outperforms generative replay at lower complexity, which is why production continual-learning systems still mostly use literal replay where they can.
Architectural Methods: Progressive Networks and Beyond
A different angle on the stability-plasticity dilemma: instead of asking the same network to handle all tasks, allocate fresh capacity for each new task while keeping prior-task parameters frozen. Architectural methods sidestep forgetting by construction — the parameters that solved task A are not modified when task B arrives, so they cannot be overwritten. The cost is parameter growth and sometimes loss of cross-task generalisation; the benefit is a clean theoretical guarantee against forgetting.
Progressive Networks
Progressive Networks (Rusu et al. 2016) is the canonical architectural method. For each new task, instantiate a new "column" — a fresh copy of the network architecture — and freeze all prior columns. The new column receives lateral connections from each prior column at every layer, allowing it to reuse prior-task features but not modify them. By construction, prior-task performance cannot degrade because prior-task weights cannot change; the new column adds whatever capacity is needed for the new task while leveraging prior representations through the lateral connections.
The cost is parameter growth — N tasks means N columns of weights — which is prohibitive for tasks numbering in the dozens. Progressive Networks were popular in early continual-RL work where tasks were few (5–10) and the per-column cost was acceptable. They remain a strong baseline against which more parameter-efficient architectural methods are evaluated.
PackNet and parameter pruning
PackNet (Mallya and Lazebnik 2018) pushes the per-task capacity argument in the opposite direction: rather than adding parameters per task, partition the existing network's parameters across tasks via pruning. After training on task A, prune the lowest-magnitude weights — say 50% of them — fix the remaining ones, and train on task B using only the pruned-out (zeroed) parameters. After task B, repeat for task C using the remaining unallocated parameters. The result is a single fixed-size network in which each subset of parameters is dedicated to a specific task, and forgetting is impossible by construction.
PackNet's elegance is its zero parameter growth. Its cost is that the per-task capacity declines as more tasks are added — task 5 gets the smallest fraction of the network. The number of tasks it can handle is bounded by the network's redundancy. In practice, PackNet handles 5–10 tasks well on networks designed with extra capacity, and is a popular choice for embedded continual-learning deployments where parameter count is fixed by hardware.
Adapter-based methods
The modern architectural approach uses adapters — small task-specific modules inserted into a fixed backbone network. The backbone is trained once (or pretrained from a foundation model) and frozen; each new task gets a fresh small adapter. Piggyback (Mallya, Davis, and Lazebnik 2018) takes this further with binary masks per task; HAT (Serra et al. 2018) uses learned attention masks. Adapter-based methods scale to more tasks than Progressive Networks (per-task cost is much smaller) and have become the dominant architectural strategy in the foundation-model era — every task-specific LoRA module is conceptually an adapter doing one-task continual learning on top of a frozen pretrained model.
Strengths and weaknesses
Architectural methods have the cleanest forgetting guarantees — they cannot forget by construction. Their cost is twofold: the parameter overhead per task (or per-task capacity reduction), and the difficulty of cross-task generalisation. A method that walls off task-specific parameters cannot easily benefit from positive transfer between tasks, and tasks that share underlying structure (like classifying different bird species) end up not sharing as much as they should. The 2026 consensus: architectural methods are the right choice when forgetting is intolerable and tasks are well-separated; replay-based methods are the right choice when transfer between tasks matters; production systems often combine both.
Bayesian Continual Learning
Continual learning is structurally a Bayesian problem: the posterior after task A becomes the prior for task B. Done exactly, this would solve the forgetting problem entirely — past data informs the prior, new data updates the posterior, no information is lost. Done approximately, with the kinds of variational and Laplace tools of Bayesian deep learning (Part XIII Ch 07), it produces a principled family of continual-learning methods. The connection to EWC is direct, and the recent literature has produced more powerful Bayesian continual learners.
The sequential-Bayesian formulation
The Bayesian continual-learning framework: maintain a posterior p(θ | D1:t) over the network weights given all data seen up to task t. When task t+1 arrives, the right update is:
EWC (Section 3) is the most basic instance: the posterior is approximated as a Gaussian centred at the MAP with covariance from the diagonal Fisher; the new-task loss adds a quadratic regulariser that pulls toward the previous mean with strength given by the Fisher. The Bayesian framing makes the EWC choice transparent — it is a particular Gaussian posterior approximation propagated through tasks.
Variational Continual Learning
Variational Continual Learning (VCL, Nguyen et al. 2018) extends EWC to a full mean-field-Gaussian variational posterior — every weight has its own Gaussian variational distribution rather than a point estimate plus diagonal covariance. The posterior after task t is used as the prior for task t+1, and the new variational posterior is fit by maximising the standard ELBO. The result is a continual-learning method that produces calibrated uncertainty as a side benefit and tends to outperform EWC on benchmarks where multiple posterior modes matter.
The cost is the standard variational-inference cost — twice as many parameters (means and variances), reparameterisation-trick training. The benefit is a more honest posterior that supports the kind of selective-prediction and OOD detection covered in Ch 07. Production deployments of Bayesian continual learning are still rare but the method has strong empirical support.
Posterior-distillation methods
An alternative to maintaining an explicit posterior is posterior distillation: when training on a new task, ensure the network's output distribution remains close to the previous network's output distribution on representative inputs. Methods like Learning without Forgetting (LwF, Li and Hoiem 2017) do exactly this with a knowledge-distillation loss against the previous network's logits. LwF is very simple (one extra loss term) and works well as a baseline in task-incremental settings; it was an important transitional method between regularisation-based and replay-based families.
Why Bayesian methods are not yet dominant
Despite the principled framing, Bayesian continual-learning methods have not become the dominant production choice. The reasons are mostly empirical: the approximation error in mean-field Gaussian posteriors compounds over many tasks, leading to pathologies (over-regularisation, brittle uncertainty estimates) that simpler replay-based methods avoid. Recent work has explored richer posterior approximations (rank-1 perturbations, normalising-flow posteriors, structured-covariance forms) that close some of this gap; the direction is active but not yet settled. For most applications in 2026, the Bayesian framing is conceptually useful but the production methods are replay-based.
Meta-Continual Learning and Online Adaptation
If meta-learning (Ch 08) trains a model to adapt rapidly to new tasks, and continual learning trains a model to retain old ones, the natural combination is a model that does both — adapts quickly without forgetting. The intersection has produced some of the most interesting recent continual-learning methods and connects directly to the way large foundation models behave when fine-tuned on streams of new data.
The OML framework
Online aware Meta-Learning (OML, Javed and White 2019) explicitly meta-learns representations that resist forgetting. The setup: meta-train on streams of tasks with the explicit objective of low forgetting after sequential updates. The inner loop performs a sequence of gradient updates on a stream; the outer loop measures total accuracy across the stream and updates the meta-parameters accordingly. The result is a representation that, when fine-tuned sequentially, retains past-task knowledge much better than a representation trained without this objective.
OML demonstrated something important: forgetting is largely a property of the representation, not just the optimisation. A well-meta-learned representation forgets dramatically less than a representation that was never trained for sequential robustness, with no other changes to the training pipeline. This suggests that pretraining on diverse tasks (which approximates the OML setup implicitly) is a powerful continual-learning strategy in itself.
ANML and modulating networks
A Neuromodulated Meta-Learning Algorithm (ANML, Beaulieu et al. 2020) extends OML with an explicit modulation network. The architecture has two parts: a prediction network that produces predictions and a neuromodulation network that gates which neurons fire on each input. The gating provides a built-in mechanism for selectively activating different parts of the network for different tasks, reducing interference. ANML achieves stronger continual-learning results than OML and remains a strong meta-CL baseline.
MER: meta-experience replay
Meta-Experience Replay (MER, Riemer et al. 2019) combines explicit experience replay with a meta-learning-style gradient alignment. At each step, MER computes gradients on current-task and replay examples and uses Reptile-style meta-updates (Section 4 of Ch 08) to align the gradients across tasks. The result is a method that explicitly trains for compatible gradients across tasks, reducing interference. MER outperforms vanilla replay on several benchmarks at modest extra cost.
Continual learning as a meta-learning instance
Stepping back: continual learning can be framed as a particular meta-learning problem where the "tasks" are sequences of tasks and the "performance" metric is the average post-stream accuracy. The meta-learner learns to perform online updates that respect both current performance and prior retention. This framing connects continual learning to the in-context-learning literature of Ch 08 — a sufficiently large foundation model trained on enough sequential-task data could in principle handle continual updates implicitly via in-context conditioning, without explicit weight updates at all.
The 2024–2026 literature on long-context LLMs as continual learners has produced encouraging early results: a foundation model with a long context window can implicitly maintain a "task memory" via the context and adapt to streams of tasks without weight updates. Whether this scales to true lifelong learning (millions of tasks over months) remains an open empirical question; the early evidence suggests partial success.
Benchmarks, Pitfalls, and Evaluation
Continual-learning evaluation is fraught. Several published-and-hyped methods have failed to replicate, several "obvious" baselines outperform elaborate methods on careful comparison, and the gap between benchmark performance and real-world deployment robustness is large. This section covers the standard benchmarks, the common evaluation pitfalls, and the protocols that actually work.
Standard benchmarks
The dominant benchmarks fall into three tiers. Split-MNIST and Permuted-MNIST are toy benchmarks built from MNIST: split the digit classes into 5 sequential 2-class tasks (Split-MNIST) or apply different fixed pixel permutations to define each task (Permuted-MNIST). They are too easy by 2026 standards but remain useful as sanity checks. Split-CIFAR-100 and Split-TinyImageNet are mid-tier — CIFAR-100 split into 10 or 20 sequential tasks of 10 or 5 classes each. Most published continual-learning methods report results on these. CORe50 and Stream-51 are real-world streaming benchmarks designed for class-incremental learning with realistic distributional shifts; they are the right standard for production-relevant evaluation in 2026.
The "online vs. offline" distinction
A subtle but crucial evaluation choice: how many epochs does the model get on each task's data? Offline continual learning allows multiple epochs per task (training to convergence on each task's data before moving on); online continual learning allows only a single pass through each task's stream. Online is dramatically harder — there is no chance to refine the per-task representation — and corresponds more closely to real deployment scenarios. Many published methods report only offline results; the online numbers are typically much weaker. The 2024 standard is to report both, with online as the primary metric.
The naive baseline problem
Several "obvious" baselines have proven surprisingly hard to beat. Naive fine-tuning (just train sequentially with no special continual-learning machinery) is the trivial baseline — it usually fails badly, as expected. But joint training (train on all tasks' data simultaneously, as if continual learning never applied) is the upper bound — methods trying to approximate this performance with sequential access to data are evaluated against it. The gap between the best continual-learning methods and joint training is typically 5–30 percentage points on hard benchmarks, sometimes much more in class-incremental settings. Closing this gap is the central open problem.
More embarrassingly, several papers have shown that simply scaling up model capacity closes much of the continual-learning gap without any specialised method — bigger networks forget less. This deflates the apparent value of fancy continual-learning methods and is part of why the field's claims have moderated in the post-2022 era.
Hyperparameter selection traps
A persistent pitfall: many continual-learning methods are hyperparameter-sensitive (the regularisation coefficient λ in EWC, the buffer size and memory-management policy in replay methods), and the "right" hyperparameter often depends on which task is being evaluated. Standard ML evaluation uses validation data from the same distribution as test data; in continual learning this becomes problematic because the validation set must come from the same task-stream distribution as test. Many published methods have implicitly used per-task hyperparameter tuning that would not be available in a real deployment. The modern standard (Hadsell et al., Mirzadeh et al.) is to specify hyperparameters before any task is seen and report sensitivity.
What good evaluation looks like
The 2026 standard for serious continual-learning evaluation: report ACC and BWT after the final task, evaluate at multiple memory budgets, separate task-incremental from class-incremental scenarios, evaluate online (single-epoch) and offline (multi-epoch) separately, fix hyperparameters before seeing tasks, and report both small-network and large-network variants. The Avalanche library (Lomonaco et al. 2021) implements most of these protocols; using it gives the field a consistent comparison surface that the more ad-hoc 2017–2020 papers lacked.
Applications and Frontier
Continual learning shows up wherever a model must adapt over time without losing prior knowledge — edge devices learning from local data streams, robots picking up new manipulation skills, recommendation systems tracking shifting preferences, and the increasingly important question of how to update large foundation models with new information. Each domain has its own deployment constraints, and the methods of the chapter combine in different ways for different applications.
Edge devices and on-device learning
Smartphones, smart cameras, and IoT devices increasingly need to learn from their local data streams without sending the data to the cloud (privacy, latency, bandwidth). The continual-learning machinery is essential — the device cannot store all past data, must adapt to user-specific patterns, and must do all this with limited compute. The 2024 generation of on-device continual learning uses some combination of tiny replay buffers (latent-feature replay), efficient adapter updates, and aggressive regularisation. Apple's Personalized Federated Learning, Google's Federated Analytics, and similar systems all incorporate continual-learning components.
Robotics and lifelong skill acquisition
Robots that learn new manipulation skills in deployment without forgetting old ones are the canonical lifelong-learning use case. The challenge is severe: new skills are learned from few demonstrations (Ch 12 Ch 03), the robot cannot store all prior experience, and forgetting a critical skill (don't drop the egg) is unacceptable. Production robotics continual-learning systems combine replay buffers with progressive-network-style architectural growth and aggressive evaluation gates that prevent deployment of updates that degrade safety-critical behaviours.
Recommendation systems and concept drift
Every recommendation system faces continuous concept drift — user preferences change, items churn, seasonal effects shift the distribution. Standard continual-learning machinery applied here typically uses experience replay (a rolling buffer of recent interactions), online updating (streaming SGD on new interactions), and periodic offline retraining as a stability anchor. The deployments at TikTok, YouTube, and similar platforms run continual-learning loops at production scale, with the explicit acknowledgement that "the model is never done training."
Foundation-model updating
The largest continual-learning challenge in 2026 is updating foundation models. A trained LLM has knowledge cut off at its pretraining date; integrating new knowledge without expensive full retraining is an open problem. Several approaches are active: continued pretraining on new data plus a fragment of the original (a form of replay), retrieval-augmented generation (sidestep the problem by storing new knowledge in an external retrieval index rather than the model), and edit methods (ROME, MEMIT — surgical updates to specific factual knowledge). The retrieval-augmented approach has so far been the most successful in production; the edit methods remain promising but unreliable; continued-pretraining-with-replay is the heaviest hammer and the most reliable.
Frontier methods
Several frontiers are active in 2026. Continual learning of foundation models at scale: how to incorporate millions of new tokens into a 100B-parameter model without retraining from scratch, with quality matching joint training. Continual reinforcement learning: agents that keep learning in deployment while preserving prior capabilities; the challenges combine continual-learning machinery with the stability problems of RL. Modular continual architectures: routing-based and mixture-of-experts approaches that allocate per-task experts implicitly, sidestepping forgetting through routing rather than parameter freezing. Theoretical analysis of forgetting: papers like Doan et al. and Mirzadeh et al. on the geometry of loss landscapes in continual learning, which connect to the broader effort to understand why and when methods work.
What this chapter does not cover
Several adjacent areas are out of scope. Online learning on stationary distributions — the classical statistical-learning setup where data arrives in a stream but the underlying distribution does not change — is well-developed in classical ML and lives mostly outside the modern continual-learning literature. Multi-task learning with simultaneous access to all tasks' data (as opposed to sequential access) is conceptually related but has a different methodological focus. Federated learning, where models are updated across distributed devices, intersects with continual learning when the per-device data streams shift over time but is conventionally treated through the federation lens. The cognitive-science literature on human memory consolidation and forgetting is the conceptual ancestor of continual learning but is largely descriptive rather than algorithmic. And the formal-learning-theory analysis of regret-style guarantees in online learning, while elegantly relevant, mostly produces theoretical bounds that do not translate to deep-learning practice.
Further reading
Foundational papers and surveys for continual and lifelong learning. The Parisi survey plus the canonical EWC and iCaRL papers plus the Avalanche documentation is the right starting kit for practitioners.
-
Overcoming Catastrophic Forgetting in Neural Networks (EWC)The EWC paper. Introduces Elastic Weight Consolidation as the canonical regularisation-based continual-learning method, with a Bayesian-posterior-as-prior interpretation that grounds the entire regularisation family. The single most-cited continual-learning paper and the right reading for the foundational concepts. The reference for regularisation-based continual learning.
-
iCaRL: Incremental Classifier and Representation LearningThe iCaRL paper. The canonical replay-based class-incremental method, combining herding-style buffer selection, knowledge distillation, and a nearest-mean classifier. The right reading after EWC for understanding how replay-based methods bridge the regularisation-vs-replay divide. Pair with Lopez-Paz & Ranzato 2017 (GEM) for the gradient-projection variant. The reference for replay-based class-incremental learning.
-
Continual Lifelong Learning with Neural Networks: A ReviewThe standard survey. Comprehensive taxonomy of continual-learning methods (regularisation-, replay-, architectural-based), with biological motivations and standard benchmarks. The right second reading after the canonical method papers and a useful organisational framework for the literature. The survey reference for the field.
-
Three Scenarios for Continual LearningThe taxonomy paper. Establishes the task-incremental, domain-incremental, and class-incremental distinction that is the standard scenario taxonomy in the field. Required reading for understanding what any given continual-learning method is actually solving and what benchmark numbers mean. The reference for continual-learning scenarios.
-
Progressive Neural NetworksThe Progressive Networks paper. Introduces architectural growth as a continual-learning strategy, with frozen prior columns and lateral connections to enable transfer without forgetting. The natural reading for understanding the architectural family of methods and the foundation for adapter-based continual learning. The reference for architectural continual learning.
-
Continual Learning with Deep Generative ReplayThe DGR paper. Establishes generative replay as the privacy-preserving alternative to literal replay buffers. While the original GAN/VAE-based approach has been overtaken by diffusion-based variants, the framework remains the standard reference for the generative-replay family. The reference for generative replay.
-
Variational Continual LearningThe VCL paper. Extends EWC to a full mean-field variational posterior, with the previous task's posterior used as the prior for the next. The right reading for the Bayesian framing of continual learning and the natural follow-up to Ch 07's variational-inference machinery. The reference for Bayesian continual learning.
-
Gradient Episodic Memory for Continual Learning (GEM)The GEM paper. Introduces the gradient-projection approach to replay, where current-task gradients are projected to be compatible with past-task gradients computed on a memory buffer. Pair with A-GEM (Chaudhry et al. 2019) for the scalable variant that became the production-friendly version. The reference for gradient-based replay methods.
-
Avalanche: an End-to-End Library for Continual LearningThe Avalanche library paper. The standard production-grade library for continual learning, implementing all the major method families and the standard benchmarks (Split-MNIST, Split-CIFAR, CORe50, Stream-51) with consistent evaluation protocols. The right tooling reference for any new continual-learning project. The library you will actually use.