Part XII · Robotics & Embodied AI · Chapter 03

Learning from Demonstration & Imitation, watching, copying, generalising.

Hand-coding a robot to do anything novel is hard. Showing it how to do something — by teleoperating it, kinaesthetically guiding it, or recording a human doing the task — turns out to be much easier, and the policies you can learn from those demonstrations are often more capable than anything you could write by hand. This chapter covers how that works: from vanilla behaviour cloning, through DAgger and inverse reinforcement learning, to the foundation-scale generalist policies of 2024–2026.

Prerequisites & orientation

This chapter assumes familiarity with deep learning (Part V Ch 01–02), basic reinforcement learning (Part IX Ch 01), and the planning/control concepts of Chapter 02 (Motion Planning & Control). The architectures discussed in Section 09 build on attention mechanisms (Part V Ch 05) and diffusion models (Part X Ch 04). The data-collection sections (07–08) connect to the perception material of Chapter 01.

Two threads run through the chapter. The first is the central technical problem of imitation learning: distribution shift. Naive supervised learning on demonstrations fails on long-horizon tasks because the robot's own mistakes lead it to states the expert never visited, and the policy has no idea what to do there. Most of the algorithmic content of this chapter — DAgger, IRL, modern action-chunking architectures — is a different attempt to fix this problem. The second thread is the data flywheel: imitation learning's progress over the past five years has been driven less by algorithm changes than by larger and more diverse demonstration datasets, and the trend looks set to continue.

In this chapter

Why Learn from Demonstrations? data efficiency · sparse rewards · the engineering case
The Imitation Learning Setup expert · observation-action pairs · loss functions · evaluation
Behavior Cloning supervised learning · MSE · cross-entropy · what works
Compounding Errors and Distribution Shift covariate shift · quadratic regret · long-horizon failure
DAgger and the Online Imitation Family DAgger · query strategies · HG-DAgger · SafeDAgger
Inverse Reinforcement Learning reward recovery · max-entropy IRL · GAIL · adversarial imitation
Teleoperation: Capturing Demonstrations leader-follower · ALOHA · VR · kinesthetic teaching
Modern Datasets and Data Quality RT-X · Open X-Embodiment · DROID · diversity
Architectures: ACT, Diffusion Policy, VLAs action chunking · diffusion · transformers · RT-1 / RT-2
Frontier: Generalist Imitation Policies cross-embodiment · π0 · OpenVLA · the data flywheel

Why Learn from Demonstrations?

For most of robotics' history, telling a robot how to do something meant writing a controller — a hand-coded function from sensor inputs to motor outputs that, given enough engineering time, made the robot behave the right way. The cost of that approach is enormous: every new task is a new project, every change in the environment risks breaking the controller, and tasks where the right behaviour is hard to articulate verbally (folding cloth, pouring liquids, assembling small parts) are nearly impossible to write down at all. Learning from demonstration sidesteps the verbalisation problem by letting the human provide the behaviour directly: do the task, record what happened, train a policy that reproduces it. Whether you got there with one demonstration or a thousand, the artefact you walk away with is a function from observations to actions — and it works on the cases the human showed it.

The data-efficiency argument

Reinforcement learning, in principle, can solve any task that can be defined by a reward signal and explored by trial and error. In practice, RL is desperately data-hungry: the per-step reward signal is so weak that even simple manipulation tasks routinely require millions of environment interactions to converge. On a real robot, where each interaction takes seconds and risks hardware damage, that is impractical. A demonstration provides a much stronger signal — at every state visited by the expert, the entire action vector is given — which is why a few hundred demonstrations can produce policies that millions of RL steps cannot.

The corollary is the data-efficiency curve: imitation learning beats RL when demonstrations are easier to collect than rewards to design and explore. For tasks where the reward is well-shaped and exploration is cheap (game playing, simulated locomotion), RL still wins. For tasks where the reward is unclear or sparse and exploration is expensive (fine manipulation, mobile navigation in human spaces), imitation is the better tool.

The reward-specification argument

A separate argument: even if exploration were free, you still have to specify what success looks like. For tasks like "pour water into the cup without spilling," the reward function that captures every nuance of "not spilling" is itself a research project. Demonstrations dodge this entirely: the expert's behaviour is the specification. The reward is implicit, latent in the data, and the policy never has to know what it formally is. Inverse reinforcement learning (Section 06) makes this explicit by recovering the reward; behaviour cloning (Section 03) sidesteps it altogether.

What you give up

Imitation learning is bounded by the demonstrations it sees. A behaviour-cloning policy will not be better than the human who taught it, will not solve cases the human did not demonstrate, and will not gracefully handle situations far from the training distribution. RL, given a good reward function, can in principle exceed human performance — and on tasks where that is what you want (chess, Go, complex video games), RL or RL-on-top-of-imitation is the right architecture. The two paradigms complement rather than compete: most modern robotics systems are imitation-pretrained and then RL-finetuned, taking the data efficiency of demonstrations and the capability ceiling of RL.

The Engineering Case

Outside academic comparisons, the practical reason imitation learning has won most of robotics manipulation in 2024–2026 is engineering pragmatism. Demonstrations can be collected on existing hardware by existing operators with no specialised infrastructure. Rewards have to be carefully designed by an ML expert and verified against task semantics. For most teams, "have someone teleoperate the robot for an afternoon" is achievable; "design and tune a reward function" is not.

The Imitation Learning Setup

The basic imitation-learning setup is one of the cleanest in machine learning. There is an expert — usually a human — that produces a policy π_E mapping observations to actions. The expert demonstrates the task many times, producing a dataset D = { (o_i, a_i) } of observation-action pairs sampled from rolling out the expert's policy in the environment. The learner's job is to fit a parameterised policy π_θ that approximates π_E, with the parameters trained on D. At test time, π_θ is rolled out in the environment to produce behaviour. That is the entire formalism.

Observations, actions, and the choice of representation

The observation space is whatever the robot's sensors produce: camera images, joint angles, force readings, language instructions, or some combination. The action space is whatever the controller below the policy expects: joint positions, joint velocities, end-effector poses, motor torques, or — increasingly — high-level skill tokens. The choice of representation has enormous practical consequences: a policy that outputs end-effector pose deltas at 10 Hz is solving a much easier learning problem than one that outputs raw joint torques at 1 kHz, because the underlying low-level control loops absorb the high-frequency dynamics. Modern manipulation policies almost universally output end-effector poses or joint targets at 10–30 Hz, with a separate impedance or PID controller running underneath at higher rate.

Loss functions

Once the action representation is fixed, the loss function is the next major design decision. For continuous actions, mean-squared error (MSE) is the default — minimise ‖π_θ(o) − a‖² averaged over the dataset. MSE is simple and works, but it has a known weakness: it produces unimodal predictions even when the demonstration distribution is multimodal. If the expert sometimes goes left and sometimes goes right around an obstacle, MSE-trained policies converge on going through the middle — into the obstacle. Cross-entropy with discretised action spaces, mixture-of-Gaussians outputs, and (most recently) diffusion-based action sampling all sidestep the unimodality problem. Section 09 returns to this question with modern architectures.

What "good" means

Evaluating an imitation-learned policy is harder than evaluating a supervised classifier. The natural metric — average action-prediction loss on a held-out demonstration set — turns out to correlate weakly with task success. A policy with low loss can fail catastrophically because its small per-step errors compound (the subject of Section 04), and a policy with relatively high loss can succeed because its errors don't matter for the task. The honest evaluation is rolling the policy out in the environment and measuring task success rate over many trials. That is expensive, especially on real hardware, which is why much of the imitation-learning literature lives in simulation — and which is why the gap between "simulation success" and "real-world success" is one of the persistent themes in this chapter and the next.

Behavior Cloning

Behavior cloning (BC) is the most direct application of supervised learning to control: treat each (observation, action) pair from the demonstrations as a training example, fit a neural network with a regression or classification loss, and use the trained network as a policy. Conceptually it is no more sophisticated than image classification, with the labels being actions instead of class indices. In practice, that simplicity is BC's main virtue and its main weakness — it inherits all of supervised learning's strengths (mature tooling, sample efficiency on iid data, easy debugging) and one structural failure mode (Section 04) that supervised learning was never designed to handle.

The training procedure

Behavior cloning — basic procedure Given dataset D = { (o i, a i) } from expert rollouts: θ* = argmin θ \sum i L(π θ (o i), a i) At test time: o \to π θ* (o) \to environment \to o' \to ... L is typically MSE for continuous actions or cross-entropy for discrete ones. Training is just standard supervised optimisation. The crucial assumption — and the one that breaks at test time — is that observations seen during training are i.i.d. with observations seen during execution. They are not, because the policy's mistakes lead it to states the expert never visited.

What works

BC works surprisingly well when three conditions are met. First, the task is short enough that compounding errors don't have time to accumulate — pick-and-place tasks of a few seconds are well within reach, hour-long sequences are not. Second, the demonstrations are relatively consistent: the same situation produces similar expert actions across demonstrations, so the network has a coherent function to learn. Third, the policy outputs a representation that is forgiving of small errors — end-effector pose deltas absorbed by an underlying impedance controller are far more robust than raw torques.

When these conditions hold, BC produces strong policies cheaply. The 2023–2026 manipulation explosion (ALOHA, Diffusion Policy, RT-1) is mostly powered by BC, dressed up with modern architectures and large datasets but algorithmically still BC at its core. The lesson is that BC's failures are real but addressable, and most of what looks like cleverer algorithms in modern imitation learning is fundamentally still BC plus some way to make BC more robust.

What goes wrong

BC fails predictably on long-horizon tasks, on tasks where the demonstration distribution is highly multimodal, and on tasks where the policy's training data poorly covers the states the policy actually visits at execution. The first failure mode is severe enough to deserve its own section.

Compounding Errors and Distribution Shift

The single most important fact in imitation learning is this: behaviour cloning trains on the distribution of states the expert visits, but is evaluated on the distribution of states the learned policy visits. These are not the same distribution, and the gap between them is the central problem the rest of this chapter is trying to solve.

Covariate shift in policy execution

Suppose the expert never makes a particular kind of mistake — say, never gets its end-effector tilted past 30 degrees during a manipulation task. The training data contains no examples of recovering from such a tilt. Now the learned policy, due to small errors, eventually drifts past 30 degrees during execution. From the policy's perspective, this is a state it has never seen — it has no idea what action to take, and the action it produces is essentially uninformed. That action causes a further mistake, which produces a state even further from the expert distribution, and so on. The mistakes compound geometrically: each error is the input to the next decision, and small errors are amplified rather than corrected.

The compounding-errors problem. The expert's trajectory (green) defines the distribution the policy was trained on. The learned policy (pink) starts at the same state but accumulates small errors at every step; each error places it slightly further from the training distribution, which makes the next prediction worse, which compounds the drift. Behaviour cloning has no mechanism to correct this — it only knows what the expert would do at expert states, not what to do once you've drifted away.

The quadratic regret bound

Ross and Bagnell (2010) made this rigorous: a behaviour-cloned policy with ε per-step error can suffer total regret that scales as O(ε T²) over an episode of length T, where the quadratic factor comes precisely from compounding. By contrast, a supervised classifier facing iid data has linear-in-T regret. The squared-T blow-up explains why BC works fine on short tasks and falls apart on long ones — and it motivates DAgger (Section 05), which provably reduces this back to linear by training on the policy's own state distribution.

Practical mitigations short of DAgger

If you cannot collect on-policy data — common in production where querying a human expert mid-execution is not practical — several tricks make BC more robust without reaching for DAgger:

Action-noise injection during demonstration. Adding small noise to the expert's actions during data collection forces the expert to recover from off-distribution states, which puts those recoveries in the training data. Crude but effective.
State perturbation during training. Augmenting (o, a) pairs with (o + δ, a + correction) trains the policy to recover from small perturbations. Requires knowing what the correction should be.
Action chunking. Predicting a chunk of N future actions per step and executing them open-loop reduces the number of correction opportunities — and counterintuitively reduces compounding by making the policy commit to coherent multi-step plans rather than reacting to each tiny perturbation. The architectural workhorse of modern manipulation, covered in Section 09.
Dataset diversity. The brute-force solution: collect demonstrations from many different starting conditions, including deliberate failures and recoveries. Expensive but reliable.

DAgger and the Online Imitation Family

DAgger — Dataset Aggregation, introduced by Ross, Gordon & Bagnell in 2011 — is the canonical fix for compounding errors. The key insight is that the policy's failure mode is being trained on the wrong distribution. The solution: roll out the policy in the environment, collect the states it visits, and ask the expert to label those states with what the expert would do. Train on the augmented dataset, repeat. After a few iterations, the policy is trained on its own state distribution, which means there is no distribution shift between training and testing.

The DAgger algorithm

DAgger D \leftarrow expert demonstrations π 1 \leftarrow train on D for i = 1 to N: τ i \leftarrow rollout π i in environment, recording observations for each o in τ i : a E \leftarrow query expert for what they would do at o D \leftarrow D \cup { (o, a E) } π i+1 \leftarrow train on D return best π The algorithm collects observations under the policy's distribution but supervises with the expert's actions. The dataset is aggregated across iterations rather than replaced, so the policy retains experience from the expert's states as well as its own. The theoretical guarantee: total regret reduces from O(ε T 2) to O(ε T) — back to the linear scaling of ordinary supervised learning.

The cost: expert availability

DAgger's elegance hides a practical cost: it requires querying the expert on the policy's state distribution, repeatedly, throughout training. For a human teleoperator, this means watching the policy roll out in the environment and providing the action they would take at each visited state — labour-intensive and disruptive. For a programmatic expert (a slow optimal planner whose decisions the learner is approximating, or a privileged-information expert in simulation), DAgger is straightforward; for a human expert on physical hardware, it is hard to scale. Most "DAgger" deployments in practice are really hybrid pipelines that collect a few rounds of expert correction, then fall back to BC for the rest of training.

The variant zoo

Several DAgger variants address its practical weaknesses:

HG-DAgger (Human-Gated DAgger). The human watches the policy roll out and intervenes only when they think it is about to make a mistake. Their corrections become the training labels; the rest is unchanged. Dramatically reduces expert burden because the human only acts when needed.
SafeDAgger. Trains a secondary classifier that predicts whether the policy's action will deviate from the expert's. When the classifier fires, the expert's action is used; otherwise the policy's action is. Reduces unsafe rollouts during data collection.
AggreVaTe (Aggregate Values to Imitate). A reinforcement-learning-flavoured variant that trains the policy to match not just the expert's action but the expert's full action-value function. More sample-efficient than DAgger when the expert has a value estimate available.
DART (Disturbances Augmented Reduced Training). A different angle: rather than collecting on-policy data, inject deliberate noise into the expert's actions during demonstration so the expert is forced to recover from off-distribution states. Captures recovery behaviour without needing online queries.

When DAgger is the right tool

DAgger is the right choice when on-policy data collection is cheap (simulation, programmatic expert), when the task is long-horizon enough that compounding errors matter, and when the expert is queryable at arbitrary states. It is the wrong choice when expert queries are expensive, when the expert is a separate human who cannot easily be summoned for label collection, or when the task is short enough that vanilla BC works well. Most production manipulation systems in 2026 do not use DAgger directly; they use BC with action chunking and a large enough dataset that compounding errors are practically tolerable. The conceptual lesson DAgger established — that training distribution must match test distribution — survives in those systems even when the algorithm itself does not.

Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) reframes the imitation problem. Instead of training a policy to copy the expert's actions, IRL infers the reward function the expert was implicitly optimising, and then runs standard reinforcement learning against that reward to produce a policy. The policy may end up doing things the expert never explicitly demonstrated — IRL generalises by recovering the underlying objective rather than the surface behaviour. The trade-off is that IRL is harder to make work, requires solving an RL problem inside a learning loop, and the recovered reward is not unique (many rewards explain the same demonstrations). For tasks where the reward is fundamental and BC's surface-level mimicry is insufficient, IRL has been the right tool; for tasks where surface behaviour is exactly what you want, BC is simpler.

Why the reward is ambiguous

Given any set of demonstrations, infinitely many reward functions are consistent with them — including the trivial one that assigns reward zero everywhere (under which any policy is optimal). The classical IRL problem is therefore ill-posed without additional assumptions. The dominant resolution is the maximum-entropy formulation (Ziebart et al., 2008): among all reward functions consistent with the demonstrations, prefer the one under which the demonstration distribution has highest entropy. This makes IRL well-posed and gives a clean probabilistic interpretation: trajectories with high reward are exponentially more likely than trajectories with low reward.

Maximum-entropy IRL P(τ | r θ) \propto exp( \sum t r θ (s t, a t) ) maximise θ \sum τ \in D log P(τ | r θ) The reward parameters θ are chosen to maximise the likelihood of the observed expert trajectories under the soft Boltzmann distribution. Computing P(τ | r θ) requires summing over all possible trajectories — tractable in small MDPs, approximated by sampling in larger ones. After fitting the reward, an RL policy is trained against it.

Apprenticeship learning and the feature-matching view

Abbeel and Ng's apprenticeship learning (2004) takes a different angle: rather than recovering the reward explicitly, find a policy whose expected feature-counts match the expert's. If the true reward is linear in some feature representation, matching feature expectations matches expected reward — and the policy that does so is at least as good as the expert. The procedure iterates: train a policy, compute its feature expectations, update the implicit reward, repeat. The feature-counts framing remains influential as a way to think about what IRL actually does even when the algorithms have moved on.

GAIL and adversarial imitation

Generative Adversarial Imitation Learning (Ho & Ermon, 2016) sidestepped explicit reward recovery by importing the GAN framework. Train a discriminator D to distinguish the expert's state-action pairs from the policy's; train the policy π to produce state-action pairs the discriminator cannot tell from the expert's. The discriminator implicitly defines a reward (negative log probability of being the policy), and the policy is trained to maximise it. GAIL avoids the need to solve an inner RL problem to convergence at every IRL iteration, dramatically improving sample efficiency on continuous-control tasks. It dominated the IRL literature for the latter half of the 2010s.

Where IRL stands now

In 2026, IRL is a smaller part of the imitation-learning landscape than the BC-and-architecture lineage covered in the rest of this chapter. The reasons are practical: BC scaled with data better than IRL did, modern architectures (Diffusion Policy, ACT) handle the multimodality that originally motivated IRL, and recovering an explicit reward is rarely useful in production unless you plan to run RL on top. IRL retains its niche for cases where reward generalisation matters (autonomous driving in unusual scenarios, behaviour transfer across environments) and as a conceptual tool for understanding what demonstrations actually communicate. The deepest version of the question — "what is the expert's reward, really?" — connects to alignment research and the broader programme of inferring human values from observed behaviour, which is the subject of Part XVIII.

Teleoperation: Capturing Demonstrations

All of the algorithms above are useless without demonstrations. The bottleneck of imitation learning, in practice, is collecting enough data of high enough quality on real robotic hardware. Several teleoperation paradigms have emerged, each with different trade-offs in capture rate, accuracy, hardware cost, and operator skill required.

Leader-follower teleoperation

The leader-follower (or puppeteering) approach uses two robots — a leader that the human moves directly with their hands, and a follower that mirrors the leader's motion in real time. The leader is mechanically simple (passive joints with encoders, no motors) and so is cheap; the follower is the actual robot the policy will run on. The recorded data — joint positions of the leader at each time step — becomes the action labels for the follower's observations.

The 2023 ALOHA system (A Low-cost Open-source HArdware platform, by Zhao et al.) made this approach famous and cheap: roughly $20,000 of off-the-shelf parts buys a bimanual setup that an operator can teleoperate intuitively. ALOHA was the data-collection backbone behind several breakthrough manipulation results in 2023–2024, and its descendants (Mobile ALOHA, ALOHA Unleashed) extended it to mobile manipulation and longer-horizon tasks.

VR and motion-tracking teleoperation

Consumer VR headsets (Quest, Vision Pro) provide six-DOF hand tracking and a head-mounted display at low cost. The natural mapping is to project the operator's hand poses onto the robot's end-effector poses, with the headset displaying the robot's camera feed for visual feedback. This works well for end-effector-controlled robots and is the dominant approach for rapid prototyping in academic robotics. The downside is the visuomotor mismatch — the operator's proprioceptive sense of where their hands are does not match what they see, and most operators take time to learn to teleoperate well. VR teleoperation produces noisier data than leader-follower setups, but it scales: one VR headset and a single operator can produce hundreds of hours of demonstrations.

Kinesthetic teaching

Kinesthetic teaching is the simplest approach: physically grab the robot and move it through the desired motion while it records its own joint positions. Used in industrial robotics for decades, it remains useful for tasks where the operator wants to feel contact forces directly (assembly, polishing). The downside is that it works only when the robot can be backdriven safely — most modern collaborative robots support this, but high-stiffness industrial arms cannot.

Motion capture and the human-as-template approach

Rather than teleoperating the robot, capture a human doing the task with motion-capture or computer vision, and re-target the captured motion onto the robot. This generalises easily across embodiments — once you have human demonstrations, you can deploy them on any robot whose kinematics map onto a human's — but the re-targeting itself is non-trivial, especially for tasks involving contact. The approach has gained traction with humanoid robotics, where the kinematic similarity between human and robot makes re-targeting cleaner.

Simulation-based demonstrations

The cheapest source of demonstrations is simulation. A scripted policy or a slow optimal planner produces synthetic trajectories in a simulator, which become training data for a fast neural-network policy. The data is essentially free, but it inherits whatever distributional gap exists between the simulator and reality — the sim-to-real problem that Chapter 04 will treat in detail. Production stacks frequently combine simulated demonstrations (for breadth) with a small amount of real-world demonstrations (for grounding), with the simulator deliberately randomised to make the policy robust to the differences (domain randomisation, also Chapter 04).

Operator Quality Dominates

The quietest but most consequential variable in imitation-learning data collection is operator quality. Two operators producing the same number of demonstrations can produce policies of very different quality, simply because one operator's demonstrations are more consistent, cleaner, and cover the relevant variations better. Most practitioners learn this the hard way: the first six months of an imitation-learning project go into discovering that the data collection itself is a craft, not a button press.

Modern Datasets and Data Quality

Imitation learning's progress in 2022–2026 has been driven less by algorithmic innovation than by the appearance of very large, very diverse demonstration datasets. The transition mirrors what happened in language modelling a few years earlier: the same architectures applied to ten times the data produced qualitatively different capabilities. This section covers the major datasets, what makes them good or bad, and the open problems around data curation.

Open X-Embodiment and RT-X

The Open X-Embodiment dataset (2023, Padalkar et al.), accompanied by the RT-X model family from Google DeepMind, was a watershed: 22 different robot embodiments, 60+ source datasets contributed by 34 academic labs, more than 1 million teleoperated trajectories. The dataset was deliberately heterogeneous — different robots, different tasks, different action representations — and the central finding of the RT-X paper was that policies trained jointly across all of them outperformed policies trained on any single subset, even when evaluated on tasks from that subset. The cross-embodiment effect was the first concrete evidence that there is value in training on more than just your robot's data.

DROID, Bridge, Robomimic, and friends

Several other datasets have followed, each filling a different niche. DROID (2024, 76,000 trajectories collected with a single Franka-arm setup distributed to 13 institutions) emphasises consistency of hardware over breadth of embodiment. BridgeData V2 contains 60,000 trajectories on a single WidowX arm across many household tasks and is the standard dataset for kitchen-style manipulation. Robomimic is older (2021) and smaller, but its high-quality teleoperation and diverse policy classes make it the standard benchmark for behaviour-cloning methods. The 2026 picture is that most modern manipulation policies are trained on a mixture of these datasets, with weights chosen to emphasise embodiments and tasks similar to the deployment target.

Quality vs. quantity

The honest finding from the past few years is that quality matters more than quantity, up to a point. A dataset of 1,000 high-quality demonstrations from a single skilled teleoperator routinely outperforms a dataset of 10,000 demonstrations of variable quality. The mechanism is subtle: poor demonstrations contain inconsistent action labels for similar observations, which pulls the policy in conflicting directions and ultimately produces a worse network. Curation — filtering out failed demonstrations, near-misses, and operator inconsistencies — has become a standard preprocessing step in serious deployments.

Beyond a threshold, however, more quantity does help, especially for diversity: the cross-embodiment, cross-task generalisation seen in RT-X requires the kind of breadth that no single-source dataset can match. The practical answer is "high-quality data first, then scale." Teams that try to skip the first step usually end up redoing it later.

The action-label problem

A subtle issue with large heterogeneous datasets is that "the action" is not a single thing. One dataset records joint positions, another end-effector poses, another delta-poses, another absolute Cartesian targets, another joint torques. To train a policy across all of them, the actions have to be unified — which means choosing a canonical representation and re-projecting each dataset's actions into it. The choice has consequences for what the policy learns and how it transfers. Open X-Embodiment uses a 7-DOF end-effector pose representation; subsequent foundation models (π0, OpenVLA) have largely followed.

Architectures: ACT, Diffusion Policy, VLAs

The neural-network architecture that turns demonstrations into a policy has evolved rapidly. The 2018-era manipulation policies were CNN-MLP combinations producing single-step actions; the 2026-era policies are transformers or diffusion models predicting chunks of future actions, often conditioned on language. Three architectural ideas dominate the modern landscape, and any production imitation policy uses some combination of them.

Action chunking (ACT)

The Action Chunking Transformer (ACT, Zhao et al. 2023, the architecture behind ALOHA's results) made the empirical observation that policies which predict a chunk of future actions — say, 16 actions covering the next 0.5–1.0 seconds — outperform policies which predict one action at a time. The advantages are several: chunking averages out per-step noise in the demonstrations, reduces compounding by committing to coherent multi-step plans, and lets the network use its capacity to model temporal structure rather than re-predicting from scratch each step. ACT also uses a conditional-VAE objective during training to handle multimodality, sampling from the latent variable to produce different valid action sequences for the same observation.

The action-chunking paradigm has become near-universal in modern manipulation policies. The chunk size, the prediction frequency, and the way overlapping chunks are merged at inference time are tuning knobs, but the basic idea — predict a sequence, execute a few of its actions, re-predict — is the dominant inference pattern.

Diffusion Policy

Diffusion Policy (Chi et al., 2023) treats the action chunk itself as the output of a diffusion model. Conditioned on the current observation, the network reverses a diffusion process to sample a sequence of actions. The advantage over ACT's CVAE is that diffusion models naturally represent multimodal distributions — when there are multiple valid action sequences, the diffusion process can produce any of them, and the sampled outputs cover the full mode landscape rather than collapsing to an average. Diffusion Policy has been the dominant manipulation architecture since 2023 and provides the strongest published behaviour-cloning results across most benchmarks.

The cost is inference latency: each action sample requires several denoising steps, which increases the time-per-action by 5–20× compared to a feedforward network. The standard mitigations are low-step diffusion schedulers (DDIM with 5–10 steps), one-step distillation (training a single-step network to mimic the multi-step diffusion output), and chunk-based execution (since each prediction covers many future actions, the per-action cost is amortised).

Vision-Language-Action models (VLAs)

The third architectural lineage is the vision-language-action model: a foundation-scale transformer that takes images and a language instruction as input and produces an action sequence as output. RT-1 (2022) was the first to scale this approach; RT-2 (2023) showed that a co-trained VLA inheriting from a vision-language pretrained model dramatically outperformed action-only baselines. OpenVLA (Stanford/UC Berkeley, 2024) is the leading open-source instance; π0 (Physical Intelligence, 2024) is the leading commercial instance. All of these tokenise actions (typically as discrete bins on each action dimension) and predict them autoregressively or in a single pass with a token-prediction head.

The argument for VLAs is that language-grounded action prediction is the natural endpoint of imitation learning: the policy should know what task it is doing, generalise to novel instructions, and benefit from internet-scale visual and linguistic priors. The argument against is that VLAs are huge (multiple billion parameters), expensive to inference (limited to ~10 Hz on serious hardware), and show inconsistent gains over smaller specialist policies on specific tasks. The 2026 production picture is mixed: VLAs are dominant in research and frontier deployments, while specialist policies (Diffusion Policy + ACT-style architectures) remain common in narrow industrial use.

Action Representation Is the Quietest Important Choice

The most underdiscussed design choice in a modern imitation-learning policy is the action representation. Joint positions, joint velocities, end-effector poses, end-effector deltas, force-torque commands, motion primitives, language tokens — each induces a different learning problem and a different deployment story. Two papers reporting the same architecture and dataset can get very different results because they made different action-representation choices. When reproducing or comparing methods, this is the variable to check first.

Frontier: Generalist Imitation Policies

The 2024–2026 frontier in imitation learning is the same as in the rest of machine learning: foundation models. The thesis is that a single very large policy, trained on a very large mixture of demonstration data across many robots, tasks, and modalities, will produce a generalist that outperforms any specialist trained from scratch. The evidence so far is mixed but trending positive, and the engineering effort going into this direction is substantial.

Cross-embodiment learning

The clearest near-term frontier is cross-embodiment generalisation. Open X-Embodiment showed that joint training across 22 robots produced better single-robot performance than single-robot training; subsequent work (e.g., CrossFormer, Octo, RDT-1B) has scaled this approach and showed transfer to robots not seen during training. The mechanism appears to be that the cross-embodiment data forces the policy to learn task-level abstractions rather than embodiment-specific quirks, and those abstractions transfer. The open question is how far this scales — whether a single policy can be trained across arms, mobile bases, humanoids, and grippers without sacrificing per-embodiment performance.

π0 and the foundation-imitation lineage

Physical Intelligence's π0 (released October 2024) is the most influential commercial instance: a 3-billion-parameter VLA trained on a mix of teleoperation data across multiple robots, distilled from a larger vision-language pretrained model, and capable of zero-shot performance on many manipulation tasks given only a language instruction. π0 was followed in 2025 by π0.5 and other variants, and inspired numerous open-source competitors (NORA, OpenVLA, RDT). The lineage is the closest thing in robotics today to GPT-4-for-action: a foundation model that can be prompted to do many things and progressively fine-tuned to do new ones.

OpenVLA and the open-source ecosystem

OpenVLA (Stanford, 2024) is the corresponding open-source instance: a 7-billion-parameter VLA fine-tuned from a vision-language model on the Open X-Embodiment dataset. Its release made foundation-imitation accessible to academic labs and startups, and a substantial fine-tuning ecosystem has grown around it (LoRA adapters for specific tasks, distilled smaller variants for inference, domain-specific fine-tunes). In 2026 the open-source VLA ecosystem looks structurally similar to the open LLM ecosystem of 2023 — a few dominant base models, many derivative fine-tunes, and a flow of techniques from the proprietary frontier into the open community within months.

The data flywheel

The big-picture argument for generalist imitation policies is the data flywheel: a deployed policy generates more demonstrations (intervention data, on-policy rollouts, success/failure signals), which improves the next version of the policy, which makes deployment more useful, which produces more data. The flywheel has worked in language and vision; whether it works in robotics depends on whether deployed robots produce useful training data and whether the costs of doing so (privacy, hardware wear, safety) are bearable. The serious robotic-foundation-model companies in 2026 — Physical Intelligence, Skild, 1X, Figure, Tesla — are explicitly betting on this flywheel, and their fleets are accumulating data orders of magnitude faster than any academic-lab data-collection effort.

What remains open

Several questions are genuinely open. Does scale work in robotics the way it works in language? The early evidence is encouraging but not conclusive — robotics scaling laws are noisier and the data is more heterogeneous than text. How much of robotics will end up being imitation learning vs. RL? The current trajectory is imitation-pretrained, RL-fine-tuned, but the balance might shift if simulation gets good enough that pure-RL approaches become viable. What is the action representation that will scale? Discrete action tokens have worked for VLAs but may not be optimal for fine motor control; continuous diffusion outputs have worked for specialists but are harder to scale. The next several years will resolve these questions empirically.

What this chapter does not cover

The simulators that produce synthetic demonstrations and the techniques that close the sim-to-real gap belong to Chapter 04 (Sim-to-Real Transfer). The vision-language-action models of Section 09 and the foundation-imitation lineage of this section are treated in more depth in Chapter 05 (Foundation Models for Robotics), with explicit coverage of RT-2, π0, and the broader VLA architecture space. The integration of imitation-learned policies into a complete autonomous system, with all of its safety and behavioural guarantees, belongs to Chapter 06 (Autonomous Vehicles).

Imitation learning is the layer at which human knowledge enters a robot. The classical methods of this chapter remain the backbone of the field; the modern foundation-scale architectures are reshaping what that backbone can do. The skills the chapter has tried to ground — understanding distribution shift, choosing between BC and DAgger, designing data-collection pipelines, picking action representations — apply equally to a small lab project and to a billion-parameter generalist policy. The methods change; the discipline does not.