Offline RL & Imitation Learning
Online RL requires a live environment to learn in. But many of the most valuable decisions — surgical robotics, intensive-care medicine, autonomous driving — cannot be rehearsed at scale in the real world. This chapter covers how agents learn from fixed datasets of past experience: through imitation, through recovering reward signals, and through pessimistic offline optimization.
Prerequisites
This chapter assumes familiarity with MDPs, Q-learning, and policy gradient methods (Chapters 01–04). Concepts from supervised learning (classification, regression) are used in the imitation learning sections. The offline model-based section connects to Chapter 05.
The Offline Setting
Online RL agents improve by acting. Offline RL agents must improve without acting at all — learning solely from a fixed dataset of transitions collected by some other policy, possibly suboptimal, possibly long ago.
In offline reinforcement learning (also called batch RL), the agent has access to a dataset $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ of transitions collected by one or more behavior policies $\pi_\beta$. Once learning begins, the agent cannot interact with the environment to gather new data. The goal is to extract the best possible policy from $\mathcal{D}$ alone.
The setting is practically important for several reasons. First, many domains are too costly, dangerous, or slow to allow extensive online exploration: a surgical robot cannot practice on real patients, an autonomous vehicle cannot rehearse collisions, and a clinical decision-support system cannot randomly experiment on critically ill patients. Second, enormous historical datasets often exist — hospital records, driving logs, industrial sensor histories — that contain rich information about good and bad decisions, even if they were never collected for RL purposes.
The Behavior Policy and Dataset Quality
The behavior policy $\pi_\beta$ that generated $\mathcal{D}$ is usually not the optimal policy. It might be a human expert, a scripted heuristic, a mix of policies, or even a random agent. The dataset might be expert-only (near-optimal demonstrations), mixed (a combination of expert and suboptimal trajectories), or random (essentially no structure). These regimes are qualitatively different: an expert dataset is relatively easy to exploit through imitation, while a random dataset requires the algorithm to piece together high-reward trajectories from fragmented experience — a problem sometimes called stitching.
Distributional Shift: The Core Challenge
The fundamental difficulty of offline RL is distributional shift. Any offline algorithm must extrapolate — make predictions at state-action pairs $(s, a)$ that are not in $\mathcal{D}$. A function approximator trained on $\mathcal{D}$ has no guarantee of accuracy outside the data distribution. In online RL, distributional shift is mild because the agent continuously visits the states it cares about. In offline RL, the agent's policy can drift far from $\pi_\beta$, querying the value function at points where it is systematically wrong.
The offline RL triangle: The three vertices of the space are (1) pure imitation learning — copy $\pi_\beta$ directly; (2) inverse RL — recover the reward that $\pi_\beta$ is implicitly optimizing; and (3) offline RL — directly optimize a policy using $\mathcal{D}$, potentially exceeding $\pi_\beta$. Each involves a different treatment of the reward signal and a different posture toward distributional shift.
Behavior Cloning
The simplest approach to imitation: treat expert demonstrations as labeled data, and train a policy with supervised learning. No reward signal is needed. No RL is performed. And yet the results can be surprisingly good — and surprisingly brittle.
The Method
Behavior cloning (BC) reduces imitation to supervised learning. Given a dataset of (state, action) pairs from an expert, BC trains a policy $\pi_\theta$ to minimize the imitation loss:
$$\mathcal{L}_{BC}(\theta) = \mathbb{E}_{(s, a) \sim \mathcal{D}}\left[-\log \pi_\theta(a \mid s)\right]$$For discrete actions this is cross-entropy; for continuous actions it is typically a negative log-likelihood under a Gaussian or a mean-squared-error regression objective. No environment interaction is required during training — the entire procedure is offline supervised learning.
Compounding Errors
BC's weakness is compounding errors. Because $\pi_\theta$ is trained on states from the expert's trajectory distribution, any deviation at test time puts the agent in states not covered by training. Even a small per-step error rate $\epsilon$ leads to accumulated drift: after $T$ steps, the agent may be in a state distribution that is $O(\epsilon T^2)$ far from the expert's (Ross & Bagnell, 2010). In practice this means a cloned driving policy that works at 0.01% error rate per frame can still crash after a few hundred frames — the tiny drift compounds into catastrophic failure.
The compounding error problem is a manifestation of covariate shift: the distribution of inputs at test time (states the cloned policy visits) differs from the distribution at training time (states the expert visited). Standard supervised learning theory assumes these distributions are the same.
When BC Works Well
Despite compounding errors, BC works surprisingly well on short-horizon tasks, tasks where the expert trajectory distribution covers the state space densely, and tasks with powerful enough policy representations. In robotics, BC from large human demonstration datasets has produced capable manipulation policies. In language models, next-token prediction (a form of BC on human text) produces extraordinarily capable behavior. The key is data scale: with enough demonstrations, the learned distribution covers enough states that compounding errors never venture far into unexplored territory.
DAgger & Interactive Imitation
DAgger fixes BC's compounding error problem by querying the expert on states the learned policy actually visits — at the cost of requiring continued access to the expert during training.
Dataset Aggregation
DAgger (Dataset Aggregation) (Ross, Gordon & Bagnell, 2011) iteratively collects new demonstrations. In each round: (1) execute the current policy $\pi_i$ in the environment; (2) for each state visited, query the expert for the correct action; (3) add these new (state, expert-action) pairs to the aggregate dataset $\mathcal{D}$; (4) retrain $\pi_{i+1}$ on $\mathcal{D}$. Over iterations, the dataset covers not just the expert's natural trajectory distribution but also the states the learning policy wanders into, preventing compounding errors.
DAgger guarantees that the policy's error is bounded by $O(\epsilon)$ rather than $O(\epsilon T^2)$, where $\epsilon$ is the per-state supervised learning error. This is a dramatic improvement. The cost: the expert must be available to label states throughout training, which may be expensive or infeasible (a human surgeon cannot be on-call for every RL training step).
Practical Variants
HG-DAgger (human-gated DAgger) lets the human expert decide when to intervene rather than labeling every state systematically, reducing annotation burden. SafeDAgger adds a safety filter that invokes the expert whenever the learned policy's confidence falls below a threshold. EnsembleDAgger uses disagreement between an ensemble of policies as a proxy for uncertainty, triggering expert queries in high-uncertainty states rather than requiring the expert to monitor continuously.
Imitation from Observation
A stricter variant of imitation learning requires learning from state-only observations: the agent sees expert states but not the expert's actions. Imitation from observation (IfO) must additionally infer what actions produced the observed state transitions, making the problem harder but applicable to settings where action information is unavailable — for example, learning to drive by watching video of human driving without access to steering and throttle logs.
Inverse Reinforcement Learning
Rather than copying the expert's actions, IRL asks a deeper question: what reward function would make the expert's behavior optimal? Recover the reward, then optimize it — and the resulting policy may generalize far beyond the demonstrated behavior.
The IRL Problem
Inverse reinforcement learning (IRL) takes expert demonstrations $\mathcal{D}$ as input and outputs a reward function $r: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ such that the expert policy is (near-)optimal under $r$. The recovered reward can then be used to train a new agent via standard RL, potentially allowing the agent to generalize to states and situations not seen in the demonstrations.
IRL is ill-posed: many reward functions make any policy optimal (the trivially zero reward makes every policy optimal). Early approaches constrained the reward function to be a linear combination of hand-specified features: $r(s, a) = w^\top \phi(s, a)$, and sought weights $w$ such that the expert's feature expectations matched those of the optimal policy under $r$. This feature matching approach connects to the classic imitation theorem: two policies with the same expected feature counts will achieve the same expected return under any reward in the feature class.
Maximum Entropy IRL
Maximum entropy IRL (Ziebart et al., 2008) resolves the ambiguity in IRL with an elegant principle: among all distributions over trajectories that match the expert's feature expectations, choose the one with maximum entropy. This leads to a distribution over trajectories proportional to $\exp(r(\tau))$, where $r(\tau) = \sum_t r(s_t, a_t)$. The recovered reward can be found by gradient descent: the gradient of the log-likelihood of the expert demonstrations under this model equals the difference between the expert's feature expectations and those of the current model policy. This gives a tractable learning algorithm when the MDP can be solved exactly, and scales to complex settings with deep neural reward functions.
Reward Transferability
The main advantage of IRL over BC is reward transferability. A recovered reward function can be used to train an agent in a modified environment (different dynamics, different start states) and still produce expert-like behavior — because it captures the underlying goal, not just the specific actions used to achieve it. This makes IRL particularly valuable for robotics, where the training environment and deployment environment often differ.
GAIL & Adversarial Imitation
Maximum entropy IRL requires solving a full RL problem at each gradient step, which is expensive. GAIL bypasses explicit reward recovery by matching the agent's occupancy measure directly to the expert's, using an adversarial discriminator as a real-time reward signal.
Occupancy Measures
The occupancy measure of a policy $\pi$ is the distribution over state-action pairs encountered when following $\pi$: $\rho_\pi(s, a) = \pi(a \mid s) \sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \pi)$. A key theorem connects occupancy measures to reward optimization: $\pi^* = \arg\max_\pi \mathbb{E}_{\rho_\pi}[r(s,a)]$. This means that matching $\rho_\pi$ to $\rho_{\pi_E}$ (the expert's occupancy measure) is equivalent to matching the policy's behavior, and any policy that matches the expert's occupancy measure achieves the same return as the expert under the true reward function.
The GAIL Objective
GAIL (Generative Adversarial Imitation Learning) (Ho & Ermon, 2016) casts imitation as a GAN problem. A discriminator $D_\psi(s, a)$ learns to distinguish expert state-action pairs from those of the learner policy; the policy learns to fool the discriminator. The GAIL objective is:
$$\min_\pi \max_D \;\mathbb{E}_{\rho_\pi}[\log D(s,a)] + \mathbb{E}_{\rho_E}[\log(1-D(s,a))] - \lambda H(\pi)$$The discriminator provides a reward signal $r(s,a) = -\log D(s,a)$ which the policy optimizes via online RL (typically TRPO or PPO). At convergence, the learned policy's occupancy measure matches the expert's, and the discriminator cannot distinguish them. GAIL learns reward-free imitation without explicitly constructing a reward function, making it more scalable than MaxEnt IRL.
f-Divergence Variants
The GAN objective in GAIL minimizes the Jensen-Shannon divergence between $\rho_\pi$ and $\rho_E$. More general f-divergence formulations (AIRL, f-GAIL) recover reward functions that are transferable across dynamics changes — a key advantage over vanilla GAIL which produces a reward that works only in the training environment. AIRL (Adversarial Inverse Reinforcement Learning) specifically recovers a reward that disentangles the environment dynamics from the true reward, enabling transfer to environments with different transition functions.
The Extrapolation Error Problem
Apply standard Q-learning to an offline dataset and the learned policy will fail — often catastrophically. The culprit is systematic Q-value overestimation for actions not in the dataset, causing the policy to greedily select actions it has never seen and cannot evaluate reliably.
Standard Q-learning trains $Q_\theta(s, a)$ to satisfy the Bellman equation using data from $\mathcal{D}$. The policy is then derived by $\pi(s) = \arg\max_a Q_\theta(s, a)$. In online RL, this works because the agent actually visits the states and actions selected by its current policy, correcting errors. In offline RL, the agent never executes its policy, so errors never get corrected. Worse, the greedy policy will tend to select out-of-distribution (OOD) actions — actions not well represented in $\mathcal{D}$ — precisely because the Q-function systematically overestimates their value (the Bellman backup uses the maximum over actions, and noise in the Q-function favors overestimates).
Why Naive Bellman Backups Fail
The root cause is the maximization operator in the Bellman backup: $y = r + \gamma \max_{a'} Q(s', a')$. Function approximation errors are always positive in expectation for the max-selected action — the maximum over a noisy set is biased upward. In online RL, this overestimation is bounded because the agent visits the selected action and gets a corrective signal. Offline, the overestimation accumulates unchecked over Bellman backups, driving OOD action values arbitrarily high. Fujimoto et al. (2019) showed empirically that standard TD3 applied offline degrades to near-random performance, even on datasets collected by a good policy — a clean demonstration that the algorithm failure is structural, not incidental.
Conservative Q-Learning
CQL adds a penalty to the Bellman loss that explicitly suppresses Q-values for actions not in the dataset. The resulting value function is a lower bound on the true value — safe to optimize against, because underestimation cannot cause the policy to flee into unobserved territory.
The CQL Objective
Conservative Q-Learning (CQL) (Kumar et al., 2020) augments the standard Bellman loss with a regularizer that simultaneously minimizes Q-values for actions drawn from some prior $\mu$ (not in the dataset) and maximizes Q-values for actions from the dataset:
$$\mathcal{L}_{CQL}(Q) = \underbrace{\mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}[(Q(s,a) - \mathcal{B}^* Q(s,a))^2]}_{\text{Bellman error}} + \alpha \underbrace{\left(\mathbb{E}_{s \sim \mathcal{D},\, a \sim \mu}[Q(s,a)] - \mathbb{E}_{(s,a) \sim \mathcal{D}}[Q(s,a)]\right)}_{\text{conservative penalty}}$$The first term is the standard TD error. The second term with coefficient $\alpha$ pushes down Q-values under the prior $\mu$ (often a uniform distribution or the current policy) and pulls up Q-values for in-dataset actions. The result is a Q-function that is a pointwise lower bound on the true Q under $\pi_\beta$: $\hat{Q}(s,a) \leq Q^{\pi_\beta}(s,a)$ for all $(s,a)$ in $\mathcal{D}$.
Safe Policy Improvement
Because $\hat{Q}$ is a lower bound, any policy that maximizes $\hat{Q}$ is guaranteed to do at least as well as $\pi_\beta$ in expectation. This safe policy improvement guarantee is CQL's core theoretical contribution. The algorithm can extract a policy that is strictly better than $\pi_\beta$ in regions where the data gives reliable value estimates, while avoiding the collapse that naive offline Q-learning suffers. In practice, CQL works well on both discrete (Atari) and continuous (D4RL locomotion and manipulation) benchmarks, often substantially outperforming the behavior policy even on mixed-quality datasets.
CQL Variants
CQL can be combined with actor-critic methods (CQL(SAC)) for continuous actions. The prior $\mu$ can be chosen adaptively — using the current policy rather than a fixed distribution — making the regularizer tighter. A related algorithm, COMBO, adds model-based rollouts to CQL, generating additional pessimistic training data and improving sample efficiency in low-data regimes.
Implicit Q-Learning
CQL still involves computing Q-values at OOD actions (to push them down). IQL sidesteps this entirely by never evaluating the Q-function outside the dataset — using expectile regression to implicitly capture the maximum-value action without ever querying it.
The Insight
Implicit Q-Learning (IQL) (Kostrikov et al., 2021) observes that the Bellman optimality operator requires $\max_{a'} Q(s', a')$ — which inevitably queries OOD actions. IQL replaces this with an implicit approximation. Instead of explicitly maximizing over actions, IQL estimates the maximum Q-value using expectile regression: a regression objective that, at asymmetric quantile $\tau \to 1$, converges to the maximum of the target distribution.
IQL trains three functions: a value function $V(s)$, a Q-function $Q(s, a)$, and a policy $\pi(a \mid s)$. The value function is trained with an expectile loss at level $\tau$ (typically $\tau = 0.7$–$0.9$):
$$\mathcal{L}_V(V) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\left[L^\tau_2(Q(s,a) - V(s))\right]$$where $L^\tau_2(u) = |\tau - \mathbf{1}(u < 0)| \cdot u^2$. This is a squared loss that penalizes underestimates more than overestimates at rate $\tau$. The Q-function is then trained with a standard Bellman backup using $V$ rather than $\max_a Q$: $y = r + \gamma V(s')$, which never queries OOD actions. The policy is extracted by advantage-weighted regression: actions with high advantage $A(s,a) = Q(s,a) - V(s)$ are upweighted.
Why It Works
At $\tau = 1$, the expectile loss recovers exactly $V(s) = \max_{a \in \mathcal{D}(s)} Q(s, a)$ — the best action in the dataset at each state. With $\tau < 1$, IQL computes a softer maximum that interpolates between average and best performance. This gives a clean offline RL algorithm with no OOD queries, excellent empirical performance on D4RL benchmarks, and a simple implementation. IQL has become a standard offline RL baseline due to its stability and strong results across diverse task types.
TD3+BC & Offline Actor-Critic
Rather than modifying the value function, a family of methods regularizes the policy directly — constraining it to stay near the behavior policy while still optimizing for high value. Surprising result: a simple penalty on policy deviation can match or beat elaborate value-function modifications.
Policy Constraint Approaches
The fundamental offline RL challenge can be addressed from two sides: constrain the value function to be pessimistic (CQL, IQL) or constrain the policy to stay close to the behavior policy (so it never selects OOD actions whose values are unreliable). Policy constraint methods formulate:
$$\pi^* = \arg\max_\pi \mathbb{E}_{s \sim \mathcal{D}}\left[Q(s, \pi(s))\right] \quad \text{subject to } D(\pi, \pi_\beta) \leq \epsilon$$where $D$ is some divergence measure. Different choices of $D$ give different algorithms: KL divergence (BEAR uses MMD), total variation distance, and support constraint (the policy assigns nonzero probability only where $\pi_\beta$ does).
TD3+BC
TD3+BC (Fujimoto & Gu, 2021) is perhaps the simplest offline RL algorithm that works well. It adds a behavior cloning term directly to the TD3 policy update:
$$\pi = \arg\max_a \left[\lambda Q(s, a) - (a - \pi_\beta(s))^2\right]$$The first term maximizes Q-value; the second penalizes deviation from the behavior policy's action. The normalization factor $\lambda = 1 / \mathbb{E}[|Q(s, \pi_\beta(s))|]$ ensures the two terms are comparably scaled. This two-line modification of TD3 achieves competitive performance on D4RL benchmarks, underscoring how much offline RL complexity can be captured by a simple regularizer.
BEAR
BEAR (Bootstrapping Error Accumulation Reduction) uses maximum mean discrepancy (MMD) to constrain the policy's action distribution to be close to the behavior policy in distribution space (not just in mean action). MMD is estimated using a kernel between samples from $\pi$ and samples from $\pi_\beta$, and the constraint is enforced via a Lagrangian. BEAR's key insight is that support constraints (simply not selecting actions outside $\pi_\beta$'s support) are more important than mode constraints (matching the mode of $\pi_\beta$) — a loose support constraint allows the policy to focus on the best actions in the dataset without requiring it to copy suboptimal behavior.
Offline Model-Based RL
Model-based methods offer a natural offline RL solution: learn a dynamics model from the dataset, then plan with it. The challenge is that model errors compound — and a model trained offline may be wildly wrong in regions far from the data. The solution is pessimism: penalize rewards in high-uncertainty regions.
Model Uncertainty as a Penalty
MOPO (Model-based Offline Policy Optimization) (Yu et al., 2020) trains an ensemble of neural network dynamics models on $\mathcal{D}$, then defines a penalized reward:
$$\tilde{r}(s, a) = r(s, a) - \lambda \cdot u(s, a)$$where $u(s, a)$ is the uncertainty of the dynamics model — measured as the disagreement between ensemble members. Planning against $\tilde{r}$ with model rollouts gives a policy that avoids state-action pairs where the model is unreliable, because the uncertainty penalty makes them look bad even if the nominal reward is high. MOPO provides a formal bound: the true return of the learned policy is at least the penalized return minus a term proportional to $\lambda$ times the maximum model error. Choosing $\lambda$ appropriately tunes the conservatism.
MOReL
MOReL (Model-based Offline Reinforcement Learning) takes a harder pessimistic stance. Instead of penalizing uncertain regions, it partitions state-action space into a known region (low model uncertainty) and an unknown region (high model uncertainty), and assigns a large negative reward to all transitions entering the unknown region. The agent is thus forbidden from exploiting model errors — it must solve the offline problem within the envelope of reliable model predictions. MOReL's partitioning is simpler to tune than MOPO's continuous penalty and achieves comparable performance.
Advantages of the Model-Based Approach
Offline model-based methods have a key advantage over model-free methods: they can generate additional training data via rollouts, potentially covering parts of the state space underrepresented in $\mathcal{D}$. They can also leverage planning methods (MPC, MCTS) at test time. The main risk is that a model trained on limited data may be confidently wrong in ways the uncertainty estimate misses. In practice, ensemble disagreement is a surprisingly reliable uncertainty proxy for locomotion and manipulation tasks, but breaks down on higher-dimensional observations like images.
Decision Transformer
What if offline RL is just sequence modeling? The Decision Transformer reframes the problem as conditional generation: given a desired return, generate the sequence of actions that achieves it. No Bellman equations, no value functions — just a Transformer trained on offline data.
Return-Conditioned Sequence Modeling
Decision Transformer (DT) (Chen et al., 2021) represents trajectories as sequences of (return-to-go $\hat{R}_t$, state $s_t$, action $a_t$) tuples. The return-to-go $\hat{R}_t = \sum_{t'=t}^T r_{t'}$ is the sum of future rewards from step $t$ onward. The model — a GPT-style Transformer — is trained to predict the action $a_t$ given the context window of recent $(R, s, a)$ tuples. At test time, the agent specifies a high desired return-to-go (e.g., the maximum observed in the dataset), and the Transformer autoregressively generates actions that produce trajectories with that return.
There are no explicit value functions, no Bellman backups, no distributional shift corrections — the model simply learns the conditional distribution $\pi(a_t \mid \hat{R}_t, s_t, \hat{R}_{t-1}, s_{t-1}, a_{t-1}, \ldots)$ from the offline data. The return conditioning acts as a form of goal specification: higher desired returns produce higher-quality trajectories.
Trajectory Transformer
Trajectory Transformer (Janner et al., 2021) extends the sequence modeling approach to model the entire trajectory — states, actions, and rewards jointly — enabling planning via beam search over predicted trajectories. Unlike DT, which generates actions autoregressively, Trajectory Transformer can look ahead and select the sequence of actions with the highest predicted cumulative reward, explicitly performing model predictive control inside the Transformer.
In-Context RL
A striking extension of sequence modeling for offline RL is in-context reinforcement learning: a Transformer trained on offline data from many tasks can adapt to a new task at test time simply by being given a few demonstrations in its context window — analogous to in-context learning in large language models. Algorithm Distillation (Laskin et al., 2022) demonstrates that a Transformer can even learn to do RL in-context: given a sequence of improving trajectories from an online RL algorithm, the Transformer learns to replicate the RL algorithm's improvement process at meta-test time.
Debate: Decision Transformer performs remarkably well on D4RL locomotion benchmarks — often matching CQL and IQL — but its performance on stitching tasks (where the optimal trajectory must be assembled from sub-trajectories in the dataset) is weaker than value-function methods. This suggests that sequence modeling captures the conditional structure of demonstrations well but lacks the credit-assignment mechanism that makes offline RL powerful when data is imperfect.
Benchmarks & Applications
D4RL standardized offline RL evaluation; deployment has followed in robotics, clinical decision support, and autonomous driving. Offline-to-online fine-tuning bridges the dataset-first learning phase with subsequent environment access.
D4RL
D4RL (Datasets for Deep Data-Driven Reinforcement Learning) (Fu et al., 2020) is the standard benchmark suite for offline RL. It provides fixed datasets of varying quality — random, medium, medium-replay, medium-expert, and expert — for MuJoCo locomotion tasks (HalfCheetah, Hopper, Walker2d), manipulation tasks (AntMaze, Kitchen, Pen), and Atari games. The locomotion tasks test basic offline optimization; AntMaze requires stitching short sub-trajectories into long-range navigation; Kitchen requires composing manipulation skills. D4RL has been critical for measuring algorithmic progress and revealing that different algorithms excel in different regimes.
Robotics
Offline RL is particularly well-suited to robotics, where real-world interaction is expensive and risky. Large-scale robot learning projects (RT-1, Open X-Embodiment) have assembled datasets of hundreds of thousands of robot demonstrations and trained generalist policies using BC and offline RL. The result is broad task coverage that would be impossible to achieve with online RL alone. Offline-trained policies are then fine-tuned online for specific tasks using small amounts of real-world interaction, a pattern that dramatically reduces the cost of robot skill acquisition.
Clinical Decision Support
Electronic health records contain retrospective data on millions of clinical decisions and patient outcomes — a natural offline RL dataset. Offline RL has been applied to sepsis treatment (optimizing vasopressor and fluid dosing from ICU records), mechanical ventilator management, and chemotherapy scheduling. The safety-critical nature of these applications makes offline-only methods essential: online exploration with real patients is ethically unacceptable. Conservative methods (CQL, IQL) are preferred because they provide guarantees that the learned policy is at least as good as historical care in expectation.
Offline-to-Online Fine-Tuning
Pure offline learning and pure online learning represent extremes. In practice, the most effective approach is often offline-to-online fine-tuning: pre-train a policy offline on a large dataset, then continue training online with limited environment access. The offline phase provides a good initialization; the online phase corrects residual errors and adapts to the deployment environment. Naive fine-tuning can be unstable — the policy quickly unlearns offline knowledge when online data is sparse — so algorithms like IQL fine-tuning, Cal-QL, and SPOT maintain conservative constraints during online adaptation, gradually relaxing them as online data accumulates.
Algorithm Summary
| Algorithm | Family | Key Mechanism | Strength |
|---|---|---|---|
| BC | Imitation | Supervised cloning | Simple; works well with expert data |
| DAgger | Interactive IL | Dataset aggregation with expert queries | Fixes BC compounding errors |
| MaxEnt IRL | IRL | Max-entropy trajectory distribution | Transferable reward function |
| GAIL | Adversarial IL | GAN occupancy measure matching | No explicit reward needed |
| CQL | Offline RL (value) | Pessimistic Q-value lower bound | Strong theory; discrete + continuous |
| IQL | Offline RL (value) | Expectile regression; no OOD queries | Stable; strong on D4RL |
| TD3+BC | Offline RL (policy) | BC regularizer on policy update | Simple; competitive baseline |
| MOPO | Offline MBRL | Uncertainty-penalized model rewards | Data-efficient; principled bounds |
| Decision Transformer | Sequence modeling | Return-conditioned generation | No value functions; scales with data |
Further Reading
Foundational Papers
-
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (DAgger)Introduced DAgger and the formal analysis of compounding errors in imitation learning, proving that interactive data collection reduces the $O(\epsilon T^2)$ error to $O(\epsilon T)$. The foundational imitation learning theory paper.
-
Generative Adversarial Imitation Learning (GAIL)Framed imitation as occupancy measure matching, proposing the GAN-based training procedure that avoids explicit reward recovery while achieving expert-level performance on MuJoCo tasks. Seminal adversarial imitation paper; connects IRL, GANs, and RL.
-
Conservative Q-Learning for Offline Reinforcement Learning (CQL)Proposed the pessimistic value function approach to offline RL, deriving policy improvement guarantees and showing strong empirical results on D4RL across locomotion, manipulation, and Atari. The defining offline RL algorithm paper; required reading.
Key Results
-
Offline Reinforcement Learning with Implicit Q-Learning (IQL)Introduced expectile regression as a way to train offline RL without OOD action queries, achieving state-of-the-art D4RL results with a simpler and more stable training procedure than CQL. The cleanest offline RL formulation; widely adopted in practice.
-
Decision Transformer: Reinforcement Learning via Sequence ModelingDemonstrated that return-conditioned sequence modeling with a Transformer can match value-function offline RL methods on Atari and MuJoCo, sparking a large literature on in-context and sequence-modeling approaches to RL. Opened the sequence modeling framing of offline RL.
-
D4RL: Datasets for Deep Data-Driven Reinforcement LearningIntroduced the standard offline RL benchmark suite covering locomotion, navigation, and manipulation tasks at multiple dataset quality levels. Essential for reproducible offline RL research. The benchmark every offline RL paper reports on.
Surveys & Resources
-
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open ProblemsComprehensive survey of offline RL from the Berkeley group, covering the problem formulation, algorithmic landscape, theoretical underpinnings, and connections to imitation learning and model-based methods. The definitive survey; start here for a thorough grounding.
-
A Minimalist Approach to Offline Reinforcement Learning (TD3+BC)Showed that adding a simple behavior cloning penalty to TD3 achieves competitive offline RL performance, arguing that much of the offline RL literature's complexity may be unnecessary. A useful counterpoint — sometimes the simplest method is best.