Part IX · Reinforcement Learning · Chapter 05

Model-Based RL & World Models, learning to imagine before acting.

Every model-free algorithm discards a piece of knowledge it implicitly possesses: the transition dynamics of the environment. Model-based methods learn those dynamics explicitly, using the resulting world model to generate imaginary experience that supplements — or replaces — costly real interaction. Dyna framed this idea in the tabular era; MBPO and Dreamer brought it to deep RL with neural transition models; Monte Carlo Tree Search demonstrated that planning with a model can reach superhuman performance even in the most complex games humans have devised.

Chapter notes

Model-based RL spans an unusually wide range of methods — from the classical Dyna architecture to latent-space world models to AlphaZero's tree search. The unifying thread is the use of a learned or given transition model to improve sample efficiency or planning quality. The chapter covers the principal modern algorithms and the key theoretical challenge that links them all: compounding model error.

Prerequisites: Chapter 02 (Tabular RL) introduces the Dyna idea in the tabular setting; Chapter 03 and Chapter 04 cover the model-free deep RL baselines that model-based methods aim to improve upon. Familiarity with variational autoencoders (Part X, Ch 01) is helpful for Sections 07–09 but not required.

Sections

The model-based ideasample efficiency, the simulation bottleneck, what a model buys
Types of world modelstransition, reward, termination; deterministic vs stochastic; forward vs inverse
Dyna in the deep RL erareal + imagined experience, the Dyna-Q loop, neural transition models
Compounding model errorone-step error, distributional shift, why long rollouts diverge
Model ensembles and uncertaintyepistemic uncertainty, disagreement signal, PETS
MBPOshort model rollouts, mixing ratio, Janner 2019
Latent space world modelsencoding observations, RSSM, stochastic + deterministic state
Dreamerimagination training, world model objectives, latent imagination
DreamerV2 and DreamerV3categorical latents, symlog, universal hyperparameters, Minecraft
Monte Carlo Tree SearchAlphaGo, AlphaZero, selection–expansion–simulation–backpropagation
Shooting methods and MPCrandom shooting, CEM, iLQR, model predictive control
The model-based frontierwhen models help, when they hurt, sample efficiency vs asymptotic performance

The model-based idea

Every real environment interaction is expensive — in robot hardware, in simulation time, in financial cost. A learned model of the environment lets the agent extract more signal from each real transition by asking "what would have happened if I had done something different?"

Model-free methods learn directly from experience: interact, observe a reward, update a value function or policy. They are agnostic about the environment's internal structure and require many samples to converge — typically millions of environment steps for Atari, tens of millions for MuJoCo locomotion. This is acceptable when a fast simulator is available; it is prohibitive when each step requires running a physical robot, executing a real-world process, or waiting for hours of biological experiment.

Model-based methods learn a world model $\hat{f}$: a function that predicts the next state and reward given the current state and action. Once the model exists, it can be queried arbitrarily — producing imaginary transitions without touching the real environment. If the model is accurate, these imagined transitions are as informative as real ones; the agent can improve its policy from imaginary rollouts and reserve real interactions for updating the model itself.

The central promise is sample efficiency: achieving a given level of performance with fewer real environment steps. Model-free algorithms like SAC typically require $10^5$–$10^6$ steps to solve a continuous control task; Dreamer achieves comparable performance in $10^4$–$10^5$ steps, a 10× improvement. In robotic manipulation, this can mean the difference between a few hours of data collection and several weeks.

Types of world models

A "world model" is a family of learnable functions. The most important is the transition model, but a complete model also predicts rewards and episode termination — and models differ sharply in whether they operate in observation space or a compressed latent space.

A forward transition model $\hat{f}(s, a) \to s'$ predicts the next state given the current state and action. This is the core component. An inverse model $g(s, s') \to a$ instead predicts which action was taken given a pair of consecutive states — useful for self-supervised representation learning but not sufficient for planning. A reward model $\hat{r}(s, a) \to r$ predicts the scalar reward; a termination model $\hat{d}(s, a) \to \{0, 1\}$ predicts whether the episode ends.

Models also vary in how they represent uncertainty. A deterministic model outputs a single predicted next state: $s' = \hat{f}_\theta(s, a)$. It is fast and simple but cannot express the inherent stochasticity of many environments. A stochastic model outputs a distribution over next states: $p_\theta(s' \mid s, a)$, typically parameterized as a Gaussian or a categorical mixture. Stochastic models can represent multimodal futures and provide calibrated uncertainty, at the cost of greater complexity.

Perhaps the most important architectural choice is whether the model operates in observation space or latent space. Observation-space models predict next frames or sensor readings directly — interpretable but computationally expensive for high-dimensional inputs like images. Latent-space models first encode observations into a compact representation $z = \text{enc}(o)$, then model dynamics entirely in $z$-space. This is far more efficient, but the model must also learn to decode $z$ back to observations for training supervision.

Dyna in the deep RL era

Sutton's 1991 Dyna architecture integrated model learning, direct RL, and model-based planning into a single unified loop. With neural networks in every component, it remains the conceptual skeleton of most modern model-based methods.

The Dyna loop has three simultaneous processes. First, the agent interacts with the real environment, collecting transitions $(s, a, r, s')$ into a replay buffer $\mathcal{D}$. Second, a transition model $\hat{f}_\theta$ is trained on $\mathcal{D}$ to predict $(\hat{r}, \hat{s}')$ from $(s, a)$. Third, the model is used to generate imaginary transitions — starting from real states sampled from $\mathcal{D}$, the model rolls out $k$ steps and produces synthetic experience. Both real and imaginary transitions are used to update the policy or value function.

The model-based RL training loop. Real transitions flow from the environment into the replay buffer (right column), which trains both the world model and the policy directly. The world model additionally generates imagined rollouts — synthetic transitions the policy trains on without touching the real environment.

The ratio of imaginary to real updates is a key hyperparameter. Dyna-Q in the tabular setting used many planning steps per real step; in deep RL the optimal ratio depends on model quality. With an accurate model, more imaginary updates accelerate learning. With a poor model, they poison the policy with systematically wrong gradients. Modern algorithms like MBPO tune this ratio carefully and use short rollout horizons to keep model error bounded.

Compounding model error

A model that is 99% accurate at each step is only 37% accurate after 100 steps. Model error does not merely add — it multiplies, and long rollouts stray into regions the model has never seen and cannot predict reliably.

Let $\varepsilon$ be the one-step model error: the expected difference between the model's predicted next state and the true next state. After $H$ steps of rollout, the error accumulates. For simple linear models, the error grows as $O(H \varepsilon)$; for nonlinear models with sensitive dynamics, it grows exponentially. An imaginary trajectory that begins near a real state rapidly diverges from any real trajectory, visiting states the model was never trained on. In these out-of-distribution regions, model predictions are unreliable in unpredictable ways — often confidently wrong.

The phenomenon is closely related to covariate shift in supervised learning. The model is trained on states visited by the current behavioral policy, but a long rollout visits states induced by the policy conditioned on model predictions — a shifted distribution. The longer the rollout, the worse the shift.

This is the central challenge of model-based RL and explains most of the architectural choices in modern algorithms. MBPO limits rollout length $H$ to a small number ($H = 1$–$5$). Dreamer operates in a compact learned latent space where dynamics are simpler. Model ensembles quantify uncertainty to detect when rollouts enter unreliable regions. MCTS uses many short simulations rather than one long one, and returns to the model frequently.

Model ensembles and uncertainty

A single neural network cannot tell you how uncertain it is. An ensemble of networks can — their disagreement is a signal of epistemic uncertainty, flagging states where the model should not be trusted.

Training $N$ independent models $\{\hat{f}_{\theta_i}\}_{i=1}^N$ on the same dataset, with different random seeds and minibatch orderings, produces models that agree in well-explored regions (where all have seen ample training data) and disagree in under-explored regions (where extrapolation is unreliable). The variance of predictions across ensemble members — disagreement — is a practical proxy for epistemic uncertainty about the true dynamics.

PETS (Probabilistic Ensembles with Trajectory Sampling, Chua et al. 2018) combines an ensemble of stochastic models with model predictive control. Each ensemble member is a Gaussian neural network outputting $(\mu_i(s, a), \Sigma_i(s, a))$; to generate a rollout, PETS samples from a randomly selected ensemble member at each step — a technique called TS∞ — propagating uncertainty over the full trajectory. The ensemble's spread determines how optimistic or pessimistic the planner should be.

Ensembles introduce a practical constraint: they multiply training cost by $N$ (typically 5–7). This is tolerable in settings where environment interaction is expensive and model training is cheap. It is also what makes PETS attractive for robotics: the cost of training seven models is negligible compared to the cost of additional robot runs.

Disagreement-based uncertainty is also used as an exploration bonus: the agent is rewarded for visiting states where ensemble members disagree, encouraging it to collect data that reduces model uncertainty. This is the basis for curiosity-driven exploration in model-based settings.

MBPO

Model-Based Policy Optimization (MBPO) achieves the sample efficiency of model-based methods and the asymptotic performance of model-free ones by generating only short model rollouts, then training a model-free algorithm on the combined real and imagined data.

Janner et al. (2019) showed that the suboptimality of a policy trained on model-generated data is bounded by a term that grows linearly with rollout length $H$ and with model error $\varepsilon$. The bound suggests that short rollouts (small $H$) reduce compounding error at the cost of less imaginary data diversity; the optimal $H$ balances these two effects. In practice, $H \in \{1, 5\}$ works well across a range of MuJoCo tasks.

The MBPO procedure: at each real environment step, (1) add the transition to replay buffer $\mathcal{D}_\text{env}$; (2) update an ensemble of probabilistic dynamics models on $\mathcal{D}_\text{env}$; (3) sample starting states from $\mathcal{D}_\text{env}$, roll out $H$ model steps to generate a batch of imagined transitions, and add them to $\mathcal{D}_\text{model}$; (4) run $G$ gradient steps of SAC on a mixture of $\mathcal{D}_\text{env}$ and $\mathcal{D}_\text{model}$.

The key insight is that MBPO does not require the model to produce accurate long-horizon predictions — only accurate one- to five-step predictions starting from real states. Real states are by definition in-distribution for the model; the short rollout keeps the imagined trajectories close to the training distribution. Empirically, MBPO matches the final performance of SAC while using roughly 5–40× fewer real environment steps, depending on the task.

Latent space world models

Modeling dynamics directly in pixel space is expensive and unnecessary. Learning a compact latent representation and running dynamics there is cheaper, better regularized, and — as Dreamer showed — sufficient for impressive performance.

A latent world model consists of three learned components: an encoder $q_\phi(z_t \mid o_t, h_t)$ that maps observations into latent states, a dynamics model $p_\psi(z_{t+1} \mid z_t, a_t, h_{t+1})$ that predicts future latent states, and a decoder $p_\xi(o_t \mid z_t, h_t)$ that reconstructs observations from latents (used only for training, not at test time). The latent state $z_t$ captures the information in the observation that is relevant to predicting future observations and rewards; irrelevant details — visual texture, uninformative background — are discarded.

The Recurrent State Space Model (RSSM, Hafner et al. 2019) combines a deterministic RNN component $h_t$ with a stochastic component $z_t$: the deterministic component provides long-range memory; the stochastic component captures environmental randomness. The joint state $(h_t, z_t)$ forms the latent representation. This factored design allows the model to express both predictable dynamics (captured by $h_t$) and irreducible uncertainty (captured by $z_t$).

The model is trained by maximizing a variational lower bound that combines three objectives: reconstruction of observations from latents (pixel reconstruction loss), prediction of rewards from latents, and a KL divergence between the posterior encoder $q$ (using the actual next observation) and the prior dynamics model $p$ (predicting without seeing the next observation). The KL term forces the prior to track the posterior — making imagination possible.

Dreamer

Hafner et al. (2020) showed that an actor-critic trained entirely inside a latent world model's imagination — never touching the real environment during policy optimization — can match or exceed the performance of model-free algorithms using a fraction of the real environment steps.

Dreamer separates the learning process into two phases. In the world model phase, the RSSM is trained on sequences of real experience: given a sequence of observations and actions $(o_1, a_1, \ldots, o_T)$, the model learns to encode them, predict dynamics, reconstruct observations, and predict rewards. This phase runs standard gradient descent on the variational objective described in Section 07.

In the behavior learning phase, an actor and critic are trained entirely inside the learned model — never querying the real environment. Starting from latent states sampled from real experience, the actor generates actions; the RSSM prior rolls out imagined trajectories for $H = 15$ steps; the critic estimates returns along these imagined trajectories using $\lambda$-returns; the actor is updated to maximize predicted returns. Gradients flow through the model's deterministic dynamics via backpropagation through time (BPTT).

The key insight enabling this is latent imagination: because the dynamics model is differentiable, gradients from the critic's value predictions can propagate all the way back through the imagined rollout to the actor's parameters. This is far more efficient than REINFORCE-style policy gradient, which would require many rollouts to estimate the gradient. The imagined rollout provides a dense, differentiable training signal at every step.

Dreamer on DeepMind Control Suite: The original Dreamer paper matched or exceeded the performance of SAC — a strong model-free baseline — using 5× fewer environment steps on most tasks. The gains are largest on visually complex tasks where the world model's compression of pixel observations provides the most benefit.

DreamerV2 and DreamerV3

DreamerV2 replaced Gaussian latents with categorical ones and achieved human-level performance on Atari. DreamerV3 introduced universal hyperparameters and trained a single agent to master tasks as different as Atari, continuous control, and Minecraft diamond collection.

DreamerV2 (Hafner et al. 2021) replaced the RSSM's Gaussian stochastic state with a categorical stochastic state: 32 categorical variables, each with 32 classes. Categorical variables are better suited to the discrete, mode-switching dynamics of Atari games — a ball bouncing off a paddle is a discrete event, better captured by categorical than Gaussian noise. The categorical sampling is trained using the straight-through gradient estimator, which passes gradients through the discrete sampling step as if it were continuous. On 55 Atari games, DreamerV2 matched the performance of Rainbow with a fraction of the environment interactions and without any per-game hyperparameter tuning.

DreamerV3 (Hafner et al. 2023) aimed for universality. The key architectural additions are the symlog transform — $\text{symlog}(x) = \text{sgn}(x) \cdot \log(|x| + 1)$ — applied to inputs, targets, and predictions to handle returns that vary by orders of magnitude across tasks, and free bits in the KL loss to prevent posterior collapse in easy-to-predict environments. Critic targets use symexp-transformed predicted quantiles. These changes stabilize training across radically different reward scales without any task-specific tuning.

With fixed hyperparameters across all domains, DreamerV3 achieved the first RL agent to collect a diamond in Minecraft (requiring a sequence of over 20 intermediate goals from scratch), matched state-of-the-art on Atari and BSuite, and matched or exceeded specialized algorithms on continuous control and 3D environments. The universality result suggests that world model quality, not task-specific tuning, is the primary driver of performance.

Monte Carlo Tree Search

MCTS plans by building a search tree of possible futures, allocating more simulation budget to promising branches. Combined with a learned value function and policy, it produces the planning engine behind AlphaGo, AlphaZero, and MuZero.

Pure MCTS — without a learned model — requires a perfect simulator to roll out possible futures. But its integration with learned value functions in AlphaGo demonstrated that even an imperfect model combined with a value estimate can dramatically outperform both pure search and pure learning. AlphaZero then removed the human-engineered features entirely, learning value and policy networks from self-play alone, then using those networks to guide MCTS.

Each iteration of MCTS has four phases. Selection: starting from the root (current state), descend the tree by selecting the action that maximizes an upper confidence bound $\text{UCB} = Q(s, a) + c \cdot P(s, a) / (1 + N(s, a))$, where $Q$ is the current value estimate, $P$ is the prior policy probability, and $N$ is the visit count. Expansion: when a leaf node is reached, expand it by querying the policy network for prior action probabilities. Simulation: evaluate the new leaf using the value network (or, in classic MCTS, run a random rollout). Backpropagation: update $Q$ and $N$ values from the leaf back to the root.

MuZero (Schrittwieser et al. 2020) extends this framework by learning the model itself rather than relying on a known simulator. MuZero learns a representation network, dynamics network, and prediction network jointly from experience, enabling MCTS planning in environments where the rules are unknown. It matched AlphaZero in chess, shogi, and Go, and simultaneously set new records on Atari — demonstrating that planning with a learned model can match planning with a perfect one.

Shooting methods and MPC

Before gradient-based optimization through learned models became standard, planning meant sampling many candidate action sequences, evaluating each under the model, and executing the best one. This family of methods remains competitive in settings with short planning horizons.

Random shooting is the simplest approach: sample $N$ action sequences of length $H$ uniformly at random, roll each out under the model to compute predicted cumulative reward, and execute the first action of the highest-scoring sequence. It is embarrassingly parallel and requires no gradients, but is sample-inefficient: most random sequences are poor, and the search space grows exponentially with $H$.

The Cross-Entropy Method (CEM) improves on random shooting by iterative refinement. CEM maintains a distribution over action sequences (typically a Gaussian). At each iteration, it samples $N$ sequences from the current distribution, evaluates them under the model, selects the top $K$ elite sequences, and fits a new Gaussian to the elites. Iterating this procedure concentrates probability on high-reward regions of action space. CEM is the planner used in PETS and is effective for $H \leq 30$ step horizons.

Model Predictive Control (MPC) is a control-theoretic framework that generalizes these ideas: at each timestep, solve an optimization problem over the next $H$ steps under the model, execute only the first action, then replan at the next step. Replanning at every step makes MPC robust to model errors: even if the model is imperfect, each real observation corrects the plan before it diverges too far. iLQR (iterative Linear Quadratic Regulator) is an efficient MPC solver for smooth dynamics, computing a local quadratic approximation of the cost and a linear approximation of the dynamics around the current trajectory, then iterating to convergence.

The model-based frontier

Model-based methods win on sample efficiency; model-free methods win on asymptotic performance in complex environments. The frontier is moving, and the gap is narrowing.

The sample efficiency advantage of model-based methods is clearest when the environment has (a) relatively smooth, predictable dynamics, (b) dense and informative rewards, and (c) expensive real interactions. Robotic manipulation, molecule design, and industrial control all fit this profile. In these settings, Dreamer-family and MBPO-family methods routinely achieve in thousands of steps what model-free baselines require hundreds of thousands to learn.

Model-free methods maintain their advantage in environments with (a) highly stochastic or chaotic dynamics that resist accurate model learning, (b) rich visual complexity that makes latent compression difficult, or (c) abundant cheap simulation. Arcade games with complex physics, multi-agent environments with emergent social dynamics, and environments where the policy must handle long-tail rare events all remain challenging for model-based approaches. In these settings, model-free methods like PPO and SAC remain competitive or dominant.

Method	Model type	Planning	Key strength	Key weakness
Dyna / MBPO	Obs-space ensemble	Short rollouts	Simple, principled mixing	Obs-space models scale poorly
PETS	Obs-space ensemble + CEM	Shooting / CEM	Uncertainty-aware planning	Expensive in high dimensions
Dreamer	Latent RSSM	Latent imagination	Pixel obs, long horizons	Stochastic envs harder
AlphaZero	Perfect simulator	MCTS	Superhuman in perfect-info games	Requires known dynamics
MuZero	Learned latent model	MCTS	No known simulator needed	Complex to train and scale

The most active current research direction is scaling world models to internet-scale video data — learning general physics from video without any RL signal, then fine-tuning for specific tasks. Models like Genie (Bruce et al. 2024) and DIAMOND (Alonso et al. 2024) hint at a future where the world model itself is a large pretrained foundation model, and RL training adapts it rather than learning it from scratch.

Model-Based RL & World Models, learning to imagine before acting.

Chapter notes

The model-based idea

Types of world models

Dyna in the deep RL era

Compounding model error

Model ensembles and uncertainty

MBPO

Latent space world models

Dreamer

DreamerV2 and DreamerV3

Monte Carlo Tree Search

Shooting methods and MPC

The model-based frontier

Further reading

Foundational papers

Going further