Reinforcement learning is the third paradigm of machine learning: rather than fitting labels or finding structure, an agent interacts with an environment, receives scalar reward signals, and learns a behavior policy to maximize cumulative return. The Markov decision process is the mathematical backbone that makes this tractable — and the Bellman equations, value functions, and exploration–exploitation tradeoff are the concepts every subsequent algorithm builds on.
This chapter establishes vocabulary and formal structure. The algorithms — dynamic programming, Q-learning, policy gradients — appear in chapters 02 through 04. Notation follows Sutton & Barto (2018) throughout: uppercase letters for random variables (S, A, R), lowercase for their realizations (s, a, r), subscript t for time step.
Prerequisites: probability theory (Part I Ch 04), basic optimization (Part I Ch 03). Knowledge of deep learning is not required here but becomes necessary starting in chapter 03.
Supervised learning requires a teacher with correct answers. Unsupervised learning requires a dataset to find structure in. Reinforcement learning requires neither — only a way to interact with a world and receive feedback on the results.
In supervised learning, the training signal is explicit: here is an input, here is the correct output, adjust your parameters to close the gap. The data is given, the target is given, and the loss function makes the problem well-posed. Generalization is the challenge; the learning signal itself is not in question.
Reinforcement learning inverts this. There is no labeled dataset. There is no teacher who can say what the right action was. Instead there is an agent that takes actions in an environment and receives a scalar reward — a signal that encodes, loosely and often sparsely, whether things went well or poorly. The agent must discover, through trial and error, which actions lead to high cumulative reward over time. The learning signal is indirect: reward says "this was good" or "this was bad," but does not say what should have been done instead.
This is closer to how biological learning often works. A child learning to walk does not receive a labeled dataset of correct joint angles; it tries things, falls, recovers, and slowly improves. A chess player does not receive move-by-move annotations; they play games, win or lose, and adjust. The temporal credit assignment problem — figuring out which of the many decisions in a long sequence was responsible for a good or bad outcome — is what makes RL both natural as a paradigm and technically difficult as a field.
The defining features of RL. Three properties jointly distinguish RL from other learning paradigms: the agent learns from interaction, the feedback is a scalar reward (not a gradient or a label), and decisions are sequential — today's action affects tomorrow's situation. Remove any one and you have a different problem.
RL has a history separate from and partially overlapping with deep learning. The Bellman equations trace to operations research in the 1950s; temporal difference learning was formalized in the 1980s; Q-learning appeared in 1989. For decades the field solved small, carefully structured problems. The integration with deep neural networks — beginning with the Atari-playing DQN in 2013 and accelerating through AlphaGo, AlphaZero, and OpenAI Five — transformed what the paradigm could do, and the integration with language models via RLHF has since reshaped how frontier AI systems are trained. The foundations, however, remain the same formalisms Bellman and Sutton wrote down.
Everything in RL is built on a single repeating loop: the agent observes a state, selects an action, and the environment returns a new state and a reward. Everything else is commentary on this cycle.
At each discrete time step t, the interaction proceeds as follows. The environment is in some state S_t. The agent observes S_t (or some function of it) and selects an action A_t from the available action space. The environment transitions to a new state S_{t+1} according to some transition dynamics, and simultaneously emits a scalar reward R_{t+1}. The agent then observes S_{t+1} and R_{t+1} and selects A_{t+1}. This loop continues until a terminal state is reached or indefinitely.
The boundary between agent and environment is a modeling choice, not a physical fact. In robotics, the robot's motors might be part of the agent; the external world is the environment. In a game, the game engine is the environment; the policy is the agent. In RLHF, the language model is the agent, and the human preference signal is part of the environment. The distinction is wherever it is useful to draw it, but once drawn it fixes the problem: the agent controls actions, and actions are the only lever it has.
Episodic vs continuing tasks. Some tasks naturally terminate — a game ends in win or loss, a robot episode ends when it falls. These are episodic tasks, and the interaction resets to a start state. Other tasks run indefinitely — a trading algorithm, a recommendation system, an HVAC controller. These are continuing tasks. The distinction matters for how return is defined and how algorithms handle the boundary.
The reward signal is all the agent gets. It is the only source of information about whether the chosen actions are good ones. This places enormous design pressure on reward specification: if the reward is misaligned with what we actually want, the agent will optimize for the proxy, often in ways that violate the intent. Reward design is less a technical problem than a statement of values, and getting it wrong is the root of most spectacular RL failures in deployment.
The agent–environment loop becomes mathematically tractable when formalized as a Markov decision process. The MDP is the central object of reinforcement learning theory.
A Markov decision process is a tuple $(S, \mathcal{A}, P, R, \gamma)$:
The transition function $P$ is the environment's physics. The agent does not choose $P$; it is fixed by the world. What the agent controls is the mapping from states to actions — the policy (Section 05). The goal is to find the policy that maximizes expected cumulative discounted reward.
Known vs unknown models. In some settings the agent is given $P$ and $R$ explicitly — this is the planning setting, addressed in chapter 02 via dynamic programming. In most practical RL, $P$ and $R$ are unknown and must be inferred from experience. The distinction between these two regimes — model-based vs model-free — is one of the central axes of the field (Section 11).
Real environments often violate the MDP assumptions. The state space may be incompletely observed — a robot cannot see behind obstacles; a poker player cannot see opponents' cards. These are partially observable MDPs (POMDPs), and they require the agent to maintain a belief state over possible true states. Much practical RL sidesteps this formally by training on observations directly and relying on recurrent networks or frame stacking to recover relevant history, but the underlying POMDP structure explains many failure modes.
The Markov property says the future is independent of the past, given the present. It is what makes the MDP framework tractable — and what breaks when environments keep secrets.
A state $S_t$ is said to be Markov if:
$$P(S_{t+1} \mid S_t, A_t) = P(S_{t+1} \mid S_0, A_0, S_1, A_1, \ldots, S_t, A_t)$$That is, knowing the current state $S_t$ gives you all the information in the full history that is relevant to predicting the future. The past is irrelevant once you have the present. This is not a claim about the world; it is a claim about the state representation. A badly chosen state can violate Markov (if it omits information the future depends on), while a well-chosen one restores it.
The classic example: a ball in free flight. If the state is only the current position, it is not Markov — the future trajectory also depends on velocity. Augment the state to include both position and velocity, and Markov is restored. The physics of the ball did not change; the representation did. Designing states that are Markov — or approximately Markov — is one of the key engineering problems in applying RL to real systems.
Partial observability in practice. Most real systems are technically POMDPs: the agent sees observations, not the full state. Standard practice is to treat observations as if they were states, relying on temporal context (stacking frames in Atari, using LSTM hidden states) to recover some of the missing history. This works surprisingly often but fails when the hidden state matters significantly — a recurring source of fragility in deployed RL systems.
A policy is the agent's behavior function — the mapping from states to actions, or to distributions over actions. It is what RL optimizes.
Formally, a policy $\pi$ is a mapping from states to actions. Two flavors:
Policies can be represented in many ways. In tabular settings (small, discrete state-action spaces), a policy is literally a table. In deep RL, the policy is a neural network that takes the state as input and outputs action probabilities or, for continuous actions, the parameters of a distribution (typically a Gaussian mean and variance). This representation — the policy network — is the primary object in policy gradient and actor-critic methods.
A key distinction is between stationary and non-stationary policies. A stationary policy chooses actions based only on the current state, not on time. For MDPs with a fixed horizon, optimal policies may be time-dependent; for infinite-horizon discounted problems, the optimal policy is always stationary — an important simplification that the discount factor buys us.
What RL actually optimizes. The goal of RL is to find a policy $\pi^*$ that maximizes the expected return from the initial state distribution. Every algorithm in the field is, at its core, a different approach to this same optimization problem — differing in what it estimates (value function, policy gradient, model), how it uses data, and what approximations it makes.
The agent's goal is not to maximize immediate reward but cumulative reward over time. How future rewards are weighted relative to immediate ones — the discount factor — is one of the most consequential hyperparameters in any RL system.
The return $G_t$ is the total reward accumulated from time $t$ onward:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$The discount factor $\gamma \in [0, 1)$ controls the time horizon of the agent's concern. At $\gamma = 0$, the agent is purely myopic, caring only about the next immediate reward. As $\gamma \to 1$, the agent gives nearly equal weight to rewards in the far future. The discount factor serves three purposes: it ensures the infinite sum converges (for bounded rewards), it introduces a preference for earlier rewards over later ones, and it captures the practical reality that distant consequences are less certain.
Note the recursive structure: $G_t = R_{t+1} + \gamma G_{t+1}$. Today's return equals the immediate reward plus the discounted return from the next state. This recursive identity is the seed from which the Bellman equations grow.
Choosing $\gamma$. In practice, $\gamma$ is treated as a hyperparameter and tuned empirically. Values of 0.99 or 0.999 are common in continuous control; 0.95–0.99 in Atari-style games. A low $\gamma$ makes optimization easier (shorter effective horizon) but produces myopic behavior. A high $\gamma$ produces farsighted agents but increases variance and slows learning. The "effective horizon" of an agent with discount $\gamma$ is approximately $1 / (1 - \gamma)$ steps.
For episodic tasks, the sum is finite — it runs to the terminal time step $T$. For continuing tasks, the discount factor is necessary to ensure convergence. Some formulations of continuing tasks use the average reward criterion rather than discounting, which avoids the $\gamma$ hyperparameter but complicates the math; average reward methods appear most often in queuing and scheduling applications.
A value function assigns a number to each state — or state-action pair — measuring how much cumulative reward the agent can expect from that point onward under a given policy. Value functions are how RL converts the long-run objective into something computable at each step.
The state-value function under policy $\pi$ is:
$$V^\pi(s) = \mathbb{E}_\pi\left[G_t \mid S_t = s\right] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\Bigg|\; S_t = s\right]$$This is the expected return when starting in state $s$ and following policy $\pi$ thereafter. A high $V^\pi(s)$ means starting in $s$ leads to a lot of cumulative reward under $\pi$; a low value means it does not.
The action-value function (or Q-function) under policy $\pi$ is:
$$Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t \mid S_t = s, A_t = a\right]$$This is the expected return when starting in state $s$, taking action $a$ first, and thereafter following $\pi$. The Q-function is more informative than the V-function: knowing $Q^\pi(s, \cdot)$ for all actions tells you exactly which action is best in state $s$, without needing to know the transition dynamics. This is why Q-functions are the centerpiece of value-based deep RL (DQN and its descendants).
The two are related by:
$$V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)$$— the value of a state is the expectation of the action-value over the actions selected by the policy. And conversely:
$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\pi(s')$$These interrelations recur constantly throughout RL theory and algorithm design.
The Bellman equations express the value of a state in terms of the value of its successors. This recursive consistency condition is the mathematical engine behind every value-based RL algorithm.
Substituting the recursive definition of return ($G_t = R_{t+1} + \gamma G_{t+1}$) into the definition of $V^\pi$, and taking expectations:
$$V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s, a) + \gamma\, V^\pi(s')\right]$$This is the Bellman expectation equation for $V^\pi$. It says: the value of state $s$ under policy $\pi$ equals the expected immediate reward plus the discounted expected value of the next state — where the expectation is over both the policy's choice of action and the environment's stochastic transition. An analogous equation holds for $Q^\pi$:
$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \sum_{a'} \pi(a' \mid s')\, Q^\pi(s', a')$$These equations are simultaneously a definition and a constraint: any valid value function must satisfy them. This means they can be used to check whether a given value function is correct (compute both sides; they should match) and to improve an estimate (treat the right-hand side as a target and update toward it). The second use is bootstrapping — updating value estimates using other value estimates — and it is the defining feature of temporal difference learning.
Bootstrapping as the core idea. Supervised learning updates a prediction using a fixed external target (the label). Temporal difference methods update a prediction using another prediction — the estimated value of the next state. This sounds circular but converges because the next-state estimate is updated too, and under standard conditions the system finds the fixed point. The Bellman equation is that fixed point.
The Bellman equations can be written compactly in matrix form for finite state spaces: $\mathbf{v}^\pi = \mathbf{r}^\pi + \gamma \mathbf{P}^\pi \mathbf{v}^\pi$, which has the closed-form solution $\mathbf{v}^\pi = (I - \gamma \mathbf{P}^\pi)^{-1} \mathbf{r}^\pi$. This is policy evaluation in exact form. For large or continuous state spaces, exact solution is intractable; approximation (via neural networks or other function approximators) is necessary — which is the entry point into deep RL.
The optimal value functions are the best any policy can achieve. Finding them — or approximating them — is the goal of every value-based RL algorithm.
The optimal state-value function $V^*$ is the value function of the best possible policy:
$$V^*(s) = \max_\pi V^\pi(s) \quad \text{for all } s \in S$$Similarly, the optimal action-value function:
$$Q^*(s, a) = \max_\pi Q^\pi(s, a) \quad \text{for all } s, a$$These satisfy the Bellman optimality equations:
$$V^*(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s')\right]$$ $$Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \max_{a'} Q^*(s', a')$$The critical difference from the expectation equations is the replacement of $\sum_a \pi(a \mid s)$ with $\max_a$: rather than averaging over the policy's choices, we take the best possible action. These equations have a unique solution (under the contracting-mapping theorem for $\gamma < 1$), meaning the optimal value function is well-defined and findable — at least in principle.
Given $Q^*$, the optimal policy is simply:
$$\pi^*(s) = \arg\max_a\, Q^*(s, a)$$This is the greedy policy with respect to $Q^*$. It is optimal without needing to know the transition dynamics — just the Q-values. This is why Q-learning is so powerful: if you can learn $Q^*$, the optimal behavior falls out for free. The entire enterprise of value-based RL is the pursuit of $Q^*$ in environments too large to solve exactly.
An agent that only exploits its current knowledge never discovers whether something better exists. An agent that only explores never gets to use what it learned. Managing this tension is a fundamental challenge with no universally optimal solution.
The exploration–exploitation dilemma is cleanest in the multi-armed bandit problem — RL stripped to a single-step interaction. You have $k$ slot machines (bandits), each with an unknown reward distribution. At each round, you choose one to pull. Pull the one you think is best (exploit), and you might miss a better option you haven't tried enough. Try everything randomly (explore), and you waste pulls on machines you already know are bad. The question is how to balance these over time to maximize cumulative reward.
Three canonical strategies for exploration:
In full MDPs (not just bandits), exploration is harder: the effect of an action may not be felt until many steps later, and unvisited states might be reached only through specific action sequences. Methods like count-based exploration, intrinsic motivation (curiosity-driven reward bonuses), and randomized value functions extend the bandit intuitions to sequential settings, with varying success. Sparse reward environments — where the reward signal is near-zero for most of training — are the hardest case and the site of much ongoing research.
Exploration in the real world. In physical systems — robots, power grids, clinical treatment — exploration has costs that are absent in simulation: wear, safety violations, irreversible outcomes. Safe exploration, constrained RL, and sim-to-real transfer are responses to this problem. The assumption that exploration is free is one of the cleanest signs that an RL formulation was designed for simulation.
Does the agent learn a model of the world, or does it learn directly from experience? This architectural choice determines sample efficiency, generalization, and failure modes.
A model-based algorithm learns (or is given) a representation of the environment's dynamics — the transition function $P(s' \mid s, a)$ and reward function $R(s, a)$ — and uses it to plan. With a good model, the agent can simulate trajectories, reason about consequences, and make decisions far more sample-efficiently than if it had to try everything in the real environment. Dynamic programming (chapter 02), Dyna (chapter 05), and Dreamer (chapter 05) are all model-based.
A model-free algorithm learns directly from environment interactions, estimating value functions or policy gradients without building an explicit model of dynamics. Q-learning, SARSA, PPO, and SAC are all model-free. The lack of a model is a limitation — the agent cannot plan ahead — but it is also a robustness: a wrong model is often worse than no model, and model errors can compound catastrophically in long-horizon reasoning.
| Dimension | Model-based | Model-free |
|---|---|---|
| Sample efficiency | Higher | Lower |
| Computational cost | Higher (planning) | Lower |
| Sensitivity to model error | High | N/A |
| Typical use case | Low-data regimes, sim planning | High-data regimes, game playing |
In practice the distinction is a spectrum. Dyna-style algorithms maintain a model and use it for additional "imagined" transitions to supplement real experience, getting some of the sample efficiency benefits with less planning overhead. World models (Dreamer, MBPO, chapter 05) take this further: learning a rich latent-space dynamics model and doing most planning inside it. The trend in frontier RL is toward more model use, as collecting real experience in physical systems becomes the binding constraint.
On-policy algorithms learn about the policy they are currently executing. Off-policy algorithms can learn about one policy while executing another — unlocking experience replay, data reuse, and learning from demonstrations.
In on-policy learning, the data used for updating the value function or policy must come from the current policy. SARSA and the REINFORCE algorithm are on-policy: they update using the actions actually taken. If the policy changes, old data becomes stale and must be discarded. This is wasteful but safe — the estimates are always consistent with what the agent is actually doing.
In off-policy learning, updates are performed using data from a different policy — the behavior policy — while estimating the value of the target policy. Q-learning is the canonical example: regardless of what action the agent actually took, the update uses $\max_{a'} Q(s', a')$ — the value of the best action, not the one chosen. This means data collected under any policy can be used to improve any other policy.
Off-policy learning enables two major capabilities. First, experience replay: store all past transitions in a replay buffer and sample from it during updates. The agent can learn from the same experience many times, dramatically improving data efficiency. DQN's replay buffer was central to its success on Atari. Second, learning from offline data: in offline RL (chapter 07), the agent learns entirely from a fixed dataset of previously collected transitions, without any environment interaction at all.
Importance sampling. When the behavior policy differs significantly from the target policy, raw off-policy estimates can be biased. Importance sampling corrects for this by reweighting transitions by the ratio $\pi_{\text{target}}(a \mid s) / \pi_{\text{behavior}}(a \mid s)$. The correction is exact but introduces high variance when the policies differ greatly. Managing this variance is a central challenge in off-policy actor-critic methods like SAC and TD3.
The on-policy / off-policy axis cuts across the model-based / model-free axis, giving a rough taxonomy: on-policy model-free (REINFORCE, PPO, A3C), off-policy model-free (Q-learning, DQN, SAC), on-policy model-based (Dyna with fresh rollouts), off-policy model-based (Dreamer, MBPO). Most frontier algorithms are off-policy — the data efficiency benefits are too large to ignore.