When the state and action spaces are small enough to fit in a table, the full machinery of reinforcement learning becomes exact: dynamic programming solves MDPs given a known model; Monte Carlo and temporal difference methods solve them from sampled experience alone. SARSA and Q-learning, the two canonical TD control algorithms, illustrate how the on-policy and off-policy distinction plays out in practice — and Dyna shows how a learned model can multiply the value of every real interaction.
Tabular methods store value estimates as explicit lookup tables indexed by state (or state-action pair). They are exact in the sense that no approximation of the value function is introduced — only the approximation inherent in learning from finite samples. This exactness makes them ideal for building intuition and proving convergence; the limitations are purely computational (tables become infeasible in large or continuous spaces). Chapters 03 and 04 replace the table with a neural network; the algorithms themselves remain recognizable.
Prerequisites: Chapter 01 (RL Fundamentals) — particularly MDPs, value functions, and the Bellman equations. All notation follows Sutton & Barto (2018).
Tabular RL is RL with the simplest possible value function representation: a table. One entry per state for $V$; one entry per state-action pair for $Q$. Everything is exact, and everything is finite.
Concretely, a Q-table is a two-dimensional array where rows index states and columns index actions. The entry $Q[s][a]$ stores the current estimate of $Q^*(s, a)$ — the expected return from taking action $a$ in state $s$ and thereafter acting optimally. At the start of learning, the table is typically initialized to zeros or small random values. The algorithm's job is to update these entries until they converge to the true $Q^*$.
The tabular setting applies whenever the state space and action space are both finite and small enough that maintaining one number per state-action pair is computationally feasible. Classic examples: grid worlds (states are grid positions, actions are the four compass directions), simple card games (states are hand configurations, actions are hit or stand), inventory management with a handful of discrete levels. Real-world examples fitting this profile are rare, but tabular methods are invaluable pedagogically: every algorithm in this chapter has a direct generalization to deep RL, and understanding the tabular version eliminates the neural-network approximation as a source of confusion.
The curse of dimensionality. If a state is described by $n$ binary features, the state space has $2^n$ entries. Ten features gives 1,024 states — manageable. Twenty features gives over a million. Thirty gives a billion. The table grows exponentially with the number of state dimensions, making tabular methods completely infeasible for any realistic continuous or high-dimensional problem. This is the fundamental limit that function approximation (chapter 03 onward) is designed to overcome.
Given a fixed policy and a known MDP, iteratively applying the Bellman equation computes the exact value function. This is policy evaluation — the prediction problem solved.
Dynamic programming (DP) refers to a family of algorithms that solve MDPs by exploiting the recursive structure of the Bellman equations. The crucial assumption: the full MDP model — both $P(s' \mid s, a)$ and $R(s, a)$ — is known. DP does not interact with the environment; it computes directly from the model. This makes it a planning algorithm, not a learning one in the usual sense.
Iterative policy evaluation computes $V^\pi$ for a given policy $\pi$ by repeatedly applying the Bellman expectation equation as an update rule. Starting from an arbitrary initial estimate $V_0$, each sweep updates every state:
$$V_{k+1}(s) \leftarrow \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s, a) + \gamma\, V_k(s')\right]$$This update replaces the old estimate of $V(s)$ with the immediate reward plus the discounted estimated value of successor states — exactly the Bellman equation, used as an operator. Under the contraction mapping theorem, this sequence $V_0, V_1, V_2, \ldots$ converges to the true $V^\pi$ as $k \to \infty$, for any starting point and any $\gamma < 1$.
In practice, convergence is declared when $\max_s |V_{k+1}(s) - V_k(s)| < \theta$ for some small tolerance $\theta$. Two implementation variants exist: synchronous sweeps (compute all $V_{k+1}$ from $V_k$ before overwriting) and in-place updates (overwrite $V(s)$ immediately, so later states in the same sweep already see updated values). In-place updates typically converge faster in practice.
The prediction problem. Policy evaluation answers the question: "Given that the agent follows policy $\pi$, how much reward will it collect starting from each state?" This is a strictly easier problem than control (finding the best policy), and it appears as a subroutine inside every control algorithm that uses value functions.
Given an accurate value function, a better policy is always recoverable — just act greedily with respect to values. Alternating evaluation and improvement converges to the optimal policy.
The policy improvement theorem states: if $\pi'$ is the greedy policy with respect to $V^\pi$ — that is, $\pi'(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R(s, a) + \gamma V^\pi(s')]$ — then $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$, with strict inequality somewhere unless $\pi$ was already optimal. Acting greedily with respect to the current value function always produces a policy at least as good, and usually strictly better.
Policy iteration exploits this by alternating two steps indefinitely:
Each improvement step produces a strictly better policy until none is possible — at which point the policy is optimal. Because there are only finitely many deterministic policies (at most $|\mathcal{A}|^{|S|}$), policy iteration must terminate in finite steps. In practice it converges in very few iterations — often fewer than 10, even for problems with thousands of states — because each greedy update tends to fix many states at once.
The concept of alternating evaluation and improvement is sometimes called generalized policy iteration (GPI). It is the meta-framework underlying almost every RL algorithm: even model-free methods that never run full policy evaluation are performing approximate GPI, maintaining a value estimate and periodically making the policy more greedy with respect to it.
Policy iteration evaluates the current policy to full convergence before improving it — which is wasteful. Value iteration collapses evaluation and improvement into a single sweep, converging directly to the optimal value function.
The observation motivating value iteration: in policy iteration, most of the work in policy evaluation happens in the early sweeps; later sweeps give diminishing returns. If you truncate evaluation after a single sweep and immediately improve, you still converge — just along a different path. The extreme of this is value iteration, which performs exactly one update before taking the greedy step, merging the two into one operator:
$$V_{k+1}(s) \leftarrow \max_a \sum_{s'} P(s' \mid s, a)\left[R(s, a) + \gamma\, V_k(s')\right]$$Note the critical difference from policy evaluation: $\sum_a \pi(a \mid s)$ is replaced by $\max_a$. This is the Bellman optimality operator rather than the Bellman expectation operator. Applying it repeatedly drives $V_k$ to $V^*$, the optimal value function. Convergence is again guaranteed by the contraction mapping theorem: the Bellman optimality operator is a $\gamma$-contraction.
Once $V^*$ is obtained, the optimal policy is read off greedily: $\pi^*(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)[R(s, a) + \gamma V^*(s')]$. Value iteration is generally preferred over full policy iteration for large state spaces because it avoids the inner evaluation loop.
Asynchronous DP. Synchronous DP updates every state on every sweep — expensive for large spaces. Asynchronous DP updates states in any order, potentially focusing computation on states that are most relevant or most in error. Prioritized sweeping (updating states with the largest estimated Bellman error first) is an effective asynchronous variant. This foreshadows the prioritized experience replay used in deep RL.
Dynamic programming requires a model. Monte Carlo methods require only experience: the agent runs complete episodes, observes the actual returns, and averages them. No model, no bootstrapping, and full-episode returns.
In Monte Carlo (MC) prediction, the agent runs many episodes under policy $\pi$ and estimates $V^\pi(s)$ by averaging the observed returns that followed visits to state $s$. The law of large numbers guarantees convergence: with enough samples, the sample mean converges to the true expectation. Two variants handle states visited multiple times per episode:
For MC control, the agent alternates between running episodes under the current policy (generating experience) and improving the policy using the observed returns. Because MC uses actual complete returns rather than estimated future values, it has zero bias but potentially high variance — a single episode's return can be far from the expected value, especially in long or stochastic tasks.
The fundamental MC limitation. Monte Carlo methods require complete episodes to compute a return. This makes them inapplicable to continuing tasks (no terminal state) and slow on long episodes, where most of the useful information about early decisions arrives only at the end. Temporal difference methods, developed next, address both limitations by learning from incomplete sequences via bootstrapping.
MC is nonetheless the right tool in several situations: when the environment's transition dynamics are complex or unknown and only episode samples are available; when episodes are short; and when unbiased value estimates are critical (MC estimates are unaffected by the approximation errors that bootstrapping can propagate). MC methods also have a natural connection to model-free rollout-based planning and to the simulation strategy used in Monte Carlo Tree Search (MCTS), the planning backbone of AlphaGo and AlphaZero.
Temporal difference methods combine the model-free sampling of Monte Carlo with the bootstrapping of dynamic programming. They update value estimates after every step, not at episode end — and they use existing value estimates as targets, not full observed returns.
The simplest TD method, TD(0), updates $V(S_t)$ after each transition using the one-step return as a target:
$$V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]$$The bracketed quantity is the TD error:
$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$It measures the difference between what was expected ($V(S_t)$) and what was received plus estimated future value ($R_{t+1} + \gamma V(S_{t+1})$). The step size $\alpha \in (0, 1]$ controls how much to move toward the new estimate. TD(0) updates immediately after each step, requires no model, and needs only the current transition — not the full episode.
The target $R_{t+1} + \gamma V(S_{t+1})$ is called the TD target. Unlike the MC target (the full return $G_t$), it is biased — it uses an estimate of $V(S_{t+1})$ rather than the true value. But it is far lower variance, because it depends on only one step of randomness rather than the accumulated randomness of an entire episode. The bias-variance tradeoff between MC and TD is one of the fundamental design tensions in RL.
Bootstrapping. TD methods learn a guess from a guess — they update a value estimate using another value estimate as the target. This is bootstrapping, and it is what separates TD from MC (which uses actual observed returns). Bootstrapping introduces bias but dramatically reduces variance and enables learning without complete episodes. The Bellman equation guarantees a fixed point exists; TD methods find it.
SARSA extends TD prediction to control by estimating Q-values and updating them with the action the agent actually takes next. It learns the value of the policy being executed.
TD(0) predicts $V^\pi$. To find an optimal policy, we need to estimate $Q^\pi$ — the action-value function — and improve the policy based on it. SARSA (named for the five quantities involved in each update: State, Action, Reward, next State, next Action) does this with the update:
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma\, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\right]$$The next action $A_{t+1}$ is sampled from the current policy (typically $\varepsilon$-greedy). Because the update target uses the Q-value of the action actually taken — $Q(S_{t+1}, A_{t+1})$ — SARSA is on-policy: it evaluates and improves the same policy generating the data. If the exploratory policy is $\varepsilon$-greedy with respect to the current $Q$ estimates, SARSA converges to the optimal $\varepsilon$-greedy policy (not $Q^*$ itself, but the best policy that explores at rate $\varepsilon$).
A key behavioral consequence of on-policy learning: SARSA is safe. Because it accounts for the exploration actions the agent actually takes, its Q-values reflect the true value of the policy being executed — including its risky exploratory moves. An agent navigating near a cliff will learn to stay away from the edge, because it factors in the occasional random step that might send it over.
Q-learning learns the optimal Q-function directly, regardless of what policy generates the experience. It is off-policy, uses the max operator, and converges to Q* under mild conditions.
Q-learning (Watkins, 1989) makes one change to the SARSA update: replace $Q(S_{t+1}, A_{t+1})$ — the value of the action actually taken — with $\max_{a'} Q(S_{t+1}, a')$ — the value of the best possible action:
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)\right]$$This makes Q-learning off-policy: the update target is the greedy (optimal) action, not the action the behavior policy happened to choose. The agent can explore freely — random actions, ε-greedy, even actions from a human demonstrator — and as long as every state-action pair is visited sufficiently often, the Q-table will converge to $Q^*$, the true optimal action-value function. The behavior policy affects the speed and coverage of learning but not what the method converges to.
Convergence of tabular Q-learning to $Q^*$ is guaranteed when: (1) all state-action pairs are visited infinitely often, (2) the step sizes $\alpha$ satisfy the Robbins-Monro conditions ($\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$), and (3) rewards are bounded. In practice a fixed small $\alpha$ and decaying $\varepsilon$ are common and work well.
Why Q-learning matters. Q-learning is arguably the most important single algorithm in tabular RL. It learns $Q^*$ directly, it is off-policy (enabling experience replay), and it generalizes cleanly to the deep setting — Deep Q-Networks (DQN) replace the table with a neural network but preserve the Q-learning update almost exactly. Understanding Q-learning is the prerequisite for understanding everything that follows in deep RL.
SARSA and Q-learning look nearly identical — but the difference of one character in their update rules produces meaningfully different behavior.
The behavioral divergence between the two algorithms is sharpest in environments with risk. The cliff walking task is the canonical demonstration: a grid world with a cliff along one edge, where falling off the cliff gives a large negative reward. The shortest path to the goal runs along the cliff edge. SARSA, being on-policy, accounts for the occasional exploratory step that might accidentally walk into the cliff, and learns to take the longer but safer inland route. Q-learning, targeting the optimal policy directly, learns the optimal cliff-edge path — but then occasionally falls into the cliff during training due to exploration. Both are correct in their own terms: SARSA learns the best policy given that you will explore; Q-learning learns the best policy assuming you can execute it perfectly.
In practical deep RL, Q-learning and its variants (DQN, Double DQN, Dueling DQN) dominate for discrete action spaces because off-policy learning enables experience replay — storing and reusing old transitions — which SARSA cannot cleanly do. For safety-critical applications where training behavior matters as much as asymptotic performance, on-policy methods are often preferred.
TD(0) bootstraps from one step ahead. Monte Carlo uses the full return from the rest of the episode. Eligibility traces smoothly interpolate between them via a single parameter λ, often improving on both extremes.
The $n$-step return generalizes TD(0) by looking $n$ steps ahead before bootstrapping:
$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$$n=1$ recovers the TD(0) target; $n = \infty$ (running to the end of the episode) recovers the MC return. Different problems benefit from different lookahead depths — long-horizon tasks may need $n \gg 1$ to propagate reward signals back efficiently; noisy environments benefit from shorter, lower-variance targets.
TD(λ) elegantly averages over all $n$-step returns, weighting the $n$-step return by $(1-\lambda)\lambda^{n-1}$:
$$G_t^\lambda = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$When $\lambda = 0$, all weight falls on the 1-step return: TD(0). When $\lambda = 1$, the sum telescopes to the MC return. Intermediate values of $\lambda$ — commonly 0.7 to 0.95 — often achieve the best performance by balancing bias and variance.
The most practical implementation of TD(λ) uses eligibility traces, a per-state memory that tracks how recently and how frequently a state was visited:
$$e_t(s) = \gamma\lambda\, e_{t-1}(s) + \mathbf{1}[S_t = s]$$At each step, every state's value is updated in proportion to its eligibility trace — recently visited states get a large update, older ones get exponentially decaying updates. This allows the algorithm to assign credit backwards through the trajectory, connecting a reward to the states that caused it, without storing the full trajectory. The backward view is equivalent to the forward view (averaging $n$-step returns) but is online and computationally efficient.
What if the agent could practice mentally, replaying remembered situations to squeeze more learning from each real interaction? Dyna-Q does exactly this — combining model-free TD updates with simulated planning from a learned model.
The core observation behind Dyna (Sutton, 1990): after each real experience $(S_t, A_t, R_{t+1}, S_{t+1})$, two things can be done. First, update $Q$ using the real transition (a standard model-free update). Second, record the transition in an internal model of the environment — and then, before the next real interaction, perform $n$ additional Q-updates using transitions sampled from the model. These imaginary transitions are free: the agent simply asks its mental model "if I had taken action $a$ in state $s$, what would have happened?" and updates $Q$ with the answer.
The Dyna-Q algorithm:
Initialize Q(s, a) and Model(s, a) for all s, a
Loop for each episode:
Observe S; choose A ε-greedy from Q
Loop for each step:
Take action A, observe R, S′
Q(S,A) ← Q(S,A) + α[R + γ max_a′ Q(S′,a′) − Q(S,A)] # real update
Model(S,A) ← R, S′ # model update
Repeat n times: # planning
S̃ ← random previously-seen state
à ← random action taken from S̃
R̃, S̃′ ← Model(S̃, Ã)
Q(S̃,Ã) ← Q(S̃,Ã) + α[R̃ + γ max_a′ Q(S̃′,a′) − Q(S̃,Ã)]
S ← S′; A ← ε-greedy action from Q(S′,·)
The effect of the $n$ planning steps is dramatic. With $n = 0$ (no planning), Dyna-Q reduces to ordinary Q-learning. With $n = 50$, the agent performs 50 times as many Q-updates per real step, propagating value information through the table far faster. In simple maze tasks, Dyna-Q with 50 planning steps can reach near-optimal performance in the same number of episodes that pure Q-learning needs to merely begin learning.
Model accuracy matters. The planning steps are only useful if the model is accurate. In deterministic environments with a perfect tabular model, Dyna is highly effective. When the environment changes (the model becomes stale), Dyna can reinforce outdated beliefs — planning updates propagate incorrect values confidently. Dyna-Q+ addresses this by adding a bonus to actions not tried recently, incentivizing the agent to revisit parts of the environment and check whether its model is still correct.
The Dyna architecture is the prototype for all model-based RL: learn a model from real data, use the model for additional planning updates, and interleave real and imagined experience. In the deep setting, this scales up to learned neural network world models (chapter 05), but the core logic — real experience to update both Q and model; imagined experience to update Q further — is unchanged.
Every method in this chapter is exact, well-understood, and provably convergent. And every method fails catastrophically as soon as the state or action space grows beyond a few thousand entries.
The root problem is representation. A tabular value function stores one independent number per state-action pair. It cannot generalize: updating $Q(s, a)$ tells the agent nothing about the value of a nearby state $s'$ or a similar action $a'$. In a game of Go, where the state space exceeds $10^{170}$ board positions, storing one number per state is not a meaningful aspiration. In robotics, where states are continuous joint angles and velocities, the table is infinite from the start.
The solution is function approximation: represent $Q(s, a)$ not as a table but as a parameterized function $Q_\theta(s, a)$, where $\theta$ is a much smaller set of parameters that the agent learns. Linear function approximation (representing $Q$ as a linear combination of hand-crafted features) extends tabular methods in a principled way. Neural networks, which learn their own features, make this practical for arbitrary input spaces — at the cost of convergence guarantees that the tabular setting provides for free.
| Property | Tabular RL | Deep RL (function approx.) |
|---|---|---|
| Value function | Exact lookup table | Neural network |
| Generalization | None | Yes (learned features) |
| Convergence | Guaranteed (mild conditions) | Not generally guaranteed |
| State spaces | Small, discrete | Large, continuous |
| Interpretability | Full (inspect any Q value) | Limited |
The algorithms in this chapter are not historical curiosities — they run in embedded controllers, small optimization problems, and anywhere discrete and finite state spaces arise. More importantly, they are the conceptual foundation for everything that follows. DQN is Q-learning plus a neural network. PPO is policy iteration plus policy gradient estimation. Dreamer is Dyna plus a recurrent world model. The tabular setting strips away all approximation and exposes the algorithm's logic in its cleanest form. Understanding it thoroughly is the most efficient path to understanding modern deep RL.