Part IX · Reinforcement Learning · Chapter 03

Deep Q-Networks, when neural nets replaced the table.

The 2015 DQN paper demonstrated that a single convolutional network, stabilized by two engineering insights, could master dozens of Atari games from raw pixels — launching the era of deep reinforcement learning. The years that followed produced a sequence of targeted fixes for each remaining failure mode: Double DQN for overestimation bias, prioritized replay for sampling efficiency, dueling networks for architectural inductive bias, and distributional RL for richer learning signals. Rainbow synthesized all six improvements into a single agent that set the state of the art on the Atari benchmark.

Chapter notes

This chapter bridges tabular RL and the modern deep RL landscape. The core algorithms — Q-learning and TD control — are unchanged from Chapter 02; what changes is the value function representation. Understanding why naïve combinations of neural networks and Q-learning diverge, and how DQN's two tricks (experience replay and target networks) address the instability, is the conceptual core of this chapter.

Prerequisites: Chapter 02 (Tabular RL) — Q-learning, TD learning, and the Bellman optimality equation. Familiarity with convolutional neural networks (Part V, Ch 04) is helpful for understanding the DQN architecture but not required for the algorithmic content.

Sections

The scale problemcurse of dimensionality, continuous states, why tabular fails at Atari
Function approximationparametric Q-functions, semi-gradient descent, generalization
The deadly triadbootstrapping + function approximation + off-policy, divergence
Experience replayreplay buffer, breaking correlations, reuse of transitions
Target networksfrozen copy, stable TD targets, hard vs soft updates
DQN: putting it togetherMnih 2015, CNN architecture, Atari benchmark, ε-greedy
Double DQNmaximization bias, decoupling selection from evaluation
Prioritized experience replayTD-error priority, importance sampling, proportional vs rank-based
Dueling networksvalue and advantage streams, identifiability, when it helps
Multi-step returnsn-step TD, bias-variance tradeoff, off-policy corrections
Distributional RLlearning the return distribution, C51, KL divergence loss
Rainbowcombining all improvements, ablations, Noisy Nets

The scale problem

A Q-table requires one row per state. For problems like Atari — where the input is a stack of game frames — the number of distinct states is larger than the number of atoms in the observable universe.

The appeal of tabular methods is exactness: every state-action pair gets its own estimate, updated independently. The cost is that this requires enumeration. In a grid world with 100 cells and 4 actions, a Q-table has 400 entries — manageable. In Atari, the agent observes four stacked 84×84 grayscale frames, each pixel taking one of 256 intensity values. The number of distinct observations is $256^{84 \times 84 \times 4}$, a number so large that no table could exist, let alone be filled by any finite amount of interaction.

Continuous state spaces make the problem even more fundamental. A robot with six joints, each described by a real-valued angle, lives in $\mathbb{R}^6$. There are uncountably many states; the tabular approach is not merely impractical but conceptually inapplicable.

The solution is function approximation: replace the lookup table with a parametric function that maps states (or state-action pairs) to value estimates. The function generalizes — if it learns that a particular configuration of pixels is dangerous, it can apply that lesson to visually similar configurations it has never seen. Chapter 02 ended by naming this as the next step; this chapter unpacks what it takes to make it work.

Function approximation

A neural network replaces the Q-table. Given a state, it outputs one Q-value estimate per action — a vector of length $|\mathcal{A}|$ computed in a single forward pass.

Write $Q(s, a; \theta)$ for the network output, where $\theta$ collects all weights and biases. The loss that drives training is the mean-squared Bellman error: the network's current estimate should agree with the TD target $y$, a one-step bootstrap estimate of the true Q-value. For a transition $(s, a, r, s')$:

$$y = r + \gamma \max_{a'} Q(s', a'; \theta)$$ $$\mathcal{L}(\theta) = \mathbb{E}\bigl[(y - Q(s, a; \theta))^2\bigr]$$

The gradient update is a semi-gradient step: we treat the target $y$ as a constant (stopping gradients through it) and differentiate only through $Q(s, a; \theta)$. This is crucial because the target itself depends on $\theta$; differentiating through both sides would produce double-counted gradients that do not correspond to a true gradient descent on any fixed objective.

The hope is that networks trained this way converge to something close to $Q^*$. The worry — well founded, as the 1990s discovered — is that they often do not.

The deadly triad

Three ingredients, each individually benign, combine to produce divergence. Sutton and Barto call this the deadly triad: bootstrapping, function approximation, and off-policy training.

Bootstrapping means using the current value estimate to construct the update target — the TD update depends on $Q(s', \cdot; \theta)$, which is itself being updated. Function approximation means that updating $Q(s, a; \theta)$ inevitably changes $Q$ at other states too, because the parameters are shared. Off-policy training means the data distribution does not match the policy being evaluated — Q-learning is off-policy by nature, learning about the greedy policy while following an ε-greedy behavioral policy.

When all three are present, the combination can be unstable. The classic illustration is Baird's counterexample (1995): a simple seven-state MDP where semi-gradient TD with linear function approximation diverges to infinity despite an optimal solution existing. The issue is that the TD update is not the gradient of any fixed loss function — it is a semi-gradient — so there is no guarantee of convergence.

Why the triad matters for DQN: DQN cannot eliminate bootstrapping (it is the foundation of TD learning) or off-policy training (experience replay requires it). Instead it mitigates the instability through engineering: experience replay decorrelates the data distribution, and target networks slow down the rate at which the bootstrap target changes.

Experience replay

Store every transition in a buffer. Sample randomly for updates. Two lines of code that prevent two distinct failure modes.

Online Q-learning with a neural network updates on consecutive transitions: $(s_1, a_1, r_1, s_2)$, then $(s_2, a_2, r_2, s_3)$, and so on. Consecutive transitions are highly correlated — if the agent is walking down a corridor, the next hundred observations all look like corridor. Gradient updates computed on correlated data push the network in the same direction repeatedly, destabilizing training just as in supervised learning when data is fed in non-random order.

The replay buffer $\mathcal{D}$ is a FIFO queue of fixed capacity $N$ (DeepMind used $N = 10^6$). At each step the agent stores the transition $(s_t, a_t, r_t, s_{t+1}, \text{done})$ into $\mathcal{D}$, then samples a random minibatch of size 32 (or 64) to compute the gradient update. Because the minibatch is sampled uniformly at random from up to a million past transitions, consecutive updates are effectively decorrelated.

A secondary benefit is data efficiency. In online learning each transition is used once and discarded. With a replay buffer, each stored transition is used for multiple gradient updates before it ages out of the buffer. This is particularly valuable in environments where each interaction step is expensive.

Experience replay makes Q-learning off-policy in a stronger sense than usual: the behavioral policy that collected the data may have been $\varepsilon$-greedy with a very different $\varepsilon$, or even a previous version of the policy. This is fine for Q-learning, which learns $Q^*$ regardless of behavior policy, but it becomes a significant complication for on-policy methods like SARSA.

Target networks

If the goalposts move with every gradient step, learning chases a moving target — literally. A frozen copy of the network, updated only occasionally, gives the optimizer something stable to aim at.

The TD target in Q-learning is $y = r + \gamma \max_{a'} Q(s', a'; \theta)$. Notice that both sides of the loss depend on $\theta$: the prediction $Q(s, a; \theta)$ on the right, and the target $y$ (through $Q(s', \cdot; \theta)$) on the left. Every gradient step changes $\theta$, which changes $y$, which changes the loss surface the optimizer is navigating. The result is oscillation and, in the worst case, divergence — the estimates chase each other in a feedback loop with no fixed point.

The solution introduced in DQN is a target network $Q(\cdot; \theta^-)$: a separate copy of the network whose parameters $\theta^-$ are held fixed for $C$ steps (DeepMind used $C = 10{,}000$). The training loss becomes:

$$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$ $$\mathcal{L}(\theta) = \mathbb{E}\bigl[(y - Q(s, a; \theta))^2\bigr]$$

Now the gradient flows only through $Q(s, a; \theta)$; the target $y$ is constant for $C$ steps, giving a stable optimization target. After $C$ steps, we set $\theta^- \leftarrow \theta$ (a hard update) and repeat. An alternative is the soft update $\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$ at every step with small $\tau$, which keeps $\theta^-$ continuously tracking $\theta$ with a lag.

DQN: putting it together

Mnih et al. (2015) combined experience replay and target networks with a convolutional architecture and a single set of hyperparameters across 49 different games. The result was human-level play on 29 of them.

The DQN architecture takes four stacked 84×84 grayscale frames as input. Three convolutional layers extract spatial features; two fully connected layers produce Q-values for all possible actions. The network architecture is identical across all 49 Atari games — only the output dimension changes to match the number of actions in each game. The same hyperparameters, learning rate, and training duration are used throughout.

Training follows this loop: the agent acts $\varepsilon$-greedy with respect to $Q(\cdot; \theta)$, annealing $\varepsilon$ from 1.0 to 0.1 over the first million steps, then holding it at 0.1. Each transition is stored in the replay buffer. Every four environment steps, a minibatch of 32 transitions is sampled and used to compute a semi-gradient update. Every 10,000 steps, the target network weights are synchronized.

What made DQN surprising was not the individual components — experience replay had been proposed by Lin (1992), and target networks were a natural engineering fix — but that together they sufficed. Prior work on RL with neural networks had struggled with instability for two decades. The 2015 Nature paper showed the combination was robust enough to generalize across an entire suite of games with no per-game tuning.

The 2013 arXiv version (the "Atari paper") established the concept on a subset of games. The 2015 Nature version extended it to 49 games, compared against human expert play, and introduced the target network. Both papers are foundational; most citations refer to the 2015 version.

Double DQN

DQN systematically overestimates Q-values. The fix is a one-line change that decouples action selection from action evaluation.

In DQN's target, the max operator selects and evaluates the best next action using the same network: $\max_{a'} Q(s', a'; \theta^-)$. When Q-value estimates are noisy — as they always are early in training — the max over noisy estimates is positively biased. If the true Q-values of several actions are equal, random estimation noise will make one action appear best, and the network will evaluate it at a value higher than the true optimum. Over time this maximization bias propagates backward through the TD updates, inflating Q-estimates throughout the state space.

Van Hasselt et al. (2015) proposed Double DQN: use the online network $\theta$ to select the best action, and the target network $\theta^-$ to evaluate it:

$$a^* = \arg\max_{a'} Q(s', a'; \theta)$$ $$y = r + \gamma Q(s', a^*; \theta^-)$$

Because action selection and evaluation now use independent (or at least partially decorrelated) parameter sets, the upward bias cancels out. In practice Double DQN consistently produces lower, more accurate Q-estimates and improves performance on most Atari games — especially those where overestimation was causing the agent to pursue artificially inflated value states.

Prioritized experience replay

Not all stored transitions are equally useful. Sampling proportional to TD error preferentially replays the experiences the agent currently finds most surprising.

Standard experience replay samples uniformly: a transition stored one million steps ago has the same probability of being replayed as one stored one step ago. This is suboptimal. A transition where the agent's prediction was wildly wrong — large TD error $|\delta|$ — is more informative than one where the prediction was nearly correct. Uniform sampling wastes compute on transitions the network has already learned from.

Prioritized experience replay (Schaul et al. 2015) assigns each transition a priority $p_i$ and samples proportionally. The proportional scheme sets $p_i = (|\delta_i| + \epsilon)^\alpha$, where $\epsilon$ prevents zero-priority transitions from never being sampled, and $\alpha \in [0, 1]$ controls the degree of prioritization ($\alpha = 0$ recovers uniform sampling). The rank-based scheme instead sets priority based on the rank of the transition when sorted by $|\delta|$.

Changing the sampling distribution introduces bias: the network sees a non-uniform distribution of transitions, which shifts the stationary distribution of the learned policy. Importance sampling weights $w_i = (N \cdot p_i)^{-\beta}$ correct for this, multiplying each transition's gradient by $w_i$ to recover an unbiased estimate of the true gradient. The exponent $\beta$ is annealed from an initial value near 0.4 to 1.0 over training, so the correction is more aggressive later when Q-estimates are more stable.

Dueling networks

Many states are valuable regardless of which action the agent takes. Dueling networks make this explicit by splitting Q into a state-value component and an action-advantage component.

Wang et al. (2015) observed that in many states the agent's choice of action barely matters — what matters is where it finds itself. A car approaching a cliff is in a bad state regardless of whether it steers left or right at the moment of observation; a car on an open highway is in a good state regardless of lane choice. Standard DQN learns $Q(s, a)$ for every action independently, and every action update teaches the network only about that one action's value — even though the state-value component is shared.

The dueling architecture splits the network's upper layers into two streams after a shared encoder. The value stream produces a scalar $V(s; \theta, \alpha)$; the advantage stream produces a vector $A(s, a; \theta, \beta)$ of length $|\mathcal{A}|$. These are combined to produce Q-values:

$$Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \alpha) + \left(A(s, a; \theta, \beta) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a'; \theta, \beta)\right)$$

The mean subtraction enforces identifiability: without it, any constant can be shifted between $V$ and $A$ without changing $Q$, making the decomposition arbitrary. With the mean subtracted, the advantage stream is forced to have mean zero, pinning the decomposition uniquely.

Standard DQN (left) produces Q-values through a single head. Dueling DQN (right) splits the network into a value stream V(s) and an advantage stream A(s,a). Because V(s) is updated by every action's gradient signal, the architecture is especially efficient when many actions have similar Q-values.

The architectural benefit is gradient efficiency. In standard DQN, a gradient from action $a_1$ updates the Q-estimate for $a_1$ only. In a dueling network, every gradient also updates $V(s)$, so the state-value estimate improves regardless of which action was taken. In environments with large action spaces — many Atari games have 18 possible actions — this multiplied gradient signal produces substantially faster and more stable learning of which states are valuable.

Multi-step returns

One-step TD bootstrap propagates reward information slowly — one step per update. Multi-step returns propagate it faster, at the cost of increased variance.

DQN's TD target looks one step into the future: $y^{(1)} = r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta^-)$. The $n$-step return extends this:

$$y^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \max_{a'} Q(s_{t+n}, a'; \theta^-)$$

For small $n$, the agent uses observed rewards for the first few steps and only then bootstraps — reducing the reliance on potentially inaccurate value estimates. For large $n$, the return approaches a Monte Carlo estimate: low bias but high variance because the sum of future rewards is stochastic. The practical sweet spot, used in Rainbow, is $n = 3$.

Multi-step returns introduce a subtlety with off-policy data. The $n$-step return assumes the agent followed the behavioral policy for all $n$ steps — but transitions stored in the replay buffer were collected under older policies. For small $n$ and slowly changing policies this is usually ignored in practice; for large $n$ it requires explicit off-policy corrections such as importance sampling ratios or the V-trace operator.

Distributional RL

Standard RL collapses the full distribution of future returns into a single number — its expectation. Distributional RL learns the entire distribution, preserving information that the mean discards.

For a given state-action pair, the return $G$ is a random variable (it depends on future stochasticity in the environment and the policy). Q-learning estimates $\mathbb{E}[G | s, a] = Q(s, a)$. But the distribution of $G$ carries much more information: two actions with the same expected return but very different variance distributions should be treated differently by a risk-sensitive agent.

C51 (Bellemare et al. 2017) models $Z(s, a)$ — the random variable whose expectation is $Q(s, a)$ — as a categorical distribution over 51 fixed atoms $\{z_1, z_2, \ldots, z_{51}\}$ uniformly spaced in the interval $[V_{\min}, V_{\max}]$. The network outputs 51 logits per action, representing the probability that the return falls at each atom. The distributional Bellman equation is:

$$Z(s, a) \overset{D}{=} R + \gamma Z(s', a^*)$$

This says the distribution of returns from $(s, a)$ equals, in distribution, the reward plus the discounted return distribution from the next state-action pair. The training loss is a KL divergence between the projected target distribution and the network's predicted distribution — replacing the squared error of standard DQN.

Why does it help even when we only care about the mean? The hypothesis, supported empirically, is that learning a richer signal produces better representations. The network's shared encoder must be good enough to predict an entire distribution, not just a single scalar — and those richer intermediate representations appear to generalize better, yielding superior Q-value estimates at the end.

Rainbow

Each improvement addressed a distinct failure mode. Rainbow (Hessel et al. 2017) combined all six and set a new state of the art on the Atari benchmark by a substantial margin.

The six components integrated into Rainbow are: Double DQN (reduces overestimation bias), prioritized experience replay (improves sample efficiency), dueling networks (better gradient utilization), multi-step returns ($n = 3$, reduces bootstrap bias), distributional RL with C51 (richer learning signal), and Noisy Nets (replaces $\varepsilon$-greedy with learnable exploration).

Noisy Nets (Fortunato et al. 2017), the sixth component, replace the final fully connected layers with stochastic linear layers: $y = (\mu^w + \sigma^w \odot \varepsilon^w)x + (\mu^b + \sigma^b \odot \varepsilon^b)$, where $\varepsilon^w$ and $\varepsilon^b$ are sampled noise tensors and the $\sigma$ parameters are learned. This produces state-dependent exploration: the network explores more aggressively in states where it is uncertain and has learned to be bold, and less in familiar states. $\varepsilon$-greedy, by contrast, explores uniformly regardless of state.

Component	Problem addressed	Key mechanism
Double DQN	Overestimation bias	Separate networks for action selection and evaluation
Prioritized replay	Sample efficiency	Sample high-TD-error transitions more often
Dueling networks	Gradient efficiency	V(s) and A(s,a) streams share the backbone
Multi-step returns	Bootstrap bias	Propagate observed rewards over n steps before bootstrapping
Distributional RL	Signal richness	Learn full return distribution, not just the mean
Noisy Nets	Exploration	Learned, state-dependent stochasticity in weights

The Rainbow ablation study found that prioritized replay and multi-step returns contribute the largest individual gains; removing either one degrades performance substantially. Distributional RL is the third most important; Double DQN and dueling networks add smaller but consistent improvements. Noisy Nets' contribution is more variable across games.

Rainbow represents the mature state of purely value-based deep RL. The next frontier — policy gradient and actor-critic methods, covered in Chapter 04 — relaxes the assumption that we can enumerate Q-values across all actions, enabling continuous action spaces and on-policy learning.

Deep Q-Networks, when neural nets replaced the table.

Chapter notes

The scale problem

Function approximation

The deadly triad

Experience replay

Target networks

DQN: putting it together

Double DQN

Prioritized experience replay

Dueling networks

Multi-step returns

Distributional RL

Rainbow

Further reading

Foundational papers

Textbooks and courses

Going further