Part IX · Reinforcement Learning · Chapter 04

Policy Gradient & Actor-Critic, optimizing the policy directly.

Value-based methods work by learning Q-values and deriving a policy implicitly; policy gradient methods flip this around, parameterizing the policy directly and ascending its expected-return gradient. The policy gradient theorem gives a tractable estimator of this gradient; actor-critic methods sharply reduce its variance by coupling it with a learned value function. PPO distills a decade of stability research into a single clipped objective that dominates practical deep RL, while SAC and TD3 extend these ideas to continuous actions with maximum-entropy and deterministic-gradient frameworks respectively.

Chapter notes

Policy gradient methods and value-based methods (Chapter 03) are complementary rather than competing. Value-based methods are off-policy and sample-efficient but struggle with continuous actions and stochastic policies. Policy gradient methods are naturally on-policy, handle arbitrary action spaces, and learn explicit stochastic policies — critical for RLHF and other applications where the full distribution over actions matters.

Prerequisites: Chapter 01 (RL Fundamentals) — value functions and the Bellman equations — and Chapter 03 (Deep Q-Networks) for context on function approximation and experience replay. Familiarity with automatic differentiation and log-likelihood gradients is helpful for Section 03.

Sections

Why optimize the policy directly?continuous actions, stochastic policies, limits of Q-learning
Parameterized policiessoftmax, Gaussian, the score function, reparameterization
The policy gradient theoremlog-derivative trick, the gradient estimator, on-policy sampling
REINFORCEMonte Carlo policy gradient, Williams 1992, the variance problem
Baselines and the advantage functionvariance reduction, A(s,a) = Q − V, the baseline trick
Actor-critic methodsbootstrapped critic, TD actor-critic, the bias-variance tradeoff
A3C and A2Casynchronous workers, n-step returns, shared network, Mnih 2016
Trust regions and TRPOnatural gradient, KL constraint, monotonic improvement guarantee
Proximal Policy Optimizationclipped surrogate objective, ε-clip, dominant practical algorithm
Deterministic policy gradient and DDPGDPG theorem, off-policy actor-critic, continuous control
TD3twin critics, delayed policy updates, target policy smoothing
Soft Actor-Criticmaximum entropy RL, entropy bonus, automatic temperature tuning

Why optimize the policy directly?

Q-learning finds the optimal value function and derives a policy from it. Policy gradient methods ask: why not just optimize the policy itself?

Deep Q-networks output one Q-value per discrete action. To act, the agent takes the argmax. This works well in Atari because the action space is small and discrete — 4 to 18 actions. But many important problems have continuous action spaces: the torque applied to each joint of a robot arm, the throttle and steering angle of a vehicle, the bid price in an auction. Computing $\arg\max_a Q(s, a)$ over a continuous $a$ is itself an optimization problem, typically requiring a separate inner loop at every step.

A second limitation is expressiveness. Q-learning derives a deterministic policy: always take the action with the highest Q-value. Some problems genuinely require stochastic policies — rock-paper-scissors is the canonical example; more practically, exploration in partially observable environments, multi-agent competition, and RLHF all benefit from explicit probability distributions over actions. Q-learning can produce approximately stochastic behavior via $\varepsilon$-greedy, but this is a heuristic add-on, not a principled distribution.

Policy gradient methods parameterize the policy directly as $\pi_\theta(a \mid s)$ — a probability distribution over actions conditioned on the state — and optimize the expected return $J(\theta) = \mathbb{E}_{\pi_\theta}[G_0]$ by gradient ascent on $\theta$. The policy is the primary object of optimization; a value function, when used, plays a supporting role.

Parameterized policies

The policy is a neural network that outputs a probability distribution. For discrete actions it outputs a categorical; for continuous actions, a Gaussian.

For discrete action spaces, the standard parameterization passes the network logits through a softmax: $\pi_\theta(a \mid s) = \text{softmax}(f_\theta(s))_a$. This ensures probabilities are non-negative and sum to one, and the network can express arbitrary preferences over actions by adjusting the logit magnitudes.

For continuous action spaces, a common parameterization is a diagonal Gaussian: the network outputs a mean vector $\mu_\theta(s)$ and (log-)standard deviation $\sigma_\theta(s)$, and actions are sampled as $a \sim \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2 I)$. The mean controls where the policy concentrates probability; the standard deviation controls exploration. As the agent becomes more confident, $\sigma$ typically shrinks.

The score function $\nabla_\theta \log \pi_\theta(a \mid s)$ is central to policy gradient algorithms. It measures how much a small change in $\theta$ shifts the log-probability of a particular action — i.e., in which direction the parameters should be moved to make $a$ more or less likely. For the Gaussian policy, $\nabla_\theta \log \pi_\theta(a \mid s) = \frac{(a - \mu_\theta)}{\sigma_\theta^2} \nabla_\theta \mu_\theta$, showing that the gradient is large when an action was far from the mean, proportionally encouraging or discouraging extreme actions based on their outcomes.

The policy gradient theorem

Computing $\nabla_\theta J(\theta)$ seems hard — the distribution over trajectories depends on $\theta$, so we cannot simply move the gradient inside an expectation. The log-derivative trick resolves this.

The expected return $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$ is an expectation over trajectories $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ sampled by following $\pi_\theta$. Differentiating through the expectation seems to require differentiating through the sampling distribution, which is intractable. The log-derivative trick sidesteps this:

$$\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau) \cdot \nabla_\theta \log p_\theta(\tau)\right]$$

Since the environment dynamics $p(s' \mid s, a)$ do not depend on $\theta$, the log-probability of a trajectory simplifies to $\nabla_\theta \log p_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)$. The full policy gradient theorem states:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot Q^{\pi_\theta}(s_t, a_t)\right]$$

This expression is remarkable: the gradient of expected return depends only on how much the policy's log-probability changes with $\theta$, weighted by the Q-value of the actions taken. We do not need to differentiate through the environment or the reward function — only through the policy itself. This makes policy gradient applicable to non-differentiable environments, including real-world physical systems.

In practice, $Q^{\pi_\theta}(s_t, a_t)$ must be estimated. REINFORCE estimates it with Monte Carlo returns; actor-critic methods estimate it with a learned value network.

REINFORCE

Williams (1992) proposed the simplest instantiation of the policy gradient theorem: use the episode's actual return as the Q-value estimate. It works, but its variance is crushing.

REINFORCE collects complete episodes by following $\pi_\theta$, then uses the discounted return from each timestep as the estimate of $Q^{\pi_\theta}(s_t, a_t)$. The update for a single episode is:

$$\theta \leftarrow \theta + \alpha \sum_t G_t \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t)$$

where $G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$ is the return from step $t$. The intuition is direct: if the full episode went well ($G_t$ large), increase the probability of every action taken; if it went poorly ($G_t$ small), decrease them. Actions that led to high returns are reinforced; those that did not are discouraged.

The estimator is unbiased — its expectation equals the true policy gradient — but its variance is enormous. Consider: whether a good action looks good depends on everything that happened after it, including many future random events. The return $G_t$ conflates the quality of action $a_t$ with the stochasticity of the entire remaining trajectory. In problems with long horizons or high stochasticity, the signal-to-noise ratio collapses and learning becomes impractically slow.

REINFORCE is important to understand, but rarely used directly in practice. Its role in the curriculum is to establish the core idea — the score-function estimator — before seeing how actor-critic methods and PPO dramatically reduce the variance problem.

Baselines and the advantage function

Subtracting a baseline from the return reduces variance without introducing bias. The optimal baseline is the state-value function V(s), turning the Q-value weight into an advantage.

For any function $b(s)$ that does not depend on the action $a$, the following identity holds:

$$\mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s)\right] = 0$$

This means we can subtract any state-dependent baseline from the Q-value in the policy gradient without changing its expectation — but we dramatically change its variance. Subtracting $b(s) = V^{\pi_\theta}(s)$ yields the advantage function:

$$A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)$$

The advantage $A(s, a)$ measures how much better action $a$ is than the average action from state $s$ under the current policy. If the agent is in a good state ($V(s)$ is large) but takes an average action, the advantage is zero — no gradient signal. Only actions that are genuinely better or worse than average produce a gradient. This centering eliminates the baseline noise that made REINFORCE so high-variance.

In practice, $V^{\pi_\theta}(s)$ is estimated by a separate neural network — the critic. The TD residual $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is a one-step estimate of the advantage, and $n$-step generalized advantage estimation (GAE, Schulman et al. 2015) provides a tunable bias-variance tradeoff via a parameter $\lambda$.

Actor-critic methods

Actor-critic algorithms maintain two networks: an actor that selects actions, and a critic that estimates how good they are. The critic's signal drives actor updates; the actor's behavior drives critic updates.

The actor is the policy $\pi_\theta(a \mid s)$, updated by policy gradient. The critic is a value function $V_w(s)$ (or $Q_w(s, a)$), updated by TD learning to minimize squared Bellman error. Their interactions form a feedback loop: the actor generates transitions from which the critic learns; the critic's advantage estimate tells the actor which actions to make more or less probable.

The actor-critic loop. State s fans to both networks. The actor samples action a; the environment returns reward R and next state s′ to the critic, which estimates the advantage A(s,a) = δ. This signal (dashed) updates the actor's parameters θ, while the critic updates its own weights w by minimizing TD error.

The two-network design reconciles the policy gradient theorem's requirement for a Q-value estimate with the desire for sample efficiency. Pure REINFORCE waits for full episodes and uses high-variance Monte Carlo returns. The critic's bootstrap estimate $\hat{A}(s_t, a_t) = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$ is available after a single step and has lower variance — at the cost of some bias from the imperfect critic. This is the same bias-variance tradeoff encountered throughout TD learning, now applied to the policy gradient estimator.

A3C and A2C

Mnih et al. (2016) showed that running many actor-critic agents in parallel — each with its own environment copy — produces sufficient gradient diversity to train stably without a replay buffer.

Asynchronous Advantage Actor-Critic (A3C) runs $n$ workers (typically 16–64), each interacting with its own copy of the environment and computing local gradients. Workers asynchronously push gradient updates to a shared global network and pull updated parameters back. Because different workers are at different points in their environments, their experiences are decorrelated — solving the same problem that experience replay solved in DQN, but without the off-policy complications.

Each worker computes an $n$-step advantage estimate before updating. For $n$ steps of experience $(s_t, a_t, r_t, \ldots, s_{t+n})$, the advantage target combines observed rewards with a bootstrapped value estimate:

$$\hat{A}_t = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V_w(s_{t+n}) - V_w(s_t)$$

The actor and critic share most network parameters — a shared backbone extracts features, with separate linear heads for the policy logits and the value estimate. This weight sharing means representations learned for predicting value also inform the policy, and vice versa.

A2C is the synchronous variant: a coordinator waits for all workers to finish an $n$-step rollout before applying a single averaged gradient update. This removes the staleness of asynchronous updates and is easier to implement on GPUs; most modern frameworks prefer A2C. In practice, both A3C and A2C have been largely superseded by PPO, which uses the same parallel worker structure but with a more stable update rule.

Trust regions and TRPO

Vanilla policy gradient can catastrophically collapse a well-trained policy with a single large update. Trust Region Policy Optimization prevents this by constraining how much the policy can change per step.

The policy gradient estimator tells us which direction to push $\theta$, but not how far. A large step might improve performance on the current batch of experience but produce a policy that performs much worse elsewhere — and since on-policy methods collect new data from the updated policy, a bad update poisons future data collection too. Unlike supervised learning, there is no fixed dataset to fall back on; a policy collapse is self-reinforcing.

Schulman et al. (2015) proposed Trust Region Policy Optimization (TRPO), which maximizes a surrogate objective subject to a KL divergence constraint between the old and new policy:

$$\max_\theta \; \mathbb{E}\left[\frac{\pi_\theta(a \mid s)}{\pi_{\theta_\text{old}}(a \mid s)} A^{\pi_{\theta_\text{old}}}(s, a)\right] \quad \text{subject to} \quad \mathbb{E}\left[\text{KL}(\pi_{\theta_\text{old}} \| \pi_\theta)\right] \leq \delta$$

The ratio $r(\theta) = \pi_\theta / \pi_{\theta_\text{old}}$ is called the probability ratio; it measures how much more or less likely the new policy is to take each action than the old one. Maximizing $r(\theta) \cdot A$ pushes toward better actions; the KL constraint prevents the ratio from growing too large. TRPO solves this constrained problem using the conjugate gradient method and a natural gradient step, which requires computing the Fisher information matrix — expensive but principled.

TRPO gives a monotonic improvement guarantee under certain conditions and set the state of the art in continuous control in 2015–2016. Its main drawback is implementation complexity and computational overhead. PPO was designed to capture the same benefit with a much simpler mechanism.

Proximal Policy Optimization

PPO achieves TRPO's stability through a clipped surrogate objective that requires no second-order optimization, no constraint solver, and adds roughly a dozen lines to a vanilla policy gradient implementation.

Schulman et al. (2017) introduced the clipped surrogate objective. Let $r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$ be the probability ratio. PPO maximizes:

$$L^\text{CLIP}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\,A_t,\;\text{clip}(r_t(\theta),\,1-\varepsilon,\,1+\varepsilon)\,A_t\right)\right]$$

The clip operation prevents $r_t(\theta)$ from straying outside the interval $[1-\varepsilon, 1+\varepsilon]$ — typically $\varepsilon = 0.2$. The minimum of the clipped and unclipped objective creates a conservative surrogate: when the advantage is positive (the action was good), the objective cannot benefit from increasing $r_t$ beyond $1 + \varepsilon$; when the advantage is negative (the action was bad), it cannot avoid penalty by decreasing $r_t$ below $1 - \varepsilon$. The policy is nudged toward better actions but prevented from making large steps in either direction.

Why PPO dominates practice: It is simple to implement with standard automatic differentiation libraries, requires no line searches or constraint solvers, runs well on GPUs with minibatch updates, and achieves competitive performance on almost every benchmark. OpenAI used PPO to train the five-player Dota 2 agent OpenAI Five; it underlies much of the RLHF pipeline used to align large language models.

A typical PPO implementation collects $T$ timesteps of experience with $N$ parallel workers, computes GAE-based advantage estimates, then runs $K$ epochs of minibatch SGD on the collected data before discarding it. The full loss combines the clipped policy objective with a value function loss and an entropy bonus: $L = L^\text{CLIP} - c_1 L^\text{VF} + c_2 H[\pi_\theta]$. The entropy term encourages exploration by penalizing deterministic policies.

Deterministic policy gradient and DDPG

Stochastic policy gradients require integrating over actions, which is expensive in high-dimensional continuous spaces. The deterministic policy gradient theorem shows we can avoid the integral entirely.

Silver et al. (2014) proved the deterministic policy gradient (DPG) theorem: if the policy is deterministic, $\mu_\theta(s)$, its gradient with respect to expected return is:

$$\nabla_\theta J(\mu_\theta) = \mathbb{E}_{s \sim \rho^\mu}\!\left[\nabla_\theta \mu_\theta(s) \cdot \nabla_a Q^\mu(s, a)\big|_{a=\mu_\theta(s)}\right]$$

The policy gradient is the product of two Jacobians: how the action changes with $\theta$, and how the Q-value changes with the action. There is no integral over actions — the gradient evaluates at the single deterministic action $\mu_\theta(s)$. This is far more sample-efficient in high-dimensional action spaces.

Deep Deterministic Policy Gradient (DDPG, Lillicrap et al. 2015) applies DPG with neural network function approximators, borrowing DQN's experience replay and target networks. The actor network outputs a single action vector; the critic takes both state and action as input and outputs a scalar Q-value. DDPG is off-policy: it explores using Ornstein-Uhlenbeck or Gaussian noise added to the deterministic action, and learns from a replay buffer of past transitions. This makes it substantially more sample-efficient than on-policy methods like PPO in terms of environment interactions, at the cost of greater hyperparameter sensitivity.

TD3

DDPG suffers from the same overestimation bias that plagued DQN, amplified by the actor exploiting critic errors. TD3 addresses this with three targeted fixes.

Fujimoto et al. (2018) diagnosed DDPG's instability: the actor learns to exploit imperfections in the critic, moving to regions of state-action space where the critic overestimates Q-values, then collecting bad experience there, degrading the critic further in a feedback loop. Twin Delayed Deep Deterministic policy gradient (TD3) addresses this with three modifications:

Twin critics. Maintain two critic networks $Q_{w_1}$ and $Q_{w_2}$ with separate parameters. For TD targets, use $\min(Q_{w_1}, Q_{w_2})$ — the pessimistic estimate. Since both critics independently overestimate, their minimum is a less biased estimate of the true Q-value. This directly mirrors the Double DQN idea but applied to the continuous actor-critic setting.

Delayed policy updates. Update the actor less frequently than the critics — TD3 uses one actor update for every two critic updates. This gives the critic time to stabilize before the actor starts exploiting it, breaking the feedback loop. Empirically, this reduces variance in actor updates substantially.

Target policy smoothing. Add clipped random noise $\varepsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c)$ to the target action when computing the TD target: $y = r + \gamma \min_i Q_{w_i'}(s', \mu_{\theta'}(s') + \varepsilon)$. This regularizes the critic against sharp Q-value peaks by forcing it to produce similar estimates for nearby actions — a form of implicit critic regularization that makes the landscape smoother for the actor to optimize.

TD3 substantially outperforms DDPG on MuJoCo continuous control benchmarks and is simpler to tune. It was the dominant off-policy continuous control algorithm until Soft Actor-Critic offered similar performance with better sample efficiency.

Soft Actor-Critic

SAC reframes the RL objective to include an entropy bonus, producing a policy that is simultaneously reward-seeking and maximally stochastic — and, in practice, dramatically more stable to train than DDPG or TD3.

Haarnoja et al. (2018) introduced the maximum entropy RL framework. The standard objective maximizes $\mathbb{E}[G_0]$; the maximum entropy objective maximizes:

$$J(\pi) = \mathbb{E}\left[\sum_t r_t + \alpha \mathcal{H}(\pi(\cdot \mid s_t))\right]$$

where $\mathcal{H}(\pi(\cdot \mid s)) = -\mathbb{E}_a[\log \pi(a \mid s)]$ is the entropy of the policy and $\alpha > 0$ is a temperature parameter controlling the tradeoff between reward and entropy. The augmented objective encourages the policy to explore — maintaining probability mass across many actions — while still preferring high-reward ones. Deterministic policies have zero entropy; SAC naturally avoids collapsing to them.

The entropy bonus has several practical benefits. It produces robust policies that hedge against environmental uncertainty: rather than committing to one narrow strategy, the policy spreads probability over multiple good actions. It reduces the risk of premature convergence to local optima. And it implicitly provides exploration without any explicit noise injection — the policy is stochastic by design, and its stochasticity is tuned by the temperature $\alpha$.

SAC uses twin critics (borrowed from TD3) and an experience replay buffer (off-policy, like DDPG). The stochastic actor uses the reparameterization trick: actions are sampled as $a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \varepsilon)$ with $\varepsilon \sim \mathcal{N}(0, I)$, allowing gradients to flow through the sampling operation. The temperature $\alpha$ can be fixed or automatically tuned to maintain a target entropy level — in practice, automatic tuning reliably produces well-calibrated exploration without manual search.

Algorithm	Policy type	On/off policy	Action space	Key strength
REINFORCE	Stochastic	On-policy	Any	Simplest policy gradient baseline
A2C / A3C	Stochastic	On-policy	Any	Parallel workers, no replay buffer
PPO	Stochastic	On-policy	Any	Stable, simple, dominant in practice
DDPG	Deterministic	Off-policy	Continuous	Sample-efficient continuous control
TD3	Deterministic	Off-policy	Continuous	DDPG + overestimation fixes
SAC	Stochastic	Off-policy	Continuous	Max-entropy, stable, auto-temperature

SAC and PPO are the two workhorses of modern deep RL. SAC dominates in continuous control with limited environment samples (robotics, simulations); PPO dominates where parallel environment rollouts are cheap (game environments, LLM alignment). Understanding both — and when each applies — covers the vast majority of practical deep RL deployments.

Policy Gradient & Actor-Critic, optimizing the policy directly.

Chapter notes

Why optimize the policy directly?

Parameterized policies

The policy gradient theorem

REINFORCE

Baselines and the advantage function

Actor-critic methods

A3C and A2C

Trust regions and TRPO

Proximal Policy Optimization

Deterministic policy gradient and DDPG

TD3

Soft Actor-Critic

Further reading

Foundational papers

Key supporting work

Going further