Reinforcement Learning


Scope and What This Chapter Is About

The chapter develops reinforcement learning - the paradigm where a learner improves its behaviour by interacting with an environment and observing rewards. We cover the foundations (MDPs, Bellman equations, value iteration, policy iteration), the modern algorithmic families (Q-learning and DQN, policy gradients and actor-critic, PPO, SAC, TD3), model-based RL (Dyna, MuZero, Dreamer), offline RL, RLHF as the LM-environment paradigm, exploration, and the connections to other chapters (especially LLM §5 for RLHF/DPO and the Theory chapter §10 for sample complexity). Open problems are flagged inline and consolidated in §14.


§1. Motivation and Scope

A worked example to anchor the rest

Before any formal definition, the concrete picture. Imagine training a robot to walk. There is no labelled dataset of “for state ss, the correct action is aa” - nobody knows the right joint torques for every robot configuration. There is, however, a way to evaluate any walking attempt: how far did the robot travel, did it fall over, how much energy did it use. The robot tries something, observes what happens, and updates its behaviour in the direction of higher reward over time.

This is the reinforcement learning setup: a learner (the agent) interacts with an environment by taking actions; the environment returns observations (or states) and rewards; the agent’s goal is to learn a policy - a mapping from states to actions - that maximizes the long-term cumulative reward.

Three concrete instances of this setup, drawn from across 2026 practice:

  1. AlphaZero playing Go. The agent is the neural network policy. The environment is the game of Go. Actions are stone placements; states are board configurations; the reward is +1+1 for winning, 1-1 for losing, 00 along the way. The agent improves by self-play: it plays against copies of itself, observes wins and losses, and updates the policy.

  2. A robot learning manipulation. The agent is a deep policy network mapping camera images and joint angles to motor commands. The environment is the real (or simulated) physical world. The reward is a hand-designed signal: +1+1 when an object is successfully grasped, 00 otherwise; or, in dense form, a function of distance-to-target. The agent collects trajectories of state-action-reward sequences and updates its policy with a deep RL algorithm.

  3. An LLM post-trained with RLHF. The agent is the language model. The environment is a “prompt → completion” exchange: a prompt is sampled, the model produces a completion, a reward model (or human) scores the completion, and the LM is updated. This is the modern post-training step for Foundation Models, covered in detail in LLM §5 and revisited from the RL angle in §10 of this chapter.

The three settings share structure (agent, actions, rewards, policy) but differ enormously in scale, action-space type, and reward structure. The RL framework abstracts over all of them.

How RL differs from supervised and self-supervised learning

Three structural differences make RL its own paradigm:

  • Data is collected by the learner. In supervised learning, the dataset {(xi,yi)}\{(x_i, y_i)\} is given. In RL, the data is generated by the agent’s interactions: which states the agent visits depends on the policy it currently has. This is the exploration-exploitation problem (§11): the agent must visit states it has not seen to learn about them, but visiting them costs reward.

  • Feedback is evaluative, not instructive. Supervised data tells you what the right answer is. Reward signals tell you how good your answer was but not what would have been better. The agent must figure out the right behaviour from a scalar reward, not from a target label.

  • Decisions have delayed consequences. A move in a Go game affects rewards 100 moves later. A motor command affects the robot’s state for many subsequent steps. The agent must learn to assign credit to actions whose consequences are temporally distant - the temporal credit assignment problem, addressed by value functions and the Bellman equations of §3.

These differences imply that supervised-learning techniques cannot be applied directly. The chapter is structured around the conceptual machinery that handles them: MDPs for the formalism, value-based and policy-based algorithms for credit assignment, and exploration strategies for the data-collection problem.

Why RL has become central in 2026

RL was a niche topic for two decades after AIMA 4e’s predecessor was written. The current centrality reflects three forces that arrived roughly simultaneously:

  1. Deep RL works. From DQN (2015) onward, the combination of deep function approximation with RL algorithms made it possible to train policies for problems with high-dimensional observations (Atari from pixels, robotic manipulation, large games). The algorithmic and infrastructure work that produced this is the content of §6§7.

  2. RLHF tied RL to Foundation Models. When Christiano et al. (2017), and later Ouyang et al. (2022) for InstructGPT, demonstrated that RL could fine-tune LMs against human preferences, RL became the post-training step for Foundation Models (FM §7) and the production pipeline for nearly all deployed LLMs (LLM §5). The scale and stakes of LLM RLHF training have made RL one of the highest-investment areas in industrial AI.

  3. Reasoning models trained with RL. Starting in late 2024 (o1) and into 2026 (o3, DeepSeek-R1, and successors), RL has been used to train chain-of-thought reasoning in language models - making the LM produce extended reasoning before answering, with the reasoning trained against verifier-based rewards. This is the content of §12. It is the most recent and most consequential application of RL in the field.

The combined consequence: a research-oriented practitioner in 2026 will encounter RL not just in robotics and game-playing but in language modelling, reasoning, agentic systems, and Foundation Model post-training. The chapter is structured to support that breadth.

Boundaries with adjacent chapters

RL is not the only chapter dealing with sequential decision-making. The boundaries:

  • Planning (AIMA Part III): plans actions given a known model of the environment. RL handles the case where the model is unknown or learned. Many algorithms (MuZero, AlphaZero) combine planning and learning.

  • Game playing (AIMA Ch 5–6): two-player and multi-player adversarial planning. AlphaZero-style self-play is RL applied in this setting; we develop it briefly in §8 but defer detailed game-playing material to the game-theory chapter.

  • Control (classical, optimal control, LQR): the engineering antecedent of RL with a different (model-based, often closed-form) flavour. RL is the model-free, data-driven cousin.

  • Multi-agent systems: RL in environments with multiple learning agents. Substantial enough to deserve its own chapter; we touch on it in §13 cross-references.

  • AI Agents and Tool Use: deployed agentic systems often use a policy (sometimes RL-trained) over high-level actions. The relationship is sketched in §13.

  • Theoretical Foundations of Learning §10 develops the sample-complexity theory for RL - what bounds we can prove on data efficiency. This chapter develops the algorithms; the theory chapter develops the analysis.

  • Large Language Models §5 develops RLHF and DPO at the LM scale specifically. This chapter develops the general RL framework that those LM-specific applications instantiate.

What this chapter does not try to do

A research-oriented chapter must be honest about scope:

  • We do not provide a complete textbook treatment of classical RL - Sutton and Barto’s textbook does that job better and longer than is reasonable here. We cover the foundations to the depth needed for the modern applications.

  • We do not catalogue every modern RL algorithm. The space is large; we develop a small number of canonical algorithms (DQN, PPO, SAC, DDPG/TD3, AlphaZero/MuZero, Dreamer, CQL/IQL, PPO-for-RLHF, DPO, GRPO) in mechanistic detail and refer the reader to surveys for breadth.

  • We do not derive most theoretical results in full. §3 develops the Bellman equations and basic convergence; deeper sample-complexity results live in the Theory chapter §10.

  • We do not separately treat robotics. RL for robotics is one of the largest application domains and has its own engineering substance (sim-to-real, domain randomization, safety constraints); we touch on it where it matters for the algorithmic story and otherwise refer the reader to the robotics-specific literature.

Position taken in this chapter

The chapter is structured around an algorithmic rather than a historical spine: we develop the modern recipes that practitioners actually use in 2026, with the historical antecedents inlined where they motivate the modern form. PPO is the central modern algorithm and gets full mechanistic treatment in §6; RLHF is the LLM-era central application and gets full treatment in §10; reasoning-RL is the post-2024 frontier and gets the closing §12. The classical material (Q-learning, tabular methods) is necessary scaffolding and is treated as such.


§2. Historical Context

This section traces the field from its mid-20th-century origins through the deep-RL inflection of 2013–2015, the AlphaGo moment, the PPO-and-RLHF era, and the reasoning-model era of 2024–2026.

A timeline of the inflection points:

   1950s         Bellman and dynamic programming;
                  Markov decision processes
                                  │
                                  ▼
   1959-1980     Sutton's preliminary work on
                  reinforcement and temporal-difference ideas;
                  Klopf's "hedonistic neurons"
                                  │
                                  ▼
   1988          Sutton "Learning to Predict by the
                  Methods of Temporal Differences"
                                  │
                                  ▼
   1989          Watkins introduces Q-learning;
                  convergence to optimal Q-values in
                  tabular setting (Watkins, Dayan 1992)
                                  │
                                  ▼
   1992-1995     TD-Gammon (Tesauro): TD-learning + NN
                  reaches world-champion backgammon level;
                  first major deep-RL-flavoured success
                                  │
                                  ▼
   1999-2000     Policy gradient theorem (Sutton, McAllester,
                  Singh, Mansour); actor-critic methods
                                  │
                                  ▼
   2000s         RL stays a niche academic field;
                  most progress is in tabular or kernel
                  function-approximation settings
                                  │
                                  ▼
   2013-2015     DEEP RL INFLECTION: DQN (Mnih et al.)
                  plays Atari from pixels; experience replay
                  + target networks make deep value-based RL
                  stable
                                  │
                                  ▼
   2016          AlphaGo (Silver et al.) defeats Lee Sedol;
                  combines deep networks, MCTS, and self-play
                                  │
                                  ▼
   2017          AlphaZero generalizes the recipe (Chess, Shogi,
                  Go) with self-play + RL only - no human data;
                  TRPO (Schulman et al.) and PPO follow;
                  Christiano et al. propose RLHF for tasks where
                  reward functions are hard to specify
                                  │
                                  ▼
   2018-2020     PPO becomes the dominant modern policy-gradient
                  algorithm; SAC (Haarnoja et al.) and TD3
                  (Fujimoto et al.) become dominant for
                  continuous control; MuZero (Schrittwieser
                  et al. 2020) learns the model end-to-end;
                  Dreamer family (Hafner et al.) makes
                  latent-world-model RL practical
                                  │
                                  ▼
   2020-2022     OFFLINE-RL programme: CQL (Kumar et al.),
                  IQL (Kostrikov et al.), AWAC; the dataset-as-
                  environment paradigm matures.
                  InstructGPT (Ouyang et al. 2022) demonstrates
                  RLHF at scale on a deployed LM; ChatGPT
                  (late 2022) brings RLHF-trained LMs to
                  mainstream attention.
                                  │
                                  ▼
   2023          DPO (Rafailov et al.): closed-form preference
                  optimization; alternative to PPO-for-RLHF
                                  │
                                  ▼
   2024-2026     REASONING-RL ERA: o1 (OpenAI, late 2024)
                  introduces test-time RL for chain-of-thought;
                  o3 (early 2025), DeepSeek-R1 (early 2025)
                  with GRPO. RL becomes the dominant post-
                  training method for the most capable models.
                  Frontier RL is now done at the
                  Foundation-Models scale.

We develop each phase below.

Bellman and the dynamic-programming substrate

The mathematical foundation was laid by Richard Bellman in the late 1950s. His framework of dynamic programming (Bellman, 1957) provides the recursive equations for value functions in sequential decision-making problems that remain the conceptual core of RL. The Markov Decision Process formalism (covered in §3) - state space, action space, transition kernel, reward function - is essentially due to Bellman.

Bellman’s dynamic programming is model-based and exact: it requires the transition kernel to be known and computes optimal value functions by iterating the Bellman equation. For small finite state spaces this is computationally tractable; for the state spaces of practical interest (real-world robotics, board games, language) it is hopelessly intractable. The field’s progress from 1959 onward has been about handling the large-state-space case - first via function approximation (tabular → linear → neural), then via sampling-based methods that avoid sweeping the whole state space.

Sutton, temporal differences, and Q-learning

Through the 1970s and 1980s, Richard Sutton developed the framework that today is called temporal-difference (TD) learning. The key insight: rather than computing exact value functions by dynamic programming, estimate them from sample trajectories, and update the estimates incrementally as new trajectories are collected. This is the bridge between Monte Carlo evaluation (full trajectory rollouts) and dynamic programming (recursive value updates).

Sutton’s 1988 paper “Learning to Predict by the Methods of Temporal Differences” formalized TD learning. The crucial property: TD methods can learn from incomplete trajectories, can learn online, and converge to the correct value function under appropriate conditions.

Chris Watkins (1989) introduced Q-learning, the off-policy variant: learn the Q-function (state-action values) for the optimal policy while following a different (typically more exploratory) policy. Watkins and Dayan (1992) proved convergence of Q-learning to the optimal Q-values in the tabular setting under standard conditions. Q-learning became the canonical value-based RL algorithm and remains the conceptual ancestor of modern deep-RL methods like DQN.

TD-Gammon: the first deep-RL-flavoured success

Gerald Tesauro’s TD-Gammon (1992–1995) is worth noting as a historical milestone. Tesauro trained a neural network as a value function for backgammon using TD(λ\lambda); the network’s input was a hand-coded representation of the board state, and the training was entirely from self-play. By 1995, TD-Gammon was playing at world-champion level.

TD-Gammon was, in retrospect, deep RL avant la lettre: a (small) neural network as function approximator, TD learning, self-play. It did not catch on as a paradigm at the time partly because the field of neural networks was in a dormant period (1990s “AI winter”) and partly because the result was widely viewed as a backgammon-specific success rather than a general technique. The general technique was rediscovered with deep nets in 2013.

The policy-gradient family

A different theoretical line emerged in the late 1990s and 2000s. Value-based methods (Q-learning, SARSA) learn value functions and derive policies from them; policy-gradient methods parameterize the policy directly and update it by gradient ascent on expected reward. The policy-gradient theorem (Sutton, McAllester, Singh, Mansour, 1999) gave a clean expression for this gradient that could be estimated from sampled trajectories.

REINFORCE (Williams, 1992) was the simplest policy-gradient algorithm; actor-critic methods reduced its variance by learning a value function as a baseline. Through the 2000s these methods accumulated theoretical and algorithmic refinements (natural policy gradient, TRPO precursor work) but applications were limited.

The 2013–2015 deep-RL inflection

In 2013, Mnih et al. at DeepMind published “Playing Atari with Deep Reinforcement Learning”; a more polished version appeared in Nature in 2015. The algorithm - Deep Q-Network (DQN) - combined Q-learning with a convolutional neural network mapping raw pixels to Q-values. Two specific technical contributions made deep value-based RL stable for the first time: experience replay (sample minibatches from a large buffer of past transitions, breaking temporal correlation) and target networks (use a slowly-updated copy of the Q-network as the target for Bellman updates, preventing instability from chasing a moving target).

DQN reached human-level performance on a wide range of Atari games using a single architecture and hyperparameter setting. This was the inflection. After DQN, deep RL became a real research area: dozens of variants and refinements (Double DQN, Dueling DQN, prioritized experience replay, distributional RL, Rainbow) followed in the next two years.

AlphaGo and self-play

In March 2016, DeepMind’s AlphaGo defeated Lee Sedol, a top professional Go player, in a five-game match. The system combined a deep policy network (trained initially on human games, then refined by self-play), a deep value network, and Monte Carlo Tree Search. The defeat of a top human professional at Go - a game widely viewed as too complex for AI for decades to come - was a cultural moment as much as a technical one.

AlphaZero (2017) generalized the recipe to chess, shogi, and Go without any human game data: self-play from scratch with MCTS-guided RL training. AlphaZero defeated the strongest existing engines at all three games within hours to days of training. The self-play paradigm became one of the most-studied ideas in modern RL, eventually generalizing to MuZero (2020), which learned the model of the environment alongside the policy and value functions.

TRPO, PPO, and the modern policy-gradient regime

Parallel to the AlphaGo work, Schulman et al. at OpenAI developed two policy-gradient algorithms that became the dominant modern recipes. TRPO (Trust Region Policy Optimization, 2015) introduced a KL-divergence constraint between successive policies, preventing the catastrophic policy collapses that earlier policy-gradient methods suffered. PPO (Proximal Policy Optimization, 2017) simplified TRPO into a clipped-surrogate-objective form that is computationally cheaper and easier to implement; PPO has been, since roughly 2018, the single most-used policy-gradient algorithm in deep RL.

SAC (Soft Actor-Critic, Haarnoja et al., 2018) and TD3 (Twin Delayed DDPG, Fujimoto et al., 2018) became the dominant choices for continuous-action problems, replacing DDPG (Lillicrap et al., 2015) with more stable variants.

Model-based deep RL: Dreamer and MuZero

Through 2018–2020, a parallel line developed model-based deep RL - learn a model of the environment from data, then plan with the learned model. MuZero (Schrittwieser et al., 2020) learned a latent dynamics model end-to-end with the policy and value functions; this scaled to Atari, Go, chess, and shogi at AlphaZero-level performance without being given the rules of the games.

The Dreamer family (Hafner et al.) - Dreamer V1 (2019), V2 (2020), V3 (2023) - developed latent-world-model RL: encode observations into a latent space, learn dynamics in that latent space, train a policy via imagined rollouts. Dreamer V3 in 2023 demonstrated that a single hyperparameter setting could solve a wide range of tasks across game and continuous-control domains - a result analogous to DQN’s 2015 generalization across Atari games.

Offline RL

A different framing matured in 2020–2022. Offline RL asks: given a fixed dataset of past interactions (no further data collection), learn the best possible policy. This is the practical setting for robotics, healthcare, and any domain where deployment is expensive or risky.

The naive approach (apply standard RL to the dataset) fails because the algorithm tends to estimate Q-values for actions that are out of distribution in the dataset, where the model is unreliable. Conservative Q-Learning (CQL, Kumar et al., 2020), Implicit Q-Learning (IQL, Kostrikov et al., 2022), and AWAC were the canonical responses: each introduces some form of pessimism about out-of-distribution actions. The framework matured into a substantial subfield with its own benchmarks and applications.

The RLHF era

The single most consequential RL application of the 2020s. Christiano et al. (2017) “Deep Reinforcement Learning from Human Preferences” demonstrated that an RL agent could be trained from preference comparisons rather than scalar rewards: humans see pairs of trajectories, choose the better one, and a reward model is fitted to predict preferences. Standard RL (initially PPO) then trains a policy against the reward model.

The technique was applied to summarization (Stiennon et al., 2020), then to general instruction-following (Ouyang et al., 2022, “Training Language Models to Follow Instructions with Human Feedback” - the InstructGPT paper). InstructGPT was the direct predecessor of ChatGPT (deployed late 2022), which brought RLHF-trained language models into mainstream use.

By 2023, RLHF was the standard final stage of LLM training across the field - proprietary (GPT, Claude, Gemini) and open (Llama Chat, Mistral Instruct). The LM-as-RL-environment framing - where a prompt is a “state,” a completion is a “trajectory,” and a reward model scores the trajectory - was a non-obvious adaptation that turned out to be enormously productive.

Direct Preference Optimization (DPO, Rafailov et al., 2023) showed that the RLHF objective could be re-derived as a supervised loss on preference pairs, eliminating the explicit reward model and the PPO step. DPO is simpler, more stable, and computationally cheaper; it has been widely adopted as an RLHF alternative (LLM §5 develops both in detail).

Reasoning-RL: the 2024–2026 era

The most recent inflection. Starting with OpenAI’s o1 (announced late 2024), a new training recipe emerged: train the language model to think before answering, with the thinking trained by RL against verifiable rewards (e.g., does the final answer match the known-correct solution to a math problem?).

The technical innovation was not entirely novel - chain-of-thought prompting had been understood since 2022 (LLM §7); reward modelling against math correctness had been explored. What was new was the scale and the recipe. o1 demonstrated that an LM trained with RL on chains of reasoning (often at substantial inference-time cost - the model “thinks” for many tokens before producing an answer) was dramatically better at math, coding, and STEM reasoning than its pre-RL predecessor.

o3 (early 2025) and DeepSeek-R1 (early 2025) consolidated and extended the recipe. DeepSeek-R1 in particular published its method (using GRPO (Group Relative Policy Optimization), a simplification of PPO suited to large-scale RLHF/reasoning-RL training) and demonstrated near-frontier reasoning performance with an open-weights release. The reasoning-RL paradigm has, by 2026, become the dominant post-training method for frontier models.

Where this leaves us in 2026

The current state. The RL toolkit is large and mature:

  • Tabular and small-scale value-based RL: well-understood; rarely the practical method.

  • DQN-family deep value-based RL: dominant for discrete-action problems at moderate scale.

  • PPO-family policy gradients: the workhorse for most continuous-control and policy-learning tasks.

  • SAC/TD3: dominant for continuous-control benchmarks.

  • AlphaZero/MuZero/Dreamer: model-based recipes with strong sample efficiency in domains where they apply.

  • CQL/IQL: standard offline-RL methods.

  • PPO-for-RLHF / DPO / GRPO: the LM post-training stack.

  • Reasoning-RL: the 2024+ frontier paradigm.

The chapter develops each of these in §3§12, with §10 on RLHF as the connecting tissue between classical RL and the modern LM-centric application. The remaining sections (§13 connections, §14 open problems, §15 further reading, §16 exercises) close out.

Editorial note. The reasoning-RL paradigm (§12) is the most recent material and the most likely to be substantially extended or revised as the field evolves. The 2024–2026 period has seen rapid algorithmic and practical progress; readers should treat §12 as a current snapshot rather than a settled state.


§3. MDPs, Bellman Equations, and Dynamic Programming

A concrete grid-world to anchor the formalism

Before writing the formal definitions, a worked example. Consider a 4×34 \times 3 gridworld:

   GRIDWORLD (Russell-Norvig style)

   ┌───┬───┬───┬───┐
   │   │   │   │+1 │     terminal: reward +1
   ├───┼───┼───┼───┤
   │   │ ■ │   │-1 │     ■ = wall;  terminal: reward -1
   ├───┼───┼───┼───┤
   │ S │   │   │   │     S = start
   └───┴───┴───┴───┘

   Actions: UP, DOWN, LEFT, RIGHT
   Each non-terminal step incurs reward -0.04 (a small cost)
   Action transitions are noisy: 80% intended direction,
   10% perpendicular each side; if movement would hit a wall
   or boundary, the agent stays put.

The agent starts at SS in the bottom-left. At each step it chooses an action; the environment moves it to a new cell (with some probability of going perpendicular to the intended direction) and returns the reward of the resulting cell. The episode ends when the agent reaches one of the two terminal cells. The agent’s task is to learn a policy - a rule mapping current cell to action - that maximizes total reward.

The example is small enough to reason about by hand. The optimal policy is intuitive: head right and up toward +1+1, but route around the 1-1 cell (since perpendicular slips can dump you into it). The framework that makes “learn this policy” precise is the Markov Decision Process.

The Markov Decision Process (MDP)

An MDP is a tuple M=(S,A,P,r,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, r, \gamma) defined as follows.

  • S\mathcal{S}: the state space. A set of possible states the environment can be in. In the gridworld, S=11|\mathcal{S}| = 11 (12 cells minus 1 wall). For Atari from pixels, a state is an 84×84×484 \times 84 \times 4 stacked-frame tensor and S|\mathcal{S}| is astronomical. We assume states are fully observable - the agent sees the true state. Partially-observable extensions (POMDPs) live outside this section.

  • A\mathcal{A}: the action space. A set of actions available to the agent. Discrete (gridworld: {UP,DOWN,LEFT,RIGHT}\{UP, DOWN, LEFT, RIGHT\}) or continuous (ARd\mathcal{A} \subseteq \mathbb{R}^d for joint torques).

  • P(ss,a)P(s' \mid s, a): the transition kernel. The probability of landing in state ss' after taking action aa in state ss. The “Markov” property: PP depends only on (s,a)(s, a), not on the history that led to ss.

  • r(s,a)r(s, a): the reward function. The (expected) immediate reward for taking action aa in state ss. (Some formulations use r(s,a,s)r(s, a, s'), depending on the next state too. Equivalent for our purposes.)

  • γ[0,1)\gamma \in [0, 1): the discount factor. Weights future rewards: a reward kk steps in the future is worth γk\gamma^k of an immediate reward. γ\gamma near 1 is “far-sighted”; γ\gamma near 0 is “myopic.” The discount also serves a mathematical purpose: it makes the infinite-horizon return finite.

Given these objects, a policy π\pi is a mapping from states to actions (deterministic) or to distributions over actions (stochastic): π(as)\pi(a \mid s). The agent’s behaviour is fully specified by π\pi.

The return: what the agent is optimizing

The return from time tt is the discounted sum of future rewards:

Gt=rt+γrt+1+γ2rt+2+=k=0γkrt+k.G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}.

The agent’s goal is to find a policy π\pi that maximizes the expected return from each starting state. The expectation is taken over the randomness in transitions (PP) and in the policy (π\pi if stochastic).

Two value functions formalize “expected return”:

  • State-value function Vπ(s)V^\pi(s): the expected return starting from state ss and following policy π\pi thereafter:

    Vπ(s)=Eπ[Gtst=s].V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s].

  • Action-value function Qπ(s,a)Q^\pi(s, a): the expected return starting from state ss, taking action aa first, and following π\pi thereafter:

    Qπ(s,a)=Eπ[Gtst=s,at=a].Q^\pi(s, a) = \mathbb{E}_\pi[G_t \mid s_t = s, a_t = a].

The QQ-function is more useful for learning a policy, because it tells the agent which action is best from each state. Given QπQ^\pi, the greedy policy is π(s)=argmaxaQπ(s,a)\pi'(s) = \arg\max_a Q^\pi(s, a).

The Bellman expectation equations

The defining recursive structure of value functions. The expected return from ss equals the expected immediate reward plus the discounted expected return from the next state:

Vπ(s)=aπ(as)[r(s,a)+γsP(ss,a)Vπ(s)].V^\pi(s) = \sum_a \pi(a \mid s) \left[ r(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^\pi(s') \right].

For the action-value function:

Qπ(s,a)=r(s,a)+γsP(ss,a)aπ(as)Qπ(s,a).Q^\pi(s, a) = r(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \sum_{a'} \pi(a' \mid s') Q^\pi(s', a').

Reading the first equation: the value of state ss under policy π\pi is, in expectation under the action distribution π(s)\pi(\cdot \mid s), the immediate reward plus the discounted expected value of the next state. The recursion bottoms out at terminal states (where Vπ=0V^\pi = 0 by convention).

These are the Bellman expectation equations. They are linear equations in VπV^\pi (or QπQ^\pi); for finite state spaces they have a unique solution that can be computed by direct linear-algebra methods or by iterative fixed-point methods. They are the substrate for everything that follows.

Optimality: the Bellman optimality equations

The optimal value function V(s)V^*(s) is the maximum over all policies:

V(s)=maxπVπ(s).V^*(s) = \max_\pi V^\pi(s).

Bellman’s foundational result: VV^* satisfies its own recursion, the Bellman optimality equation:

V(s)=maxa[r(s,a)+γsP(ss,a)V(s)].V^*(s) = \max_a \left[ r(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s') \right].

Equivalently for QQ^*:

Q(s,a)=r(s,a)+γsP(ss,a)maxaQ(s,a).Q^*(s, a) = r(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \max_{a'} Q^*(s', a').

Reading these: the optimal value of ss is the best (over actions) of immediate reward plus discounted optimal value of the next state.

The Bellman optimality equation is non-linear (because of the max), but it has a unique solution (in the infinite-horizon discounted case with γ<1\gamma < 1) and can be solved by iteration. Once QQ^* is known, the optimal policy is greedy with respect to QQ^*: π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a).

Value iteration

The algorithm. Initialize VV arbitrarily; iterate the Bellman optimality update:

ALGORITHM Value Iteration
INPUT: MDP (S, A, P, r, gamma); convergence threshold theta

1. Initialize V(s) = 0 for all s.
2. Repeat:
     delta = 0
     For each s in S:
       v_old = V(s)
       V(s) = max over a of [ r(s, a) +
                              gamma * sum over s' of P(s' | s, a) * V(s') ]
       delta = max(delta, |v_old - V(s)|)
   Until delta < theta.
3. Output: greedy policy pi(s) = argmax over a of [ r(s,a) +
                                  gamma * sum over s' of P(s' | s, a) * V(s') ]

The key fact: each sweep applies the Bellman optimality operator T:VTVT^*: V \mapsto T^*V where (TV)(s)=maxa[r(s,a)+γsP(ss,a)V(s)](T^*V)(s) = \max_a [r(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s')]. This operator is a γ\gamma-contraction in the sup-norm - meaning TVTVγVV\|T^*V - T^*V'\|_\infty \leq \gamma \|V - V'\|_\infty - and so by the Banach fixed-point theorem, value iteration converges to the unique fixed point VV^* at a geometric rate O(γk)O(\gamma^k).

For the gridworld example, value iteration converges in a few dozen sweeps to within machine precision. The output is the optimal value function and the optimal policy.

Policy iteration

A complementary algorithm that often converges in fewer iterations. Alternate between two steps:

  1. Policy evaluation: compute VπV^{\pi} for the current policy π\pi (by solving the Bellman expectation equation for π\pi, or by iterative updates).

  2. Policy improvement: update π\pi greedily with respect to VπV^{\pi}.

ALGORITHM Policy Iteration
INPUT: MDP (S, A, P, r, gamma)

1. Initialize pi(s) arbitrarily for all s.
2. Repeat:
   2a. POLICY EVALUATION: solve V^pi from
       V(s) = sum_a pi(a|s) [r(s,a) + gamma sum_{s'} P(s'|s,a) V(s')]
       (linear system; or iterate to fixed point).
   2b. POLICY IMPROVEMENT:
       For each s in S:
         pi_new(s) = argmax_a [r(s,a) + gamma sum_{s'} P(s'|s,a) V(s')]
       If pi_new == pi: STOP. Return pi.
       Else: pi = pi_new.

Policy iteration converges in finitely many iterations for finite MDPs (each iteration strictly improves the policy or stops). It often takes far fewer outer iterations than value iteration, but each iteration is more expensive (the inner policy-evaluation step is a full linear system).

Why exact DP is rarely usable in modern RL

The crucial limitation. Value iteration and policy iteration require:

  • Knowing the MDP exactly. The agent must have access to PP and rr. In the gridworld these are specified; in real applications they are not. (Modern model-based RL learns approximations of PP and rr from data - covered in §8.)

  • Iterating over all states. Both algorithms sweep over S\mathcal{S} in each iteration. For the 4×34 \times 3 gridworld this is 11 cells. For Atari from pixels, S|\mathcal{S}| is comparable to 256848441067000256^{84 \cdot 84 \cdot 4} \approx 10^{67000}. A complete sweep is infeasible.

  • Storing VV or QQ as a table. Tabular representations are infeasible for large state spaces. Function approximation (linear or neural) is required - and once we use approximation, the contraction-mapping convergence guarantees no longer apply directly.

Modern RL handles each limitation:

  • The agent learns from samples of (s,a,r,s)(s, a, r, s') transitions rather than from the full PP and rr. This is the move from DP to temporal-difference methods (§4).

  • The agent updates value estimates only for visited states rather than sweeping the whole state space. This is what makes RL practical on large state spaces - and what creates the exploration problem (§11), since the agent must somehow visit informative states.

  • The agent uses function approximation (linear, neural) to represent VV or QQ. Deep RL (§6) is the case where the function approximator is a deep neural network.

The classical DP material is the conceptual foundation; the modern algorithms inherit the Bellman recursion as their target but estimate it from samples and approximations.

Connection to dynamic programming as a research tradition

Bellman’s dynamic programming was developed in operations research, separately from reinforcement learning as a learning paradigm. The two traditions converged in the late 1980s when Sutton, Barto, and others recognized that TD learning was, in effect, sample-based dynamic programming. The reinforcement-learning tradition emphasized learning from interaction; the DP tradition emphasized exact computation given a known model. The modern view: both are different vantages on the same Bellman-equation core.

This is what allows the rest of the chapter to talk about Q-learning, policy gradients, and PPO as variations on the Bellman theme - each is a particular way of approximating, sampling, or sidestepping the Bellman equations of this section.


§4. Temporal-Difference Learning and Q-Learning

The conceptual move: sample, don’t compute

Value iteration (§3) computes VV^* by sweeping the state space and applying the Bellman optimality operator. Two impossibilities for practical RL: we don’t know PP and rr in closed form, and we cannot sweep the full state space.

Temporal-difference learning replaces both with sampling. Rather than compute sP(ss,a)V(s)\sum_{s'} P(s' \mid s, a) V(s') (an expectation requiring knowledge of PP and a sum over all ss'), the agent takes an action in state ss, observes the resulting next state ss' and reward rr, and uses the single sample (s,a,r,s)(s, a, r, s') to update its value estimate. The expectation is replaced by Monte Carlo sampling; the sweep is replaced by visiting states the policy actually encounters.

This is the conceptual core of all model-free RL.

TD(0): the simplest TD algorithm

The TD(0) update for state-value learning. The agent has a current estimate V(s)V(s) of the value function. After observing a single transition (s,a,r,s)(s, a, r, s'), it updates:

V(s)V(s)+α[r+γV(s)V(s)],V(s) \leftarrow V(s) + \alpha \big[\, r + \gamma V(s') - V(s) \,\big],

where α(0,1]\alpha \in (0, 1] is the learning rate (or step size).

Reading this. The bracket δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s) is the TD error - the difference between the bootstrapped target r+γV(s)r + \gamma V(s') (immediate reward plus discounted current estimate of next-state value) and the current estimate V(s)V(s). If the target is higher than the estimate, V(s)V(s) is updated upward; if lower, downward. The factor α\alpha controls how much we trust this single sample.

The target r+γV(s)r + \gamma V(s') is bootstrapped: it uses the current estimate V(s)V(s') rather than waiting for the actual return GtG_t from the trajectory. This is the difference from Monte Carlo evaluation (which uses the actual return). Bootstrapping has lower variance (one step of randomness rather than the whole trajectory) at the cost of bias (the bootstrap target is biased while VV is being learned).

Pseudocode for tabular TD(0)

ALGORITHM TD(0) for State-Value Estimation
INPUT: policy pi, learning rate alpha, discount gamma

1. Initialize V(s) = 0 for all s in S.
2. For each episode:
     s = initial state
     While s is not terminal:
       a = action selected by pi(. | s)
       Take action a; observe reward r and next state s'
       V(s) = V(s) + alpha * [ r + gamma * V(s') - V(s) ]
       s = s'
3. After many episodes, V approximates V^pi.

Three properties to note. First, the algorithm uses no model - only the actually-observed (s,a,r,s)(s, a, r, s') transitions. Second, it updates only the states the agent actually visits - large unvisited regions of S\mathcal{S} remain untouched. Third, it can run online, updating after every step, without storing entire trajectories.

Convergence: tabular TD(0) converges to VπV^\pi with probability 1 under the Robbins-Monro conditions on α\alpha (sum diverges, sum-of-squares converges) and assuming all states are visited infinitely often. The proof uses stochastic approximation theory.

TD(λ\lambda) and eligibility traces

A generalization parameterized by λ[0,1]\lambda \in [0, 1]. TD(0) updates only the most recently visited state; TD(λ\lambda) updates a trace of recently visited states, with weight decaying as λ\lambda. The mechanism uses an eligibility trace e(s)e(s) for each state, incremented when ss is visited and decayed by γλ\gamma\lambda each step.

The TD(λ\lambda) update is then:

δ=r+γV(s)V(s),V(s)V(s)+αδe(s)for all s.\delta = r + \gamma V(s') - V(s), \qquad V(s'') \leftarrow V(s'') + \alpha \delta \cdot e(s'') \quad \text{for all } s''.

Two extreme cases. λ=0\lambda = 0 recovers TD(0) (only the current state is updated). λ=1\lambda = 1 recovers (with appropriate trace) Monte Carlo updating (the entire trajectory is updated against the actual return). Intermediate λ\lambda interpolates between bias (TD(0)) and variance (Monte Carlo).

In modern deep RL, eligibility traces are often replaced by n-step returns - explicit nn-step bootstraps - which serve a similar purpose without the eligibility-trace bookkeeping. The conceptual lesson (interpolating between bootstrapped one-step targets and full Monte Carlo returns) carries over.

Action-value learning: SARSA and Q-learning

For control (learning a policy, not just evaluating one), we need to learn the action-value Qπ(s,a)Q^\pi(s, a) rather than just Vπ(s)V^\pi(s). Two algorithms differ in what they target.

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)],Q(s, a) \leftarrow Q(s, a) + \alpha \big[\, r + \gamma Q(s', a') - Q(s, a) \,\big],

where aa' is the action the agent actually takes in ss' under the current policy. SARSA learns QπQ^\pi for the policy π\pi that is being followed (which may be exploratory, e.g., ϵ\epsilon-greedy with respect to the current QQ).

Q-learning (Watkins, 1989). The off-policy update:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)].Q(s, a) \leftarrow Q(s, a) + \alpha \big[\, r + \gamma \max_{a'} Q(s', a') - Q(s, a) \,\big].

The crucial difference: the target uses maxaQ(s,a)\max_{a'} Q(s', a') - the value of the best action in ss' according to the current QQ - not the value of the action the agent actually takes next. Q-learning therefore learns the value of the greedy policy with respect to QQ, regardless of what policy is being followed for data collection.

Pseudocode for tabular Q-learning

ALGORITHM Q-Learning (tabular)
INPUT: learning rate alpha, discount gamma, exploration epsilon

1. Initialize Q(s, a) = 0 for all (s, a).
2. For each episode:
     s = initial state
     While s is not terminal:
       # Epsilon-greedy action selection
       With probability epsilon: a = random action
       Otherwise: a = argmax over a' of Q(s, a')

       Take action a; observe reward r and next state s'

       # Off-policy Q-learning update
       Q(s, a) = Q(s, a) + alpha * [ r + gamma * max_{a'} Q(s', a') - Q(s, a) ]

       s = s'
3. Output: greedy policy pi(s) = argmax_a Q(s, a)

The off-policy property is what makes Q-learning so widely useful: the agent can collect data with one policy (a highly exploratory ϵ\epsilon-greedy one) and learn the value function of a different policy (the greedy one). This decoupling of behaviour and target is the foundation of essentially all modern value-based RL, including DQN (§6).

A worked Q-learning step on the gridworld

To make the abstraction concrete. Take the gridworld of §3. Suppose the agent is in cell (1,1)(1, 1) (one right of SS). Current QQ-values are all 00. Initialize α=0.5\alpha = 0.5, γ=0.9\gamma = 0.9.

The agent (with ϵ\epsilon-greedy at ϵ=1\epsilon = 1, i.e., random) selects action RIGHT. With probability 0.8 it moves to (2,1)(2, 1); suppose this happens. Reward is 0.04-0.04 (the standard step cost).

The Q-learning update:

  • s=(1,1),a=RIGHT,r=0.04,s=(2,1)s = (1, 1), a = \text{RIGHT}, r = -0.04, s' = (2, 1).

  • maxaQ(s,a)=max(0,0,0,0)=0\max_{a'} Q(s', a') = \max(0, 0, 0, 0) = 0 (all values still zero).

  • TD target: r+γmax=0.04+0.90=0.04r + \gamma \max = -0.04 + 0.9 \cdot 0 = -0.04.

  • TD error: δ=0.040=0.04\delta = -0.04 - 0 = -0.04.

  • Updated Q((1,1),RIGHT)=0+0.5(0.04)=0.02Q((1,1), \text{RIGHT}) = 0 + 0.5 \cdot (-0.04) = -0.02.

After thousands of such updates, the QQ-values propagate backward from the terminal +1+1 and 1-1 cells: states near the +1+1 accumulate positive value, states near the 1-1 accumulate negative value, and the greedy policy with respect to the converged QQ matches the optimal policy of value iteration.

Convergence of tabular Q-learning

The classical theoretical result. Watkins and Dayan (1992): tabular Q-learning converges with probability 1 to the optimal QQ^*, provided:

  1. All state-action pairs are visited infinitely often.

  2. The learning rate satisfies the Robbins-Monro conditions: tαt=\sum_t \alpha_t = \infty and tαt2<\sum_t \alpha_t^2 < \infty.

  3. Rewards are bounded.

The first condition is not automatic - it requires sufficient exploration. The second is satisfied by, e.g., αt=1/t\alpha_t = 1/t. Under these conditions, no matter what behaviour policy is used (as long as it explores enough), Q-learning converges to the values of the optimal policy.

This is a strong and clean theoretical result. It is also strictly tabular: once function approximation enters, the convergence guarantees largely break down. Tsitsiklis and Van Roy (1997) showed that off-policy TD with linear function approximation can diverge - the so-called deadly triad of bootstrapping, off-policy learning, and function approximation. Modern deep value-based RL (DQN, §6) addresses the deadly triad with experience replay and target networks; the convergence is empirical rather than provable.

SARSA vs Q-learning: the cliff-walking example

A pedagogical contrast worth knowing. Sutton and Barto’s “cliff walking” gridworld: an agent walks along the edge of a cliff; falling off costs large reward and resets to the start. Two algorithms with ϵ\epsilon-greedy exploration:

  • SARSA learns the value of the ϵ\epsilon-greedy policy. Because the policy occasionally takes random actions (the ϵ\epsilon probability), and a random action near the cliff falls off with non-trivial probability, the SARSA value function takes the cliff-fall risk into account. SARSA learns to walk farther from the cliff - a “safer” but longer route.

  • Q-learning learns the value of the greedy policy. The greedy policy never falls off the cliff (it always picks the safe action). Q-learning therefore learns to walk along the cliff edge - the optimal greedy route, even though the agent currently following ϵ\epsilon-greedy will sometimes fall off during training.

Both algorithms converge to the right thing for what they target. The lesson: the choice of on-policy vs off-policy matters when the behaviour policy and the target policy differ in safety characteristics. Modern deep RL is overwhelmingly off-policy (Q-learning, DQN, DDPG, SAC) for sample-efficiency reasons, with the trade-off that experience-replay and target-network machinery is needed to manage the deadly triad.

Where TD methods sit in the modern algorithm space

TD(0), Q-learning, and SARSA are the conceptual ancestors of every modern value-based deep RL algorithm. The chain:

  • Q-learning + neural function approximation + experience replay + target networks = DQN (§6).

  • Q-learning + double-DQN target + dueling architecture + prioritized replay + distributional RL + multi-step returns + noisy nets = Rainbow (Hessel et al., 2018).

  • SARSA-flavoured, on-policy, with policy parameterization = A3C / A2C / PPO family (§5§6).

  • Off-policy actor-critic with deterministic policy = DDPG / TD3 / SAC (§7).

All of them inherit from Q-learning the bootstrapped one-step (or n-step) Bellman target. The TD substrate developed in this section is the foundation for everything algorithmic in §5§12.


§5. Policy Gradient and Actor-Critic

Why policy gradients exist

Q-learning (§4) is value-based: learn QQ^*, then act greedily. This works well when the action space is small enough that argmaxaQ(s,a)\arg\max_a Q(s, a) is computable. It works less well when:

  • The action space is large or continuous (computing argmax\arg\max over Rd\mathbb{R}^d is itself a hard optimization).

  • The optimal policy is stochastic (rock-paper-scissors against a smart opponent has no deterministic optimal policy; greedy-with-respect-to-QQ cannot represent stochastic optima).

  • The policy must satisfy structural constraints (parameterized by a particular network architecture, biased toward certain action distributions, etc.).

Policy-gradient methods parameterize the policy directly as πθ(as)\pi_\theta(a \mid s) - a function with parameters θ\theta (a neural network in modern practice) - and update θ\theta by gradient ascent on the expected return. Value functions appear as auxiliary quantities for variance reduction, not as the primary representation. This is the second major family of RL algorithms, and it is the family that scales most successfully into the deep-RL and LLM-RLHF regimes.

The objective

The agent’s objective is the expected return under its policy. Define the trajectory τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots) generated by following πθ\pi_\theta from a starting-state distribution ρ0\rho_0. The expected return is

J(θ)=Eτπθ ⁣[t=0γtr(st,at)]=Eτπθ[G0],J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right] = \mathbb{E}_{\tau \sim \pi_\theta}[G_0],

with G0G_0 the return of the trajectory. The agent wants to find θ=argmaxθJ(θ)\theta^* = \arg\max_\theta J(\theta).

The challenge: how do we compute the gradient θJ(θ)\nabla_\theta J(\theta)? The trajectory distribution πθ\pi_\theta depends on θ\theta, and we cannot differentiate through the environment dynamics (PP is unknown and not differentiable in θ\theta anyway). The policy-gradient theorem gives a way around this.

The policy-gradient theorem

The central result. Sutton, McAllester, Singh, Mansour (1999) (with antecedents in Williams 1992 and earlier): the gradient of J(θ)J(\theta) admits a clean expression as an expectation over trajectories that the agent can sample:

θJ(θ)=Eτπθ ⁣[t=0θlogπθ(atst)Gt].\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[ \sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t \right].

Reading this. For each step tt in a trajectory, we take the log-probability gradient θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t \mid s_t) - the direction in θ\theta-space that increases the probability of the action the agent took - and weight it by GtG_t, the return earned from time tt onward. Summed over the trajectory and averaged over many trajectories, this is the gradient of expected return.

The intuition. Increase the probability of action sequences that produced high returns; decrease the probability of those that produced low returns. This is essentially the same intuition as supervised learning’s gradient - but using the return as the reward signal rather than a labelled target.

The proof (sketch). The clever step is the log-derivative trick: θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \cdot \nabla_\theta \log \pi_\theta(a \mid s). This converts a gradient of a probability into an expectation over the distribution itself, eliminating the need to differentiate through the unknown environment dynamics. The full derivation rolls this through the trajectory distribution and reduces to the formula above.

REINFORCE: the simplest policy-gradient algorithm

Williams (1992) “Simple statistical gradient-following algorithms for connectionist reinforcement learning” introduced REINFORCE:

ALGORITHM REINFORCE
INPUT: parameterized policy pi_theta, learning rate alpha,
       discount gamma

1. Initialize theta randomly.
2. For each episode:
     # Roll out a trajectory under the current policy.
     Sample trajectory tau = (s_0, a_0, r_0, s_1, a_1, r_1, ...)
     by following pi_theta until termination.

     # Compute returns at each step.
     For t = T-1, T-2, ..., 0:
       G_t = r_t + gamma * G_{t+1}    (with G_T = 0)

     # Update theta with the policy-gradient estimate.
     g = sum over t of [ grad_theta log pi_theta(a_t | s_t) * G_t ]
     theta = theta + alpha * g

Three properties. First, REINFORCE is on-policy: trajectories must be generated by the current πθ\pi_\theta; once θ\theta is updated, old trajectories are no longer valid. Second, it is unbiased: the gradient estimate is correct in expectation. Third, it is high variance: the return GtG_t is the sum of many random rewards, and a single trajectory’s GtG_t can be far from its expectation.

The high variance is REINFORCE’s central limitation. With high variance, learning is slow and unstable; many trajectories are needed for each gradient update. Most refinements of policy-gradient methods are about reducing this variance.

Variance reduction via baselines

A baseline b(s)b(s) subtracted from the return does not change the gradient’s expectation but can dramatically reduce its variance:

θJ(θ)=Eτπθ ⁣[tθlogπθ(atst)(Gtb(st))].\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[ \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b(s_t)) \right].

The intuition. Subtracting b(st)b(s_t) rescales the magnitude of the gradient signal: actions with returns above b(st)b(s_t) are reinforced, actions with returns below b(st)b(s_t) are suppressed. The signal is no longer “did this trajectory earn 47?” (a noisy absolute number) but “did this action do better or worse than the baseline?” (a much less noisy relative quantity).

The optimal baseline (variance-minimizing) is roughly b(s)=Vπ(s)b(s) = V^\pi(s) - the expected return from ss under the current policy. Substituting:

GtVπ(st)Aπ(st,at):=Qπ(st,at)Vπ(st),G_t - V^\pi(s_t) \approx A^\pi(s_t, a_t) := Q^\pi(s_t, a_t) - V^\pi(s_t),

the advantage function. Reading this: Aπ(s,a)A^\pi(s, a) measures how much better taking action aa in state ss is than the average action under π\pi. Policy-gradient with the advantage as the weighting signal is the cleanest variance-reduced form.

Actor-critic: learning a value function as the critic

The actor-critic family takes the variance-reduction idea seriously. Two networks:

  • The actor πθ(as)\pi_\theta(a \mid s) - the policy being learned.

  • The critic Vϕ(s)V_\phi(s) (or Qϕ(s,a)Q_\phi(s, a)) - a value function being learned alongside, used as the baseline.

The critic is learned by TD updates (§4); the actor is updated by policy gradient with the critic-derived advantage. The simplest one-step actor-critic uses the TD-error advantage estimate:

A^t=rt+γVϕ(st+1)Vϕ(st).\hat{A}_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t).

The actor update is then

θθ+αθlogπθ(atst)A^t.\theta \leftarrow \theta + \alpha \cdot \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \hat{A}_t.

This is the structural form behind essentially every modern policy-gradient algorithm. Variations differ in:

  • How the advantage is estimated (one-step TD, n-step, Generalized Advantage Estimation = GAE, Monte Carlo).

  • How conservative the policy update is (vanilla, natural gradient, trust region, clipped).

  • How parallel the data collection is (single environment, multi-environment A2C/A3C).

Generalized Advantage Estimation (GAE)

A practically important refinement. Schulman, Moritz, Levine, Jordan, Abbeel (2016) introduced Generalized Advantage Estimation, an interpolation between one-step TD advantage (low variance, high bias) and Monte Carlo advantage (high variance, no bias) parameterized by λ[0,1]\lambda \in [0, 1]:

A^tGAE(γ,λ)=l=0(γλ)lδt+l,δt=rt+γVϕ(st+1)Vϕ(st).\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \qquad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t).

GAE has become the standard advantage estimator for actor-critic methods (PPO uses it; SAC variants use it; the LLM-RLHF stack uses it). Typical values are γ=0.99\gamma = 0.99 and λ[0.9,0.97]\lambda \in [0.9, 0.97].

The natural policy gradient and trust regions

A subtle issue with vanilla policy gradient: the step size in θ\theta-space does not correspond to a meaningful change in the policy itself. Two parameterizations of the same policy distribution can require very different θ\theta-steps to produce the same change in behaviour.

The natural policy gradient (Kakade, 2001) addresses this by replacing the gradient with the natural gradient, scaled by the inverse Fisher information matrix. The natural gradient updates θ\theta in the direction that maximizes return per unit change in policy, where policy distance is measured by KL divergence.

Trust Region Policy Optimization (TRPO, Schulman et al., 2015) further constrained this idea: each update step is constrained so that the KL divergence between successive policies is bounded by a small δ\delta. This prevents catastrophic policy collapses (an aggressive update that drops the policy into a region where it performs much worse than before) that vanilla and natural policy gradient occasionally suffer.

TRPO works but is computationally expensive (each update requires conjugate-gradient solves of the Fisher matrix). PPO (Schulman et al., 2017) simplifies TRPO into a clipped-surrogate-objective form that captures most of TRPO’s stability gains at a fraction of the computational cost. PPO is developed in §6 and is the dominant modern policy-gradient algorithm.

The transition to modern deep RL

Tabular policy-gradient methods are rare in modern practice. The transition to deep RL replaced:

  • The actor with a deep network (the policy is a softmax over actions for discrete control, a Gaussian or beta distribution for continuous control).

  • The critic with a deep network (the value function is the output of a deep network on state representations).

  • Single-environment data collection with massively parallel rollouts (multiple environment workers feeding a single learner).

The architectural and infrastructure work that produced this transition is the content of §6 (DQN through PPO) and §7 (continuous control). The conceptual content of policy gradients - the policy-gradient theorem, baselines, advantages, actor-critic structure - survives essentially unchanged. The deep-RL revolution is, at the algorithmic level, mostly an instantiation of the policy-gradient theorem at scale.


§6. Deep RL: From DQN to PPO

What “deep RL” added to RL

The combination of deep neural networks with reinforcement learning is what made RL practical for high-dimensional observations (pixels, continuous control, language). The conceptual contribution of deep RL is not a new algorithmic framework - Q-learning, policy gradients, and actor-critic existed before - but the engineering work that made these algorithms stable when value functions and policies are deep networks.

The two transformative results:

  1. DQN (Mnih et al., 2013, 2015) for value-based deep RL.

  2. PPO (Schulman et al., 2017) for policy-gradient deep RL.

We develop both mechanistically. They define the algorithmic vocabulary that essentially all modern deep RL inherits.

DQN: the algorithmic recipe

The setup. Train a Q-function Qθ(s,a)Q_\theta(s, a) parameterized by a deep network. The classic DQN architecture for Atari: a CNN takes a stack of four 84×8484 \times 84 frames and outputs Q-values for each of the 18 possible joystick actions.

The naive Q-learning loss with function approximation:

L(θ)=E(s,a,r,s)B ⁣[(r+γmaxaQθ(s,a)Qθ(s,a))2],\mathcal{L}(\theta) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{B}}\!\left[ \big(\, r + \gamma \max_{a'} Q_\theta(s', a') - Q_\theta(s, a) \,\big)^2 \right],

where the expectation is over transitions sampled from the agent’s experience. This naive form is famously unstable (the deadly triad of §4). DQN’s two engineering innovations make it work.

Innovation 1: Experience replay. Instead of using each transition once and discarding it, store all transitions in a large replay buffer B\mathcal{B} (typically 10610^6 entries). Each gradient update samples a random minibatch from B\mathcal{B}. This decorrelates the training data, giving each gradient step the i.i.d.-like properties that supervised learning requires for stability. It also lets each transition contribute to many updates, improving sample efficiency.

Innovation 2: Target networks. The naive Q-learning target r+γmaxaQθ(s,a)r + \gamma \max_{a'} Q_\theta(s', a') uses the same parameters θ\theta as the network being updated. As θ\theta changes, the target moves, creating a feedback loop that destabilizes training. DQN uses a target network with frozen parameters θ\theta^- - a periodic snapshot of θ\theta - for the target:

y=r+γmaxaQθ(s,a).y = r + \gamma \max_{a'} Q_{\theta^-}(s', a').

The target network is updated to match θ\theta every CC steps (typically C=10000C = 10000). This decouples the moving target from the moving estimator and stabilizes training.

Pseudocode for DQN

ALGORITHM DQN (Mnih et al., 2015)
INPUT: environment, Q-network Q_theta, target network Q_theta-,
       replay buffer B (capacity N), discount gamma,
       exploration epsilon, target-update period C

1. Initialize theta randomly; theta- = theta.
2. Initialize empty replay buffer B.

3. For each step t = 1, 2, ...:
   3a. # Action selection: epsilon-greedy on Q_theta
       With probability epsilon: a = random action
       Otherwise: a = argmax_a Q_theta(s, a)

   3b. # Environment step
       Take action a; observe r, s'

   3c. # Store transition
       Add (s, a, r, s') to B (evicting oldest if full)

   3d. # Sample minibatch and update Q_theta
       Sample random minibatch {(s_i, a_i, r_i, s'_i)} from B
       For each i:
         If s'_i is terminal: y_i = r_i
         Else: y_i = r_i + gamma * max_{a'} Q_theta-(s'_i, a')
       L = (1/|batch|) sum_i (y_i - Q_theta(s_i, a_i))^2
       theta = theta - lr * grad_theta L

   3e. # Periodic target-network update
       If t mod C == 0: theta- = theta

   3f. s = s'

The 2015 Nature DQN paper applied this single architecture and hyperparameter set to 49 Atari games and reached human-level play on most of them - the empirical demonstration that made deep RL a major research area.

Refinements: Double DQN, Dueling DQN, Prioritized Replay, Rainbow

DQN was followed by a flood of refinements through 2016–2017. Several proved durable:

  • Double DQN (van Hasselt, Guez, Silver, 2016). The Q-learning target maxaQθ(s,a)\max_{a'} Q_{\theta^-}(s', a') is biased - using the same network for action selection and value estimation creates an over-estimation bias. Double DQN decouples the two: use QθQ_\theta to choose the maximizing action, QθQ_{\theta^-} to evaluate it: y=r+γQθ(s,argmaxaQθ(s,a))y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s', a')). Reduces overestimation; small but consistent improvement.

  • Dueling DQN (Wang et al., 2016). Architectural change: split the network into two streams that compute V(s)V(s) and A(s,a)A(s, a) separately, combined as Q(s,a)=V(s)+A(s,a)1AaA(s,a)Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a'). Improves performance when many actions have similar values.

  • Prioritized Experience Replay (Schaul et al., 2016). Sample transitions from B\mathcal{B} with probability proportional to their TD error rather than uniformly. Transitions that the model predicts poorly get more weight, accelerating learning. Practically standard in modern DQN-family work.

  • Distributional RL / C51 (Bellemare, Dabney, Munos, 2017). Instead of estimating the expected Q-value, estimate the distribution of returns. The Bellman update becomes a distributional update; modelling the full distribution (rather than just its mean) gives better representations.

  • Rainbow (Hessel et al., 2018) combined six refinements (Double, Dueling, Prioritized, multi-step returns, distributional, noisy nets). On Atari, Rainbow substantially outperformed DQN and was the SOTA for value-based deep RL on this benchmark for years.

PPO: the algorithmic recipe

The dominant modern policy-gradient algorithm. PPO (Schulman, Wolski, Dhariwal, Radford, Klimov, 2017) is the algorithm behind, among many others, OpenAI Five (Dota 2), the GPT InstructGPT RLHF training, and most modern RLHF/reasoning-RL implementations.

The starting point. The TRPO objective constrains policy updates by a KL constraint:

θnew=argmaxθE(s,a)πθold ⁣[πθ(as)πθold(as)A^πθold(s,a)]s.t.E[KL(πθoldπθ)]δ.\theta_{\text{new}} = \arg\max_\theta \mathbb{E}_{(s,a) \sim \pi_{\theta_{\text{old}}}}\!\left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}^{\pi_{\theta_{\text{old}}}}(s, a) \right] \quad \text{s.t.} \quad \mathbb{E}[\text{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)] \leq \delta.

The expectation is over states/actions from the old policy; the importance-sampling ratio πθ(as)/πθold(as)\pi_\theta(a|s) / \pi_{\theta_{\text{old}}}(a|s) corrects for the distribution mismatch.

PPO’s contribution: replace the explicit KL constraint with a clipping mechanism on the importance-sampling ratio. Define

rt(θ)=πθ(atst)πθold(atst).r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}.

The PPO clipped objective is

LCLIP(θ)=Et ⁣[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)],\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[ \min\big(\, r_t(\theta) \hat{A}_t,\; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \,\big) \right],

where ϵ\epsilon is a clipping parameter (typically 0.10.1 or 0.20.2) and A^t\hat{A}_t is a GAE-estimated advantage.

Reading the clipped objective. When rt(θ)r_t(\theta) is within [1ϵ,1+ϵ][1 - \epsilon, 1 + \epsilon] (the policy hasn’t changed much), the objective is the standard surrogate. When rt(θ)r_t(\theta) exceeds 1+ϵ1 + \epsilon on a positive-advantage action, the gradient is clipped - preventing a too-aggressive increase in the action’s probability. Symmetrically for negative-advantage actions and rt<1ϵr_t < 1 - \epsilon. The min ensures the gradient is always conservative (we take whichever side caps the update).

The full PPO objective often includes a value-function loss and an entropy bonus:

LPPO(θ,ϕ)=Et ⁣[LtCLIPc1(Vϕ(st)Vttarget)2+c2H[πθ(st)]],\mathcal{L}^{\text{PPO}}(\theta, \phi) = \mathbb{E}_t\!\left[ \mathcal{L}^{\text{CLIP}}_t - c_1 (V_\phi(s_t) - V_t^{\text{target}})^2 + c_2 \mathcal{H}[\pi_\theta(\cdot | s_t)] \right],

with c1,c2c_1, c_2 as coefficients, VttargetV_t^{\text{target}} a TD-style or n-step return target, and H\mathcal{H} the entropy of the policy at sts_t. The entropy term encourages exploration by penalizing low-entropy (deterministic) policies.

Pseudocode for PPO

ALGORITHM PPO (Schulman et al., 2017)
INPUT: policy pi_theta, value function V_phi, GAE parameters (gamma, lambda),
       clipping parameter epsilon, learning rate, K epochs,
       minibatch size, parallel actors N

1. Initialize theta, phi.
2. For each iteration:

   2a. # Parallel rollout
       For each of N parallel actors:
         Roll out trajectory of T steps under pi_theta_old = pi_theta.
         Store transitions (s_t, a_t, r_t, log pi_theta(a_t|s_t)).

   2b. # Compute advantages and returns via GAE
       For each trajectory:
         delta_t = r_t + gamma * V_phi(s_{t+1}) - V_phi(s_t)
         A_hat_t = sum_{l=0..} (gamma * lambda)^l * delta_{t+l}
         R_t = A_hat_t + V_phi(s_t)   (return target)

   2c. # K epochs of minibatch updates
       For epoch = 1..K:
         For each minibatch:
           # Importance ratio (theta_old is fixed during these epochs)
           r_t(theta) = pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t)

           # Clipped policy loss
           L_clip = -mean over batch of [
               min( r_t(theta) * A_hat_t,
                    clip(r_t(theta), 1-epsilon, 1+epsilon) * A_hat_t )
           ]

           # Value-function loss
           L_v = mean over batch of (V_phi(s_t) - R_t)^2

           # Entropy bonus
           L_ent = -mean over batch of H[pi_theta(. | s_t)]

           # Combined loss
           L = L_clip + c_1 * L_v - c_2 * L_ent
           theta, phi = AdamStep(theta, phi, L)

The structure is striking. PPO is essentially supervised learning on a clipped surrogate objective with K epochs of minibatch SGD per data-collection iteration. The infrastructure (parallel actors, GAE for advantages, replay-style minibatching) is what makes it scale. The simplicity and stability of PPO compared to TRPO are why it has become dominant despite TRPO’s stronger theoretical motivation.

Why PPO became dominant

Several practical reasons:

  • Stability across hyperparameters. PPO is forgiving - works on a wide range of tasks with minor hyperparameter tweaks. TRPO’s hyperparameters are more delicate.

  • Simplicity of implementation. The clipped objective is a one-line modification of the standard surrogate; the rest is standard SGD. TRPO requires conjugate-gradient solves and Fisher-vector products.

  • Compatibility with parallel rollout infrastructure. PPO’s design assumes many parallel actors collecting data, which is the natural deployment model on modern GPU-equipped clusters.

  • Empirical performance. PPO matches or exceeds TRPO across continuous-control and game benchmarks.

By 2019 PPO was the default choice for new policy-gradient applications. By 2022 it was the algorithm behind InstructGPT’s RLHF training (LLM §5). By 2026 it remains the central modern policy-gradient algorithm, with its variants (GRPO for reasoning-RL, Reinforce++) used at the frontier. The rest of this chapter - continuous control (§7), RLHF (§10), reasoning-RL (§12) - relies on the PPO substrate developed here.

The modern algorithmic stack

For practitioners, the canonical 2026 deep-RL stack is:

  • Discrete-action problems (small to moderate scale): PPO or DQN-family (Rainbow, IQN). PPO is more common in modern open-source implementations.

  • Continuous control (robotics, simulated control): SAC or TD3 dominant; PPO common as a baseline (§7).

  • Game-playing with self-play (Go, chess, complex multiplayer): AlphaZero/MuZero family (§8).

  • Long-horizon, sample-efficient (model-based) settings: Dreamer V3 family (§8).

  • Offline RL from fixed datasets: CQL, IQL, AWAC (§9).

  • LM post-training (RLHF/DPO/GRPO): PPO-for-RLHF, DPO, or GRPO (§10).

  • Reasoning-RL: PPO/GRPO with verifier-based reward (§12).

PPO threads through most of the categories; DQN and SAC dominate specific niches; AlphaZero/MuZero/Dreamer occupy specific structural regimes. The remaining sections develop each in mechanistic detail.


§7. Continuous Control and Off-Policy Deep RL

Why continuous control is its own regime

The algorithms of §6 (DQN family, PPO) work on discrete action spaces - Atari joystick directions, chess moves, language tokens. Continuous control is the setting where actions live in Rd\mathbb{R}^d: joint torques for a robot arm, steering and throttle for a car, control inputs for a quadrotor. The dimension dd is typically 4–30 for robotics tasks, occasionally higher.

Continuous action spaces make Q-learning awkward. The greedy step argmaxaQ(s,a)\arg\max_a Q(s, a) is itself a continuous optimization - there is no longer a finite set of actions to take the max over. Three solutions have proven workable, each instantiated by a major algorithm family:

  1. DDPG / TD3. Parameterize a deterministic policy μθ:SA\mu_\theta: \mathcal{S} \to \mathcal{A}; learn it by gradient ascent on the critic’s Q-value. This replaces the argmax\arg\max with a parameterized actor whose update direction is given by the critic.

  2. SAC. Parameterize a stochastic policy (typically Gaussian), maximize expected return plus an entropy bonus. The entropy term keeps the policy exploratory and is theoretically motivated by maximum-entropy RL.

  3. PPO. Use a stochastic policy and the on-policy clipped objective from §6. This works for continuous control too; PPO is a common baseline. It is less sample-efficient than the off-policy alternatives but is simpler and more robust.

This section develops DDPG, TD3, and SAC - the canonical off-policy continuous-control algorithms.

DDPG: the deterministic-policy-gradient recipe

Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) was the first deep continuous-control algorithm that scaled. It instantiates the deterministic policy gradient theorem (Silver et al., 2014): for a deterministic policy μθ(s)\mu_\theta(s), the policy gradient of expected return is

θJ(θ)=Esρμ ⁣[θμθ(s)aQμ(s,a)a=μθ(s)].\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu}\!\left[ \nabla_\theta \mu_\theta(s) \cdot \nabla_a Q^\mu(s, a) \big|_{a = \mu_\theta(s)} \right].

Reading this. The gradient of expected return with respect to θ\theta is the chain rule: how the policy changes with θ\theta, multiplied by how Q-value changes with the action. The critic QμQ^\mu gives the “direction in action space” the actor should move; the actor learns to follow this direction.

DDPG combines this with the DQN-style infrastructure of §6:

  • Replay buffer of past transitions.

  • Target networks for both the actor μθ\mu_{\theta^-} and the critic QϕQ_{\phi^-}, updated by soft (Polyak) averaging θτθ+(1τ)θ\theta^- \leftarrow \tau \theta + (1 - \tau) \theta^- with τ0.005\tau \approx 0.005.

  • Exploration via additive noise (Ornstein-Uhlenbeck or Gaussian) on the deterministic action.

DDPG demonstrated that deep continuous-control was feasible. It was also empirically brittle: hyperparameter-sensitive, prone to overestimation in the critic, and difficult to reproduce across implementations. The next two algorithms address these failures.

TD3: three fixes for DDPG

Twin Delayed DDPG (TD3, Fujimoto, van Hoof, Meger, 2018) is a refinement that became the canonical reliable form of deterministic-policy continuous control. Three changes:

  1. Twin Q-networks. Train two independent critics Qϕ1,Qϕ2Q_{\phi_1}, Q_{\phi_2}. Use the minimum of the two as the target: y=r+γmin(Qϕ1(s,μθ(s)),Qϕ2(s,μθ(s)))y = r + \gamma \min(Q_{\phi_1^-}(s', \mu_{\theta^-}(s')), Q_{\phi_2^-}(s', \mu_{\theta^-}(s'))). This is clipped double Q-learning - directly addressing the overestimation bias of single-critic methods (analogous to Double DQN in §6).

  2. Delayed policy updates. Update the actor (and the target networks) less frequently than the critics - typically once every two critic updates. Letting the critics converge between actor updates produces more accurate Q-estimates for the actor to follow.

  3. Target policy smoothing. Add small clipped Gaussian noise to the target action: a~=μθ(s)+clip(N(0,σ),c,c)\tilde{a} = \mu_{\theta^-}(s') + \text{clip}(\mathcal{N}(0, \sigma), -c, c). This regularizes the critic against exploiting narrow Q-value peaks that are artefacts of function approximation.

ALGORITHM TD3 (Fujimoto et al., 2018, sketch)
INPUT: actor mu_theta, twin critics Q_phi1, Q_phi2, replay buffer B

For each step:
  Collect transition with exploratory action a = mu_theta(s) + noise; add to B.
  Sample minibatch from B.
  # Twin-critic target
  a_tilde = mu_theta-(s') + clip(N(0, sigma), -c, c)
  y = r + gamma * min(Q_phi1-(s', a_tilde), Q_phi2-(s', a_tilde))
  # Update both critics
  phi1, phi2 = AdamStep on MSE(Q_phi*(s, a), y)
  # Delayed actor update (every d steps)
  If step mod d == 0:
    theta = AdamStep on -mean of Q_phi1(s, mu_theta(s))
    # Soft target update
    phi1- = tau * phi1 + (1 - tau) * phi1-
    phi2- = tau * phi2 + (1 - tau) * phi2-
    theta- = tau * theta + (1 - tau) * theta-

TD3 reliably matches or exceeds DDPG on standard benchmarks and is much less hyperparameter-sensitive. It is one of the two canonical modern continuous-control algorithms (the other being SAC).

SAC: maximum-entropy RL with a stochastic policy

Soft Actor-Critic (SAC, Haarnoja et al., 2018) takes a different approach. Rather than a deterministic policy, SAC learns a stochastic policy under an augmented objective: maximize return plus an entropy bonus.

The maximum-entropy objective:

JMaxEnt(π)=Eτπ ⁣[tr(st,at)+αH[π(st)]],J_{\text{MaxEnt}}(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[ \sum_t r(s_t, a_t) + \alpha \mathcal{H}[\pi(\cdot | s_t)] \right],

where H[π(s)]=Eaπ(s)[logπ(as)]\mathcal{H}[\pi(\cdot | s)] = -\mathbb{E}_{a \sim \pi(\cdot|s)}[\log \pi(a | s)] is the entropy of the policy at state ss and α>0\alpha > 0 is the entropy coefficient (the “temperature”).

Why entropy bonus. Two reasons. First, exploration: a high-entropy policy is exploratory by construction; the agent does not need to add separate exploration noise. Second, robustness: the optimal policy under entropy regularization is multi-modal where the reward landscape is multi-modal (it does not collapse to a single mode). This is empirically helpful for tasks where multiple action sequences earn similar reward.

The Bellman equation is modified accordingly. Define the soft Q-function QsoftπQ^\pi_{\text{soft}} satisfying

Qsoftπ(s,a)=r(s,a)+γEsP ⁣[Eaπ[Qsoftπ(s,a)]+αH[π(s)]].Q^\pi_{\text{soft}}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P}\!\left[ \mathbb{E}_{a' \sim \pi}[Q^\pi_{\text{soft}}(s', a')] + \alpha \mathcal{H}[\pi(\cdot | s')] \right].

The optimal policy under this objective is the Boltzmann distribution over actions: π(as)exp(Qsoft(s,a)/α)\pi^*(a | s) \propto \exp(Q^*_{\text{soft}}(s, a) / \alpha). For a Gaussian policy parameterization, the algorithm becomes tractable.

SAC’s algorithmic structure:

ALGORITHM SAC (Haarnoja et al., 2018, sketch)
INPUT: Gaussian policy pi_theta (mean, std), twin soft Q-networks Q_phi1, Q_phi2,
       replay buffer B, entropy coefficient alpha (often learned)

For each step:
  Collect transition by sampling a ~ pi_theta(. | s); add to B.
  Sample minibatch from B.
  # Soft target with twin-Q minimum
  Sample a' ~ pi_theta(. | s').
  y = r + gamma * [ min(Q_phi1-(s', a'), Q_phi2-(s', a')) - alpha * log pi_theta(a' | s') ]
  # Update twin critics
  phi1, phi2 = AdamStep on MSE(Q_phi*(s, a), y)
  # Update actor via reparameterized gradient
  a_theta = mu_theta(s) + sigma_theta(s) * eps     (reparameterization trick)
  theta = AdamStep on
      -mean over batch of [ min(Q_phi1(s, a_theta), Q_phi2(s, a_theta)) - alpha * log pi_theta(a_theta | s) ]
  # Optional: learn alpha to match a target entropy
  alpha = AdamStep on -alpha * (log pi_theta(a_theta | s) + target_entropy)
  # Soft target updates
  phi1- = tau * phi1 + (1 - tau) * phi1-
  phi2- = tau * phi2 + (1 - tau) * phi2-

Two technical points worth highlighting. The reparameterization trick - sampling a=μ(s)+σ(s)ϵa = \mu(s) + \sigma(s) \cdot \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) - makes the policy gradient backpropagatable through the action sampling. This is the same trick used in variational autoencoders and is what makes SAC’s gradient estimator low-variance. The automatic entropy tuning of α\alpha (Haarnoja et al., 2018b) eliminates the main hyperparameter that early SAC required hand-tuning.

TD3 vs SAC: when to use which

Both are off-policy, both are sample-efficient, both are reliably trainable. They differ in:

  • Policy structure. TD3 is deterministic with external exploration noise; SAC is stochastic with entropy-driven exploration.

  • Theoretical motivation. TD3 is a direct fix of DDPG; SAC is derived from the maximum-entropy framework with its own theory.

  • Hyperparameter sensitivity. SAC tends to be the more robust choice across tasks, especially with automatic entropy tuning.

  • Performance ceiling. Comparable on standard continuous-control benchmarks (MuJoCo, DM Control); the choice between them is largely a matter of practitioner preference and implementation availability.

By 2026, SAC is probably the more common choice in new continuous-control work, but TD3 remains a common baseline and is preferred when deterministic policies are explicitly required (some industrial control applications).

Where continuous-control RL is SOTA

The honest accounting. Off-policy deep RL (TD3, SAC) achieves SOTA on:

  • Simulated continuous control benchmarks (MuJoCo, DM Control Suite, Isaac Gym).

  • Many real-world robotics applications with substantial sim-to-real engineering.

  • Some industrial control problems where a high-fidelity simulator exists.

It does not achieve SOTA on:

  • Real-world robotics from scratch (sample efficiency is still far below what is practical without simulation).

  • Long-horizon tasks with sparse reward (exploration remains hard).

  • Tasks with hierarchical structure where high-level decisions matter more than fine motor control.

The robotics community has accordingly developed combinations: model-based RL for sample efficiency (§8), imitation-learning warm-starts, sim-to-real pipelines, and (in some applications) modern deployment of Foundation Models trained on multi-task robot data. Pure on-from-scratch continuous-control RL is rare in production; it appears most often inside larger systems.


§8. Model-Based Reinforcement Learning

What “model-based” means

Through §3§7 the algorithms have been model-free: they learn a policy or value function directly from sampled transitions (s,a,r,s)(s, a, r, s') without trying to model the environment’s transition dynamics P(ss,a)P(s' \mid s, a) or reward function r(s,a)r(s, a). The agent improves by interacting with the environment and updating its estimates from the resulting data.

Model-based algorithms additionally learn an explicit model P^,r^\hat{P}, \hat{r} of the environment from the same transition data. Once the model is learned, the agent can:

  • Plan with the model. Use the learned P^,r^\hat{P}, \hat{r} inside a dynamic-programming or tree-search procedure to find good actions, without further real-environment interaction.

  • Generate synthetic experience. Simulate trajectories from the model and use them to train a model-free RL algorithm - multiplying the effective data the agent has seen.

  • Imagine alternatives. Roll out the consequences of not-yet-taken actions in the model, evaluating counterfactuals.

The promise of model-based RL is sample efficiency: each real-environment transition produces not just one update of the policy but many updates via the model. The risk is model bias: the learned model is imperfect, and a policy that exploits model errors will fail on the real environment. The history of model-based RL is largely the history of managing this trade-off.

The Dyna framework: planning + learning + acting

Sutton (1991) introduced the Dyna architecture: an explicit combination of model-free learning, model-based planning, and acting in the real environment. The basic loop:

DYNA ARCHITECTURE (Sutton, 1991, sketch)

Repeat:
  # Act in the real environment
  s' = env.step(a)  for action a chosen by current policy
  Update model M from (s, a, r, s')
  Update Q with real transition: Q(s, a) += alpha * [r + gamma max Q(s', a') - Q(s, a)]

  # Plan with the learned model
  Repeat N times (e.g., N = 10):
    Sample (s_p, a_p) from past experience
    Predict r_p, s'_p = M(s_p, a_p)
    Update Q with simulated transition:
      Q(s_p, a_p) += alpha * [r_p + gamma max Q(s'_p, a') - Q(s_p, a_p)]

Each real-world transition triggers NN simulated transitions; NN controls how much the agent leans on the model. Dyna was developed in the tabular setting; the framework generalizes to function approximation with the standard caveats about model error.

World models: Ha and Schmidhuber’s foundational paper

The modern model-based deep-RL renaissance began with Ha and Schmidhuber (2018) “World Models.” The setup:

  • A vision module (a VAE-like encoder) maps high-dimensional observations (pixels) to a compact latent representation ztz_t.

  • A memory module (an RNN with mixture-density output) predicts the next latent zt+1z_{t+1} from (zt,at)(z_t, a_t).

  • A controller (a small linear or shallow policy) is trained to maximize reward - purely inside the learned world model.

The training pipeline: collect random trajectories in the real environment; train the world model on them; train the controller in imagination using the world model; deploy the controller in the real environment.

Two striking results. First, the controller never trains directly on real-environment rewards: it trains exclusively in imagination. Second, on tasks like CarRacing and VizDoom, the resulting controller performed well in the real environment despite never having received a real-environment training signal. The paper demonstrated that latent world models could be accurate enough for policy learning.

Dreamer V1 → V3: the latent-world-model recipe matures

Dreamer (Hafner, Lillicrap, Norouzi, Ba, 2020) refined the World Models recipe into a standardized algorithm. The architecture:

  • A Recurrent State-Space Model (RSSM) with both stochastic and deterministic latent components, trained on observations and actions.

  • A policy network trained via imagined rollouts from the world model, using actor-critic in latent space.

  • An auxiliary loss to ensure the latent representations are informative (the observation-prediction loss).

Dreamer V1 matched or exceeded model-free deep RL on a range of pixel-based continuous-control tasks with 5–20× better sample efficiency.

Dreamer V2 (Hafner et al., 2020) replaced the stochastic latent with a discrete-categorical latent and demonstrated that the recipe could compete with Rainbow on Atari at lower compute.

Dreamer V3 (Hafner et al., 2023) made the algorithm hyperparameter-stable across a wide range of tasks: a single configuration solved a wide set of pixel-based and proprioceptive control tasks with minimal tuning. This was a milestone analogous to DQN’s 2015 generalization across Atari - model-based deep RL was now reliable enough for routine application.

By 2026, Dreamer V3 and its successors are the canonical model-based deep-RL recipe for pixel-based continuous control.

A different line of model-based work, from DeepMind. AlphaZero (Silver et al., 2017) combined:

  • A neural network that, given a board state, outputs (a) a policy prior over moves and (b) a value estimate.

  • Monte Carlo Tree Search (MCTS) that uses the network’s prior to guide expansion of a game tree, balancing exploration and exploitation via the PUCT formula.

  • Self-play training: play the agent against itself; use MCTS-improved policies as training targets; iterate.

AlphaZero is model-based in the sense that MCTS uses an exact model of the game (the game’s rules are given). The neural network is not used to predict the dynamics; it is used to guide the search through a known dynamics.

MuZero (Schrittwieser et al., 2020) removed the assumption of known dynamics. The MuZero architecture has three networks:

  • Representation network hh: maps an observation to a latent state z0=h(o)z_0 = h(o).

  • Dynamics network gg: maps (zk1,ak)(z_{k-1}, a_k) to (zk,rk)(z_k, r_k) - predicts the next latent state and reward.

  • Prediction network ff: maps zkz_k to (pk,vk)(p_k, v_k) - predicts policy prior and value at that latent state.

MCTS is run in latent space, using the learned dynamics network to roll out hypothetical trajectories. Training is done end-to-end: the three networks are trained jointly so that MCTS in the latent space matches the real environment’s outcomes.

MuZero matched AlphaZero on Chess, Shogi, and Go without being given the rules of these games, and matched (in many cases exceeded) Rainbow on Atari. The recipe - learn a latent dynamics model that is consistent with future rewards and observations, plan with MCTS in latent space - is one of the most striking algorithmic results of modern RL.

When model-based helps, when it hurts

The honest trade-off.

Model-based helps when:

  • Sample efficiency matters. Real-environment interaction is expensive (robotics, healthcare, expensive simulators). The model multiplies each real transition into many useful gradient updates.

  • The environment is reasonably predictable. Physics-based simulators, board games, and most controlled-laboratory tasks fall here. Model errors are bounded.

  • Planning structure helps the policy. Games with deep look-ahead (chess, Go), tasks with long horizons, and tasks where temporal abstractions matter benefit from the explicit planning capability.

Model-based hurts (or doesn’t help) when:

  • The environment is too complex to model. Open-ended real-world settings (autonomous driving in arbitrary traffic, conversational LM environments) have dynamics that are too rich for current model classes to capture well.

  • Model errors compound. Long-horizon rollouts in an imperfect model accumulate error; the policy ends up trained on increasingly unrealistic states.

  • Sample efficiency doesn’t matter. Some applications (LLM RLHF post-training, certain large-scale game RL) have so much data that the extra mechanism of model-based methods doesn’t pay off.

A common modern approach: short-horizon model-based rollouts combined with model-free updates beyond the rollout horizon. This caps the compounding-error problem and captures most of the sample-efficiency benefit. Dreamer V3’s rollout horizon is typically 16 steps, much shorter than full episodes.

Where model-based fits in 2026

By 2026, model-based deep RL has matured into a reliable toolbox for:

  • Pixel-based continuous control with limited real-environment access (Dreamer family).

  • Board games and discrete-action games with self-play (AlphaZero/MuZero family).

  • Sample-efficient research deployments where a controlled simulator is available (the dominant academic setting).

It has not become the dominant paradigm for:

  • Frontier LM post-training (RLHF, reasoning-RL). The “environment” for an LLM in RLHF is essentially the LM itself plus a reward model; model-based methods do not naturally apply.

  • Production robotics at scale. Here, model-based methods are often combined with imitation learning and sim-to-real pipelines rather than used as the standalone framework.

The remaining sections turn to the specifically-LM-flavoured RL of the modern era: offline RL (§9), RLHF/DPO/GRPO (§10), and reasoning-RL (§12). Model-based methods will reappear in §13 as a connection-thread (world-models-for-agents, simulators as learned models).


§9. Offline Reinforcement Learning

The offline-RL problem

The algorithms of §4§8 are all online: the agent interacts with the environment, collects new data, updates its policy, and repeats. Modern deep RL routinely consumes 10610^6 to 101010^{10} environment interactions per training run.

In many practical settings this is impossible. Three regimes where online interaction is infeasible:

  • Robotics with expensive or fragile hardware. Each interaction wears out actuators; some interactions are dangerous (a robot arm can damage itself or its surroundings).

  • Healthcare and policy applications. Each “interaction” is a real patient or a real public policy. Exploration is unethical.

  • Industrial control with quality constraints. Each interaction is a production run; off-policy exploration is too expensive.

In all three, what is available is a fixed dataset of past interactions D={(si,ai,ri,si)}i=1N\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N collected by some behaviour policy πβ\pi_\beta (often a human operator, a heuristic controller, or a previous version of the system). The offline RL problem: given D\mathcal{D}, with no further environment access, learn the best possible policy.

Why naive offline application of online algorithms fails

The first thing to try: take a standard online algorithm (e.g., Q-learning with neural networks, or any of §6’s value-based methods) and apply it to the static dataset D\mathcal{D} in place of an environment. This systematically fails.

The failure mode is called distributional shift or out-of-distribution (OOD) action exploitation. The Q-learning target uses maxaQ(s,a)\max_{a'} Q(s', a') - the maximum over all actions, including actions never tried in D\mathcal{D}. For OOD actions, the neural network’s Q-estimate is whatever extrapolation it produces from training - which can be arbitrarily large, and is not corrected by environment feedback (because there is no environment). The optimizer is then incentivized to push the policy toward those OOD actions, which compounds the problem.

A worked illustration. Suppose D\mathcal{D} contains transitions where the agent always moved RIGHT. The Q-network has never seen LEFT actions; its Q-value for LEFT is undetermined. Standard Q-learning will produce a high Q-estimate for LEFT if the regression’s extrapolation happens to be high there, and the resulting greedy policy will choose LEFT in deployment - where it may behave catastrophically. The agent has overestimated an action it has never tried and cannot try.

This is not an exploration problem (we cannot collect more data to fix it). It is a fundamental difficulty of the offline regime: any algorithm that relies on max\max-over-actions has to somehow avoid trusting actions absent from D\mathcal{D}.

The pessimism principle

The conceptual fix shared across modern offline RL: be pessimistic about actions not well-supported in the dataset. The policy should prefer actions that the dataset has seen frequently (where Q-estimates are reliable) and avoid actions the dataset has not seen (where Q-estimates are speculative). Several concrete instantiations.

Behavioural cloning as the simplest baseline

The simplest offline approach: behavioural cloning - supervised learning of the policy to match the actions in D\mathcal{D}:

θ=argminθ(si,ai)Dlogπθ(aisi).\theta^* = \arg\min_\theta \sum_{(s_i, a_i) \in \mathcal{D}} -\log \pi_\theta(a_i \mid s_i).

This is pure imitation: the policy reproduces the behaviour distribution of D\mathcal{D}. It avoids OOD actions by construction (it never proposes actions outside D\mathcal{D}'s support). It does not improve on πβ\pi_\beta - if the behaviour policy is suboptimal, behavioural cloning produces a suboptimal policy.

Behavioural cloning is the floor for offline RL: any offline algorithm worth its complexity must demonstrably exceed it. In practice, on many benchmark tasks, behavioural cloning is a strong baseline - modest improvements over it are the usual offline-RL contribution.

CQL: Conservative Q-Learning

Kumar, Zhou, Tucker, Levine (2020) “Conservative Q-Learning for Offline Reinforcement Learning.” CQL modifies the Q-learning loss to explicitly penalize high Q-values on OOD actions:

LCQL(ϕ)=LTD(ϕ)+αEsD ⁣[logaexp(Qϕ(s,a))Eaπβ(s)[Qϕ(s,a)]].\mathcal{L}_{\text{CQL}}(\phi) = \mathcal{L}_{\text{TD}}(\phi) + \alpha \cdot \mathbb{E}_{s \sim \mathcal{D}}\!\left[ \log \sum_a \exp(Q_\phi(s, a)) - \mathbb{E}_{a \sim \pi_\beta(\cdot|s)}[Q_\phi(s, a)] \right].

Reading the penalty. The first inner term logaexp(Qϕ(s,a))\log \sum_a \exp(Q_\phi(s, a)) is a soft-max over actions - large when any action has a high Q-value. The second term subtracts the average Q-value of actions actually present in D\mathcal{D} (so the penalty does not push down the in-distribution Q-values). The net effect: the loss penalizes Q-functions that assign high values to actions not in D\mathcal{D}, while preserving high values for in-distribution actions.

The hyperparameter α\alpha controls how conservative the algorithm is. Larger α\alpha → more pessimism → safer offline behaviour but less improvement over πβ\pi_\beta.

CQL was one of the first offline-RL algorithms to demonstrably improve over behavioural cloning on the D4RL benchmark suite (Fu et al., 2020). It has the disadvantage of being computationally expensive (the inner soft-max over actions) and hyperparameter-sensitive.

IQL: Implicit Q-Learning

Kostrikov, Nair, Levine (2022) “Offline Reinforcement Learning with Implicit Q-Learning.” IQL takes a different approach: avoid the max\max-over-actions entirely. Instead of V(s)=maxaQ(s,a)V(s) = \max_a Q(s, a), use expectile regression to estimate V(s)V(s) as a high quantile of the Q-value distribution under the dataset action distribution:

LV(ψ)=E(s,a)D ⁣[L2τ(Qϕ(s,a)Vψ(s))],L2τ(u)=τ1[u<0]u2,\mathcal{L}_V(\psi) = \mathbb{E}_{(s, a) \sim \mathcal{D}}\!\left[ L_2^\tau(Q_\phi(s, a) - V_\psi(s)) \right], \quad L_2^\tau(u) = |\tau - \mathbb{1}[u < 0]| u^2,

with τ(0,1)\tau \in (0, 1). The expectile loss L2τL_2^\tau penalizes negative residuals more than positive ones (for τ>0.5\tau > 0.5), so VψV_\psi regresses toward the upper expectile of the Q-distribution conditional on ss. As τ1\tau \to 1, VψV_\psi approaches maxaDQϕ(s,a)\max_{a \in \mathcal{D}} Q_\phi(s, a) - the maximum over in-dataset actions, never OOD ones.

The Q-function is then updated with Vψ(s)V_\psi(s') as the bootstrap target (no max\max needed):

LQ(ϕ)=E(s,a,r,s)D ⁣[(r+γVψ(s)Qϕ(s,a))2].\mathcal{L}_Q(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}}\!\left[ (r + \gamma V_\psi(s') - Q_\phi(s, a))^2 \right].

Finally, the policy is extracted by advantage-weighted regression (AWR) - supervised learning of πθ(as)\pi_\theta(a \mid s) weighted by exp(β(Qϕ(s,a)Vψ(s)))\exp(\beta (Q_\phi(s, a) - V_\psi(s))). Actions in D\mathcal{D} with high advantage are imitated more; those with low advantage less. The policy stays in the support of D\mathcal{D} by construction.

IQL is simpler than CQL, faster, and competitive on the standard benchmarks. By 2026 it is one of the canonical offline-RL recipes.

AWAC: Advantage-Weighted Actor-Critic

Nair, Gupta, Dalal, Levine (2020) developed AWAC with the same general idea: combine a Q-function trained on the dataset with policy extraction that stays in the dataset’s support. AWAC’s distinctive feature is the natural transition from offline pretraining to online fine-tuning - the policy and Q-function trained offline serve as the initialization for further online learning. This is the offline-to-online paradigm: do most of the work cheaply on logged data, then fine-tune in the real environment.

Pseudocode summary for offline RL

GENERIC OFFLINE RL TEMPLATE

INPUT: fixed dataset D, network architectures for Q (and V, pi).

1. Initialize Q_phi, V_psi (if used), pi_theta.

2. For each training step:
   2a. Sample minibatch from D.
   2b. Update Q_phi:
       - CQL: TD loss + OOD-action penalty.
       - IQL: TD with V_psi target (no max).
   2c. Update V_psi (IQL only) via expectile regression.
   2d. Update pi_theta:
       - CQL: standard actor update (with constrained policy).
       - IQL/AWAC: advantage-weighted regression on dataset actions.

3. After training: deploy pi_theta (no further interaction needed).

The common structure across CQL, IQL, AWAC: stay close to the support of D\mathcal{D}, in one of three ways - penalize OOD actions in the loss (CQL), never use OOD actions in the target (IQL), or imitate in-distribution actions weighted by advantage (AWAC).

Where offline RL stands in 2026

The pessimism principle has been productive. Modern offline RL routinely improves over behavioural cloning on benchmark tasks and is increasingly deployed in real applications - robotics from teleoperation logs, healthcare from electronic-records replays, recommender systems from logged user interactions.

Open issues. The theoretical understanding of which conditions guarantee offline-RL improvement is partial; sample-complexity bounds (Theory chapter §10) are loose. The hyperparameters of CQL and IQL still require some tuning per task. And the gap between offline and online - how much can offline RL approach the performance of an online algorithm with the same data budget? - remains an active question.

The offline RL paradigm also informs §10’s RLHF: the LLM RLHF setting is naturally close to offline (most of the data is logged preference pairs), and several of the offline-RL design choices have analogues in RLHF practice.


§10. RL from Human Feedback (RLHF), DPO, GRPO

This section is the bridge between the classical RL of §3§9 and the modern LLM application. The detailed treatment of RLHF/DPO/GRPO at the LM scale lives in LLM §5; here we develop the RL-framing of the paradigm. The two sections should be read together: LLM §5 is the LM-specific algorithmic detail, RL §10 is the conceptual placement within the RL framework.

The LM as RL environment

The non-obvious move that produced RLHF: treat a language model as an agent in an environment defined by prompts and reward signals.

The setup. The agent is the language model πθ(yx)\pi_\theta(y \mid x), mapping prompts xx to completions yy. The environment is a stochastic process that:

  • Samples a prompt xx from a prompt distribution.

  • Receives a completion yy generated autoregressively by πθ\pi_\theta.

  • Computes a reward r(x,y)r(x, y) - typically scalar, via a learned reward model.

Three observations about this setup compared to classical RL.

First, the episode is one prompt-completion exchange. The “trajectory” is the sequence of tokens in yy; each token is an action; the cumulative reward is the single terminal r(x,y)r(x, y).

Second, the action space is the LM’s vocabulary (tens to hundreds of thousands of tokens). Discrete, large, structured - but RL-amenable.

Third, the dynamics are trivial in one sense and complex in another: trivial because each “transition” just appends a token to the context; complex because the reward landscape over yy is highly non-trivial and we don’t have direct access to it (only via a learned reward model).

This LM-environment framing is what allowed methods designed for classical RL to be adapted to LM post-training. It is not how the LM was originally trained (which is supervised next-token prediction, LLM §3); it is what is layered on top, after pretraining and supervised fine-tuning, to align the LM with human preferences or task-specific rewards.

Reward modelling from preferences

The first conceptual question: where does the reward r(x,y)r(x, y) come from? In classical RL the reward function is given by the environment. For LMs there is no natural reward function - “how good is this completion?” is itself the question we want to answer.

Christiano et al. (2017) “Deep Reinforcement Learning from Human Preferences” introduced the modern solution: train a reward model rψ(x,y)r_\psi(x, y) from human pairwise preferences. The data is comparisons: given prompt xx, completions ywy_w (“winner”) and yly_l (“loser”), with ywyly_w \succ y_l as the human label. The reward model is trained against the Bradley-Terry preference loss:

LRM(ψ)=E(x,yw,yl)D ⁣[logσ(rψ(x,yw)rψ(x,yl))],\mathcal{L}_{RM}(\psi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[ \log \sigma\big(r_\psi(x, y_w) - r_\psi(x, y_l)\big) \right],

where σ\sigma is the sigmoid. After training, rψ(x,y)r_\psi(x, y) is a scalar score that approximates “how preferred would this completion be in a head-to-head comparison.”

The full training pipeline (PPO-for-RLHF):

RLHF TRAINING PIPELINE (Christiano 2017, Ouyang 2022)

1. Start from a pretrained + SFT'd LM pi_SFT.
2. Collect preference data D = {(x, y_w, y_l)}.
3. Train reward model r_psi on D against Bradley-Terry loss.
4. Initialize policy pi_theta = pi_SFT; freeze a copy as pi_ref.
5. PPO loop:
   a. Sample prompts x; generate completions y ~ pi_theta(.|x).
   b. Compute reward: R(x, y) = r_psi(x, y) - beta * KL(pi_theta(.|x) || pi_ref(.|x)).
   c. Compute GAE advantage A_hat.
   d. PPO update on pi_theta with clipped objective and advantage A_hat.
6. After training: deploy pi_theta.

Two specific choices worth noting. The KL penalty βKL(πθπref)\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}}) keeps the trained policy close to the reference (SFT) policy. Without it, the policy can drift far from sensible LM behaviour to exploit the reward model; with it, the policy improves while remaining recognizably language-model-like. The KL coefficient β\beta is the central hyperparameter of RLHF - too large and the policy doesn’t move; too small and it drifts into reward-hacking regions.

The advantage estimation uses standard GAE (§5) on the per-token rewards (the KL term is per-token; the rψr_\psi term is applied at the end of the completion). The PPO update is standard (§6).

The RLHF pipeline at scale

Three landmark applications:

  • Stiennon et al. (2020) “Learning to Summarize from Human Feedback.” Demonstrated RLHF on summarization; the RLHF-tuned model was preferred over supervised baselines and over much larger non-RLHF models. Established the empirical effectiveness of the approach at scale.

  • Ouyang et al. (2022, InstructGPT). The first general-purpose deployment: RLHF turned a base GPT-3-style model into an instruction-following model that was the direct predecessor of ChatGPT. The recipe became the standard.

  • Bai et al. (2022, Constitutional AI / RLAIF). Replaced human preferences with model-generated preferences guided by a constitution. Scales the data collection (AI labelling is cheaper than human labelling) at the cost of new failure modes if the labelling model is wrong.

By 2026 essentially every deployed frontier LLM has undergone RLHF or a close variant. The training stack - pretrain → SFT → reward modelling → RLHF - is the canonical post-training pipeline.

DPO: closed-form preference optimization

A reformulation that has substantially displaced PPO-for-RLHF in many settings. Rafailov, Sharma, Mitchell, Manning, Ermon, Finn (2023) “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” derived a closed-form solution to the RLHF objective that eliminates the explicit reward model and the PPO step.

The starting observation. The KL-constrained reward maximization

π(yx)=argmaxπEyπ(x)[r(x,y)]βKL(π(x)πref(x))\pi^*(y \mid x) = \arg\max_\pi \mathbb{E}_{y \sim \pi(\cdot|x)}[r(x, y)] - \beta \cdot \text{KL}(\pi(\cdot|x) \| \pi_{\text{ref}}(\cdot|x))

has a closed-form solution:

π(yx)=1Z(x)πref(yx)exp ⁣(r(x,y)β),\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\!\left(\frac{r(x, y)}{\beta}\right),

where Z(x)Z(x) is a normalizing constant. Inverting for rr:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x).r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x).

Substituting this expression into the Bradley-Terry preference loss (the partition function Z(x)Z(x) cancels between winner and loser since it depends only on xx):

LDPO(θ)=E(x,yw,yl)D ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))].\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right].

Reading this. The loss is a supervised loss on preference pairs. The reward model is gone - it has been re-expressed in terms of πθ/πref\pi_\theta / \pi_{\text{ref}}. The RL training loop is gone - the loss is a standard cross-entropy-style objective. The PPO machinery is gone.

The implication: RLHF can be done as supervised training on preference data, without a reward model, without rollouts, without PPO. DPO is dramatically simpler than PPO-for-RLHF, more sample-efficient, and more stable. It has been widely adopted since 2023.

LLM §5 develops the DPO derivation in much greater detail, including the worked preference example, the policy/reference clarification, and the comparison to PPO. We refer the reader there for the LM-specific treatment.

GRPO: simplifying PPO for reasoning-RL

GRPO (Group Relative Policy Optimization), introduced in DeepSeek’s reasoning-model work (DeepSeek-R1, early 2025), is the variant of PPO most used in modern reasoning-RL training. It simplifies PPO by eliminating the value-function critic - the dominant computational and engineering burden of LLM-scale PPO.

The setup. For each prompt xx, generate a group of KK completions {y1,,yK}\{y_1, \ldots, y_K\} under the current policy. Compute each completion’s reward r(x,yk)r(x, y_k). The advantage of completion kk is its reward relative to the group mean:

A^k=r(x,yk)1Kjr(x,yj)σK,σK=std{r(x,yj)}j=1K.\hat{A}_k = \frac{r(x, y_k) - \frac{1}{K}\sum_j r(x, y_j)}{\sigma_K}, \qquad \sigma_K = \text{std}\{r(x, y_j)\}_{j=1}^K.

That is, GRPO uses the group mean reward as a baseline (§5) - no learned value function required. The PPO clipped objective is then applied using A^k\hat{A}_k as the advantage:

LGRPO(θ)=E ⁣[min ⁣(rk(θ)A^k,clip(rk(θ),1ϵ,1+ϵ)A^k)βKL(πθπref)],\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\!\left[ \min\!\left( r_k(\theta) \hat{A}_k,\, \text{clip}(r_k(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_k \right) - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}}) \right],

with rk(θ)=πθ(ykx)/πθold(ykx)r_k(\theta) = \pi_\theta(y_k|x) / \pi_{\theta_{\text{old}}}(y_k|x) the importance ratio.

The trade-off. Eliminating the critic halves (or more) the model parameters that need to be trained alongside the policy and removes a substantial source of engineering complexity. The cost is that the advantage estimator is noisier (a group baseline is a less accurate baseline than a learned value function would be). For reasoning-RL with large KK (typically K=32K = 32 to K=128K = 128), the group baseline is accurate enough that the trade-off is favourable.

GRPO has been the standard algorithm for reasoning-RL since DeepSeek-R1’s open publication of the method in early 2025. We develop reasoning-RL itself in §12.

Connections between RLHF, offline RL, and the classical algorithms

A worth-noting structural observation. RLHF-via-DPO is essentially offline: the training data is preference pairs, the algorithm is supervised, and there is no environment interaction. RLHF-via-PPO is online in form but uses a learned reward model rather than environment rewards. GRPO is online and uses environment rewards (verifiable correctness signals in reasoning-RL).

The space of LM-RL methods can be organized along two axes:

   LM RL METHOD SPACE

                            ONLINE (real or learned rewards)
                                          │
                                          │
                     PPO-for-RLHF ────────┼──── GRPO (reasoning-RL)
                     (learned reward      │     (verifiable reward,
                      model, online)      │      online)
                                          │
   OFFLINE  ──────────────────────────────┼─────────────────────────── ONLINE
                                          │
                     DPO ─────────────────┼──── Online DPO variants
                     (preference data,    │     (recently studied)
                      supervised)         │
                                          │
                            OFFLINE (preference data only)

Each cell makes different trade-offs in sample efficiency, complexity, and capability. The 2026 landscape: DPO is the default for general-purpose RLHF; GRPO (or PPO) is the default for reasoning-RL with verifiable rewards; PPO-for-RLHF remains in use where existing infrastructure favours it.

Open issues in RLHF/DPO/GRPO

A summary, with cross-references to OP-RL-N and OP-LLM-N.

  • Reward hacking. Policies trained against a learned reward model can exploit imperfections in the model rather than improve at the actual task. This is OP-RL-5 and is one of the central concerns in alignment research.

  • Distribution shift of the reward model. As πθ\pi_\theta moves away from the data that trained rψr_\psi, the reward model’s predictions become less reliable. The KL penalty mitigates but does not solve this.

  • Reward model overfitting. With finite preference data, the reward model overfits to the training distribution; reward-hacking is a downstream consequence.

  • Sample efficiency of RLHF/DPO. Improvements per preference pair are real but bounded; scaling preference collection has substantial cost.

  • Theoretical guarantees. The sample complexity of RLHF is partially understood; cross-reference Theory §10.

  • Reasoning-RL reward design. For GRPO and similar, the reward signal (verifier-based, process-based, outcome-based) is the central design choice. We develop it in §12.

The RL-framework treatment ends here; LLM §5 picks up the LM-specific details. §11§14 develop exploration, reasoning models, connections, and the chapter-closing material.


§11. Exploration

Why exploration is its own problem

Through §3§10, exploration has appeared in pieces: ϵ\epsilon-greedy in Q-learning, entropy bonuses in SAC, KL-penalized RLHF. We have not addressed exploration as its own problem. This section does.

The basic difficulty. An RL agent that always takes the action it currently believes is best (a fully exploitative policy) gets stuck on whatever its current belief implies - even if a better strategy is available but unsampled. An agent that always takes random actions (a fully exploratory policy) collects diverse data but never converges to a good policy. The exploration-exploitation trade-off is the central tension.

A worked illustration to anchor the rest. Consider a two-armed bandit: each round, the agent pulls one of two arms; arm 1 gives reward 11 with probability 0.60.6, arm 2 with probability 0.40.4. The agent doesn’t know these probabilities.

  • A greedy agent that always pulls the arm with higher empirical mean can lock into arm 2 if it gets lucky on its first few pulls - and then never pulls arm 1 again, never discovering that arm 1 is actually better.

  • A purely random agent pulls each arm with probability 0.50.5 forever - collecting a balanced dataset but earning, in expectation, (0.6+0.4)/2=0.5(0.6 + 0.4)/2 = 0.5 reward per round, much worse than the optimal 0.60.6.

The right behaviour is somewhere in between: exploit current best estimates while gathering data on uncertain options. The structure of the right answer is what exploration algorithms try to capture.

ϵ\epsilon-greedy and Boltzmann exploration

The simplest strategies, used widely.

ϵ\epsilon-greedy. With probability ϵ\epsilon take a random action; otherwise take the action that maximizes the current Q(s,)Q(s, \cdot). The hyperparameter ϵ\epsilon is typically annealed from 1.01.0 (pure exploration early in training) to a small floor like 0.010.01 or 0.050.05 (mostly exploitation late). Used in DQN, Rainbow, and many tabular RL deployments.

The conceptual limitation: ϵ\epsilon-greedy is undirected - it explores uniformly at random regardless of which actions actually have uncertain Q-values. An action you have already pulled 1000 times gets the same exploration probability as one you have pulled 0 times.

Boltzmann exploration. Sample actions in proportion to exp(Q(s,a)/T)\exp(Q(s, a) / T) where TT is a temperature. As TT \to \infty, this is uniform; as T0T \to 0, it is greedy. Slightly more principled than ϵ\epsilon-greedy in that the relative magnitudes of Q-values matter, but still undirected: it does not distinguish uncertain from known-low actions.

Both work surprisingly well in practice for many tasks. They are the workhorses of exploration in deployed RL. Their limitation matters mostly in sparse-reward, high-dimensional settings (§11’s hard problem).

Optimism in the face of uncertainty

A more principled strategy with both theoretical backing and practical descendants. The idea: maintain uncertainty about Q-values; pull actions whose uncertainty is high; the agent’s behaviour is optimistic in that it acts as if uncertain actions might be excellent.

The canonical bandit instantiation is UCB (Upper Confidence Bound, Auer, Cesa-Bianchi, Fischer, 2002). For each arm aa, maintain the empirical mean μ^a\hat{\mu}_a and number of pulls nan_a. Choose

a=argmaxa[μ^a+clntna],a^* = \arg\max_a \left[ \hat{\mu}_a + c \sqrt{\frac{\ln t}{n_a}} \right],

where tt is the total round count and cc is a constant. The first term is the exploit signal (current best estimate); the second is the explore bonus, decaying as 1/na1/\sqrt{n_a} (more pulls → less uncertainty). UCB provably achieves regret O(KTlogT)O(\sqrt{KT \log T}) for KK-armed bandits - near-optimal among any exploration strategy.

The principle generalizes. For RL with full MDPs: maintain confidence intervals around Q-values; act as if the upper end of each interval is the true value. Algorithms like R-MAX (Brafman and Tennenholtz, 2002, §10 of the Theory chapter) implement this for tabular MDPs. PAC-MDP sample-complexity bounds (Theory §10) are proven for optimism-based algorithms.

The deep-RL difficulty: maintaining well-calibrated uncertainty over deep neural network Q-values is itself hard. Bayesian deep learning (variational nets, ensembles, MC dropout) provides approximations, but the resulting uncertainty estimates are often poorly calibrated. Direct application of UCB-style exploration to deep RL has produced modest gains rather than transformative ones.

Intrinsic motivation: reward beyond extrinsic reward

A different framing. Rather than trying to explore based on uncertainty about value, augment the reward with an intrinsic reward that pays the agent for visiting novel or surprising states. Several instantiations:

Count-based exploration. Maintain a (pseudo-)count N(s)N(s) of how many times state ss has been visited. The intrinsic reward is rint(s)=1/N(s)r_{\text{int}}(s) = 1/\sqrt{N(s)} - high for rarely-visited states, low for frequently-visited ones. For continuous or high-dimensional states, “counts” require a hashing or density-modelling step (Bellemare et al., 2016, “Unifying Count-Based Exploration and Intrinsic Motivation”).

Curiosity (Pathak et al., 2017). Train a forward model fψf_\psi predicting next state from current state and action: s^t+1=fψ(st,at)\hat{s}_{t+1} = f_\psi(s_t, a_t). The intrinsic reward is the model’s prediction error: rint(st,at,st+1)=st+1s^t+12r_{\text{int}}(s_t, a_t, s_{t+1}) = \|s_{t+1} - \hat{s}_{t+1}\|^2. The agent is rewarded for visiting states where its predictions are bad - surprise-seeking. The intuition is direct: where the model fails is where there is still something to learn.

Random Network Distillation (RND, Burda et al., 2018). A simpler, more robust variant. Initialize a random neural network ftargetf_{\text{target}} (fixed). Train a predictor network f^ψ\hat{f}_\psi to match it on visited states: L=f^ψ(s)ftarget(s)2\mathcal{L} = \|\hat{f}_\psi(s) - f_{\text{target}}(s)\|^2. The intrinsic reward is the current prediction error. Since ftargetf_{\text{target}} is random and fixed, the predictor’s error is high on states unlike those seen before - a state-novelty signal that does not require modelling environment dynamics. RND substantially improved exploration on hard sparse-reward Atari games (notably Montezuma’s Revenge) and remains widely used.

The agent’s total reward becomes

rtotal(s,a,s)=rext(s,a,s)+βrint(s,a,s),r_{\text{total}}(s, a, s') = r_{\text{ext}}(s, a, s') + \beta \cdot r_{\text{int}}(s, a, s'),

with β\beta controlling the balance. Intrinsic motivation is plugged into a standard RL algorithm (DQN, PPO); the agent learns to maximize the augmented reward.

Where deep RL exploration is stuck

The honest status. On moderately-rewarding tasks (most Atari games, MuJoCo continuous control, board games with reward at the end), ϵ\epsilon-greedy or entropy-bonus exploration plus a strong RL algorithm is sufficient. Modern deep RL has effectively solved these tasks.

On hard-exploration tasks - sparse reward, large state space, long horizons - exploration remains an open problem. The canonical hard cases:

  • Montezuma’s Revenge (Atari). For years a benchmark on which most deep-RL methods scored zero. RND and successors achieved superhuman scores, but the techniques are task-specific and brittle.

  • NetHack and procedurally-generated environments. Vast state spaces with sparse rewards; current deep-RL methods are far from human performance.

  • Sparse-reward robotics (e.g., learning to assemble a complex object). The agent receives reward only at the end; intermediate state-space exploration is the bottleneck.

  • Open-ended environments where the “goal” itself is undefined; agents are expected to develop diverse, useful behaviours autonomously.

The theoretical-practice gap in exploration is severe. Provably-efficient exploration algorithms (PAC-MDP) exist for tabular settings; in deep RL, exploration is engineered with heuristics and validated empirically. This is OP-RL-2.

Exploration in LM-RL

A quick note on how exploration appears (or fails to appear) in the LM RL setting of §10.

  • RLHF (PPO-for-RLHF, DPO). The “exploration” is essentially the LM’s stochastic sampling at temperature >0> 0. There is no novelty bonus; the KL penalty in PPO-for-RLHF actively discourages exploration of completion-space far from the SFT policy. This works because the LM’s pretrained prior already contains broad coverage of language; exploration in the conventional RL sense is unnecessary.

  • Reasoning-RL (GRPO, §12). Similar - exploration is the diversity of sampled completions in a group. Reasoning-RL benefits from high-temperature sampling during rollouts (so the group contains genuinely varied attempts at the problem) but does not need explicit novelty bonuses.

The LM-RL setting is, in this sense, a structurally simpler regime than classical sparse-reward RL. The pretrained prior provides exploration for free.


§12. Reasoning Models and Test-Time RL

What changed in 2024–2026

The single most consequential RL development of the last two years. OpenAI’s o1 (announced late 2024), o3 (early 2025), DeepSeek-R1 (early 2025), and successors introduced a new training and inference paradigm for language models:

  • Training time: use RL with verifiable rewards to teach the LM to produce extended chains of reasoning.

  • Inference time: have the model produce many tokens of “thinking” (often 10310^310510^5 tokens) before its final answer, with the thinking trained to be useful.

The result. On hard mathematical, coding, and STEM problems - problems where the answer can be checked but the path to it is non-trivial - reasoning models substantially outperform their pre-RL predecessors. AIME mathematics, Codeforces-style programming contests, and graduate-level science questions are domains where reasoning models have produced near-human-expert performance.

This section develops the reasoning-RL paradigm as the natural extension of §10’s RLHF machinery to verifiable-reward tasks. The technical content is the algorithm (GRPO from §10, applied here), the reward design (verifier-based vs process-supervision), and the inference-time consequences.

The training-time reward signal: verifiable correctness

The key structural difference from general RLHF. In RLHF (§10), the reward is learned from human preferences - itself an approximation. In reasoning-RL, the reward is checkable:

  • Math problem. Reward +1+1 if the final answer matches the known-correct answer; 00 otherwise.

  • Coding task. Reward +1+1 if the code passes all unit tests; partial credit possible.

  • Logic puzzle. Reward +1+1 if the final assignment satisfies the constraints.

The reward signal is deterministic (no reward-model overfitting), unambiguous (no Bradley-Terry preference modeling), and cheap (no human labelling per response). This eliminates the dominant failure modes of RLHF (reward hacking, reward-model overfitting) and lets the training run for much longer than RLHF would tolerate.

The training pipeline

The setup. Start from a strong base LLM (pretrained + SFT’d). Define a corpus of checkable problems {(xi,verifieri)}\{(x_i, \text{verifier}_i)\} where xix_i is the prompt and verifieri\text{verifier}_i is a function that scores any candidate response. Train with GRPO (§10):

REASONING-RL TRAINING (DeepSeek-R1 style, sketch)

INPUT: base LM pi_theta (after SFT), checkable problem set D,
       group size K, KL coefficient beta, clipping epsilon

For each iteration:
  1. Sample prompt x from D.
  2. Generate K completions y_1, ..., y_K under pi_theta (with high temperature).
     Each y_k is a long chain: "Let me think... [reasoning] ... So the answer is X."
  3. Compute reward r_k = verifier(x, y_k).  (E.g., 1 if correct, 0 if wrong.)
  4. Compute group-relative advantage:
     A_hat_k = (r_k - mean(r_1..K)) / std(r_1..K)
  5. GRPO loss with clipping and KL penalty:
     L = - E[ min(r_k(theta) A_hat_k, clip(r_k(theta), 1-eps, 1+eps) A_hat_k)
             - beta * KL(pi_theta || pi_ref) ]
  6. Adam step.

Reading this. The training signal is: of the KK attempts at this problem, which ones got the right answer? Successful attempts have positive advantage and are reinforced; failed attempts have negative advantage and are suppressed. After many iterations, the LM learns to produce reasoning chains that more reliably lead to correct answers.

What emerges from this training

The striking empirical observation. After enough training on verifiable problems, reasoning models develop behaviours that look strikingly like deliberate thinking:

  • Self-correction. The model produces an intermediate answer, then writes “Wait, let me reconsider” and revises. This was not designed in; it emerges from RL training selecting for chains that produce correct answers, and self-correction is one path to correctness.

  • Multiple solution paths. The model attempts several approaches, evaluates them, and commits to the most promising. Again, emergent.

  • Verification. The model explicitly checks its work - re-plugging answers into equations, re-running coded solutions mentally, etc.

  • Decomposition. Complex problems are broken into subproblems with explicit subgoal structure.

These behaviours are not directly rewarded - only the final answer is. They emerge because they are instrumentally useful for reaching correct answers. The implication: RL training can produce sophisticated behaviour from a simple reward signal, given a base model strong enough to express the behaviour.

DeepSeek-R1’s published research notes describe a particularly striking “aha moment” during training: at some point, the model’s average response length grows sharply, and its accuracy improves correspondingly. The model has learned that thinking longer (using more tokens for reasoning) helps - and the RL training pulls in this direction.

Process supervision vs outcome supervision

A design choice that has been actively explored.

Outcome supervision. Reward the final answer only. Simple, scalable, but provides no signal about which steps in the reasoning chain were good or bad. If the model produces 100 reasoning steps and gets the wrong answer, all 100 steps share the negative advantage uniformly.

Process supervision. Reward each step (or a meaningful subset of steps) individually. Lightman et al. (2023) “Let’s Verify Step by Step” showed that a step-level reward model - trained on human annotations of which reasoning steps are correct - substantially outperforms outcome-only supervision for math reasoning.

The trade-off. Process supervision is more sample-efficient per problem but requires step-level annotations, which are expensive. Outcome supervision is cheaper but noisier. Modern reasoning-RL stacks often combine both: outcome supervision as the base, process supervision for difficult problems where outcome alone is too noisy.

A related observation. Inference-time search with a step-level reward model is itself a powerful technique: generate many partial chains, score each step, prune the low-scoring ones, expand the high-scoring ones. This is closer to MCTS than to vanilla autoregressive generation. The relationship between training-time RL and inference-time search is the next subsection.

Inference-time compute and the training-inference relationship

A structural feature of reasoning models. They use more compute at inference than non-reasoning LMs:

  • Non-reasoning LM: input prompt, output answer. Typical inference: hundreds to thousands of tokens.

  • Reasoning model: input prompt, output a long reasoning chain followed by the answer. Typical inference: 10410^410510^5 tokens.

The “thinking time” is real compute. A reasoning model’s responses are often 10–100× slower and more expensive to generate than a non-reasoning model’s. This is the deliberate-deliberation trade-off: spend more compute at inference time, get better answers on hard problems.

Two consequences worth noting.

First, inference-time compute as a scaling axis. OpenAI and others have observed that increasing the inference budget (e.g., generating more candidate solutions and majority-voting, or running longer chains) reliably improves accuracy on hard reasoning tasks. This is a new scaling axis distinct from model size or training data - and one of the most-studied phenomena of 2024–2026.

Second, training-inference symmetry. GRPO training samples KK completions per prompt and updates the policy to reinforce the best. Inference-time best-of-K sampling generates KK completions and picks the best by a separate verifier or scorer. The two are related: training shifts the policy toward what best-of-K sampling would produce; inference-time best-of-K can be applied even after training to extract additional performance. The optimal allocation between training-time RL and inference-time compute is an open practical question (a version of OP-RL-1 and OP-FM-1).

Reward hacking in reasoning-RL

A failure mode worth flagging explicitly. Verifier-based rewards are cheap but not robust - a model can find ways to satisfy the verifier without actually solving the problem.

Examples documented in the reasoning-RL literature:

  • Output gaming. The model produces output formatted to pass the verifier (e.g., the right answer string in the right place) without correct reasoning. Especially when the verifier is a simple string-matcher.

  • Reward over-optimization. With long enough training, the model finds idiosyncratic patterns the verifier accepts but humans would not. This is the reasoning-RL analogue of RLHF reward hacking.

  • Verifier laundering. If the verifier is itself a model (rather than a deterministic checker), the policy may learn to produce outputs that fool the verifier without being correct.

Mitigations include verifier hardening (deterministic checks where possible), process supervision (which is harder to game in detail), and adversarial verifier evaluation (deliberately attempt to fool the verifier and audit cases of success).

The 2026 reasoning-RL landscape

A summary. The dominant paradigm:

  • Base model. A strong pretrained + SFT’d LM (often Llama, Qwen, DeepSeek, or proprietary frontier model).

  • Training algorithm. GRPO (the simplest; default for new work) or PPO-for-RLHF (legacy); occasionally DPO-flavoured variants.

  • Reward signal. Verifier-based for math, code, formal logic; process-supervised for very hard problems; outcome-supervised at scale.

  • Inference time. 10410^410510^5 tokens of reasoning per response on hard problems; combined with best-of-K sampling for the highest-stakes applications.

The technical material above will date faster than the rest of the chapter. The conceptual structure - verifiable rewards, GRPO-style group-relative training, training-inference compute trade-off, emergent reasoning behaviours - is what is likely to remain. The specific algorithms, reward designs, and verifier engineering practices will continue to evolve.

Editorial note. Reasoning-RL is the most recent of the chapter’s topics and the most likely to be substantively revised in the next 12–24 months. The 2024–2025 period has seen rapid progress; this section is a snapshot of the recipe as it stood in mid-2026.


§13. Connections to Other Chapters

RL is structurally connected to most other chapters in the book - as a paradigm that underlies modern LM post-training, as a theory subject with open sample-complexity questions, as the algorithmic substrate for embodied agents, and as the natural extension of classical planning. The cross-references below are dependency statements, not pointers.

  • Theoretical Foundations of Learning §10 develops the sample-complexity theory for RL: PAC-MDP, regret bounds, Bellman-eluder dimension and successors, and the severe theory-practice gap (OP-TH-5). This chapter develops the algorithms; the theory chapter develops the analysis. Recommended to read this chapter’s §3§4 alongside theory §10 if both are new to the reader.

  • Large Language Models §5 develops RLHF and DPO at the LM scale specifically, with the LM-specific worked examples and notation. This chapter’s §10 places those algorithms in the RL framework. Together they cover the same material from complementary angles. Forward references in this chapter to LLM §5 mean: for the LM-specific algorithmic detail (notation, training-loop pseudocode at LM scale, deployment considerations), see there.

  • Foundation Models §7 develops adaptation methods as a family - prompting, PEFT, full fine-tuning, preference-based methods (RLHF/DPO/GRPO/RLAIF), distillation. The RL-flavoured adaptation methods are this chapter’s §10. The two sections develop the same material at different layers of abstraction (FM §7 surveys the spectrum; RL §10 develops the mechanism).

  • Self-Supervised Learning §1 discusses what SSL does not cover; RL is the canonical not-SSL paradigm. SSL provides representations that RL can use; RL provides post-training that SSL does not. The two are complementary stages of modern training stacks (FM §5 + FM §7).

  • Deep Learning provides the architectures (MLPs, CNNs, Transformers) that modern deep RL trains. Specifically: DQN’s CNN follows DL’s CNN material; PPO’s policy and value networks are MLPs or Transformers; the LM in RLHF is the Transformer of DL §6. This chapter assumes the architectural material rather than re-developing it.

  • AI Agents and Tool Use (planned) develops deployed agentic systems. Many such systems use a (sometimes RL-trained) policy over high-level actions and tool calls. The relationship: RL is one of the substrates for agents, but agents are not purely an RL phenomenon - agentic LMs in practice rely heavily on supervised behaviour, tool-use scaffolding, and inference-time control flow that this chapter does not cover.

  • Multi-Agent Systems (planned) develops RL in the multi-agent setting. Game-theoretic RL, decentralized cooperation, self-play with multiple distinct agents - these are substantial enough to warrant their own chapter; we touch on multi-agent self-play (AlphaZero) in §8 and otherwise refer to the planned chapter.

  • Planning (AIMA Part III material, retained in the relevant chapter) is the model-based, exact-reasoning sibling of RL. Where RL learns from interaction, planning reasons over a known model. The boundary is fluid: AlphaZero (§8) combines planning (MCTS) with RL (self-play training); MuZero learns the model that planning uses; model-based RL more broadly (§8) is the bridge between the two paradigms.

  • Game Playing (AIMA Ch 5–6 material) is the classical antecedent of two-player adversarial RL. AlphaZero is the modern instance; the cross-reference is mostly historical.

  • Causality raises the question of causal reasoning in agents - distinct from the correlational learning of standard RL. Off-policy evaluation, distribution-shift-robust RL, and offline-RL safety all interact with the causal framework. The Causality chapter will develop this; this chapter is largely silent on it.

  • Robotics (planned) is the largest application domain of RL outside LM post-training. Sim-to-real transfer, manipulation, locomotion, real-time control - this chapter sketches the algorithmic substrate but defers to the robotics chapter for the engineering specifics.

  • Evaluation raises the question of how to measure RL agent performance. Reward, regret, sample efficiency, generalization - each is a different evaluative dimension. Cross-references in §10 about reward hacking and reward-model evaluation are most directly relevant.

  • Alignment / Safety treats reward hacking, specification gaming, scalable oversight, and the safety properties of RL-trained systems as central concerns. OP-RL-3 and OP-RL-5 are the open problems that the Alignment chapter develops in depth.


§14. Limitations and Open Problems

This is the consolidated open-problems list for the RL chapter. Each item carries an OP-RL-N identifier so other chapters can cross-reference.

  • OP-RL-1. Sample efficiency at scale. Modern deep RL routinely consumes 10710^7 to 101010^{10} environment interactions per training run. For domains where interaction is expensive (robotics, healthcare, real-world deployment), this is prohibitive. Improvements via model-based methods (§8), offline pretraining (§9), and pretrained-model warm-starts have helped but not closed the gap. The sample efficiency of frontier RL - RL at the scale of training a state-of-the-art language model with RLHF or reasoning-RL - is one of the dominant practical questions. Relates to OP-TH-5 (sample complexity theory) and OP-FM-2 (foundation-model training data).

  • OP-RL-2. Exploration in sparse-reward, high-dimensional settings. ϵ\epsilon-greedy, entropy bonuses, and intrinsic-motivation methods (count-based, Curiosity, RND, §11) have made progress but are not a general solution. Hard-exploration benchmarks (Montezuma’s Revenge, NetHack, sparse-reward robotics) remain partially solved at best. A principled, scalable approach to exploration in deep RL is one of the longest-standing open problems.

  • OP-RL-3. Reward hacking and specification gaming. Agents exploit imperfections in reward functions, producing high-reward behaviour that does not match the designer’s intent. This generalizes from classical bandit-style examples (boat-racing agents that loop around bonus collectibles instead of finishing the race) to modern RLHF reward hacking (LMs that produce confidence and verbosity to fool reward models) and reasoning-RL verifier gaming (outputs formatted to pass verifiers without correct reasoning). The fundamental difficulty: any specified reward function is a proxy for what the designer actually wants, and a sufficiently capable optimizer can find divergences between the proxy and the goal. Relates to the broader alignment problem (Alignment chapter).

  • OP-RL-4. Offline RL in the function-approximation regime. Pessimism-based methods (CQL, IQL, AWAC, §9) provably improve over behavioural cloning in restricted theoretical settings, but the function-approximation regime where deep nets approximate Q-functions has no clean theory. Concretely: under what conditions does offline RL beat behavioural cloning at deep-net scale? When can offline-to-online fine-tuning safely combine the two? Both are open.

  • OP-RL-5. RLHF reward-model overfitting and reward hacking. A specific instance of OP-RL-3 worth flagging separately because of its practical centrality. Reward models for LLM RLHF are trained on bounded preference data and overfit to the training distribution. As the policy moves away from the training data, the reward model’s predictions become unreliable, and the policy can find ways to exploit the reward model rather than improve at the underlying task. KL penalties mitigate but do not solve. The DPO formulation (§10) eliminates the explicit reward model but inherits a related failure mode: overfitting to the preference data itself. Relates to OP-LLM-3.

  • OP-RL-6. Theory–practice gap in deep RL. Classical PAC-MDP bounds (Theory §10) are vacuous at modern scales; function-approximation bounds require structural assumptions that are not verifiably satisfied by deep nets on real environments. The result: modern deep RL is engineering with empirical validation, not derivation from theoretical principles. Whether this gap can be closed - through better complexity measures, tighter analyses, or new theoretical frameworks - is open. Shared with Theory chapter (OP-TH-5).

  • OP-RL-7. Multi-task, transfer, continual RL. Most RL algorithms train on one task at a time. Transferring an RL agent across tasks (multi-task RL), reusing learned skills (transfer RL), and learning incrementally without forgetting (continual RL) are all areas with substantial empirical work but no settled algorithmic recipe. Recent work using pretrained Foundation Models as policy backbones (e.g., VLA models for robotics) is one direction; whether this is a general solution or specific to language-grounded tasks is open.

  • OP-RL-8. Sim-to-real transfer in robotics. Training in simulation is sample-efficient; deploying in the real world is the actual goal. The gap between simulator and reality (the “reality gap”) is large enough that direct transfer often fails. Domain randomization, domain adaptation, real-world fine-tuning, and learned-simulator approaches have all been studied; no approach is fully satisfactory. Sim-to-real is the single most consequential practical-engineering open problem in RL for robotics.

  • OP-RL-9. Optimal allocation of training-time and inference-time compute. Reasoning models (§12) raised this question explicitly: how much compute should be spent on RL training vs on inference-time sampling/search? The trade-off has implications for production deployment, scaling-law analysis, and architectural design choices. Open practical question; partial answers from the o1/o3/R1 lineage but no general theory.

  • OP-RL-10. Safe exploration and constrained RL. In many deployment regimes, the agent must not perform certain actions (irreversible damage, ethical violations, regulatory non-compliance) even during exploration. Constrained RL formalisms exist; deep-RL algorithms that provably respect constraints during training do not. Relates to OP-RL-3 and the broader safety question.


§15. Further Reading

Opinionated list. Not exhaustive; intended as a reading-order recommendation for someone entering the field. Where a textbook and a paper cover similar territory, the textbook is recommended first.

Textbooks and surveys

  • Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). The standard textbook. Covers everything from bandits to policy gradients in the tabular and basic function-approximation settings. Read first if entering the field cold. Available free online from the authors.

  • Szepesvári, C. (2010). Algorithms for Reinforcement Learning. A shorter, more mathematically dense treatment. Useful complement to Sutton-Barto for the theoretical material.

  • Achiam, J. Spinning Up in Deep RL (OpenAI, 2018). An online educational resource with implementations of the major modern algorithms (VPG, PPO, DDPG, TD3, SAC). The single most useful resource for an aspiring deep-RL implementer.

Classical foundations

  • Watkins, C. J. C. H., and Dayan, P. (1992). “Q-learning.” Tabular Q-learning convergence proof.

  • Sutton, R. S. (1988). “Learning to Predict by the Methods of Temporal Differences.” Foundational TD-learning paper.

  • Tesauro, G. (1995). “Temporal Difference Learning and TD-Gammon.” The first deep-RL-flavoured success.

  • Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” The policy-gradient theorem.

Deep RL: DQN family

  • Mnih, V., et al. (2015). “Human-level control through deep reinforcement learning.” The Nature DQN paper. Mandatory background.

  • van Hasselt, H., Guez, A., and Silver, D. (2016). “Deep Reinforcement Learning with Double Q-Learning.” Double DQN.

  • Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). “Prioritized Experience Replay.” Standard add-on.

  • Hessel, M., et al. (2018). “Rainbow: Combining Improvements in Deep Reinforcement Learning.” The combined refinement.

Deep RL: policy gradient and actor-critic

  • Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. (2015). “Trust Region Policy Optimization.” TRPO.

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). “Proximal Policy Optimization Algorithms.” PPO.

  • Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” GAE.

  • Mnih, V., et al. (2016). “Asynchronous Methods for Deep Reinforcement Learning.” A3C/A2C.

Continuous control

  • Lillicrap, T. P., et al. (2015). “Continuous control with deep reinforcement learning.” DDPG.

  • Fujimoto, S., van Hoof, H., and Meger, D. (2018). “Addressing Function Approximation Error in Actor-Critic Methods.” TD3.

  • Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” SAC.

Model-based RL

  • Ha, D., and Schmidhuber, J. (2018). “World Models.” The foundational modern paper.

  • Schrittwieser, J., et al. (2020). “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.” MuZero.

  • Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. (2023). “Mastering Diverse Domains through World Models.” Dreamer V3.

  • Silver, D., et al. (2017). “Mastering the Game of Go without Human Knowledge.” AlphaZero.

Offline RL

  • Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.” Comprehensive survey.

  • Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). “Conservative Q-Learning for Offline Reinforcement Learning.” CQL.

  • Kostrikov, I., Nair, A., and Levine, S. (2022). “Offline Reinforcement Learning with Implicit Q-Learning.” IQL.

  • Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. (2020). “D4RL: Datasets for Deep Data-Driven Reinforcement Learning.” The standard benchmark.

RLHF, DPO, GRPO

  • Christiano, P. F., et al. (2017). “Deep Reinforcement Learning from Human Preferences.” The foundational RLHF paper.

  • Stiennon, N., et al. (2020). “Learning to Summarize from Human Feedback.” RLHF demonstrated at scale.

  • Ouyang, L., et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” InstructGPT.

  • Bai, Y., et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” RLAIF.

  • Rafailov, R., et al. (2023). “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” DPO.

  • DeepSeek AI (2025). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” GRPO and the reasoning-RL recipe.

Exploration

  • Bellemare, M. G., et al. (2016). “Unifying Count-Based Exploration and Intrinsic Motivation.” Count-based for deep RL.

  • Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). “Curiosity-driven Exploration by Self-supervised Prediction.” Curiosity.

  • Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018). “Exploration by Random Network Distillation.” RND.

Reasoning models and process supervision

  • Lightman, H., et al. (2023). “Let’s Verify Step by Step.” Process supervision for math reasoning.

  • OpenAI (2024). “Learning to Reason with LLMs.” The o1 announcement (technical details limited).

  • DeepSeek AI (2025). “DeepSeek-R1.” Open-method publication of reasoning-RL with GRPO.

Reading-order recommendation

For someone entering the field cold: read Sutton-Barto first (textbook, including the deep-RL appendices). Then Achiam’s Spinning Up alongside implementing PPO and DQN. Then the DQN Nature paper, the PPO paper, and the SAC paper as the canonical modern algorithms. Then Christiano et al. 2017 and the InstructGPT paper for RLHF, followed by the DPO paper. The reasoning-RL papers (o1 announcement, DeepSeek-R1) are most useful after the rest is in hand.


§16. Exercises and Experiments

Research-style exercises designed to develop hands-on understanding of the algorithms and make the empirical-theoretical gap concrete.

  • E1. Tabular Q-learning on a gridworld. Implement tabular Q-learning on the 4×34 \times 3 gridworld of §3. Compare its converged Q-values against value iteration’s VV^* (computed by solving the Bellman optimality equation directly). Investigate the effect of (a) different ϵ\epsilon schedules, (b) different learning-rate schedules, (c) different discount factors. Plot reward-per-episode and Q-value-error-over-time learning curves. Confirm Watkins-Dayan convergence empirically.

  • E2. DQN on a small Atari subset. Implement DQN with experience replay and target networks; train on a small subset of Atari games (e.g., Pong, Breakout, Space Invaders). Reproduce the DQN paper’s qualitative results on at least one game. Then ablate experience replay and target networks separately to confirm they are both necessary for stability.

  • E3. PPO on a continuous-control task. Implement PPO with GAE; train on MuJoCo HalfCheetah or a Gymnasium continuous-control benchmark. Compare against the published PPO baselines. Investigate the effect of the clipping parameter ϵ\epsilon (try 0.1,0.2,0.50.1, 0.2, 0.5). Plot the training-curve sensitivity to ϵ\epsilon.

  • E4. SAC vs TD3 vs PPO comparison. On the same continuous-control benchmark, train all three algorithms with comparable compute budgets. Plot the sample-efficiency curves (return vs environment steps). Quantify the empirical sample-efficiency advantage of off-policy SAC/TD3 over on-policy PPO. Compare your results to the figures in the SAC and TD3 papers.

  • E5. Offline RL on a fixed dataset. Take a behaviour-cloning baseline and CQL (or IQL); train both on a D4RL dataset (e.g., HalfCheetah-medium). Evaluate by deploying the trained policy back in the environment and measuring real return. Quantify the improvement of CQL/IQL over behavioural cloning. Investigate the effect of dataset size and dataset quality (medium vs medium-replay).

  • E6. Small-scale RLHF on a tiny model. Take a small Transformer LM (10\sim 10100100M parameters). Generate a preference dataset on a synthetic task (e.g., summarization of short texts where length-then-quality is the preference order). Implement (a) PPO-for-RLHF and (b) DPO. Compare their final policy quality and training stability. Reproduce the DPO paper’s claim that DPO is simpler and more stable than PPO at this scale.

  • E7. Reasoning-RL on a small math problem set. Take a small instruction-tuned LM. Define a small math problem set with deterministic verifiers (e.g., GSM8K subset). Implement GRPO. Train and evaluate on held-out problems. Investigate whether reasoning-RL produces qualitative behavioural changes (longer responses, self-correction). Compare to a baseline of best-of-K inference-time sampling without RL training.

  • E8. Implementing UCB on a simulated bandit. Implement ϵ\epsilon-greedy and UCB on a 10-armed bandit with reward means drawn from N(0,1)\mathcal{N}(0, 1) and reward noise N(μ,1)\mathcal{N}(\mu, 1). Plot cumulative regret over T=10000T = 10000 rounds, averaged over 100 random seeds. Confirm UCB’s TlogT\sqrt{T \log T} regret empirically.

  • E9. Exploration-bonus comparison. On a sparse-reward gridworld, train DQN with (a) no exploration bonus, (b) count-based exploration bonus, (c) RND. Plot training curves. Observe which methods solve the task and which fail.


AI: A Living Reference by Fuzue. Content licensed under CC BY-SA 4.0 - share, adapt, and build on it; keep the attribution and the open licence on derivatives.