Part XI · AI Agents & Autonomous Systems · Chapter 03

Planning & Reasoning, thinking several steps ahead before committing to action.

Single-shot generation is a one-way door — the model commits to a trajectory without any ability to look ahead, reconsider, or backtrack. Yet most hard problems require exactly that: provisional steps, hypothetical branches, and the discipline to abandon a path when it proves wrong. This chapter traces the arc from classical AI planning through chain-of-thought prompting to the modern inference-time search methods — Tree of Thoughts, MCTS-guided decoding, Reflexion — that are quietly closing the gap between LLM intuition and human deliberation.

Prerequisites

This chapter assumes you've read Agent Fundamentals (Ch 01) for the sense-plan-act loop and classical planning background, and LLM-Based Agents (Ch 02) for chain-of-thought, ReAct, and the prompt-as-program model. Some sections reference MCTS, which is introduced here from scratch but is covered in depth in Model-Based RL (Part IX Ch 05). Familiarity with process reward models is helpful but not required.

Sections

The Planning Problem classical planning · LLM planners · search vs. inference
CoT as Implicit Planning scratchpad · linear reasoning · working memory · limitations
Least-to-Most Decomposition subgoal decomposition · plan-and-solve · hierarchical planning
Tree of Thoughts branching search · thought generation · state evaluation · pruning
MCTS-Guided Decoding world model · AlphaCode 2 · inference scaling · math reasoning
Self-Verification & Critique self-consistency · Constitutional AI · Self-Refine · sycophancy
Reflexion & Self-Reflection verbal feedback · episodic memory · iterative refinement
Process Reward Models step-level rewards · PRM training · production deployment
Neurosymbolic Hybrids translator pattern · classical planner · SAT & CSP solvers
When LLM Planning Fails PlanBench · brittle compositionality · failure analysis
Inference-Time Scaling DeepSeek-R1 · long CoT · scaling curve · test-time compute
Benchmarks & the Planning Landscape GSM8K · MATH · HumanEval · planning stack 2025

The Planning Problem

Foundation · Classical AI

Planning is the process of deciding, before acting, which sequence of actions will move an agent from an initial state to a goal state. The problem sounds deceptively simple, but its complexity grows combinatorially: a domain with \(n\) binary state variables has \(2^n\) possible world states, and finding the optimal path through that space is NP-hard in general.

Classical AI addressed planning with formal languages — STRIPS (1971), then PDDL — that let engineers specify actions as precondition/effect pairs and let dedicated planners (FF, FastDownward) search the resulting state space systematically. These systems could solve complex logistics problems with provable optimality guarantees. They were also entirely brittle: the world had to match the model exactly, and every new domain required exhaustive hand-engineering.

Two Kinds of Planning

It helps to distinguish task planning — deciding what to do at the level of high-level sub-goals — from motion or execution planning — deciding how to carry out each sub-goal given real-world constraints. Classical planners excelled at the former in well-defined symbolic domains. Learning-based agents, and LLMs in particular, have disrupted task planning while largely leaving the lower levels to traditional control and robotics pipelines.

A second distinction: open-loop planning constructs a full plan offline, then executes it without consulting the environment. Closed-loop planning interleaves planning and execution, revising the plan as new observations arrive. LLM agents operate almost exclusively in closed-loop mode — the environment changes, tools return unexpected results, and the agent must replan on the fly.

The Planning Horizon Problem

Standard autoregressive generation is horizon-zero planning: each token is chosen to be most likely given what came before, with no lookahead. This works well when the right answer is locally obvious, but breaks down whenever the optimal current token depends on what must come later — a trap that affects mathematical proofs, multi-step code, and any task requiring backtracking.

Why LLMs Changed the Calculus

Pre-LLM agents needed a hand-crafted world model to plan anything. LLMs arrive pre-loaded with implicit knowledge about how the world works — enough to reason about novel domains without explicit modelling. The question the field has spent the past three years answering is: can we elicit that implicit planning knowledge effectively, and can we supplement it with explicit search when the implicit reasoning runs out?

Chain-of-Thought as Implicit Planning

Core · Prompting

Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), elicits multi-step reasoning by simply including worked examples in the prompt. The observation was striking: models that failed on a problem when asked for a direct answer succeeded when prompted to think step by step. CoT turns planning implicit — the reasoning chain is visible in the output, but the model is not explicitly searching a state space. It is pattern-matching against demonstrations of deliberate thought.

Zero-shot CoT (Kojima et al., 2022) pushed this further: just appending "Let's think step by step" to a prompt improved performance across benchmarks, with no examples required. This suggested the model had internalized reasoning patterns from pretraining on human text and needed only a nudge to apply them.

The Scratchpad as Working Memory

CoT's key function is externalizing intermediate computation into the context window, which acts as working memory. A model reasoning over 17 arithmetic steps would fail if forced to keep all intermediate values in implicit "weights-space" — but writing them out lets it attend back to earlier results accurately. The same applies to logical deduction, code debugging, and natural-language argumentation.

CoT as Marginalization \[P(a \mid q) = \sum_{z} P(a \mid z, q) \cdot P(z \mid q)\] A direct answer \(a\) to question \(q\) can be decomposed as a sum over reasoning traces \(z\). CoT makes the model commit to a single high-probability \(z\) and condition on it, rather than averaging over an implicit distribution of traces.

Limitations of Linear CoT

CoT is linear: the model produces one reasoning trace, left to right, with no ability to explore alternatives or backtrack if a chain of reasoning leads to a dead end. It also cannot self-correct mid-trace — once a wrong assumption is committed to text, subsequent tokens tend to follow it rather than contradict it. These limitations motivated the search-based approaches that follow.

A practical ceiling was identified by Valmeekam et al. (2022) and subsequent work: on classical planning benchmarks like BlocksWorld, CoT-prompted LLMs fail catastrophically even at small problem sizes — not because they lack domain knowledge, but because they cannot systematically explore the state space. Fluent generation of plausible-sounding plans is not the same as actually finding valid ones.

Least-to-Most Decomposition

Decomposition · Problem-Solving

Least-to-most prompting (Zhou et al., 2022) addresses a specific failure mode of CoT: tasks that require building up capabilities in order. Where CoT attacks a problem monolithically, least-to-most breaks it into a hierarchy of sub-problems, from the simplest prerequisite to the final answer. Each sub-problem is solved in turn, with its answer added to the context before tackling the next.

The prompt has two phases. First, the model is asked to decompose the problem: "What sub-problems do I need to solve first?". Then each sub-problem is answered in sequence, with prior answers visible. This mirrors how humans actually solve hard problems — building skills and partial results before assembling the full solution.

// Least-to-most on a multi-hop question

Problem: "Which mountain range do Flagstaff, AZ and Boulder, CO both border?"

Decompose:
  1. What mountain range borders Flagstaff, AZ?
  2. What mountain range borders Boulder, CO?
  3. Are they the same?

Answer sub-1: Flagstaff is near the San Francisco Peaks, part of the Rocky Mountains.
Answer sub-2: Boulder borders the Front Range of the Rocky Mountains.
Answer sub-3: Both border the Rocky Mountains. → Rocky Mountains

Relation to Classical Hierarchical Planning

Least-to-most is a prompt-level analogue of hierarchical task network (HTN) planning, in which tasks are recursively decomposed into sub-tasks until primitive actions are reached. The key difference is that HTN planners work with explicit, pre-specified decomposition operators, while least-to-most relies on the model to generate the decomposition itself — which works well when the model has seen analogous structures in pretraining data, and poorly when it hasn't.

Plan-and-Solve Prompting

A related variant, Plan-and-Solve (Wang et al., 2023), separates planning from execution within the prompt: first ask the model to produce a step-by-step plan, then separately execute each step. This reduces the "sycophantic overcommitment" problem where the model executes a flawed plan without questioning it — by making the plan explicit, it becomes easier to audit and correct before execution begins.

When Decomposition Helps Most

Empirically, decomposition-based methods provide the largest gains on tasks that are compositional — where the answer depends on a chain of facts or operations — and provide less benefit on tasks requiring a single insight or where the necessary sub-problems aren't predictable in advance. Multi-hop QA, symbolic manipulation, and long-horizon code generation are strong candidates; creative writing and open-ended generation less so.

Tree of Thoughts

Search · Deliberate Reasoning

Tree of Thoughts (ToT, Yao et al., 2023) is the most direct attempt to graft classical search onto an LLM's generation process. The core idea: instead of generating a single linear chain of reasoning, generate multiple candidate thoughts at each step, evaluate them, and use search (breadth-first, depth-first, or beam) to navigate the resulting tree. The LLM serves triple duty as thought generator, state evaluator, and — when used in a self-evaluation mode — critic.

Tree of Thoughts Framework \[V(s) = \text{LLM}_{\text{eval}}(\text{``Rate the promise of this partial solution: ''} + s)\] \[\text{Children}(s) = \{s' : s' = s \oplus \text{LLM}_{\text{gen}}(s)\}\] \[\text{BFS: } \mathcal{F}_{t+1} = \text{top-}k \{ V(s') : s' \in \bigcup_{s \in \mathcal{F}_t} \text{Children}(s) \}\] At each step, the frontier \(\mathcal{F}_t\) consists of the \(k\) most promising partial thoughts. BFS keeps all \(k\) at each depth; DFS dives and backtracks.

The original ToT paper demonstrated dramatic improvements on three benchmark tasks chosen precisely for requiring deliberate, non-linear reasoning: the Game of 24 (arrange four numbers with arithmetic to reach 24), creative writing with structured constraints, and mini-crossword solving. On the Game of 24, a task where standard GPT-4 CoT succeeds on 4% of problems, ToT reached 74%.

Thought Generation Strategies

Thoughts can be generated in two modes. Sample mode draws multiple independent continuations from the LLM's distribution; this is cheap but can produce near-duplicates. Propose mode prompts the model to enumerate distinct candidate thoughts explicitly — for example, "What are five different approaches to the next step?" — which tends to produce more diverse and complementary candidates at higher cost.

State Evaluation Strategies

Evaluating partial states is the hard part. ToT uses the LLM itself for two kinds of evaluation. Value estimation asks the model to score a partial state as "sure/likely/impossible" given the goal. Vote aggregation generates multiple independent evaluations and takes a majority verdict — useful when the model is uncertain on any single evaluation pass.

Tree of Thoughts: the model generates candidate thoughts at each node, scores them, prunes low-value branches (Thought B here), and expands only the most promising ones. The best leaf path becomes the answer.

Costs and Trade-offs

ToT's power comes at a steep cost: each node expansion requires one or more LLM calls, and each evaluation requires additional calls. A BFS search of depth 3 with branching factor 5 requires roughly 125 expansion calls plus evaluation overhead. For frontier models, this translates to substantial latency and cost. Practitioners typically cap the branching factor at 3–5 and the depth at 3–4, using beam search rather than full BFS/DFS. The compute overhead has limited ToT's adoption in production systems, where simpler CoT + retry strategies often provide enough improvement at a fraction of the cost.

MCTS-Guided Decoding

Search · Reinforcement Learning

Monte Carlo Tree Search (MCTS) brings a more principled approach to the tree search problem. Rather than evaluating each node with a single LLM call, MCTS uses rollouts — simulated continuations to the end of the problem — to estimate node value. It balances exploitation (expanding nodes known to be good) with exploration (trying less-visited nodes that might be better) via the UCT selection criterion.

UCT Selection in MCTS \[\text{UCT}(s) = \bar{V}(s) + c \sqrt{\frac{\ln N(s_{\text{parent}})}{N(s)}}\] \(\bar{V}(s)\) is the average rollout value of node \(s\); \(N(s)\) is its visit count; \(c\) is an exploration constant (typically \(\sqrt{2}\)). Rarely-visited nodes receive exploration bonuses; the formula prevents premature convergence.

Applied to language model planning (Hao et al., 2023; "Reasoning via Planning"), MCTS treats each reasoning step as a node and LLM generations as rollouts. The model serves as both transition function (what reasoning step comes next?) and value function (how likely does this path lead to the correct answer?). After enough rollouts, the algorithm concentrates search on the most promising reasoning trajectories.

The World Model Problem

Classical MCTS requires a simulator: a way to advance the state cheaply, many times, for rollouts. For reasoning tasks, the LLM itself acts as the simulator — it generates plausible continuations — but this is expensive. Several papers have experimented with training lightweight value/policy networks alongside the main LLM to make rollouts cheaper, echoing the AlphaGo architecture. This is an active research area as of 2025.

AlphaCode 2 and Mathematical Reasoning

Deepmind's AlphaCode 2 (2023) applied MCTS-like sampling over code solutions: generate many candidate programs, filter by execution tests, use the passing tests as a reward signal, and iteratively improve. This is not strictly MCTS (there is no tree; samples are independent) but it captures the essential idea — parallel sampling plus a value signal enables much better performance than single-shot generation. Similar patterns appear in math reasoning: generating 64–1024 solutions and selecting via a verifier far outperforms greedy decoding.

Best-of-N and Majority Voting

The simplest "search" strategy is best-of-N: sample N solutions, score each with a verifier or reward model, return the best. For math problems, majority voting (self-consistency) — return the most common answer across N generations — closes much of the gap between greedy and optimal without requiring an external verifier. Both methods trade inference-time compute for accuracy in a roughly log-linear relationship.

Compute Scaling at Inference Time

MCTS and related search methods are instances of inference-time compute scaling: using more computation at evaluation time rather than embedding all capability in model weights. Snell et al. (2024) showed that for many tasks, a smaller model with more inference-time compute outperforms a larger model without it — a profound shift in how we think about the model size / compute trade-off. OpenAI's o1/o3 family explicitly commercialized this insight.

Self-Verification & Critique

Evaluation · Iterative Refinement

An LLM's ability to evaluate outputs is often better than its ability to generate them from scratch. This asymmetry — common in human cognition too — motivates a family of generate-then-critique methods where the model acts as its own checker.

Self-Consistency

Wang et al. (2022) introduced self-consistency: sample multiple reasoning chains independently, then take a majority vote over the final answers. The insight is that correct reasoning paths tend to converge on the same answer, while incorrect ones diverge. Self-consistency requires no external critic — it uses the model's own variance as a quality signal. On GSM8K math reasoning, it improved GPT-3 from 56% to 78% accuracy with 40 samples.

Constitutional AI and Self-Critique

Anthropic's Constitutional AI (Bai et al., 2022) introduced a self-critique and revision loop: the model generates a response, then critiques it against a list of principles ("Is this response harmful? Explain why."), then revises in light of the critique. Repeated for multiple rounds, this allows the model to iteratively correct its own outputs — a planning-adjacent technique when applied to agentic task completion.

Self-Refine

Madaan et al. (2023) generalized this into Self-Refine: an iterative loop where the model generates an output, provides natural-language feedback on what's wrong, and then revises the output based on the feedback. The same model handles all three roles. Critically, the feedback step makes improvement criteria explicit — the revision doesn't just try again blindly; it addresses identified flaws. On coding, math, and sentiment transfer benchmarks, Self-Refine consistently outperformed single-shot generation.

// Self-Refine loop (simplified)

y₀ = LLM(task prompt) // initial draft

for i = 1 .. max_rounds:
  fb_i = LLM("Critique y_{i-1}: What is wrong and how to fix it?")
  y_i = LLM("Revise y_{i-1} using feedback: " + fb_i)
  if LLM("Is y_i satisfactory?") == "Yes": break

return y_i

Limitations: Sycophancy and the Echo Chamber

Self-critique has a structural weakness: the same model that generated the flawed output is critiquing it. Models tuned to be agreeable exhibit sycophancy — they tend to endorse their own outputs when asked to evaluate them, especially when the output is fluent and coherent. External critics (a separate model, a ground-truth verifier, or a human) are more reliable. The field has moved toward critic models trained separately on preference data, which avoids the echo-chamber problem.

Reflexion & Self-Reflection

Reflection · Learning from Failure

Shinn et al. (2023) introduced Reflexion, which adds a long-term memory layer to self-critique: when an agent fails at a task, it generates a verbal self-reflection on what went wrong, stores that reflection in a persistent memory, and conditions its next attempt on the accumulated reflections. Instead of learning by updating weights — which requires gradient-based training — Reflexion learns by writing.

The architecture has three components. An Actor generates actions using the current policy. An Evaluator scores the outcome (pass/fail for code tests, rewards for RL tasks, or LLM-based scoring for open-ended tasks). A Self-Reflection module synthesizes a verbal post-mortem: "I failed because I didn't handle the edge case where the input is empty. Next time I should check for empty input first." These reflections are prepended to the next episode's context.

Reflexion Memory Update \[\mathcal{M}_{t+1} = \mathcal{M}_t \cup \{ \text{LLM}_{\text{reflect}}(s_{1:T}, a_{1:T}, r_{1:T}) \}\] The reflection model consumes the full trajectory \((s, a, r)\) of the failed episode and produces a natural-language reflection that is appended to memory \(\mathcal{M}\). On the next episode, the actor is conditioned on \(\mathcal{M}\).

Results and Scope

On the HumanEval coding benchmark, Reflexion brought GPT-3.5 from 67.0% (pass@1) to 91.0% after multiple reflection rounds — a larger gain than any prompting method that doesn't use external feedback. On AlfWorld text-game tasks, it achieved 97% success versus a 45% baseline. The key insight: failing fast, then reasoning carefully about the failure, is a powerful learning signal that doesn't require any weight updates.

Practical Limitations

Reflexion requires a reliable evaluator to tell the agent when it has failed. For code this is easy — run the tests. For open-ended tasks, defining failure is harder. Reflexion also accumulates context across episodes: very long memory sequences can exceed context limits, and early reflections may become stale or contradictory as the task evolves. Forgetting strategies (relevance filtering, summarization) help but introduce their own trade-offs.

Verbal Reinforcement Learning

Reflexion and related methods (ExpeL, LATS) instantiate a kind of in-context reinforcement learning that operates entirely in token space: the reward signal becomes language, the policy update becomes a natural-language reflection, and the value function is the agent's own assessment of its performance. This sidesteps the brittle infrastructure of gradient-based RL for slow-moving, high-level planning tasks.

Process Reward Models

Training · Step-Level Supervision

Most reward models in RLHF evaluate complete outputs — they score the final answer. Process reward models (PRMs) instead score each reasoning step, providing dense supervision over the entire chain of thought. This distinction matters enormously for planning: a model trained with outcome supervision learns that a correct final answer is good but learns nothing about which intermediate steps were responsible; a model trained with process supervision learns which reasoning patterns are reliable at each stage.

OpenAI's "Let's Verify Step by Step" (Lightman et al., 2023) provided the landmark result: human labellers rated each step in a mathematical reasoning chain as correct, incorrect, or neutral, and a PRM trained on this data dramatically improved the ability to identify correct reasoning paths during search. On the MATH benchmark, best-of-1860 generation with PRM selection matched what outcome-reward models needed best-of-~100k to achieve — a roughly 100× improvement in sample efficiency.

Training a PRM

The training objective is straightforward: given a reasoning trace \(s_1, s_2, \ldots, s_T\), predict whether each step \(s_i\) is correct. The challenge is annotation cost — human step-level labelling is expensive. Several approaches mitigate this: (1) using LLMs to automatically label steps, with human spot-checks; (2) using Monte Carlo estimation — from each step, roll out many completions and label the step as correct if a high fraction of rollouts reach the correct answer; (3) contrastive learning between correct and erroneous traces.

PRM-Guided Beam Search \[\text{score}(\tau) = \prod_{t=1}^{T} \text{PRM}(s_t \mid s_{1:t-1}, q)\] The PRM assigns a correctness probability to each step; the product gives a trajectory score. Beam search keeps the top-\(k\) partial traces at each step, pruning unlikely reasoning paths early.

PRMs in Production

Process reward models are now a key ingredient in the reasoning-model training pipelines behind o1, o3, and DeepSeek-R1. They enable scalable oversight by providing dense, reliable training signal over long reasoning chains — something that outcome-only reward models cannot. PRMs also enable search at test time without human labels: simply sample many continuations from each step and use PRM scores to select the best path.

Neurosymbolic Hybrids

Hybrid · Formal Methods

Pure neural planning via CoT/ToT runs into a structural problem: the model's reasoning is opaque and unverifiable. You cannot easily confirm that a multi-step argument is logically valid, only that it looks plausible. Neurosymbolic hybrids address this by offloading the formal reasoning to an external symbolic system — a theorem prover, a constraint solver, an SMT solver, or a classical planner — while using the LLM to translate between natural language and formal representations.

The Translator Pattern

The dominant architecture is a translator pipeline: (1) the LLM converts a natural-language problem into a formal specification; (2) a symbolic solver produces a provably correct solution; (3) the LLM translates the formal solution back into natural language. The LLM is doing what it does well (understanding language, generating plausible structure) while the solver does what it does well (exhaustive, correct search).

This pattern appears in Logic-LM (Pan et al., 2023), which converts questions to first-order logic and uses Prolog; in Code Interpreter agent mode, which generates Python code and runs it (the interpreter as solver); and in SatLM (Ye et al., 2023), which converts word problems to SAT constraints. All share the same failure mode: if the LLM mistranslates the problem into the formal language, the solver finds the correct answer to the wrong problem.

LLM + Classical Planner

Agents like LLM+P (Liu et al., 2023) use an LLM to translate a natural-language goal into PDDL, then call FastDownward or FF to find a valid plan, then translate the plan back. On standard PDDL benchmarks, this achieves near-perfect planning performance that pure LLM CoT cannot — but requires a clean, structured environment that maps to PDDL. In messy real-world tasks, the translation step is the bottleneck.

The Grounding Problem

Symbolic systems need precise, unambiguous representations. Natural language is ambiguous, context-dependent, and full of implicit assumptions. The LLM's job as translator is to resolve all that ambiguity before handing off to the solver — and this is precisely where translation fails. Grounding errors (hallucinated predicates, missing constraints, wrong variable bindings) are hard to detect automatically and can produce plans that are formally valid but semantically wrong.

Satisfiability and Constraint Solving

For planning problems that can be expressed as constraint satisfaction — scheduling, resource allocation, configuration — LLMs paired with SMT solvers (Z3, CVC5) or CP solvers (OR-Tools) provide a powerful combination. The LLM handles natural-language requirements and edge cases that would be tedious to encode manually; the solver handles the combinatorial search. This pairing has shown commercial viability in enterprise scheduling and logistics applications.

When LLM Planning Fails

Analysis · Failure Taxonomy

The gap between impressive benchmark results and reliable real-world planning performance is large. Understanding why is essential before deploying agents on consequential tasks. Planning failures fall into a small number of recurring patterns.

Systematic Over-Confidence

LLMs generate fluent, confident-sounding plans whether or not the plan is correct. The model does not have a reliable "I don't know how to do this" mode for planning tasks. Valmeekam et al. (2023) showed GPT-4 hallucinating valid-looking PDDL plans that violated preconditions on more than 60% of problems — while expressing no uncertainty.

Length Generalisation Failure

Performance degrades sharply as problem length increases. A model that perfectly solves 5-step planning problems may fail at 15-step ones, even if the structure is identical. The model's implicit "planning horizon" is limited by the distribution of examples in training data, not by any principled capacity bound.

Lack of Lookahead

Linear generation commits irrevocably to each step. When an early decision forecloses the correct solution — as in many combinatorial puzzles — autoregressive models have no mechanism to backtrack. Without explicit search (ToT, MCTS) this failure is architectural, not a matter of scale.

State Tracking Errors

Multi-step plans require maintaining a mental model of the current world state. LLMs frequently lose track of which sub-goals have been completed, which constraints have been satisfied, or which resources have been consumed — especially in long episodes. This compounds over time, leading to contradictory actions.

Spurious Correlation

LLMs trained on web text associate planning-adjacent surface patterns with certain outputs. A problem phrased to resemble a famous benchmark gets the benchmark answer regardless of the actual constraints. Distribution shift — novel problem formulations — exposes this brittleness immediately.

Goal Drift

In long-horizon agentic tasks, the agent's effective goal drifts from the original instruction as context grows, tool outputs accumulate, and sub-goal completions change the apparent top priority. Without explicit goal tracking, the agent can optimise for a proxy that diverges from user intent over time.

The PlanBench Findings

Valmeekam et al.'s PlanBench (2023) provided the most thorough stress-test of LLM planning. They found: (1) frontier LLMs solve fewer than 12% of Blocksworld problems correctly in greedy decoding, even though humans find most problems trivial; (2) performance does not improve meaningfully with scale alone; (3) chain-of-thought helps marginally but does not close the gap; (4) models that were given the correct plan and asked to verify it performed significantly better than models asked to generate plans — confirming that verification is easier than generation.

The lesson is not that LLMs cannot plan at all — they clearly can in many practical settings — but that their planning capability is fundamentally statistical rather than principled. They generalise well to problems similar to their training distribution and fail in predictable ways outside it. Engineers must design systems accordingly: use verifiers, use external tools for formal correctness guarantees, build in retry and reflection loops, and keep humans in the loop for high-stakes decisions.

Inference-Time Scaling

Frontier · Thinking Models

The methods in this chapter — CoT, ToT, MCTS, Reflexion, PRMs — converged on a common theme: spending more computation at inference time produces better reasoning. OpenAI's o1 (September 2024) was the first major commercial demonstration of training specifically for extended inference-time reasoning, using reinforcement learning to incentivise long internal "thinking" traces before producing a final answer.

o1's internal reasoning (not shown to users in its original deployment) resembles a hybrid of chain-of-thought and backtracking: the model considers multiple approaches, abandons dead ends, and checks its work. The thinking trace can run for hundreds or thousands of tokens before the answer is generated. This is not simply longer CoT — the model is trained via RL to produce traces that maximise correctness, not traces that look like human reasoning.

DeepSeek-R1

DeepSeek-R1 (January 2025) provided the first open-weight reasoning model trained with a similar paradigm. Critically, DeepSeek published the training methodology: start with a base model, apply GRPO (Group Relative Policy Optimization — a variant of PPO without a value function baseline) with correctness reward only (no format reward), and let the model discover its own reasoning patterns. The result was a model that spontaneously developed backtracking, self-verification, and multi-approach reasoning — emergent behaviours that no training example demonstrated explicitly.

GRPO Policy Gradient (DeepSeek-R1) \[\mathcal{L}_{\text{GRPO}} = -\mathbb{E}_{o_i \sim \pi_\theta} \left[ \min\!\left( \frac{\pi_\theta(o_i)}{\pi_{\text{ref}}(o_i)} A_i,\ \text{clip}\!\left(\frac{\pi_\theta(o_i)}{\pi_{\text{ref}}(o_i)},\ 1\pm\varepsilon\right) A_i \right) \right]\] \(A_i = r_i - \frac{1}{G}\sum_{j=1}^G r_j\) is the advantage relative to the group mean, replacing the learned value function of PPO. This simplifies training while preserving the essential variance reduction of group-relative advantages.

The Scaling Curve

Snell et al. (2024) mapped the inference-time scaling curve precisely: for a fixed training compute budget, additional inference compute (via best-of-N, MCTS, beam search, or longer thinking traces) follows a power law improvement in accuracy up to a point, then plateaus. The optimal allocation between training and inference compute depends on the task: for tasks with reliable verifiers, heavy inference scaling is efficient; for tasks requiring generalisation across distributions, training compute matters more.

Planning as a Trained Capability

The shift from "prompt LLMs to plan" to "train LLMs to plan via RL" is profound. Prompting methods depend on the planning capability being latent in pretraining data. RL-trained reasoning models acquire planning capability through direct experience of correctness signals — they don't need to have seen examples of backtracking; they discover backtracking because it works. This suggests that reasoning-model training, not prompting, is the long-run path to reliable LLM planning.

Benchmarks & the Planning Landscape

Benchmarks · Comparative Overview

Planning and reasoning ability is evaluated across a heterogeneous set of benchmarks, each stressing different aspects of the capability. Progress on any single benchmark should be interpreted cautiously — methods that look strong on benchmarks designed to reward their assumptions can fail unexpectedly on out-of-distribution problems.

Benchmark	Tests	Signal Type	Key Finding
GSM8K	Grade-school math word problems	Exact answer	CoT transforms GPT-4 from ~30% to ~90%+ with self-consistency
MATH	Competition mathematics	Exact answer	PRM-guided search achieves ~90% (o3); best-of-N scales log-linearly
PlanBench	Blocksworld / logistics PDDL	Plan validity	GPT-4 CoT <12%; LLM+P (planner) >95%; gap is structural
HumanEval	Python function generation	Test execution	Reflexion: 67% → 91%; o3-mini: ~99.5%
ARC-AGI	Abstract grid reasoning	Pattern match	o3 (high compute): 87.5%; strong inference scaling signal
Game of 24	Arithmetic search	Exact solution	CoT: 4%; ToT: 74%; designed to require search
WebArena	Multi-step web navigation	Task success	Best agents ~45% (2024); real-world planning difficulty
SWE-bench	Real GitHub issue resolution	Test execution	Top agents ~49% (2025); requires planning over large codebases

What the Numbers Mean

The benchmark landscape tells a nuanced story. On tasks with unambiguous correct answers and reliable verification — math problems, code execution — inference-time scaling via search has driven dramatic improvements, with frontier reasoning models approaching or exceeding expert human performance. On tasks requiring open-ended planning in the real world — web navigation, software engineering, embodied tasks — the gap to human performance remains large and progress is slower.

The distinction is not mainly about model intelligence; it's about the availability of verifiers. Math and code have cheap, reliable verifiers (the solver, the test suite) that can guide search. Real-world tasks often don't. Building reliable automated verifiers for complex real-world tasks is itself an open research problem — and arguably the key bottleneck for the next generation of capable AI agents.

The Planning Stack in 2025

Practitioners assembling agent planning capabilities in 2025 typically combine several layers: a reasoning-capable base model (o3-mini, Claude 3.7, DeepSeek-R1) for internal chain-of-thought; external search (ToT, MCTS, or beam sampling) for tasks with known structure; PRMs or verifiers to guide search; Reflexion-style verbal memory for multi-episode improvement; and classical planners or constraint solvers for sub-problems with formal structure. No single technique dominates — the art lies in knowing which layer to apply to which part of the problem.

Key Papers

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al., NeurIPS 2023

Introduces the ToT framework — generate multiple candidate thoughts, evaluate with the LLM, search via BFS/DFS/beam. The Game of 24 demonstration is the canonical example of search dramatically outperforming linear CoT. The foundational paper for explicit LLM search.

arXiv
Let's Verify Step by Step

Lightman et al., OpenAI, ICLR 2024

The PRM paper. Human labellers annotated 800k reasoning steps; a trained PRM outperforms outcome reward models for guiding best-of-N selection by ~100× in sample efficiency on MATH. Essential reading for understanding inference-time search with learned value functions.

arXiv
Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn et al., NeurIPS 2023

Verbal reflection as a substitute for gradient-based learning. The HumanEval result (67% → 91%) without any weight updates is striking. Lays the foundation for episodic verbal RL. Core reference for multi-episode improvement without fine-tuning.

arXiv
Self-Refine: Iterative Refinement with Self-Feedback

Madaan et al., NeurIPS 2023

Iterative generate-critique-revise loop using the same LLM for all three roles. Broad evaluation across seven tasks shows consistent improvement over single-shot generation. The cleanest implementation of self-critique; widely replicated.

arXiv
Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning About Change)

Valmeekam et al., NeurIPS 2023 Workshop

The sobering counterpoint. Systematic evaluation of frontier LLMs on classical PDDL benchmarks demonstrates near-zero planning accuracy and pervasive hallucinated plans. Essential for calibrating expectations. Read alongside the positive results to get an honest picture.

arXiv
Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters

Snell et al., 2024

Precise quantification of the inference-time scaling curve. Shows that for many tasks, a smaller model with more inference compute outperforms a larger model without it — reshaping how we think about the training/inference compute trade-off. The empirical foundation of the inference-time scaling paradigm.

arXiv
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, 2025

The open-weight reasoning model that matched o1 performance. Key contribution: emergent backtracking and self-verification from pure correctness RL, with no reasoning-format supervision. Full training methodology published. The most detailed public account of how to train a reasoning model.

arXiv