Planning & Reasoning, thinking several steps ahead before committing to action.
Single-shot generation is a one-way door — the model commits to a trajectory without any ability to look ahead, reconsider, or backtrack. Yet most hard problems require exactly that: provisional steps, hypothetical branches, and the discipline to abandon a path when it proves wrong. This chapter traces the arc from classical AI planning through chain-of-thought prompting to the modern inference-time search methods — Tree of Thoughts, MCTS-guided decoding, Reflexion — that are quietly closing the gap between LLM intuition and human deliberation.
Prerequisites
This chapter assumes you've read Agent Fundamentals (Ch 01) for the sense-plan-act loop and classical planning background, and LLM-Based Agents (Ch 02) for chain-of-thought, ReAct, and the prompt-as-program model. Some sections reference MCTS, which is introduced here from scratch but is covered in depth in Model-Based RL (Part IX Ch 05). Familiarity with process reward models is helpful but not required.
The Planning Problem
Planning is the process of deciding, before acting, which sequence of actions will move an agent from an initial state to a goal state. The problem sounds deceptively simple, but its complexity grows combinatorially: a domain with \(n\) binary state variables has \(2^n\) possible world states, and finding the optimal path through that space is NP-hard in general.
Classical AI addressed planning with formal languages — STRIPS (1971), then PDDL — that let engineers specify actions as precondition/effect pairs and let dedicated planners (FF, FastDownward) search the resulting state space systematically. These systems could solve complex logistics problems with provable optimality guarantees. They were also entirely brittle: the world had to match the model exactly, and every new domain required exhaustive hand-engineering.
Two Kinds of Planning
It helps to distinguish task planning — deciding what to do at the level of high-level sub-goals — from motion or execution planning — deciding how to carry out each sub-goal given real-world constraints. Classical planners excelled at the former in well-defined symbolic domains. Learning-based agents, and LLMs in particular, have disrupted task planning while largely leaving the lower levels to traditional control and robotics pipelines.
A second distinction: open-loop planning constructs a full plan offline, then executes it without consulting the environment. Closed-loop planning interleaves planning and execution, revising the plan as new observations arrive. LLM agents operate almost exclusively in closed-loop mode — the environment changes, tools return unexpected results, and the agent must replan on the fly.
Standard autoregressive generation is horizon-zero planning: each token is chosen to be most likely given what came before, with no lookahead. This works well when the right answer is locally obvious, but breaks down whenever the optimal current token depends on what must come later — a trap that affects mathematical proofs, multi-step code, and any task requiring backtracking.
Why LLMs Changed the Calculus
Pre-LLM agents needed a hand-crafted world model to plan anything. LLMs arrive pre-loaded with implicit knowledge about how the world works — enough to reason about novel domains without explicit modelling. The question the field has spent the past three years answering is: can we elicit that implicit planning knowledge effectively, and can we supplement it with explicit search when the implicit reasoning runs out?
Chain-of-Thought as Implicit Planning
Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), elicits multi-step reasoning by simply including worked examples in the prompt. The observation was striking: models that failed on a problem when asked for a direct answer succeeded when prompted to think step by step. CoT turns planning implicit — the reasoning chain is visible in the output, but the model is not explicitly searching a state space. It is pattern-matching against demonstrations of deliberate thought.
Zero-shot CoT (Kojima et al., 2022) pushed this further: just appending "Let's think step by step" to a prompt improved performance across benchmarks, with no examples required. This suggested the model had internalized reasoning patterns from pretraining on human text and needed only a nudge to apply them.
The Scratchpad as Working Memory
CoT's key function is externalizing intermediate computation into the context window, which acts as working memory. A model reasoning over 17 arithmetic steps would fail if forced to keep all intermediate values in implicit "weights-space" — but writing them out lets it attend back to earlier results accurately. The same applies to logical deduction, code debugging, and natural-language argumentation.
Limitations of Linear CoT
CoT is linear: the model produces one reasoning trace, left to right, with no ability to explore alternatives or backtrack if a chain of reasoning leads to a dead end. It also cannot self-correct mid-trace — once a wrong assumption is committed to text, subsequent tokens tend to follow it rather than contradict it. These limitations motivated the search-based approaches that follow.
A practical ceiling was identified by Valmeekam et al. (2022) and subsequent work: on classical planning benchmarks like BlocksWorld, CoT-prompted LLMs fail catastrophically even at small problem sizes — not because they lack domain knowledge, but because they cannot systematically explore the state space. Fluent generation of plausible-sounding plans is not the same as actually finding valid ones.
Least-to-Most Decomposition
Least-to-most prompting (Zhou et al., 2022) addresses a specific failure mode of CoT: tasks that require building up capabilities in order. Where CoT attacks a problem monolithically, least-to-most breaks it into a hierarchy of sub-problems, from the simplest prerequisite to the final answer. Each sub-problem is solved in turn, with its answer added to the context before tackling the next.
The prompt has two phases. First, the model is asked to decompose the problem: "What sub-problems do I need to solve first?". Then each sub-problem is answered in sequence, with prior answers visible. This mirrors how humans actually solve hard problems — building skills and partial results before assembling the full solution.
Problem: "Which mountain range do Flagstaff, AZ and Boulder, CO both border?"
1. What mountain range borders Flagstaff, AZ?
2. What mountain range borders Boulder, CO?
3. Are they the same?
Answer sub-1: Flagstaff is near the San Francisco Peaks, part of the Rocky Mountains.
Answer sub-2: Boulder borders the Front Range of the Rocky Mountains.
Answer sub-3: Both border the Rocky Mountains. → Rocky Mountains
Relation to Classical Hierarchical Planning
Least-to-most is a prompt-level analogue of hierarchical task network (HTN) planning, in which tasks are recursively decomposed into sub-tasks until primitive actions are reached. The key difference is that HTN planners work with explicit, pre-specified decomposition operators, while least-to-most relies on the model to generate the decomposition itself — which works well when the model has seen analogous structures in pretraining data, and poorly when it hasn't.
Plan-and-Solve Prompting
A related variant, Plan-and-Solve (Wang et al., 2023), separates planning from execution within the prompt: first ask the model to produce a step-by-step plan, then separately execute each step. This reduces the "sycophantic overcommitment" problem where the model executes a flawed plan without questioning it — by making the plan explicit, it becomes easier to audit and correct before execution begins.
Empirically, decomposition-based methods provide the largest gains on tasks that are compositional — where the answer depends on a chain of facts or operations — and provide less benefit on tasks requiring a single insight or where the necessary sub-problems aren't predictable in advance. Multi-hop QA, symbolic manipulation, and long-horizon code generation are strong candidates; creative writing and open-ended generation less so.
Tree of Thoughts
Tree of Thoughts (ToT, Yao et al., 2023) is the most direct attempt to graft classical search onto an LLM's generation process. The core idea: instead of generating a single linear chain of reasoning, generate multiple candidate thoughts at each step, evaluate them, and use search (breadth-first, depth-first, or beam) to navigate the resulting tree. The LLM serves triple duty as thought generator, state evaluator, and — when used in a self-evaluation mode — critic.
\[\text{Children}(s) = \{s' : s' = s \oplus \text{LLM}_{\text{gen}}(s)\}\]
\[\text{BFS: } \mathcal{F}_{t+1} = \text{top-}k \{ V(s') : s' \in \bigcup_{s \in \mathcal{F}_t} \text{Children}(s) \}\]
The original ToT paper demonstrated dramatic improvements on three benchmark tasks chosen precisely for requiring deliberate, non-linear reasoning: the Game of 24 (arrange four numbers with arithmetic to reach 24), creative writing with structured constraints, and mini-crossword solving. On the Game of 24, a task where standard GPT-4 CoT succeeds on 4% of problems, ToT reached 74%.
Thought Generation Strategies
Thoughts can be generated in two modes. Sample mode draws multiple independent continuations from the LLM's distribution; this is cheap but can produce near-duplicates. Propose mode prompts the model to enumerate distinct candidate thoughts explicitly — for example, "What are five different approaches to the next step?" — which tends to produce more diverse and complementary candidates at higher cost.
State Evaluation Strategies
Evaluating partial states is the hard part. ToT uses the LLM itself for two kinds of evaluation. Value estimation asks the model to score a partial state as "sure/likely/impossible" given the goal. Vote aggregation generates multiple independent evaluations and takes a majority verdict — useful when the model is uncertain on any single evaluation pass.
Costs and Trade-offs
ToT's power comes at a steep cost: each node expansion requires one or more LLM calls, and each evaluation requires additional calls. A BFS search of depth 3 with branching factor 5 requires roughly 125 expansion calls plus evaluation overhead. For frontier models, this translates to substantial latency and cost. Practitioners typically cap the branching factor at 3–5 and the depth at 3–4, using beam search rather than full BFS/DFS. The compute overhead has limited ToT's adoption in production systems, where simpler CoT + retry strategies often provide enough improvement at a fraction of the cost.
MCTS-Guided Decoding
Monte Carlo Tree Search (MCTS) brings a more principled approach to the tree search problem. Rather than evaluating each node with a single LLM call, MCTS uses rollouts — simulated continuations to the end of the problem — to estimate node value. It balances exploitation (expanding nodes known to be good) with exploration (trying less-visited nodes that might be better) via the UCT selection criterion.
Applied to language model planning (Hao et al., 2023; "Reasoning via Planning"), MCTS treats each reasoning step as a node and LLM generations as rollouts. The model serves as both transition function (what reasoning step comes next?) and value function (how likely does this path lead to the correct answer?). After enough rollouts, the algorithm concentrates search on the most promising reasoning trajectories.
The World Model Problem
Classical MCTS requires a simulator: a way to advance the state cheaply, many times, for rollouts. For reasoning tasks, the LLM itself acts as the simulator — it generates plausible continuations — but this is expensive. Several papers have experimented with training lightweight value/policy networks alongside the main LLM to make rollouts cheaper, echoing the AlphaGo architecture. This is an active research area as of 2025.
AlphaCode 2 and Mathematical Reasoning
Deepmind's AlphaCode 2 (2023) applied MCTS-like sampling over code solutions: generate many candidate programs, filter by execution tests, use the passing tests as a reward signal, and iteratively improve. This is not strictly MCTS (there is no tree; samples are independent) but it captures the essential idea — parallel sampling plus a value signal enables much better performance than single-shot generation. Similar patterns appear in math reasoning: generating 64–1024 solutions and selecting via a verifier far outperforms greedy decoding.
The simplest "search" strategy is best-of-N: sample N solutions, score each with a verifier or reward model, return the best. For math problems, majority voting (self-consistency) — return the most common answer across N generations — closes much of the gap between greedy and optimal without requiring an external verifier. Both methods trade inference-time compute for accuracy in a roughly log-linear relationship.
Compute Scaling at Inference Time
MCTS and related search methods are instances of inference-time compute scaling: using more computation at evaluation time rather than embedding all capability in model weights. Snell et al. (2024) showed that for many tasks, a smaller model with more inference-time compute outperforms a larger model without it — a profound shift in how we think about the model size / compute trade-off. OpenAI's o1/o3 family explicitly commercialized this insight.
Self-Verification & Critique
An LLM's ability to evaluate outputs is often better than its ability to generate them from scratch. This asymmetry — common in human cognition too — motivates a family of generate-then-critique methods where the model acts as its own checker.
Self-Consistency
Wang et al. (2022) introduced self-consistency: sample multiple reasoning chains independently, then take a majority vote over the final answers. The insight is that correct reasoning paths tend to converge on the same answer, while incorrect ones diverge. Self-consistency requires no external critic — it uses the model's own variance as a quality signal. On GSM8K math reasoning, it improved GPT-3 from 56% to 78% accuracy with 40 samples.
Constitutional AI and Self-Critique
Anthropic's Constitutional AI (Bai et al., 2022) introduced a self-critique and revision loop: the model generates a response, then critiques it against a list of principles ("Is this response harmful? Explain why."), then revises in light of the critique. Repeated for multiple rounds, this allows the model to iteratively correct its own outputs — a planning-adjacent technique when applied to agentic task completion.
Self-Refine
Madaan et al. (2023) generalized this into Self-Refine: an iterative loop where the model generates an output, provides natural-language feedback on what's wrong, and then revises the output based on the feedback. The same model handles all three roles. Critically, the feedback step makes improvement criteria explicit — the revision doesn't just try again blindly; it addresses identified flaws. On coding, math, and sentiment transfer benchmarks, Self-Refine consistently outperformed single-shot generation.
y₀ = LLM(task prompt) // initial draft
i = 1 .. max_rounds:
fb_i = LLM("Critique y_{i-1}: What is wrong and how to fix it?")
y_i = LLM("Revise y_{i-1} using feedback: " + fb_i)
if LLM("Is y_i satisfactory?") == "Yes":
return y_i
Limitations: Sycophancy and the Echo Chamber
Self-critique has a structural weakness: the same model that generated the flawed output is critiquing it. Models tuned to be agreeable exhibit sycophancy — they tend to endorse their own outputs when asked to evaluate them, especially when the output is fluent and coherent. External critics (a separate model, a ground-truth verifier, or a human) are more reliable. The field has moved toward critic models trained separately on preference data, which avoids the echo-chamber problem.
Reflexion & Self-Reflection
Shinn et al. (2023) introduced Reflexion, which adds a long-term memory layer to self-critique: when an agent fails at a task, it generates a verbal self-reflection on what went wrong, stores that reflection in a persistent memory, and conditions its next attempt on the accumulated reflections. Instead of learning by updating weights — which requires gradient-based training — Reflexion learns by writing.
The architecture has three components. An Actor generates actions using the current policy. An Evaluator scores the outcome (pass/fail for code tests, rewards for RL tasks, or LLM-based scoring for open-ended tasks). A Self-Reflection module synthesizes a verbal post-mortem: "I failed because I didn't handle the edge case where the input is empty. Next time I should check for empty input first." These reflections are prepended to the next episode's context.
Results and Scope
On the HumanEval coding benchmark, Reflexion brought GPT-3.5 from 67.0% (pass@1) to 91.0% after multiple reflection rounds — a larger gain than any prompting method that doesn't use external feedback. On AlfWorld text-game tasks, it achieved 97% success versus a 45% baseline. The key insight: failing fast, then reasoning carefully about the failure, is a powerful learning signal that doesn't require any weight updates.
Practical Limitations
Reflexion requires a reliable evaluator to tell the agent when it has failed. For code this is easy — run the tests. For open-ended tasks, defining failure is harder. Reflexion also accumulates context across episodes: very long memory sequences can exceed context limits, and early reflections may become stale or contradictory as the task evolves. Forgetting strategies (relevance filtering, summarization) help but introduce their own trade-offs.
Reflexion and related methods (ExpeL, LATS) instantiate a kind of in-context reinforcement learning that operates entirely in token space: the reward signal becomes language, the policy update becomes a natural-language reflection, and the value function is the agent's own assessment of its performance. This sidesteps the brittle infrastructure of gradient-based RL for slow-moving, high-level planning tasks.
Process Reward Models
Most reward models in RLHF evaluate complete outputs — they score the final answer. Process reward models (PRMs) instead score each reasoning step, providing dense supervision over the entire chain of thought. This distinction matters enormously for planning: a model trained with outcome supervision learns that a correct final answer is good but learns nothing about which intermediate steps were responsible; a model trained with process supervision learns which reasoning patterns are reliable at each stage.
OpenAI's "Let's Verify Step by Step" (Lightman et al., 2023) provided the landmark result: human labellers rated each step in a mathematical reasoning chain as correct, incorrect, or neutral, and a PRM trained on this data dramatically improved the ability to identify correct reasoning paths during search. On the MATH benchmark, best-of-1860 generation with PRM selection matched what outcome-reward models needed best-of-~100k to achieve — a roughly 100× improvement in sample efficiency.
Training a PRM
The training objective is straightforward: given a reasoning trace \(s_1, s_2, \ldots, s_T\), predict whether each step \(s_i\) is correct. The challenge is annotation cost — human step-level labelling is expensive. Several approaches mitigate this: (1) using LLMs to automatically label steps, with human spot-checks; (2) using Monte Carlo estimation — from each step, roll out many completions and label the step as correct if a high fraction of rollouts reach the correct answer; (3) contrastive learning between correct and erroneous traces.
PRMs in Production
Process reward models are now a key ingredient in the reasoning-model training pipelines behind o1, o3, and DeepSeek-R1. They enable scalable oversight by providing dense, reliable training signal over long reasoning chains — something that outcome-only reward models cannot. PRMs also enable search at test time without human labels: simply sample many continuations from each step and use PRM scores to select the best path.
Neurosymbolic Hybrids
Pure neural planning via CoT/ToT runs into a structural problem: the model's reasoning is opaque and unverifiable. You cannot easily confirm that a multi-step argument is logically valid, only that it looks plausible. Neurosymbolic hybrids address this by offloading the formal reasoning to an external symbolic system — a theorem prover, a constraint solver, an SMT solver, or a classical planner — while using the LLM to translate between natural language and formal representations.
The Translator Pattern
The dominant architecture is a translator pipeline: (1) the LLM converts a natural-language problem into a formal specification; (2) a symbolic solver produces a provably correct solution; (3) the LLM translates the formal solution back into natural language. The LLM is doing what it does well (understanding language, generating plausible structure) while the solver does what it does well (exhaustive, correct search).
This pattern appears in Logic-LM (Pan et al., 2023), which converts questions to first-order logic and uses Prolog; in Code Interpreter agent mode, which generates Python code and runs it (the interpreter as solver); and in SatLM (Ye et al., 2023), which converts word problems to SAT constraints. All share the same failure mode: if the LLM mistranslates the problem into the formal language, the solver finds the correct answer to the wrong problem.
LLM + Classical Planner
Agents like LLM+P (Liu et al., 2023) use an LLM to translate a natural-language goal into PDDL, then call FastDownward or FF to find a valid plan, then translate the plan back. On standard PDDL benchmarks, this achieves near-perfect planning performance that pure LLM CoT cannot — but requires a clean, structured environment that maps to PDDL. In messy real-world tasks, the translation step is the bottleneck.
Symbolic systems need precise, unambiguous representations. Natural language is ambiguous, context-dependent, and full of implicit assumptions. The LLM's job as translator is to resolve all that ambiguity before handing off to the solver — and this is precisely where translation fails. Grounding errors (hallucinated predicates, missing constraints, wrong variable bindings) are hard to detect automatically and can produce plans that are formally valid but semantically wrong.
Satisfiability and Constraint Solving
For planning problems that can be expressed as constraint satisfaction — scheduling, resource allocation, configuration — LLMs paired with SMT solvers (Z3, CVC5) or CP solvers (OR-Tools) provide a powerful combination. The LLM handles natural-language requirements and edge cases that would be tedious to encode manually; the solver handles the combinatorial search. This pairing has shown commercial viability in enterprise scheduling and logistics applications.
When LLM Planning Fails
The gap between impressive benchmark results and reliable real-world planning performance is large. Understanding why is essential before deploying agents on consequential tasks. Planning failures fall into a small number of recurring patterns.
LLMs generate fluent, confident-sounding plans whether or not the plan is correct. The model does not have a reliable "I don't know how to do this" mode for planning tasks. Valmeekam et al. (2023) showed GPT-4 hallucinating valid-looking PDDL plans that violated preconditions on more than 60% of problems — while expressing no uncertainty.
Performance degrades sharply as problem length increases. A model that perfectly solves 5-step planning problems may fail at 15-step ones, even if the structure is identical. The model's implicit "planning horizon" is limited by the distribution of examples in training data, not by any principled capacity bound.
Linear generation commits irrevocably to each step. When an early decision forecloses the correct solution — as in many combinatorial puzzles — autoregressive models have no mechanism to backtrack. Without explicit search (ToT, MCTS) this failure is architectural, not a matter of scale.
Multi-step plans require maintaining a mental model of the current world state. LLMs frequently lose track of which sub-goals have been completed, which constraints have been satisfied, or which resources have been consumed — especially in long episodes. This compounds over time, leading to contradictory actions.
LLMs trained on web text associate planning-adjacent surface patterns with certain outputs. A problem phrased to resemble a famous benchmark gets the benchmark answer regardless of the actual constraints. Distribution shift — novel problem formulations — exposes this brittleness immediately.
In long-horizon agentic tasks, the agent's effective goal drifts from the original instruction as context grows, tool outputs accumulate, and sub-goal completions change the apparent top priority. Without explicit goal tracking, the agent can optimise for a proxy that diverges from user intent over time.
The PlanBench Findings
Valmeekam et al.'s PlanBench (2023) provided the most thorough stress-test of LLM planning. They found: (1) frontier LLMs solve fewer than 12% of Blocksworld problems correctly in greedy decoding, even though humans find most problems trivial; (2) performance does not improve meaningfully with scale alone; (3) chain-of-thought helps marginally but does not close the gap; (4) models that were given the correct plan and asked to verify it performed significantly better than models asked to generate plans — confirming that verification is easier than generation.
The lesson is not that LLMs cannot plan at all — they clearly can in many practical settings — but that their planning capability is fundamentally statistical rather than principled. They generalise well to problems similar to their training distribution and fail in predictable ways outside it. Engineers must design systems accordingly: use verifiers, use external tools for formal correctness guarantees, build in retry and reflection loops, and keep humans in the loop for high-stakes decisions.
Inference-Time Scaling
The methods in this chapter — CoT, ToT, MCTS, Reflexion, PRMs — converged on a common theme: spending more computation at inference time produces better reasoning. OpenAI's o1 (September 2024) was the first major commercial demonstration of training specifically for extended inference-time reasoning, using reinforcement learning to incentivise long internal "thinking" traces before producing a final answer.
o1's internal reasoning (not shown to users in its original deployment) resembles a hybrid of chain-of-thought and backtracking: the model considers multiple approaches, abandons dead ends, and checks its work. The thinking trace can run for hundreds or thousands of tokens before the answer is generated. This is not simply longer CoT — the model is trained via RL to produce traces that maximise correctness, not traces that look like human reasoning.
DeepSeek-R1
DeepSeek-R1 (January 2025) provided the first open-weight reasoning model trained with a similar paradigm. Critically, DeepSeek published the training methodology: start with a base model, apply GRPO (Group Relative Policy Optimization — a variant of PPO without a value function baseline) with correctness reward only (no format reward), and let the model discover its own reasoning patterns. The result was a model that spontaneously developed backtracking, self-verification, and multi-approach reasoning — emergent behaviours that no training example demonstrated explicitly.
The Scaling Curve
Snell et al. (2024) mapped the inference-time scaling curve precisely: for a fixed training compute budget, additional inference compute (via best-of-N, MCTS, beam search, or longer thinking traces) follows a power law improvement in accuracy up to a point, then plateaus. The optimal allocation between training and inference compute depends on the task: for tasks with reliable verifiers, heavy inference scaling is efficient; for tasks requiring generalisation across distributions, training compute matters more.
The shift from "prompt LLMs to plan" to "train LLMs to plan via RL" is profound. Prompting methods depend on the planning capability being latent in pretraining data. RL-trained reasoning models acquire planning capability through direct experience of correctness signals — they don't need to have seen examples of backtracking; they discover backtracking because it works. This suggests that reasoning-model training, not prompting, is the long-run path to reliable LLM planning.
Benchmarks & the Planning Landscape
Planning and reasoning ability is evaluated across a heterogeneous set of benchmarks, each stressing different aspects of the capability. Progress on any single benchmark should be interpreted cautiously — methods that look strong on benchmarks designed to reward their assumptions can fail unexpectedly on out-of-distribution problems.
| Benchmark | Tests | Signal Type | Key Finding |
|---|---|---|---|
| GSM8K | Grade-school math word problems | Exact answer | CoT transforms GPT-4 from ~30% to ~90%+ with self-consistency |
| MATH | Competition mathematics | Exact answer | PRM-guided search achieves ~90% (o3); best-of-N scales log-linearly |
| PlanBench | Blocksworld / logistics PDDL | Plan validity | GPT-4 CoT <12%; LLM+P (planner) >95%; gap is structural |
| HumanEval | Python function generation | Test execution | Reflexion: 67% → 91%; o3-mini: ~99.5% |
| ARC-AGI | Abstract grid reasoning | Pattern match | o3 (high compute): 87.5%; strong inference scaling signal |
| Game of 24 | Arithmetic search | Exact solution | CoT: 4%; ToT: 74%; designed to require search |
| WebArena | Multi-step web navigation | Task success | Best agents ~45% (2024); real-world planning difficulty |
| SWE-bench | Real GitHub issue resolution | Test execution | Top agents ~49% (2025); requires planning over large codebases |
What the Numbers Mean
The benchmark landscape tells a nuanced story. On tasks with unambiguous correct answers and reliable verification — math problems, code execution — inference-time scaling via search has driven dramatic improvements, with frontier reasoning models approaching or exceeding expert human performance. On tasks requiring open-ended planning in the real world — web navigation, software engineering, embodied tasks — the gap to human performance remains large and progress is slower.
The distinction is not mainly about model intelligence; it's about the availability of verifiers. Math and code have cheap, reliable verifiers (the solver, the test suite) that can guide search. Real-world tasks often don't. Building reliable automated verifiers for complex real-world tasks is itself an open research problem — and arguably the key bottleneck for the next generation of capable AI agents.
The Planning Stack in 2025
Practitioners assembling agent planning capabilities in 2025 typically combine several layers: a reasoning-capable base model (o3-mini, Claude 3.7, DeepSeek-R1) for internal chain-of-thought; external search (ToT, MCTS, or beam sampling) for tasks with known structure; PRMs or verifiers to guide search; Reflexion-style verbal memory for multi-episode improvement; and classical planners or constraint solvers for sub-problems with formal structure. No single technique dominates — the art lies in knowing which layer to apply to which part of the problem.
Key Papers
-
Tree of Thoughts: Deliberate Problem Solving with Large Language ModelsIntroduces the ToT framework — generate multiple candidate thoughts, evaluate with the LLM, search via BFS/DFS/beam. The Game of 24 demonstration is the canonical example of search dramatically outperforming linear CoT. The foundational paper for explicit LLM search.
-
Let's Verify Step by StepThe PRM paper. Human labellers annotated 800k reasoning steps; a trained PRM outperforms outcome reward models for guiding best-of-N selection by ~100× in sample efficiency on MATH. Essential reading for understanding inference-time search with learned value functions.
-
Reflexion: Language Agents with Verbal Reinforcement LearningVerbal reflection as a substitute for gradient-based learning. The HumanEval result (67% → 91%) without any weight updates is striking. Lays the foundation for episodic verbal RL. Core reference for multi-episode improvement without fine-tuning.
-
Self-Refine: Iterative Refinement with Self-FeedbackIterative generate-critique-revise loop using the same LLM for all three roles. Broad evaluation across seven tasks shows consistent improvement over single-shot generation. The cleanest implementation of self-critique; widely replicated.
-
Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning About Change)The sobering counterpoint. Systematic evaluation of frontier LLMs on classical PDDL benchmarks demonstrates near-zero planning accuracy and pervasive hallucinated plans. Essential for calibrating expectations. Read alongside the positive results to get an honest picture.
-
Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model ParametersPrecise quantification of the inference-time scaling curve. Shows that for many tasks, a smaller model with more inference compute outperforms a larger model without it — reshaping how we think about the training/inference compute trade-off. The empirical foundation of the inference-time scaling paradigm.
-
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningThe open-weight reasoning model that matched o1 performance. Key contribution: emergent backtracking and self-verification from pure correctness RL, with no reasoning-format supervision. Full training methodology published. The most detailed public account of how to train a reasoning model.