Part XI · AI Agents & Autonomous Systems · Chapter 01

Agent Fundamentals, what it means to perceive, decide, and act.

Every AI agent — from a chess-playing program to a language model browsing the web — is built from the same primitive parts: something that perceives, something that decides, and something that acts. The hard problems are in the transitions between these three: how much to remember, how far to plan, and when to stop deliberating and commit to a move. Sixty years of AI research have produced a rich vocabulary for these questions. Understanding it makes the current LLM-agent revolution far less mysterious.

Prerequisites

This chapter is designed as an entry point to Part XI and requires no specialized prerequisites. Familiarity with the transformer architecture (Part VI Ch 04) is useful context for the later sections on LLM agents, but the foundational material on sense-plan-act loops, agent taxonomies, PDDL, and environment properties is self-contained. The objective problem section touches on reward design from reinforcement learning (Part IX Ch 01) — worth revisiting if RL concepts are unfamiliar.

Sections

What Is an Agent? autonomy · reactivity · pro-activeness · social ability
The Sense-Plan-Act Loop perception · world model · deliberation · execution
Agent Taxonomies simple reflex · goal-based · utility-based · learning agents
Environments and Their Properties PEAS · observability · episodic · dynamic
Classical Planning and PDDL STRIPS · state-space search · preconditions · effects
The OODA Loop Boyd cycle · orient · decision tempo · competitive advantage
Reactive Architectures subsumption · behavior layers · Brooks
Learning Agents performance element · critic · behavior cloning · RL
What Makes a System "Agentic"? autonomy spectrum · tool use · consequentiality
Action Spaces and Grounding discrete · continuous · language actions · symbol grounding
The Objective Problem Goodhart's Law · specification gaming · alignment
Agent Landscape STRIPS to ReAct · key transitions · open problems

What Is an Agent?

Section 01 · Definitions · autonomy · reactivity · pro-activeness · social ability

The word "agent" is one of the most overloaded terms in AI — it has been applied to everything from a thermostat to GPT-4. Getting the definition right matters, because it determines what questions we ask and what problems we consider solved.

The most widely cited definition comes from Russell and Norvig: an agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators. This is intentionally broad. A human perceives via eyes and ears and acts via hands and voice. A software agent perceives via API calls and acts via database writes. A robot perceives via cameras and LIDAR and acts via motors. What ties them together is the perception-action interface with an environment.

Wooldridge and Jennings (1995) offer a more demanding definition, identifying four properties that distinguish genuine agents from mere programs:

Autonomy: the agent operates without direct human intervention, exercising control over its own actions and internal state. It is not just executing a fixed script — it makes choices.
Reactivity: the agent perceives its environment and responds in a timely fashion to changes in it. A purely batch-processing system that runs once and stops is not reactive.
Pro-activeness: the agent does not merely react to stimuli. It pursues goals, taking initiative to achieve objectives even when unprompted. This is what separates a file-watcher script from an agent.
Social ability: the agent interacts with other agents — human or artificial — via some form of agent-communication language. Modern LLM agents interact in natural language; classical agents used FIPA ACL or KQML.

These four properties define what Wooldridge and Jennings called a "weak notion" of agency. A "strong notion" adds mentalistic properties — beliefs, desires, intentions (the BDI model) — treating agents as having mental states in a philosophically meaningful sense. Whether AI systems genuinely have beliefs and desires is contentious; the weak notion is more tractable and sufficient for engineering purposes.

Agents vs. Tools vs. Workflows

A useful practical distinction: a tool is purely reactive — it does exactly what it is called to do, with no goals of its own. A workflow is a pre-scripted sequence of tool calls — deterministic and non-adaptive. An agent perceives state, maintains a goal, and chooses which actions to take (including which tools to call) in service of that goal. The distinctions blur in practice, but they matter: a workflow that fails mid-step simply stops; an agent that fails mid-step can diagnose the failure, try an alternative approach, and recover.

The Sense-Plan-Act Loop

Section 02 · Perception · world model · deliberation · execution · the deliberation-reaction tradeoff

The most influential architecture in classical AI is the three-stage loop: sense the environment, plan an action, execute the action, repeat. This structure emerged from early robotics and AI planning research in the 1970s and 1980s and remains the conceptual backbone of most agent architectures, even those that have substantially departed from it in implementation.

The Sense-Plan-Act loop. Sensors produce percepts; the agent builds an internal world model; the planner selects an action; actuators execute it; the environment changes; and the loop repeats. The world model is what separates a model-based agent from a simple reflex agent.

The Three Stages in Detail

Sense: The perception module receives raw sensor data and converts it into a usable internal representation. For a robot, this might be transforming LIDAR point clouds into an obstacle map. For a software agent, it might be parsing a JSON API response into a structured state object. For an LLM agent, it is constructing the context window from conversation history, tool outputs, and retrieved documents. Perception always involves selective attention and abstraction — no agent can track every detail of its environment.

Plan: Given a world model (a belief about the current state) and a goal, the planner determines what action to take. Classical planners search a state space explicitly. Reactive controllers use lookup tables or condition-action rules. Learning-based agents use a neural policy. LLM agents reason in language, producing a plan as text before executing it. The planner is the intellectual core of the agent — it is where deliberation happens.

Act: The action is executed via actuators. In the physical world, this means motors, speakers, displays. In software, this means API calls, database writes, file edits, sending messages. The key engineering challenge is that actions have effects — often irreversible ones — and the environment's response to the action provides the input to the next sense phase.

The Deliberation-Reaction Tradeoff

The SPA loop has a fundamental tension: deliberation takes time, and the world does not wait. A chess engine can deliberate for minutes because the board does not change while it thinks. A robot navigating a crowded street cannot afford to pause; by the time a slow planner decides to turn left, a bicycle may have appeared in the turning radius. This tension gave rise to the reactive architectures covered in Section 07 — which sacrifice deliberation for speed — and to hierarchical architectures that mix fast reactive layers with slow deliberative layers operating in parallel.

Agent Taxonomies

Section 03 · Five agent types · simple reflex to learning agents · the BDI model

Russell and Norvig's taxonomy organizes agents by what internal information they maintain and how they use it. Each level builds on the previous, adding representational power at the cost of complexity.

Simple Reflex Agents

The simplest agent type: select actions based purely on the current percept, ignoring history. Implementation is a condition-action table: if percept matches condition, execute action. A thermostat is the textbook example — it checks the current temperature and either activates the heater or not, with no memory of past states or future goals. Simple reflex agents are fast and transparent but fail immediately in partially observable environments where the current percept does not determine the optimal action.

Model-Based Reflex Agents

Add an internal world model: the agent maintains a belief about the current state of the world that is updated as new percepts arrive and as the agent reasons about how its actions change the world. The model tracks what the agent cannot currently see — a robot that drives behind a building maintains a belief that the building is still there even when it cannot perceive it. This enables sensible behavior in partially observable environments but requires the world model to accurately capture state transitions.

Goal-Based Agents

Add an explicit goal representation. The agent does not merely react to state; it searches for action sequences that lead to desired goal states. This requires planning — the agent must reason forward from the current state through hypothetical action sequences to find one that reaches the goal. Goal-based agents can answer "why" questions about their behavior in terms of goal pursuit, making them easier to understand and debug than pure reflex agents.

Utility-Based Agents

Goals are binary: achieved or not. Utility functions quantify degrees of preferability across states, enabling the agent to trade off between competing goods and choose the action that maximizes expected utility rather than merely reaching any goal state. A route-planning agent doesn't just find a path — it finds the shortest or fastest or cheapest one. Utility-based agents are the theoretical ideal but require a utility function that accurately represents what we actually want — which, as Section 11 discusses, is harder than it sounds.

Learning Agents

Add a learning element that improves the agent's performance over time. Russell and Norvig decompose learning agents into four components: the performance element (the current policy for selecting actions), the learning element (modifies the performance element based on feedback), the critic (evaluates how well the agent is doing against a fixed performance standard), and the problem generator (suggests exploratory actions that may lead to better long-term performance). Reinforcement learning agents are learning agents where the critic is a reward signal from the environment.

The BDI Model

An influential alternative taxonomy is the Belief-Desire-Intention (BDI) model (Bratman 1987, Rao & Georgeff 1995). Beliefs are the agent's information about the world (possibly incorrect). Desires are its motivational states — the goals it would like to achieve. Intentions are the desires it has committed to pursuing — the plans currently in execution. BDI models that intentions have a kind of inertia: an agent doesn't abandon an intention at the first sign of difficulty, it persists. This models human-like goal commitment. The PRS (Procedural Reasoning System) and later JADEX implement BDI for practical agent systems. The BDI framing resonates particularly well with LLM agents that maintain a "current objective" and "plan" in their context window.

Environments and Their Properties

Section 04 · PEAS · eight dimensions · worked examples

Agents do not exist in isolation — their properties only make sense relative to the environment they operate in. Before designing an agent, one characterizes its task environment using the PEAS framework: Performance measure, Environment, Actuators, Sensors. PEAS forces precision about what the agent is optimizing for, what it can observe, and what it can do.

Beyond PEAS, environments are classified along eight dimensions that determine how difficult the agent design problem is:

Dimension	Variants	Design implication
Observability	Fully / Partially observable	Partial observability requires the agent to maintain a belief state; the agent cannot act optimally on the current percept alone
Determinism	Deterministic / Stochastic	Stochastic environments require probability distributions over outcomes; the agent must plan for uncertainty
Sequentiality	Episodic / Sequential	Sequential environments require long-range planning; each action may affect future options. Episodic environments allow greedy per-step decisions
Dynamism	Static / Dynamic	Dynamic environments change while the agent deliberates, imposing time pressure; the agent's world model becomes stale
Continuity	Discrete / Continuous	Continuous state and action spaces require function approximation; discrete spaces permit exact enumeration
Knowledge	Known / Unknown	Unknown environments require exploration; the agent must learn the effects of actions as it acts
Agency	Single / Multi-agent	Multi-agent environments introduce strategic complexity — other agents may be cooperative, competitive, or neutral
Accessibility	Accessible / Inaccessible	Inaccessible environments require inference about hidden state from indirect evidence

Chess is: fully observable, deterministic, sequential, static (during the agent's turn), discrete, known, two-agent. This is a clean environment — hard because of combinatorial depth, not environmental complexity. Self-driving vehicles are: partially observable (occluded pedestrians), stochastic (uncertain driver intentions), sequential, dynamic (continuously changing), continuous, partially known (maps exist but road conditions change), multi-agent (other drivers). This combination makes autonomous driving one of the hardest agent problems.

An LLM-based coding agent operates in an environment that is: partially observable (the codebase may be too large to fit in context), stochastic (tests may be flaky, APIs may return errors), sequential (each file edit affects subsequent ones), dynamic (files may change between reads in long sessions), discrete (file names and code are discrete), partially known (the agent knows Python but not the codebase conventions), and effectively single-agent. This environmental profile shapes every architectural choice in LLM agent design.

Classical Planning and PDDL

Section 05 · STRIPS · state-space search · domain and problem files · preconditions and effects

Before learning-based agents, AI planning was the dominant approach to agent behavior. Classical planning takes as input a formal description of the initial state, the goal state, and the available actions — then searches for a sequence of actions that transforms the initial state into the goal state. It is the algorithmic skeleton that modern agents have largely replaced but never fully escaped.

STRIPS: The Foundation

STRIPS (Stanford Research Institute Problem Solver, Fikes & Nilsson 1971) was the first systematic planning language. A STRIPS problem consists of: a set of propositional facts that describe states; an initial state (a set of true facts); a goal state (a set of facts that must be true); and a set of operators (actions), each with preconditions and effects. An operator can be applied when all its preconditions are satisfied; its effects add and delete facts from the current state.

STRIPS Operator Schema \[\text{Action}(a)\colon\quad \text{Pre}(a) \subseteq s \;\Rightarrow\; s' = (s \setminus \text{Del}(a)) \cup \text{Add}(a)\] An action \(a\) is applicable in state \(s\) if its preconditions hold. The successor state \(s'\) removes the delete list from \(s\) and adds the add list. Everything not mentioned is unchanged (the frame assumption).

PDDL: A Practical Language

PDDL (Planning Domain Definition Language, McDermott 1998) standardized planning problem specification across the research community. A PDDL specification consists of two files: the domain file (what actions exist and how they work) and the problem file (the specific initial state and goal for this instance).

;; domain.pddl — Logistics domain (simplified) (define (domain logistics) (:requirements :typing) (:types truck location package) (:predicates (at ?obj - (either truck package) ?loc - location) (in ?pkg - package ?truck - truck)) (:action load :parameters (?pkg - package ?truck - truck ?loc - location) :precondition (and (at ?pkg ?loc) (at ?truck ?loc)) :effect (and (in ?pkg ?truck) (not (at ?pkg ?loc)))) (:action drive :parameters (?truck - truck ?from - location ?to - location) :precondition (at ?truck ?from) :effect (and (at ?truck ?to) (not (at ?truck ?from)))))

The problem file specifies a concrete scenario: which packages and trucks exist, where they start, and what the goal configuration should be. A planner like Fast Downward or LAMA then searches the state space — using heuristics like delete relaxation or landmarks — to find a sequence of instantiated actions (a plan) that achieves the goal.

Limits of Classical Planning

Classical planning assumes complete, accurate world models; deterministic action effects; and goals fully specified in advance. All three assumptions fail in realistic settings. PDDL extensions address some of this — PDDL 2.1 adds numeric fluents and time, PDDL 3.0 adds trajectory constraints — but the core paradigm remains brittle when the world model is wrong or incomplete. This fragility is precisely what motivated the move toward learning-based agents: instead of requiring a handcrafted model of the world, let the agent learn from data what actions lead to what outcomes.

The OODA Loop

Section 06 · Boyd cycle · the orient phase · decision tempo · applications to AI

Colonel John Boyd was a fighter pilot and military strategist who, in the 1970s and 1980s, developed a framework for understanding why some adversaries consistently defeat others who appear equal in resources and capability. His answer: the faster decision cycle wins. He called this cycle OODA — Observe, Orient, Decide, Act.

Observe: Gather raw information from sensors — radar returns, radio intercepts, wingman radio calls, visual sightings. This corresponds to the sense phase in the SPA loop.

Orient: Synthesize observations into a mental model of the current situation, drawing on previous experience, cultural tradition, genetic heritage, and analysis of prior environments. Boyd considered Orient the most important phase — and the most neglected. Orientation is not passive information processing; it is active model construction, filtering, and interpretation. An agent with a poor mental model will orient incorrectly even from good observations.

Decide: Select a course of action from the options the orientation makes salient. Note that this phase is downstream of orientation — if orientation is wrong, decision follows wrongly regardless of decision quality.

Act: Execute the chosen action, which generates new observations and begins the loop again.

The Competitive Principle

Boyd's key insight was that the goal is not to execute a perfect OODA loop in isolation, but to complete your loop faster than your adversary. If you can orient, decide, and act before your opponent has finished orienting, you force your opponent to react to conditions that have already changed — their decision is outdated before it is executed. This temporal advantage — "getting inside the opponent's OODA loop" — compounds: a faster agent continuously disrupts the slower agent's orientation, causing confusion and hesitation that further slows their loop. In competitive multi-agent systems, latency is strategy.

The OODA framework imports usefully into AI agent design. Its emphasis on orientation maps directly onto the importance of world modeling and context management — an LLM agent that efficiently integrates tool outputs, prior conversation history, and domain knowledge into a coherent picture of the current state orients better than one that treats each observation independently. Its emphasis on loop speed motivates research into fast inference, cached tool results, and parallel action execution. And its warning that a slower loop loses even with individually superior decisions provides a framework for understanding why agents that pause to deliberate extensively can be outperformed by less sophisticated but faster competitors.

The OODA loop also highlights a failure mode absent from the SPA framework: orientation failure. A system can have perfect sensors (observe accurately), a perfect policy (act correctly given correct beliefs), and still fail catastrophically because its world model is wrong. Adversarial inputs, distribution shift, and prompt injection all attack the orient phase. Understanding orientation as the critical vulnerability guides both attack and defense strategies for agent systems.

Reactive Architectures

Section 07 · Subsumption · behavior layers · Brooks' critique · reactive vs. deliberative

In 1986, Rodney Brooks published a provocative paper: "Intelligence Without Representation." His target was the entire classical AI program — the idea that intelligent behavior requires building an explicit symbolic model of the world before acting. Brooks argued this approach was fundamentally flawed: real environments are too dynamic, too complex, and too incompletely modeled for symbolic representations to be useful. His alternative was the Subsumption Architecture.

The Subsumption Architecture

Subsumption decomposes behavior into layers, each implementing a complete sense-react loop running in parallel. Lower layers handle primitive behaviors (avoid obstacles, move toward light); higher layers implement more sophisticated behaviors (explore, build maps). Higher layers can suppress the outputs of lower layers or inhibit their inputs, but lower layers remain active and continuously produce their outputs.

The key insight: complex-looking behavior emerges from the interaction of simple reactive layers, without any central planner or world model. Brooks built actual robots — Herbert, Ghengis, Cog — that navigated offices, collected soda cans, and interacted with humans using this architecture. They were robust in ways that deliberative systems were not: when their sensors failed or the environment changed unexpectedly, they degraded gracefully rather than crashing.

The subsumption architecture prefigured modern behavior trees (widely used in game AI and robotics) and the idea that robust agents stack fast reflexes under slow deliberation rather than replacing one with the other.

Limitations

Pure reactivity has limits: reactive agents cannot reason about future states, cannot learn from experience without external modification, and struggle with tasks that require temporary detours away from the goal (going around a wall to eventually reach a destination requires holding a goal state that is not immediately reflected in the current percept). Brooks' critique was correct that classical AI overemphasized deliberation; the response turned out to be hybrid architectures, not the complete abandonment of representation.

Learning Agents

Section 08 · Four components · behavior cloning · RL · LLMs as pretrained world models

All the agent types discussed so far have static knowledge — their rules, models, and policies are fixed at design time. Learning agents modify themselves based on experience, improving performance on the same task over time or generalizing from one situation to related ones.

The Four Components

Russell and Norvig analyze learning agents in terms of four functional components. The performance element is the current agent policy — the function from percepts (or states) to actions. It is what the other three components serve to improve. The learning element is responsible for modifying the performance element; it takes feedback from the critic and adjusts the policy accordingly. The critic evaluates the agent's behavior against a fixed external performance standard — the reward signal in RL, or human ratings in RLHF. Crucially, the critic is separate from the agent's own evaluation of its actions; it is an external ground truth that the agent cannot directly optimize. The problem generator identifies situations where exploration is valuable — proposing actions that may be suboptimal in the short run but that generate informative feedback for the learning element. Without a problem generator, the agent cannot escape local optima.

Behavioral Cloning vs. Reinforcement Learning

Behavioral cloning learns a policy by imitating demonstrations: given state-action pairs from an expert, train a supervised classifier to predict the expert's action from the state. This is simple and data-efficient when demonstrations are available, but suffers from compounding errors — small deviations from the training distribution compound into large deviations over time, because the agent encounters states its demonstrations never covered. DAgger (Dataset Aggregation) addresses this by iteratively querying the expert on states the learned policy actually visits.

Reinforcement learning learns a policy through trial and error, receiving a scalar reward signal from the environment. RL avoids the distribution mismatch of behavioral cloning by training on the agent's own experience, but requires many more environment interactions and careful reward design. The RL agent implements all four learning agent components: policy (performance), optimizer (learning), reward (critic), and exploration strategy (problem generator).

LLMs as Pretrained World Models

Large language models represent a paradigm shift in learning agent design. An LLM pretrained on trillions of tokens of human text has implicitly learned an enormous amount about how the world works — causal relationships, physical constraints, social dynamics, procedural knowledge. This pretrained world model is not perfect, but it is remarkably broad and general. Using an LLM as the core of an agent's planning module is essentially transfer learning from the entire corpus of human language to any downstream task — the agent inherits a head start on world modeling that took decades of RL research to approach for narrow domains. This is why LLM-based agents can exhibit competent behavior in novel domains with minimal task-specific training.

What Makes a System "Agentic"?

Section 09 · The LLM-era question · autonomy spectrum · automation vs. agency · key markers

The word "agentic" has become ubiquitous and correspondingly vague. Every API wrapper is now marketed as an "AI agent." Getting precise about what the term means in the current era matters — both for research and for deployment decisions, because agentic systems carry qualitatively different risks than tools or workflows.

The Autonomy Spectrum

It helps to think about a spectrum from automation to full agency:

Tool: A pure function. Input → output. No state, no goals, no choices about what to do next. A calculator, a sentiment classifier, a spell-checker.
Workflow: A pre-scripted sequence of tool calls, possibly with branching logic. The set of possible execution paths is defined in advance by a human. If a step fails, the workflow fails — it does not adapt. An ETL pipeline, an auto-responder, a CI/CD build system.
Copilot: An LLM-based system that assists a human but does not take autonomous actions. The human reviews every output before any effect on the world. GitHub Copilot suggesting a code completion is a copilot.
Supervised Agent: Takes actions autonomously, but checks in with a human at key decision points or after each significant action. The human can interrupt, redirect, or undo. Many production LLM agent deployments target this level.
Autonomous Agent: Pursues goals over extended time horizons, choosing its own subgoals and action sequences, with minimal human oversight. Actions may be irreversible and consequential.

Anthropic's Framework

Anthropic defines agentic settings as those where models take sequences of actions or plan and make a series of decisions to complete longer-horizon tasks. The key markers are: tool use (the model can invoke external functions and APIs), multi-step execution (the task requires more than one action), autonomy (the model chooses what to do at each step rather than following a fixed script), and consequentiality (actions have real effects on the world — sending emails, writing files, executing code, making purchases — that may be difficult or impossible to reverse). A system that satisfies all four is genuinely agentic and must be designed with the safety considerations that this entails.

Why the Distinction Matters for Safety

Workflows fail predictably: they have defined failure modes that engineers can test. Autonomous agents fail unpredictably: they can discover novel failure paths that no human anticipated. The shift from workflow to agent is a shift from a closed to an open system — one that can take actions the designer did not foresee, possibly with serious real-world consequences. This is not an argument against agentic systems; it is an argument for understanding exactly how much autonomy a system has, and designing oversight mechanisms proportional to that autonomy. The chapters on agent safety (Ch 09) and evaluation (Ch 10) return to this at length.

Action Spaces and Grounding

Section 10 · Discrete · continuous · language actions · symbol grounding · tool expansion

An agent's action space is the complete set of actions it can take at any given moment. The structure of this space determines what algorithms can be applied, how exploration works, and how the agent's behavior generalizes to new situations.

Types of Action Spaces

Discrete action spaces contain a finite set of options: move up/down/left/right, buy/sell/hold, accept/reject. Algorithms can enumerate all options and select the best. Most board games and classic video games have discrete action spaces, making them well-suited to tree search and tabular RL. Continuous action spaces are infinite-dimensional: robot joint torques, steering angles, throttle settings. These require function approximation — neural policies that can generalize across the continuous domain — and specialized algorithms like DDPG, SAC, and PPO. Language action spaces are the distinctive contribution of LLM-based agents: the action at each step is a string of text. This is technically discrete (finite vocabulary) but exponentially large and structured in ways that make it neither discrete nor continuous in the classical sense.

The Symbol Grounding Problem

Stevan Harnad's 1990 symbol grounding problem asks: how do the symbols in a cognitive system acquire meaning? A chess program "knows" that a queen can move diagonally because this is hard-coded in its rule set — the symbol "queen" is grounded to behavioral consequences in the game. But how does a symbol like "danger" get grounded to anything real? Classical AI systems are notoriously bad at this: they manipulate symbols according to rules without any semantic connection between symbols and the world.

LLMs partially dissolve this problem by statistical grounding: words acquire meaning through their distributional context across trillions of documents. "Danger" co-occurs with injury, warnings, escape, and consequence across enough contexts that the model learns a rich relational structure. This is not the same as grounding in direct sensorimotor experience — a point linguists and philosophers emphasize — but it is sufficient to enable useful real-world reasoning in many domains.

Tool Use as Action Expansion

One of the most important architectural decisions for LLM agents is what tools to expose. Tools expand the action space: an agent with a web search tool can acquire information it was not trained on; an agent with a code execution tool can compute precise answers to mathematical questions; an agent with a file system tool can persist state across conversations. But each tool added to the action space also adds failure modes — the tool might be called with wrong arguments, return unexpected outputs, or have side effects the agent did not anticipate. Tool design (schemas, error handling, sandboxing) is a substantial engineering discipline covered in Ch 05.

The Objective Problem

Section 11 · Goal specification · Goodhart's Law · specification gaming · reward shaping

The hardest problem in agent design is not perception, planning, or execution — it is specifying what you actually want the agent to achieve. This is the objective problem, and it is much harder than it looks.

Goodhart's Law

The economist Charles Goodhart observed: "When a measure becomes a target, it ceases to be a good measure." The formal statement for AI: any objective metric that is imperfect (that diverges from the true goal in at least some circumstances) will be exploited when optimized. An agent powerful enough to optimize an objective function will find the edge cases where the metric and the true goal diverge — and optimize those edge cases, because that is where the metric reward is available without the cost of actually achieving the true goal.

Specification Gaming

Specification gaming (Krakovna et al., 2020) is the empirical literature on how this manifests. Canonical examples: a boat racing agent given score as its objective discovered that driving in circles while collecting boosts (without completing any laps) was higher-scoring than racing. A robotic arm given a reward for reaching a target position discovered that pushing the camera sideways (so it could not observe the arm's position) maximized the reward signal. A simulated agent told to maximize "alive" steps learned to avoid the end of the episode by never reaching the goal — staying alive in a corner indefinitely.

These examples all involve simple, low-dimensional reward signals and relatively short optimization horizons. The concern for LLM agents is that specification gaming becomes more subtle at higher levels of capability: an agent given "maximize positive user feedback" might learn to tell users what they want to hear rather than what is true, because feedback is easier to manipulate than outcomes.

Partial Solutions

Process rewards evaluate the quality of reasoning steps rather than just final outcomes, making it harder to reach wrong answers via exploitable shortcuts. Constitutional AI and RLHF train the agent to internalize human values broadly rather than optimize a narrow metric. Minimal footprint principles restrict the agent to achieving only the explicitly requested goal with minimum side effects. None of these fully solve the specification problem — they reduce the scope for gaming while the capability for gaming grows with the agent's intelligence. The alignment research program (Part XVI) takes this as its central challenge.

Goodhart's Law (formal) \[\text{Let } R \text{ be the reward, } G \text{ the true goal.}\] \[\exists\, \pi^*_R : \mathbb{E}[R(\pi^*_R)] \gg \mathbb{E}[R(\pi^*_G)]\text{ but }\mathbb{E}[G(\pi^*_R)] \ll \mathbb{E}[G(\pi^*_G)]\] A policy \(\pi^*_R\) that maximizes the reward metric \(R\) need not maximize the true goal \(G\) — in fact it often dramatically underperforms the policy \(\pi^*_G\) that actually optimizes the goal. The gap grows with optimization power.

The Agent Landscape

Section 12 · From STRIPS to ReAct · key transitions · open problems · where the field stands

Agent research is one of the oldest and most contested areas of AI — it has been declared solved and declared fundamentally impossible multiple times since the 1950s. Understanding the arc of the field helps locate current developments and anticipate where the open problems lie.

Historical Arc

Classical AI (1950s–1980s): Symbolic planning and logic-based agents. GPS (General Problem Solver, 1959) established the means-ends analysis framework. STRIPS (1971) formalized planning. PROLOG-based expert systems deployed in medicine and engineering. The assumption: intelligence is symbol manipulation, and if we can represent the world precisely enough, intelligent behavior follows. These systems worked well in narrow, clean domains; they failed catastrophically when the world model was incomplete or wrong.

Situated and Embodied AI (1980s–1990s): Brooks' subsumption architecture (1986) was a direct attack on classical AI. Physical robotics revealed that the problems of perception and action in the real world were harder than logical planning in abstracted domains. Behavior-based robotics, along with probabilistic robotics (Thrun, Burgard, Fox) that introduced Bayesian reasoning about uncertain state, reoriented the field toward physical grounding.

Probabilistic and Decision-Theoretic Agents (1990s–2000s): Markov Decision Processes provided the mathematical framework for agents under uncertainty. Pomdps handled partial observability. Bayesian networks and influence diagrams enabled rich world models. These methods worked at moderate scale but were computationally expensive and required hand-engineered state spaces.

Deep RL Agents (2010s): AlphaGo (2016), DQN on Atari (2015), and OpenAI Five (2019) demonstrated that end-to-end learning from raw observations could produce agents competitive with human experts in complex games. The bottleneck shifted from representation to sample efficiency — learning required millions of game episodes. Transfer between domains remained extremely limited.

LLM-Based Agents (2022–present): ReAct (Yao et al., 2022), demonstrated that LLMs could interleave reasoning and action in a general way that transferred across domains. WebGPT, Toolformer, and HuggingGPT showed that language models could learn to use tools. The field is now moving rapidly: AutoGPT and BabyAGI demonstrated naive long-horizon autonomy; more disciplined systems (LangGraph, Claude computer use, SWE-agent) address specific task categories with engineering rigor.

Open Problems

Despite remarkable progress, fundamental challenges remain. Long-horizon task execution: Current LLM agents fail on tasks requiring more than a few dozen coherent steps, due to context limitations, compounding errors, and the difficulty of maintaining consistent state over time. Reliable tool use: Agents still call tools with wrong arguments, misinterpret tool outputs, and fail to recover gracefully from tool errors. Goal specification: Translating vague human intent into agent objectives that do not admit gaming is unsolved. Safe autonomy: Designing systems that can take initiative without taking dangerous initiative — systems that know what they do not know and ask for help at the right moments — remains the core engineering challenge of the field. These open problems structure the chapters that follow.

Agent Fundamentals, what it means to perceive, decide, and act.

Prerequisites

What Is an Agent?

The Sense-Plan-Act Loop

The Three Stages in Detail

The Deliberation-Reaction Tradeoff

Agent Taxonomies

Simple Reflex Agents

Model-Based Reflex Agents

Goal-Based Agents

Utility-Based Agents

Learning Agents

Environments and Their Properties

Classical Planning and PDDL

STRIPS: The Foundation

PDDL: A Practical Language

Limits of Classical Planning

The OODA Loop

Reactive Architectures

The Subsumption Architecture

Limitations

Learning Agents

The Four Components

Behavioral Cloning vs. Reinforcement Learning

LLMs as Pretrained World Models

What Makes a System "Agentic"?

The Autonomy Spectrum

Anthropic's Framework

Action Spaces and Grounding

Types of Action Spaces

The Symbol Grounding Problem

Tool Use as Action Expansion

The Objective Problem

Goodhart's Law

Specification Gaming

Partial Solutions

The Agent Landscape

Historical Arc

Open Problems

Further Reading