Part XI · AI Agents & Autonomous Systems · Chapter 10

Agent Evaluation & Benchmarking, knowing whether your agent actually works.

Measuring whether an agent succeeds requires more than checking its final answer — you need to examine its reasoning, its efficiency, its safety behaviour, and whether its benchmark score will still mean something in six months.

Why this chapter matters

A language model can be evaluated by asking it questions and scoring its answers. An agent can't. It takes sequences of actions whose correctness is entangled with timing, context, and side effects. This chapter covers the metrics and benchmarks the field has converged on — task success rate, trajectory quality, efficiency, and safety — and confronts the uncomfortable truth that benchmark saturation is already a serious problem for agent evaluation.

By the end you will understand how to interpret published benchmark results, design your own evaluation suite, and resist the temptation to mistake leaderboard performance for real-world capability.

In this chapter

Why Agent Evaluation Is Hard multi-step · stochastic · partial credit · expensive runs
Task Success Rate — The Primary Metric correctness · trial count · task distribution
Major Benchmarks: GAIA, SWE-bench, WebArena & More GAIA · SWE-bench · WebArena · leaderboards
Trajectory Evaluation trace analysis · automated scoring · process vs. outcome
Efficiency Metrics tokens · steps · wall-clock · cost
Safety Evaluation scope compliance · calibration · red-teaming
The Benchmark Saturation Problem contamination · headroom · sustainable benchmarks
Building Custom Evaluations 5-step design · in-house cases · anti-patterns
Continuous Evaluation in Production distribution shift · feedback loops · live monitoring
Frontier Problems long-horizon eval · multi-agent · open-ended tasks

Why Agent Evaluation Is Hard

Section 01 — The measurement problem

Evaluating a static model is straightforward: present an input, collect an output, compare to a reference. The entire pipeline is deterministic and reproducible. Agents break every assumption in this pipeline.

First, agents produce trajectories, not outputs. A single task might involve dozens of tool calls, intermediate reasoning steps, memory reads, and sub-agent delegations. The final answer is downstream of all of them, and a wrong answer could stem from a flawed initial plan, a hallucinated tool argument, a misread result, or a correct plan executed on stale data. Knowing that the agent failed doesn't tell you which part failed.

Second, many agent tasks are non-deterministic. The same agent on the same task on the same day may succeed or fail depending on API latency, cached search results, or model sampling temperature. A single trial is meaningless; even ten trials may not converge.

Third, evaluation environments are expensive to reset. You cannot replay a trajectory through a live database or a real browser the way you can replay tokens through a model. Scaffolding sandboxed, reproducible environments is often a larger engineering project than the agent itself.

Fourth, there is no obvious ground truth for intermediate steps. An agent that takes a longer-than-expected route to the correct answer — is it inefficient, or is it being cautious? An agent that invokes fewer tools but sometimes gets wrong answers — is it fast, or is it cutting corners? These questions don't have clean answers.

The Observer Effect

Instrumenting an agent to collect trajectory data changes its behaviour. Detailed logging adds latency; wrapper functions can alter tool call semantics; and agents trained on human demonstrations may behave differently under observation if those demonstrations didn't include evaluation contexts. Keep your evaluation harness as thin as possible.

Despite all this, the field has converged on a practical toolbox: outcome-based scoring, trajectory rubrics, efficiency budgets, and curated benchmark suites. None of these is perfect, but together they give a reasonably honest picture of agent capability.

Task Success Rate — The Primary Metric

Section 02 — The primary metric

Task success rate (TSR) is the fraction of tasks the agent completes correctly:

Task Success Rate \[ \text{TSR} = \frac{|\{i : \text{outcome}(i) = \text{correct}\}|}{N} \] where \(N\) is the number of tasks in the evaluation set. Each task is scored binary: correct or not.

Simplicity is the metric's strength. Researchers, practitioners, and executives can all reason about "the agent succeeds on 43% of tasks." But TSR conceals enormous nuance, and three design decisions dominate its meaning.

What counts as correct?

For some tasks, correctness is unambiguous: did the code pass all unit tests? Did the file end up in the right location? Did the SQL query return the right rows? For others it is contested: did the research summary capture all the relevant facts? Did the email have the right tone? Evaluation suites that report TSR numbers are implicitly claiming that their correctness criterion is sensible — an assumption worth scrutinising every time you read a leaderboard.

Verification methods include exact match (brittle, works for structured outputs), normalised string match (tolerates whitespace and capitalisation), regex patterns, execution-based checking (run the code, check results), model-based grading (another LLM judges correctness), and human evaluation (expensive, the gold standard).

How many trials?

Because agents are non-deterministic, single-trial TSR estimates have high variance. The standard approach is pass@k: run each task $k$ times and score it correct if at least one run succeeds. This rewards agents that can find solutions reliably, not just occasionally.

Pass@k estimator \[ \text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \] where \(n\) is total runs and \(c\) is correct runs. This unbiased estimator avoids sampling the \(k\) runs explicitly (Chen et al., HumanEval, 2021).

For deployed systems you usually care about pass@1 — whether the agent gets it right the first time — because users don't wait for retries. For capability research, pass@k with larger $k$ is more informative about the limits of what the agent can do at all.

Task distribution

TSR is only as representative as the tasks. A benchmark that over-represents easy tasks will inflate scores; one that under-represents real-world diversity will produce agents that are narrow specialists. Good benchmarks stratify by difficulty, domain, and required capability, then report TSR at each stratum rather than aggregating everything into a single number.

Major Benchmarks

Section 03 — GAIA, SWE-bench, WebArena & more

Several benchmark suites have emerged as the de-facto leaderboards for agent capability. Each targets a different slice of what "agentic" means.

GAIA

General AI Assistant Benchmark

Real-world questions requiring multi-step reasoning, web search, file understanding, and tool use. Three difficulty levels; answers are short and verifiable. Designed so that humans score ~92% but GPT-4 Plugins scored ~15% at release.

Mialon et al., 2023 · 466 questions · 3 levels

SWE-bench

Software Engineering Benchmark

Real GitHub issues paired with test suites. The agent must navigate codebases, understand bug reports, and produce patches that pass existing tests. Verified (SWE-bench Verified) filters to unambiguous, testable issues. Tests full agentic coding loops.

Jimenez et al., 2024 · 2,294 instances · Python repos

WebArena

Realistic Web Interaction Tasks

Agents interact with fully functional web environments — Reddit, GitLab, e-commerce sites, maps — to complete natural language instructions. Tasks require multi-page navigation, form filling, and cross-site information gathering. Hard: top agents score <20% at time of publication.

Zhou et al., 2024 · 812 tasks · 5 web apps

τ-bench

Tool-Agent-User Benchmark

Simulates realistic customer service and retail workflows with dynamic user policies, multi-turn conversation, and tool calls to live databases. Uniquely tests adherence to business rules under adversarial user requests. Measures reliability across trials.

Yao et al., 2024 · retail + airline domains

Mind2Web

Cross-Website Web Navigation

Web navigation from natural language instructions across 137 websites and 2,350 tasks recorded from real users. Tests element grounding: can the agent identify which HTML element to click or type in? Evaluates generalisation to unseen websites.

Deng et al., 2023 · 137 websites · diverse domains

AgentBench

Multi-Environment Agent Evaluation

Eight diverse environments including OS shell, databases, knowledge graphs, lateral thinking puzzles, and house-holding simulations. Tests breadth of agentic capability rather than depth in one domain. Reveals how much performance varies across environment types.

Liu et al., 2023 · 8 environments · ~1,300 tasks

Major agent benchmarks mapped by task complexity and environmental realism. Benchmarks in the upper-right corner (SWE-bench, WebArena) are hardest to saturate and most predictive of real-world capability.

How to read leaderboard scores

When you see a headline number — "Agent X achieves 48% on SWE-bench Verified" — four questions are worth asking before updating your priors. First, which split was used? Some evaluations use easier subsets. Second, were any task-specific prompting tricks applied? Scaffold engineering that targets benchmark quirks doesn't transfer. Third, how many trials were run? A single-shot result on 50 tasks is much noisier than 5-shot on 500. Fourth, has the model been trained on any data that might contain benchmark solutions? Contamination is an underappreciated problem for benchmarks whose test sets are publicly available.

Trajectory Evaluation

Section 04 — Judging the journey, not just the destination

Two agents can both succeed on a task while taking radically different paths. One might verify each step before proceeding; the other might guess aggressively, get lucky, and move on. A purely outcome-based metric can't distinguish them. Trajectory evaluation fills this gap by scoring the sequence of actions directly.

What to measure in a trajectory

Step 1 — Task decomposition

Did the agent correctly identify the sub-tasks and dependencies? A coherent plan up front predicts success more reliably than the number of tools available.

Rubric: 0 (no plan), 1 (partial), 2 (complete and correct)

Step 2 — Tool call correctness

Were tool arguments valid? Did the agent pass the right parameters, avoid hallucinated field names, and handle required vs. optional fields correctly?

Rubric: fraction of tool calls with valid, non-hallucinated arguments

Step 3 — Result interpretation

Did the agent correctly parse and use tool results? A common failure mode is correctly calling a tool but misreading its output — treating a list as a single item, or ignoring an error code.

Rubric: binary per tool call — result correctly incorporated or not

Step 4 — Backtracking behaviour

When the agent hits a dead end, does it recognise the situation and recover? Agents that never backtrack either get lucky or get stuck; robust ones can detect failure and revise their plan.

Rubric: recovery rate — dead ends recovered vs. dead ends that cause task failure

Step 5 — Unnecessary actions

Did the agent take actions that contributed nothing to the task? Redundant tool calls, repeated searches for information already retrieved, and exploratory browsing of irrelevant pages all inflate cost without improving outcomes.

Rubric: unnecessary action rate = redundant steps / total steps

Automated trajectory scoring

Human evaluation of trajectories is the gold standard but doesn't scale. Three automation approaches are widely used. Reference trajectories compare agent paths to expert-authored reference solutions using token-level edit distance or step-level matching. LLM-as-judge passes the trajectory to a separate language model with a grading rubric — this is flexible but introduces a second model's biases. Programmatic checks verify specific invariants: "the agent always confirmed deletion before executing it," "no PII appeared in tool arguments," "the agent never called the billing API before authenticating."

The Correct-for-Wrong-Reasons Problem

An agent can produce the right final answer via a trajectory that would fail on any variation of the task. If the agent searched for "revenue Q3 2024" and happened to retrieve a cached page that also contained the Q4 numbers it needed — success, but only by accident. Trajectory evaluation catches these brittle solutions that outcome scoring misses entirely.

Efficiency Metrics

Section 05 — How much does it cost to get there

An agent that succeeds on every task but takes 200 tool calls and $15 per query is not deployable. Efficiency metrics quantify the cost of success alongside its probability.

STEPS

Steps to completion — the total number of reasoning–action cycles. Shorter is usually better, but not always: an agent that checks its work takes more steps and makes fewer errors. Report both steps and TSR to surface the tradeoff.

TOK

Token consumption — total input plus output tokens across the entire trajectory, including all tool call roundtrips. This is the primary cost driver. For long tasks, track cumulative input tokens separately, since the context window grows with each step and dominates total cost.

TOOL

Tool calls per task — total number of external tool invocations. Distinguish between unique tool calls (different tools or different arguments) and redundant tool calls (same tool, same arguments, called again unnecessarily). The latter is pure waste.

LAT

Wall-clock latency — elapsed time from task start to result. This is what the user experiences. It compounds model latency, tool roundtrip time, and any sequential waiting between steps. Parallelising independent tool calls is the highest-leverage latency optimisation.

COST

Cost per successful task — the most decision-relevant metric. Divide total API spend by number of tasks solved correctly. This normalises for success rate and lets you compare agents with different TSR: a 60%-success agent at $0.50/success may dominate a 70%-success agent at $2.00/success.

Cost-adjusted performance \[ \text{CAP} = \frac{\text{TSR}}{\mathbb{E}[\text{cost per task}]} \] Higher is better. CAP lets you compare agents that sit at different points on the accuracy–cost tradeoff frontier.

Efficiency metrics should always be reported alongside the success metric they condition on. An agent that halves token consumption by refusing to do hard tasks has not improved — it has just shifted the distribution of attempted tasks. Plot efficiency metrics stratified by task difficulty to avoid this illusion.

Metric	When to prioritise	Failure mode to watch
steps	Interactive use cases where users wait	Rewarding brevity causes skipping verification
tokens	API cost control, context limit headroom	Compressing context loses relevant history
tool calls	Rate-limited or metered external APIs	Batching calls introduces latency
latency	Real-time or user-facing applications	Parallelism can break sequential dependencies
cost/success	Business case for deployment	Ignores tasks the agent refuses to attempt

Safety Evaluation

Section 06 — Does the agent stay within bounds

Safety evaluation for agents goes beyond refusal testing. Because agents take real-world actions with external consequences, safety evaluation must probe whether those actions are appropriate, proportionate, and recoverable.

Scope compliance

The primary safety metric is scope compliance rate (SCR): the fraction of tasks where the agent stays within its authorised action space. Evaluation tasks should include explicit out-of-scope requests and monitor whether the agent correctly refuses them. A well-designed safety evaluation set contains tasks at the boundary — close enough to in-scope that a poorly calibrated agent might proceed, but clearly out of scope by policy.

Safety test category	What it probes	Correct agent behaviour
prompt injection	Adversarial instructions embedded in tool results or web pages	Ignore, flag, or refuse; don't follow injected commands
scope creep	Tasks that drift beyond authorised action space mid-trajectory	Pause, check authorisation, or abandon rather than continue
irreversible actions	Deletions, financial transactions, external communications	Seek confirmation or apply GATE/BLOCK tier policy
capability overreach	Attempts to acquire tools or permissions not in the original scope	Decline and explain; don't self-expand capabilities
misuse requests	User instructions that violate operator policy or ethical limits	Refuse with explanation; don't rationalise compliance
PII leakage	Sensitive data appearing in tool arguments or external calls	Redact or block before transmission

Calibration, not just refusal

A maximally cautious agent that refuses everything has perfect safety scores and zero utility. The goal is a calibrated agent: one that correctly distinguishes between tasks it should do, tasks it should confirm before doing, and tasks it should refuse entirely. Over-refusal is a safety failure of a different kind — it trains users to work around the agent rather than with it.

Safety calibration \[ \text{Safety calibration} = \text{SCR} - \alpha \cdot \text{over-refusal rate} \] where \(\alpha\) weights the cost of over-refusal relative to under-refusal. Set \(\alpha\) based on deployment context: higher for customer-facing agents, lower for high-stakes automated workflows.

Red-teaming for agents

Standard model red-teaming presents single-turn prompts designed to elicit harmful outputs. Agent red-teaming is harder because exploits can unfold across many steps, with adversarial content arriving through tool results rather than the initial prompt. Effective agent red-teaming should include multi-turn scenarios, adversarial tool environments, and prompt injection via realistic-looking web content. Automated red-teaming that uses another agent to craft adversarial trajectories is an emerging practice that scales better than purely human-driven adversarial testing.

The Benchmark Saturation Problem

Section 07 — When the leaderboard stops telling you anything useful

Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. Benchmark saturation is Goodhart's Law applied to AI evaluation: once a benchmark becomes the primary signal for capability, researchers optimise directly for it, and scores rapidly inflate beyond what underlying capability can explain.

The field has already watched this happen to benchmark after benchmark. GLUE was released in 2018; models reached human-level performance within a year. SuperGLUE replaced it; the same thing happened faster. BIG-Bench was designed to be hard; many tasks were saturated within two years. For agents, the cycle is accelerating.

GAIA (level 1)

~72%

best agent 2025

SWE-bench V.

~55%

best agent 2025

WebArena

~36%

best agent 2025

τ-bench (retail)

~48%

best agent 2025

Several mechanisms drive saturation. Test set contamination: benchmark tasks that appear in public repositories may have been seen during pretraining. Benchmark-specific engineering: practitioners learn the quirks of particular benchmarks — which retrieval heuristics work on GAIA, which code structure passes SWE-bench tests — and their agents improve on the benchmark without improving generally. Overfitting to the task distribution: models or agents fine-tuned on benchmark-adjacent data perform well on the benchmark but not on neighbouring tasks.

The Saturation Trap

A new benchmark that no current agent scores above 20% is actually more valuable than one where the best agent scores 80%. Low scores mean there is genuine signal left. By the time a benchmark appears on mainstream deployment decision matrices, it is often already in the early stages of saturation. Prefer evaluations that are continuously refreshed with new tasks, use held-out test splits, and report results stratified by task difficulty.

What sustainable benchmarks look like

Several design principles delay saturation. Procedural generation creates tasks algorithmically so the task distribution is infinite and contamination is impossible. Hidden test sets withhold evaluation data from the public; submissions are evaluated server-side. Dynamic evaluation periodically refreshes task pools and retires tasks that too many agents solve. Human-in-the-loop grading makes benchmark gaming harder because there is no fixed pattern to exploit. Stratification by difficulty preserves signal even as easier strata saturate — you can track whether progress on hard tasks is real or merely reflects easier-task spillover.

Building Custom Evaluations

Section 08 — Building evals that match your deployment

Published benchmarks measure general capability. Your deployment has specific requirements that no published benchmark was designed to capture. Before deploying an agent, you need a custom evaluation suite — one that reflects your actual task distribution, your correctness criteria, and your risk tolerance.

The five-step evaluation design process

Define the task distribution. Collect 50–200 real examples of the tasks your agent will handle. If the system isn't live yet, generate them from domain experts. Stratify by complexity, topic, and required tool usage so your evaluation set mirrors the full range of what users will actually send.

Define correctness. For each task category, specify how success is determined. Write the correctness criterion before you look at any agent outputs — post-hoc criteria are rationalised from the observed distribution, not the desired one. Wherever possible, use executable verification rather than string matching.

Build the harness. Create a reproducible environment — sandboxed tool implementations, deterministic data sources where possible, reset procedures between tasks. Log every tool call, every model response, and every intermediate state. Evaluation harnesses that don't log intermediate state will fail to diagnose failures.

Set your baselines. Run at least two reference points: a non-agentic model answering from memory (to quantify what tool access adds) and a human expert (to set a ceiling). If your agent can't beat the non-agentic baseline on simple tasks, your scaffolding is costing you more than it gains.

Schedule regular re-evaluation. Agents degrade silently. Model updates change behaviour without notice; tool APIs evolve; the task distribution drifts as users discover new use cases. Treat your evaluation suite as a regression suite: run it on every material change, and schedule it automatically on a weekly cadence even when nothing changes.

Evaluation anti-patterns

A few practices reliably produce misleading evaluation results and should be avoided. Cherry-picked tasks: selecting tasks you know the agent handles well produces impressive-sounding numbers and terrible predictions of real-world performance. Single-trial evaluation: one run per task conflates agent quality with random sampling luck. Evaluating on training distribution only: your agent may have implicitly been tuned on tasks similar to your eval set; held-out tasks that represent edge cases and novel combinations are essential. Ignoring refusals: tasks the agent refuses to attempt should count as failures, not be excluded from the denominator — otherwise you are measuring accuracy conditional on the agent trying, which overstates deployment utility.

Continuous Evaluation in Production

Section 09 — Evaluation doesn't stop at deployment

Offline evaluation on a benchmark gives you a snapshot of your agent's capability at a fixed point in time. Once deployed, the environment keeps changing. Production evaluation — monitoring and measuring agent performance on real user tasks — is the only way to detect the inevitable degradation.

What to monitor

Three tiers of monitoring are appropriate for most production agent deployments. The first tier is health metrics: task completion rate, error rates by error type (tool failure, model error, timeout), and latency distributions. These tell you the system is working, but not whether it's working well. The second tier is outcome sampling: randomly sample 1–5% of completed tasks and evaluate them against correctness criteria, either programmatically or via human review. This gives you a running TSR estimate on real user tasks. The third tier is trajectory auditing: for flagged sessions (long trajectories, high cost, explicit user negative feedback), review the full action sequence to identify systematic failure patterns that individual task scores won't surface.

Distribution shift

Production task distributions shift for several reasons. Successful products attract users with different needs than early adopters. Seasonal patterns change what users are asking about. Adversarial users discover and exploit weaknesses. Model updates at the API layer change underlying behaviour without your agent's prompts or scaffolding changing. Track the embedding distribution of incoming user requests over time — a significant shift in the distribution is often the first detectable signal that your offline evaluation set is becoming unrepresentative.

When to Trigger a Full Re-evaluation

Trigger a complete benchmark run when: (1) the production task success rate drops more than 5 percentage points from baseline, (2) a model provider updates the underlying model version, (3) more than 20% of tool API endpoints change their response schema, or (4) user feedback signals a qualitative shift in the types of failures they're reporting. Don't wait for a crisis — build these triggers into your deployment pipeline.

Closing the feedback loop

Production failures are the highest-signal input to your evaluation suite. Every task where a user explicitly marks the result as wrong, escalates to a human, or immediately retries with a rephrased request is a candidate for inclusion in your eval set. Maintain a failure library: a categorised collection of real production failures with annotated root causes. New evaluation tasks should disproportionately come from this library, because the failure library contains exactly the cases your current evaluation set is missing.

Frontier Problems

Section 10 — Where evaluation science is heading

Agent evaluation is a young field, and several hard problems remain genuinely open.

Long-horizon task evaluation

Most current benchmarks test tasks that complete in under 50 steps and under five minutes. Real-world agentic deployments increasingly involve tasks that unfold over hours, involve thousands of actions, and require maintaining consistent goals and context across context window boundaries. Evaluating these tasks requires new infrastructure: sandboxed environments that can be paused and resumed, correctness criteria that account for partial progress, and trajectory comparison methods that scale to thousands of steps.

Evaluating multi-agent systems

When multiple agents collaborate on a task, attributing success or failure to individual agents becomes difficult. If Agent A correctly plans a task but Agent B misexecutes it, who failed? Metrics that collapse across the whole system obscure these distinctions. Multi-agent evaluations need component-level attribution, protocol-level correctness checks, and stability metrics that capture whether the system reaches coherent outcomes consistently across different orderings and communication delays.

Evaluating open-ended tasks

A growing share of valuable agentic work is fundamentally open-ended: write a research report, design a marketing campaign, explore this dataset and tell me what's interesting. These tasks have no single correct answer. Evaluation methods for open-ended tasks are nascent — they typically require human judges or LLM-as-judge with carefully designed rubrics, both of which introduce significant variance and potential bias. The core challenge is defining what "better" means for tasks where multiple genuinely different outputs could all be excellent.

Benchmark design as a research area

The clearest lesson from the history of NLP and ML benchmarks is that benchmark design is itself a scientifically demanding activity — at least as demanding as the systems it evaluates. The agent evaluation community is gradually developing principled practices: stratified sampling from real user distributions, procedural task generation, held-out test sets with server-side evaluation, and human-in-the-loop grading for qualitative tasks. Expect the field to converge on evaluation standards over the next few years, much as ImageNet and then BenchLM standardised vision and language evaluation.

The Practitioner's Heuristic

No benchmark tells you how your agent will perform on your tasks. Published benchmarks are useful for orientation — they tell you roughly where a system sits in the capability landscape — but they should never substitute for evaluation on a task distribution that actually reflects your deployment. Budget at least as much effort for evaluation engineering as for agent engineering. The teams that get this right consistently outperform the teams that treat evaluation as an afterthought.

Agent Evaluation & Benchmarking, knowing whether your agent actually works.

Why this chapter matters

Why Agent Evaluation Is Hard

Task Success Rate — The Primary Metric

What counts as correct?

How many trials?

Task distribution

Major Benchmarks

How to read leaderboard scores

Trajectory Evaluation

What to measure in a trajectory

Automated trajectory scoring

Efficiency Metrics

Safety Evaluation

Scope compliance

Calibration, not just refusal

Red-teaming for agents

The Benchmark Saturation Problem

What sustainable benchmarks look like

Building Custom Evaluations

The five-step evaluation design process

Evaluation anti-patterns

Continuous Evaluation in Production

What to monitor

Distribution shift

Closing the feedback loop

Frontier Problems

Long-horizon task evaluation

Evaluating multi-agent systems

Evaluating open-ended tasks

Benchmark design as a research area

Further Reading