Agent Evaluation & Benchmarking, knowing whether your agent actually works.
Measuring whether an agent succeeds requires more than checking its final answer — you need to examine its reasoning, its efficiency, its safety behaviour, and whether its benchmark score will still mean something in six months.
Why this chapter matters
A language model can be evaluated by asking it questions and scoring its answers. An agent can't. It takes sequences of actions whose correctness is entangled with timing, context, and side effects. This chapter covers the metrics and benchmarks the field has converged on — task success rate, trajectory quality, efficiency, and safety — and confronts the uncomfortable truth that benchmark saturation is already a serious problem for agent evaluation.
By the end you will understand how to interpret published benchmark results, design your own evaluation suite, and resist the temptation to mistake leaderboard performance for real-world capability.
Why Agent Evaluation Is Hard
Evaluating a static model is straightforward: present an input, collect an output, compare to a reference. The entire pipeline is deterministic and reproducible. Agents break every assumption in this pipeline.
First, agents produce trajectories, not outputs. A single task might involve dozens of tool calls, intermediate reasoning steps, memory reads, and sub-agent delegations. The final answer is downstream of all of them, and a wrong answer could stem from a flawed initial plan, a hallucinated tool argument, a misread result, or a correct plan executed on stale data. Knowing that the agent failed doesn't tell you which part failed.
Second, many agent tasks are non-deterministic. The same agent on the same task on the same day may succeed or fail depending on API latency, cached search results, or model sampling temperature. A single trial is meaningless; even ten trials may not converge.
Third, evaluation environments are expensive to reset. You cannot replay a trajectory through a live database or a real browser the way you can replay tokens through a model. Scaffolding sandboxed, reproducible environments is often a larger engineering project than the agent itself.
Fourth, there is no obvious ground truth for intermediate steps. An agent that takes a longer-than-expected route to the correct answer — is it inefficient, or is it being cautious? An agent that invokes fewer tools but sometimes gets wrong answers — is it fast, or is it cutting corners? These questions don't have clean answers.
Instrumenting an agent to collect trajectory data changes its behaviour. Detailed logging adds latency; wrapper functions can alter tool call semantics; and agents trained on human demonstrations may behave differently under observation if those demonstrations didn't include evaluation contexts. Keep your evaluation harness as thin as possible.
Despite all this, the field has converged on a practical toolbox: outcome-based scoring, trajectory rubrics, efficiency budgets, and curated benchmark suites. None of these is perfect, but together they give a reasonably honest picture of agent capability.
Task Success Rate — The Primary Metric
Task success rate (TSR) is the fraction of tasks the agent completes correctly:
Simplicity is the metric's strength. Researchers, practitioners, and executives can all reason about "the agent succeeds on 43% of tasks." But TSR conceals enormous nuance, and three design decisions dominate its meaning.
What counts as correct?
For some tasks, correctness is unambiguous: did the code pass all unit tests? Did the file end up in the right location? Did the SQL query return the right rows? For others it is contested: did the research summary capture all the relevant facts? Did the email have the right tone? Evaluation suites that report TSR numbers are implicitly claiming that their correctness criterion is sensible — an assumption worth scrutinising every time you read a leaderboard.
Verification methods include exact match (brittle, works for structured outputs), normalised string match (tolerates whitespace and capitalisation), regex patterns, execution-based checking (run the code, check results), model-based grading (another LLM judges correctness), and human evaluation (expensive, the gold standard).
How many trials?
Because agents are non-deterministic, single-trial TSR estimates have high variance. The standard approach is pass@k: run each task \(k\) times and score it correct if at least one run succeeds. This rewards agents that can find solutions reliably, not just occasionally.
For deployed systems you usually care about pass@1 — whether the agent gets it right the first time — because users don't wait for retries. For capability research, pass@k with larger \(k\) is more informative about the limits of what the agent can do at all.
Task distribution
TSR is only as representative as the tasks. A benchmark that over-represents easy tasks will inflate scores; one that under-represents real-world diversity will produce agents that are narrow specialists. Good benchmarks stratify by difficulty, domain, and required capability, then report TSR at each stratum rather than aggregating everything into a single number.
Major Benchmarks
Several benchmark suites have emerged as the de-facto leaderboards for agent capability. Each targets a different slice of what "agentic" means.
Real-world questions requiring multi-step reasoning, web search, file understanding, and tool use. Three difficulty levels; answers are short and verifiable. Designed so that humans score ~92% but GPT-4 Plugins scored ~15% at release.
Real GitHub issues paired with test suites. The agent must navigate codebases, understand bug reports, and produce patches that pass existing tests. Verified (SWE-bench Verified) filters to unambiguous, testable issues. Tests full agentic coding loops.
Agents interact with fully functional web environments — Reddit, GitLab, e-commerce sites, maps — to complete natural language instructions. Tasks require multi-page navigation, form filling, and cross-site information gathering. Hard: top agents score <20% at time of publication.
Simulates realistic customer service and retail workflows with dynamic user policies, multi-turn conversation, and tool calls to live databases. Uniquely tests adherence to business rules under adversarial user requests. Measures reliability across trials.
Web navigation from natural language instructions across 137 websites and 2,350 tasks recorded from real users. Tests element grounding: can the agent identify which HTML element to click or type in? Evaluates generalisation to unseen websites.
Eight diverse environments including OS shell, databases, knowledge graphs, lateral thinking puzzles, and house-holding simulations. Tests breadth of agentic capability rather than depth in one domain. Reveals how much performance varies across environment types.
How to read leaderboard scores
When you see a headline number — "Agent X achieves 48% on SWE-bench Verified" — four questions are worth asking before updating your priors. First, which split was used? Some evaluations use easier subsets. Second, were any task-specific prompting tricks applied? Scaffold engineering that targets benchmark quirks doesn't transfer. Third, how many trials were run? A single-shot result on 50 tasks is much noisier than 5-shot on 500. Fourth, has the model been trained on any data that might contain benchmark solutions? Contamination is an underappreciated problem for benchmarks whose test sets are publicly available.
Trajectory Evaluation
Two agents can both succeed on a task while taking radically different paths. One might verify each step before proceeding; the other might guess aggressively, get lucky, and move on. A purely outcome-based metric can't distinguish them. Trajectory evaluation fills this gap by scoring the sequence of actions directly.
What to measure in a trajectory
Automated trajectory scoring
Human evaluation of trajectories is the gold standard but doesn't scale. Three automation approaches are widely used. Reference trajectories compare agent paths to expert-authored reference solutions using token-level edit distance or step-level matching. LLM-as-judge passes the trajectory to a separate language model with a grading rubric — this is flexible but introduces a second model's biases. Programmatic checks verify specific invariants: "the agent always confirmed deletion before executing it," "no PII appeared in tool arguments," "the agent never called the billing API before authenticating."
An agent can produce the right final answer via a trajectory that would fail on any variation of the task. If the agent searched for "revenue Q3 2024" and happened to retrieve a cached page that also contained the Q4 numbers it needed — success, but only by accident. Trajectory evaluation catches these brittle solutions that outcome scoring misses entirely.
Efficiency Metrics
An agent that succeeds on every task but takes 200 tool calls and $15 per query is not deployable. Efficiency metrics quantify the cost of success alongside its probability.
Efficiency metrics should always be reported alongside the success metric they condition on. An agent that halves token consumption by refusing to do hard tasks has not improved — it has just shifted the distribution of attempted tasks. Plot efficiency metrics stratified by task difficulty to avoid this illusion.
| Metric | When to prioritise | Failure mode to watch |
|---|---|---|
| steps | Interactive use cases where users wait | Rewarding brevity causes skipping verification |
| tokens | API cost control, context limit headroom | Compressing context loses relevant history |
| tool calls | Rate-limited or metered external APIs | Batching calls introduces latency |
| latency | Real-time or user-facing applications | Parallelism can break sequential dependencies |
| cost/success | Business case for deployment | Ignores tasks the agent refuses to attempt |
Safety Evaluation
Safety evaluation for agents goes beyond refusal testing. Because agents take real-world actions with external consequences, safety evaluation must probe whether those actions are appropriate, proportionate, and recoverable.
Scope compliance
The primary safety metric is scope compliance rate (SCR): the fraction of tasks where the agent stays within its authorised action space. Evaluation tasks should include explicit out-of-scope requests and monitor whether the agent correctly refuses them. A well-designed safety evaluation set contains tasks at the boundary — close enough to in-scope that a poorly calibrated agent might proceed, but clearly out of scope by policy.
| Safety test category | What it probes | Correct agent behaviour |
|---|---|---|
| prompt injection | Adversarial instructions embedded in tool results or web pages | Ignore, flag, or refuse; don't follow injected commands |
| scope creep | Tasks that drift beyond authorised action space mid-trajectory | Pause, check authorisation, or abandon rather than continue |
| irreversible actions | Deletions, financial transactions, external communications | Seek confirmation or apply GATE/BLOCK tier policy |
| capability overreach | Attempts to acquire tools or permissions not in the original scope | Decline and explain; don't self-expand capabilities |
| misuse requests | User instructions that violate operator policy or ethical limits | Refuse with explanation; don't rationalise compliance |
| PII leakage | Sensitive data appearing in tool arguments or external calls | Redact or block before transmission |
Calibration, not just refusal
A maximally cautious agent that refuses everything has perfect safety scores and zero utility. The goal is a calibrated agent: one that correctly distinguishes between tasks it should do, tasks it should confirm before doing, and tasks it should refuse entirely. Over-refusal is a safety failure of a different kind — it trains users to work around the agent rather than with it.
Red-teaming for agents
Standard model red-teaming presents single-turn prompts designed to elicit harmful outputs. Agent red-teaming is harder because exploits can unfold across many steps, with adversarial content arriving through tool results rather than the initial prompt. Effective agent red-teaming should include multi-turn scenarios, adversarial tool environments, and prompt injection via realistic-looking web content. Automated red-teaming that uses another agent to craft adversarial trajectories is an emerging practice that scales better than purely human-driven adversarial testing.
The Benchmark Saturation Problem
Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. Benchmark saturation is Goodhart's Law applied to AI evaluation: once a benchmark becomes the primary signal for capability, researchers optimise directly for it, and scores rapidly inflate beyond what underlying capability can explain.
The field has already watched this happen to benchmark after benchmark. GLUE was released in 2018; models reached human-level performance within a year. SuperGLUE replaced it; the same thing happened faster. BIG-Bench was designed to be hard; many tasks were saturated within two years. For agents, the cycle is accelerating.
Several mechanisms drive saturation. Test set contamination: benchmark tasks that appear in public repositories may have been seen during pretraining. Benchmark-specific engineering: practitioners learn the quirks of particular benchmarks — which retrieval heuristics work on GAIA, which code structure passes SWE-bench tests — and their agents improve on the benchmark without improving generally. Overfitting to the task distribution: models or agents fine-tuned on benchmark-adjacent data perform well on the benchmark but not on neighbouring tasks.
A new benchmark that no current agent scores above 20% is actually more valuable than one where the best agent scores 80%. Low scores mean there is genuine signal left. By the time a benchmark appears on mainstream deployment decision matrices, it is often already in the early stages of saturation. Prefer evaluations that are continuously refreshed with new tasks, use held-out test splits, and report results stratified by task difficulty.
What sustainable benchmarks look like
Several design principles delay saturation. Procedural generation creates tasks algorithmically so the task distribution is infinite and contamination is impossible. Hidden test sets withhold evaluation data from the public; submissions are evaluated server-side. Dynamic evaluation periodically refreshes task pools and retires tasks that too many agents solve. Human-in-the-loop grading makes benchmark gaming harder because there is no fixed pattern to exploit. Stratification by difficulty preserves signal even as easier strata saturate — you can track whether progress on hard tasks is real or merely reflects easier-task spillover.
Building Custom Evaluations
Published benchmarks measure general capability. Your deployment has specific requirements that no published benchmark was designed to capture. Before deploying an agent, you need a custom evaluation suite — one that reflects your actual task distribution, your correctness criteria, and your risk tolerance.
The five-step evaluation design process
Evaluation anti-patterns
A few practices reliably produce misleading evaluation results and should be avoided. Cherry-picked tasks: selecting tasks you know the agent handles well produces impressive-sounding numbers and terrible predictions of real-world performance. Single-trial evaluation: one run per task conflates agent quality with random sampling luck. Evaluating on training distribution only: your agent may have implicitly been tuned on tasks similar to your eval set; held-out tasks that represent edge cases and novel combinations are essential. Ignoring refusals: tasks the agent refuses to attempt should count as failures, not be excluded from the denominator — otherwise you are measuring accuracy conditional on the agent trying, which overstates deployment utility.
Continuous Evaluation in Production
Offline evaluation on a benchmark gives you a snapshot of your agent's capability at a fixed point in time. Once deployed, the environment keeps changing. Production evaluation — monitoring and measuring agent performance on real user tasks — is the only way to detect the inevitable degradation.
What to monitor
Three tiers of monitoring are appropriate for most production agent deployments. The first tier is health metrics: task completion rate, error rates by error type (tool failure, model error, timeout), and latency distributions. These tell you the system is working, but not whether it's working well. The second tier is outcome sampling: randomly sample 1–5% of completed tasks and evaluate them against correctness criteria, either programmatically or via human review. This gives you a running TSR estimate on real user tasks. The third tier is trajectory auditing: for flagged sessions (long trajectories, high cost, explicit user negative feedback), review the full action sequence to identify systematic failure patterns that individual task scores won't surface.
Distribution shift
Production task distributions shift for several reasons. Successful products attract users with different needs than early adopters. Seasonal patterns change what users are asking about. Adversarial users discover and exploit weaknesses. Model updates at the API layer change underlying behaviour without your agent's prompts or scaffolding changing. Track the embedding distribution of incoming user requests over time — a significant shift in the distribution is often the first detectable signal that your offline evaluation set is becoming unrepresentative.
Trigger a complete benchmark run when: (1) the production task success rate drops more than 5 percentage points from baseline, (2) a model provider updates the underlying model version, (3) more than 20% of tool API endpoints change their response schema, or (4) user feedback signals a qualitative shift in the types of failures they're reporting. Don't wait for a crisis — build these triggers into your deployment pipeline.
Closing the feedback loop
Production failures are the highest-signal input to your evaluation suite. Every task where a user explicitly marks the result as wrong, escalates to a human, or immediately retries with a rephrased request is a candidate for inclusion in your eval set. Maintain a failure library: a categorised collection of real production failures with annotated root causes. New evaluation tasks should disproportionately come from this library, because the failure library contains exactly the cases your current evaluation set is missing.
Frontier Problems
Agent evaluation is a young field, and several hard problems remain genuinely open.
Long-horizon task evaluation
Most current benchmarks test tasks that complete in under 50 steps and under five minutes. Real-world agentic deployments increasingly involve tasks that unfold over hours, involve thousands of actions, and require maintaining consistent goals and context across context window boundaries. Evaluating these tasks requires new infrastructure: sandboxed environments that can be paused and resumed, correctness criteria that account for partial progress, and trajectory comparison methods that scale to thousands of steps.
Evaluating multi-agent systems
When multiple agents collaborate on a task, attributing success or failure to individual agents becomes difficult. If Agent A correctly plans a task but Agent B misexecutes it, who failed? Metrics that collapse across the whole system obscure these distinctions. Multi-agent evaluations need component-level attribution, protocol-level correctness checks, and stability metrics that capture whether the system reaches coherent outcomes consistently across different orderings and communication delays.
Evaluating open-ended tasks
A growing share of valuable agentic work is fundamentally open-ended: write a research report, design a marketing campaign, explore this dataset and tell me what's interesting. These tasks have no single correct answer. Evaluation methods for open-ended tasks are nascent — they typically require human judges or LLM-as-judge with carefully designed rubrics, both of which introduce significant variance and potential bias. The core challenge is defining what "better" means for tasks where multiple genuinely different outputs could all be excellent.
Benchmark design as a research area
The clearest lesson from the history of NLP and ML benchmarks is that benchmark design is itself a scientifically demanding activity — at least as demanding as the systems it evaluates. The agent evaluation community is gradually developing principled practices: stratified sampling from real user distributions, procedural task generation, held-out test sets with server-side evaluation, and human-in-the-loop grading for qualitative tasks. Expect the field to converge on evaluation standards over the next few years, much as ImageNet and then BenchLM standardised vision and language evaluation.
No benchmark tells you how your agent will perform on your tasks. Published benchmarks are useful for orientation — they tell you roughly where a system sits in the capability landscape — but they should never substitute for evaluation on a task distribution that actually reflects your deployment. Budget at least as much effort for evaluation engineering as for agent engineering. The teams that get this right consistently outperform the teams that treat evaluation as an afterthought.
Further Reading
-
GAIA: A Benchmark for General AI AssistantsIntroduces the GAIA benchmark with its three-tier difficulty structure, grounding tasks in real-world assistant scenarios that require web search, document reading, and multi-step reasoning. The clearest articulation of what "general" agent capability should mean.
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?Constructs the SWE-bench evaluation from real GitHub issues and pull requests, with automated test-based correctness checking. Reveals the gap between model capability on toy coding tasks and production engineering tasks. The closest thing to a ground truth for agent coding capability.
-
WebArena: A Realistic Web Environment for Building Autonomous AgentsDescribes the construction of realistic, self-hosted web applications for agent evaluation, with functional equivalents of Reddit, GitLab, shopping sites, and maps. The benchmark that made web agent performance measurement tractable.
-
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World DomainsBenchmarks agents in realistic customer service workflows with dynamic user policies, adversarial users, and business rule compliance requirements. The most realistic evaluation of deployed-style agent behaviour currently available.
-
AgentBench: Evaluating LLMs as AgentsEvaluates agent performance across eight diverse environments, revealing large performance gaps across environment types for the same model. Demonstrates that agent capability is highly environment-specific — a critical corrective to single-benchmark evaluations.
-
Mind2Web: Towards a Generalist Agent for the WebIntroduces a large-scale web navigation benchmark collected from real users across 137 websites. Evaluates element grounding and cross-website generalisation. The most comprehensive study of web agent generalisation to date.