Part VI · NLP & Large Language Models · Chapter 10

LLM evaluation, the discipline of deciding whether a large language model is any good — a problem that turns out to be much harder than training the model, because the outputs are free text, the task is open-ended, the benchmarks leak into training data, the graders are themselves other LLMs, and the numbers that get reported on leaderboards bear an uneasy relationship to the behaviours users actually experience.

An LLM produces free-form text over an effectively infinite space of possible inputs. There is no test set that covers it. There is no single accuracy number that summarises its quality. The output "Paris is the capital of France" and the output "The capital city of France is Paris" are equivalently correct but textually different; "Paris" and "Lyon" are textually similar but semantically opposite. Every traditional ML evaluation assumption — a fixed label set, a clear notion of correctness, a bounded input distribution — breaks down. And yet evaluation has never mattered more: leaderboards drive billions of dollars of compute allocation, procurement decisions, and research direction. The field has responded with an elaborate and still-shaky apparatus: static benchmarks (MMLU, GSM8K, HumanEval, GPQA) that test capabilities; dynamic benchmarks (Chatbot Arena, LiveBench) that update faster than contamination can catch up; LLM-as-judge (G-Eval, pairwise preference) that uses a strong model to grade weaker ones; human evaluation with its annotator-agreement problems; red-teaming and safety evaluation with its adversarial structure; and a growing literature on when each of these methods lies. This chapter is a guided tour of that apparatus — what each method measures, what it misses, and how to put them together into an evaluation that you can actually trust.

How to read this chapter

Sections one and two frame the territory. Section one is why evaluation is hard — the specific reasons that evaluating a generative LLM differs from evaluating a classifier or a regressor, and why the methods that worked for classical ML break down. Section two is the taxonomy of what we are trying to measure: capability, preference, safety, efficiency, and calibration, each with its own methods and its own failure modes.

Sections three through eight cover the specific benchmark families that dominate capability evaluation. Section three is static benchmarks — the MMLU / BIG-Bench / HELM tradition of fixed multiple-choice or short-answer test sets that you run once and report a number. Section four zooms in on reasoning benchmarks — GSM8K, MATH, GPQA, ARC, and the question of whether they measure reasoning or pattern-matching. Section five is code benchmarks — HumanEval, MBPP, SWE-bench, LiveCodeBench — where execution-based grading makes evaluation cleaner than elsewhere. Section six is long-context benchmarks — needle-in-a-haystack, RULER, LongBench, BABILong — and their progression from easy to genuinely difficult. Section seven is multilingual benchmarks — the challenge of evaluating capabilities beyond English. Section eight is agent benchmarks — GAIA, WebArena, OSWorld — where the model is scored on completing tasks in an environment, not on producing text.

Section nine is the problem that hangs over the whole chapter: contamination. Benchmark questions leak into training data, either directly or through the web scrapes that feed pretraining, and reported scores can be quietly inflated by memorisation. Section ten covers the response: dynamic benchmarks — Chatbot Arena, LiveBench, MixEval — that either use rolling datasets or live human judgement to stay ahead of leakage.

Sections eleven and twelve cover the grading methodologies that allow open-ended outputs to be scored at all. Section eleven is LLM-as-judge — the dominant automatic method, its biases, and the techniques (chain-of-thought judging, reference-based scoring, pairwise fusion) that partly correct for them. Section twelve is human evaluation — pairwise, Likert, best-of-n, with the annotator-agreement and cost problems that come with it.

Sections thirteen through fifteen cover evaluation beyond capability. Section thirteen is safety evaluation — refusal rates, jailbreaks, dual-use probes, WMDP, red-teaming. Section fourteen is bias and fairness evaluation — BBQ, stereotype probes, demographic measurement. Section fifteen is hallucination and factuality evaluation — TruthfulQA, FActScore, SimpleQA, and the problem of grounded correctness.

Section sixteen is the chapter's critical moment: the leaderboard critique. Goodhart's law, selection effects, reporting bias, replication failures, the "eval hacking" that follows compute-intensive benchmarks, and the reasons careful practitioners distrust public numbers even when they are technically correct.

The closing in-ml section places evaluation in the LLM lifecycle — where it sits alongside pretraining, instruction tuning, fine-tuning, and retrieval — and sketches the open problems: continuous evaluation as LLMs update, evaluation for autonomous systems, evaluation of alignment properties that only appear at scale, and the increasing divergence between what benchmarks measure and what users care about.

Why evaluation is hardGenerative outputs, open-ended inputs, and why classical ML evaluation breaks down
The evaluation taxonomyCapability, preference, safety, efficiency, calibration
Static capability benchmarksMMLU, BIG-Bench, HELM, the first-wave test suites
Reasoning benchmarksGSM8K, MATH, GPQA, ARC, and whether they measure reasoning
Code benchmarksHumanEval, MBPP, SWE-bench, LiveCodeBench — execution-graded evaluation
Long-context benchmarksNIAH, RULER, LongBench, InfiniteBench, BABILong
Multilingual benchmarksXTREME, Flores, MGSM, the evaluation gap beyond English
Agent benchmarksGAIA, WebArena, OSWorld, SWE-bench Verified
ContaminationTraining-set leakage, detection methods, the benchmark-rot problem
Dynamic benchmarksChatbot Arena, LiveBench, MixEval, rolling datasets
LLM-as-judgeG-Eval, pairwise preference, judging biases, the corrective techniques
Human evaluationPairwise, Likert, annotator agreement, cost/reliability tradeoffs
Safety evaluationRed-teaming, refusals, jailbreaks, WMDP, dual-use probes
Bias and fairnessBBQ, stereotype probes, demographic measurement
Hallucination and factualityTruthfulQA, FActScore, SimpleQA, grounded correctness
The leaderboard critiqueGoodhart, selection effects, eval hacking, replication failures
Evaluation in the LLM lifecycleContinuous eval, agents, alignment, divergence between benchmarks and users

§1

Why evaluation is hard for LLMs

Classical machine-learning evaluation has a simple shape. There is a test set of labelled examples, a model produces a prediction for each, and accuracy, F1, or RMSE summarises the result. The numbers are well defined, statistically tractable, and tightly correlated with downstream utility. Evaluating a large language model breaks nearly every assumption in that picture. The output is not a label but free text; the input distribution is not fixed but effectively infinite; correctness is often a matter of style or interpretation rather than fact; and the same model can pass a test one day and fail it the next depending on how the prompt was worded.

The first structural problem is that the output space is open. Ask a classifier whether an email is spam and the model returns one of two labels. Ask an LLM to summarise a document and there are hundreds of plausible correct summaries and tens of thousands of partially-correct ones. No reference answer covers the space. String-match metrics (BLEU, ROUGE, exact match) work for very narrow slices — translation, well-defined QA — and fail catastrophically on most real tasks. The evaluator has to compare two pieces of text and decide whether they say the same thing, which is itself an NLP problem.

The second structural problem is that the input space is open. An LLM will be called with inputs that no benchmark anticipated: unusual formatting, multi-turn context, domain-specific jargon, adversarial prompts, code interleaved with prose. A fixed test set, however carefully curated, samples one region of that space. A model that scores well on the test set can be wildly miscalibrated on traffic that looks nothing like it.

The prompt-sensitivity problem. Small changes to the prompt — a different few-shot example, a reordering of instructions, a synonym for a key word — can move a model's benchmark score by several points, sometimes more. Prompt Report (Schulhoff et al. 2024) and the earlier Prompt Sensitivity literature document this thoroughly. A published accuracy number is conditional on a specific prompt that often has been tuned (implicitly or explicitly) to make the number look good. This is one of the main reasons published benchmark scores are not directly comparable across papers.

The third structural problem is non-determinism. Most LLMs are sampled at non-zero temperature, and even with temperature zero there is floating-point variance on GPUs. The same model on the same prompt gives different outputs on different runs. Reporting a single number is technically an expected value over that distribution, and the variance is almost never reported. Two models whose headline scores differ by a point may have overlapping confidence intervals — but confidence intervals are almost never computed.

The fourth structural problem, and the thread that runs through the rest of this chapter, is that benchmarks leak. LLMs are trained on web-scale text that inevitably includes (or paraphrases) the same questions that appear on evaluation sets. A score can reflect memorisation rather than generalisation, and distinguishing the two after the fact is hard. The field has developed a growing arsenal of responses — contamination detection, rolling benchmarks, held-out test sets — but the problem is structural and is not going away.

The fifth structural problem is that what we care about is not what we can measure. Benchmark scores capture narrow competencies; users care about whether the model is helpful, honest, and pleasant to work with over long conversations. The gap between benchmark leaderboards and actual user experience has become a persistent source of embarrassment for the field: a model can climb MMLU by five points and feel subjectively worse in deployment. Bridging that gap — finding proxies for usefulness that can be measured reproducibly — is much of what modern LLM evaluation is trying to do.

§2

The evaluation taxonomy

A useful first move in LLM evaluation is to separate what you are trying to measure from how you are measuring it. The "what" decomposes roughly into five axes — capability, preference, safety, efficiency, and calibration — each of which has its own benchmarks, its own failure modes, and its own relationship to the final product. A system that scores well on capability may refuse too often; a system that wins preference contests may hallucinate; a system that looks safe on canned red-team probes may fail a live jailbreak attempt.

Capability is what "LLM evaluation" usually means in casual usage: can the model do arithmetic, write code, answer factual questions, reason about multi-hop scenarios, translate, summarise? This is the axis that populates most benchmark leaderboards — MMLU, GSM8K, HumanEval, GPQA — and it is also the axis where the apparatus is most mature and the contamination problems are worst. Sections 3–8 cover the capability benchmarks by category.

Preference is whether humans (or strong LLMs acting as proxies for humans) prefer this model's outputs over another's, averaged over a large and diverse set of prompts. The Chatbot Arena Elo scoreboard is the canonical example; G-Eval pairwise preference is its automated cousin. Preference captures a different thing from capability — style, helpfulness, the absence of annoying verbal tics — and the correlation between preference ranks and capability ranks is positive but loose.

Safety includes refusal of harmful requests, resistance to jailbreaks, avoidance of generating biased or toxic content, and — at the frontier — avoidance of dual-use uplift (e.g. helping with bioweapons or offensive cybersecurity). Safety evaluation has its own benchmarks (HarmBench, WMDP, StrongReject) and its own methodology (red-teaming, adversarial probes) that differ substantially from capability evaluation.

Efficiency covers latency, tokens per second, cost per task, and — increasingly — total compute per correct answer. A model that scores one point higher on a benchmark while costing ten times as much is usually not an improvement. Efficiency is a first-class metric in serving but is under-reported in research papers.

Calibration is whether the model's confidence in its answer correlates with its actual correctness. A well-calibrated model that says "I'm 70% sure" is right 70% of the time; a poorly-calibrated model is right less often or more often than its stated confidence implies. Calibration is important for downstream abstention, for deciding when to hand off to a human, and for agentic systems that take actions based on self-reported uncertainty.

Methods vs axes. Orthogonal to the what is the how: static benchmarks (fixed test sets, one-shot scoring), dynamic benchmarks (rolling data, live human grading), LLM-as-judge (a stronger model grades outputs), execution-based grading (run the code, run the SQL, check the answer), and human evaluation (pairwise or Likert ratings by crowd-workers or experts). Each of the five axes can in principle be measured by any of the five methods; in practice, certain combinations dominate — capability mostly via static benchmarks, preference mostly via pairwise human or LLM-as-judge — and the rest of this chapter follows that structure.

A final useful distinction: offline evaluation runs against a fixed test set and produces a summary number; online evaluation measures live system behaviour (user engagement, task-completion rate, escalation frequency). A healthy LLM product has both, correlated, and treats sustained divergence between them as a signal that the offline evaluation has drifted from reality.

§3

Static capability benchmarks

The first wave of LLM evaluation reused the template that worked for pretrained language models and classifiers before them: assemble a large test set of questions with reference answers, run the model over it, report accuracy. The paradigmatic examples — MMLU, BIG-Bench, HELM — established the format that every subsequent benchmark either imitates or reacts against. Understanding them is a prerequisite for reading current LLM papers.

MMLU (Massive Multitask Language Understanding, Hendrycks et al. 2020) is a 15,908-question multiple-choice test covering 57 subjects from elementary mathematics through US foreign policy. For each question, the model chooses among four options; accuracy is averaged across subjects. MMLU became the de facto measure of "general knowledge" for LLMs because its breadth was unusual at the time, because multiple-choice scoring is objective, and because the score moved smoothly as models scaled. GPT-3 scored ~44%; GPT-4 scored 86%; by 2025 frontier models were pushing into the low 90s and MMLU had largely saturated.

BIG-Bench (Srivastava et al. 2022) aimed wider: 204 tasks contributed by 450 authors across a long tail of capabilities — logic puzzles, folk physics, humour recognition, code debugging, specific languages. The intent was to probe the edges of LLM capability, including capabilities where LLMs were clearly failing. BIG-Bench Hard (Suzgun et al. 2022) is the 23-task subset where PaLM 540B underperformed average human raters; it has become more commonly reported than the full benchmark.

HELM (Holistic Evaluation of Language Models, Liang et al. 2022, ongoing at Stanford CRFM) was an explicit response to the "one-number" culture: a framework that scores each model on 42 scenarios across 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). HELM's philosophical claim is that LLMs have multiple distinct qualities that benchmarks should report separately rather than collapsing into a single aggregate. In practice HELM's rankings are dominated by capability, but the framework's influence on how evaluations are structured has been substantial.

What the first-wave benchmarks got right and wrong. They got right that a public, reproducible, multi-task benchmark drives progress — the same dynamic that drove ImageNet in vision. They got wrong that accuracy on multiple-choice questions is a reliable proxy for generative capability. A model that picks the correct answer out of four options is not necessarily a model that can produce the correct answer unprompted, and the gap between "recognise the right answer" and "generate the right answer" grows as benchmarks saturate. This is one of the reasons the field has shifted toward open-ended, execution-based, and preference-based methods.

GPQA (Graduate-Level Google-Proof QA, Rein et al. 2023) is a more recent static benchmark designed as a harder MMLU. Its 448 questions are PhD-level in biology, chemistry, and physics, written by domain experts and verified to be difficult even with web access. As of 2026 frontier models score in the 70s on GPQA where domain experts (with access to the web) score ~65 — a position that has made GPQA one of the few capability benchmarks that still moves with frontier progress.

Other second-wave static benchmarks worth knowing: MMLU-Pro (Wang et al. 2024) — ten answer options rather than four, harder distractors, less contaminated; AGIEval (Zhong et al. 2023) — real-world professional exams (SAT, LSAT, bar exam); C-Eval, CMMLU — Chinese-language MMLU analogues. Each extends the MMLU template to a new surface: harder questions, new languages, real exam content.

The structural limitation of all of these is the same: they are static test sets on the public web. Once the questions are published, they can — and do — find their way into training data. The scores tend to rise with model size even in the absence of any genuine capability increase. The field's response has been to keep building new benchmarks faster than the old ones leak, and to supplement the static paradigm with dynamic alternatives (§10).

§4

Reasoning benchmarks

Capability benchmarks for reasoning are meant to separate models that compute answers from models that recall answers. The hope is that problems requiring multi-step reasoning are harder to memorise than factual questions, so accuracy on them tracks something closer to genuine capability. The reality is more complicated: the best models now solve most published reasoning benchmarks, and the question of whether they are "really reasoning" or "pattern-matching very well" is one of the more-argued and less-resolvable debates in the current literature.

GSM8K (Cobbe et al. 2021) is the workhorse grade-school-maths benchmark: 8,500 word problems requiring 2–8 steps of arithmetic reasoning. Chain-of-thought prompting (Wei et al. 2022) was originally demonstrated on GSM8K; the jump from ~20% to ~60% accuracy on a small model when chain-of-thought was added was the first clean evidence that prompting strategy could elicit reasoning the model could not otherwise express. GSM8K has been saturated by frontier models (>95%) for some time.

MATH (Hendrycks et al. 2021) is the harder version: 12,500 competition-mathematics problems from AMC, AIME, and similar sources, with LaTeX answer keys. MATH was considered essentially unreachable for LLMs in 2022; by 2025, with tool-use and reasoning-model chain-of-thought, frontier models clear 90%. The journey from "5% in 2022" to "90% in 2025" is the cleanest case study of rapid benchmark saturation the field has.

GPQA (introduced in §3) also functions as a reasoning benchmark; its graduate-level science questions require chains of deductive and quantitative reasoning that are harder to pattern-match than MMLU-style recall. ARC (Clark et al. 2018, now ARC-AGI following Chollet's reframing) contains grid-based visual reasoning puzzles that remained out of reach for classical LLMs into 2024 and have been the focal point of reasoning-model progress since.

Are reasoning benchmarks measuring reasoning? The debate has two main positions. The skeptical view (represented by Valmeekam et al., Mirzadeh et al. 2024, and related planning/reasoning critiques) points to fragility: small wording changes collapse performance, problems with irrelevant-but-distracting information fool the model, and performance degrades sharply at slightly harder problem instances. The benchmark score is high because the models have seen enough similar problems; genuine novel reasoning remains weak. The optimistic view points to the substantial gains from chain-of-thought and from reasoning-model training (OpenAI o1, DeepSeek R1, Claude with extended thinking), the clean scaling trends with more inference compute, and the surprising generalisation to unfamiliar problem types. Both views have evidence; neither has settled the question. In the meantime, careful practitioners assume both: the models have real but brittle reasoning capability, and benchmark scores overstate how much of it is robust.

HellaSwag, WinoGrande, PIQA, ARC-Easy/Challenge — the earlier commonsense-reasoning benchmarks — are now saturated and appear mostly in evaluation stacks for smaller open-weight models, where differences between 70–90% still carry signal. They remain useful for quick evaluation of early-stage models but are not informative at the frontier.

Two newer reasoning benchmarks deserve attention. FrontierMath (Glazer et al. 2024) is a set of research-mathematician-authored problems specifically designed to resist pattern-matching, with an answer space too large to guess. Its scores were near zero for general-purpose models as of mid-2024, and the early reasoning models (o1-preview) scored around 2%, giving the benchmark years of headroom. HLE (Humanity's Last Exam, 2025) is a broader attempt at the same: expert-written questions across disciplines, designed to be hard for LLMs and resistant to web contamination.

The underlying methodological problem with reasoning benchmarks is that "reasoning" is not a well-defined psychological category — it bundles together arithmetic, deduction, planning, analogy, and commonsense inference, each of which responds differently to model scale and to fine-tuning. A fairer description of what current benchmarks measure is: competence at producing multi-step symbolic derivations in the style of textbook solutions. That is a useful capability; it is also narrower than "reasoning."

§5

Code benchmarks and execution-based grading

Code evaluation has a feature that every other LLM evaluation domain envies: the answer can be executed. A test suite runs, either all tests pass or some tests fail, and the grade is objective. There is no need for LLM-as-judge, no argument about alternative correct answers, no ambiguity about whether the output counts. This has made code benchmarks disproportionately important in the evaluation landscape — not because code is the most important LLM application, but because code is the domain where measurements are cleanest.

HumanEval (Chen et al. 2021) is the foundational benchmark: 164 Python programming problems, each with a function signature, a docstring describing the behaviour, and a hidden test suite. The model generates a function body; the grader runs the tests. pass@1 is the fraction of problems solved in one attempt; pass@k is the fraction solved in $k$ attempts. HumanEval was the benchmark where GPT-3's mediocre performance and Codex's dramatic improvement first made the case that large-scale code pretraining was worthwhile. It has long since saturated (frontier models >95%) but remains the cheap sanity check for early-stage code models.

MBPP (Mostly Basic Python Problems, Austin et al. 2021) is similar in spirit: 974 entry-level problems with tests. Together, HumanEval and MBPP formed the "function-level synthesis" evaluation paradigm that dominated from 2021 to about 2023.

SWE-bench (Jimenez et al. 2023) was the inflection point that moved code evaluation beyond single-function synthesis. Each of its 2,294 tasks is a real GitHub issue from a popular Python repository, and the model is required to produce a patch (a diff across potentially several files) that resolves the issue such that the repository's own test suite passes. SWE-bench is dramatically harder than HumanEval — the model has to navigate a multi-file codebase, understand existing architecture, and produce an edit that compiles, passes tests, and doesn't break unrelated functionality. Frontier models scored under 2% in early 2024; by 2026 SWE-bench Verified (a human-cleaned subset) sees top scores in the 60–70% range, with substantial headroom.

The code evaluation hierarchy in 2026. Saturated / sanity checks: HumanEval, MBPP, APPS. Live / active leaderboards: SWE-bench Verified, LiveCodeBench, BigCodeBench. Hard / frontier: SWE-Lancer (real-money freelance programming tasks), Aider LeaderBoard (real code-editing), RepoBench (repository-scale code completion). The progression mirrors the field's trajectory: from "can the model write a function?" to "can the model act as a software engineer?"

LiveCodeBench (Jain et al. 2024) is the dynamic-benchmark answer to code contamination: problems are pulled from coding platforms (LeetCode, AtCoder, Codeforces) as they are published, so the benchmark continuously refreshes with problems the model cannot have seen in pretraining. Leaderboards are reported both for the full history and for windowed periods that exclude contaminated problems. LiveCodeBench is now the most-cited benchmark for raw competitive-programming capability.

BigCodeBench extends the function-synthesis paradigm to realistic domain tasks — data processing, web scraping, ML pipelines — with complex libraries and integration requirements. ClassEval goes further and evaluates class-level design. RepoBench and CrossCodeEval focus on cross-file context, testing whether the model can produce code consistent with the rest of a large project.

The methodological strength of code evaluation — objective, executable grading — is also its methodological limit. The benchmarks measure what can be tested. They cannot measure readability, maintainability, idiom-appropriateness, or whether the code is the right solution (vs a technically correct one that a senior engineer would reject). The LLM-as-judge and human-evaluation methods from the rest of the chapter are needed to fill those gaps, and production code-assistant teams usually combine execution-based benchmarks with ongoing internal user-satisfaction metrics.

§6

Long-context benchmarks

Frontier models now accept context windows of 200k, 1M, and (experimentally) 10M tokens. Whether they use those windows well is a separate question, and the benchmarks that try to answer it have a revealing progression: from trivially easy tasks that early long-context models trumpeted passing, to intentionally hard tasks that expose the limits of attention even in the latest generation.

Needle in a Haystack (NIAH), popularised by Greg Kamradt in 2023, is the easy end: plant a single sentence ("The magic number is 72") in a long document, then ask the model to recall it. A model that passes NIAH can find one fact in a haystack of irrelevant text — a necessary but not sufficient condition for long-context competence. Frontier models saturate NIAH at near-100% accuracy across million-token contexts, and NIAH alone is now widely considered misleading as a long-context metric.

RULER (Hsieh et al. 2024) was an explicit reaction to NIAH's limitations. It extends needle-in-a-haystack with: multi-needle (find several planted facts); variable-tracking (chains of assignments that must be followed); aggregation (count how many needles of type X appear); and multi-hop QA (questions that require joining information from multiple positions). RULER's headline finding, repeated across almost every long-context model evaluated, is that the effective context length — where the model still performs above some threshold — is typically a fraction of the advertised context length. A model with a nominal 1M-token window may lose reliable performance past 100k tokens on anything harder than simple needle retrieval.

LongBench (Bai et al. 2023) was the first multi-task long-context benchmark: 21 tasks covering QA over long documents, summarisation of long documents, code completion, in-context learning with many examples. Its scores are lower than NIAH's and more informative, but it has largely been superseded by RULER and by LongBench-v2.

BABILong and InfiniteBench. BABILong (Kuratov et al. 2024) extends the classic bAbI reasoning tasks into long-context form — embedding the reasoning steps in noise of varying length — and produces a clean picture of how reasoning performance degrades with context length. InfiniteBench (Zhang et al. 2024) evaluates at the 100k-token scale and beyond with a mix of retrieval, reasoning, coding, and math tasks. Both benchmarks agree that long-context performance is strongly task-dependent: single-fact retrieval is near-perfect, multi-fact aggregation degrades, genuine multi-step reasoning over long contexts remains substantially worse than the same reasoning in a short context.

The takeaway from the long-context benchmark literature is the same one that drove the long-context-vs-RAG discussion in Chapter 9: million-token windows exist, but they are not uniformly useful. For single-fact lookups, they work well and can replace retrieval. For multi-fact synthesis, tables of content, or comparisons across a long document, the effective window is much smaller and attention degrades in ways that depend on how the relevant information is distributed. A production system that is relying on long context should benchmark on RULER-style multi-hop tasks, not on NIAH.

One methodological subtlety that is easy to miss: long-context benchmark scores are conditional on the position of the relevant information. The lost-in-the-middle effect (Liu et al. 2023, discussed in §10 of the previous chapter) means that benchmark results depend on whether the needle is at the beginning, middle, or end of the context. Well-designed long-context benchmarks average over positions; less careful ones cherry-pick easy positions and produce inflated numbers.

The open research question is whether the gap between "1M-token window" and "1M-token useful window" can be closed at architectural level (new attention variants, hybrid state-space models, KV-cache compression) or whether retrieval is the fundamental answer. The current evidence supports "both": continue improving long-context use and continue relying on RAG for tasks where the relevant information is sparse within a very large corpus.

§7

Multilingual benchmarks

The overwhelming majority of LLM pretraining data is English. The benchmarks that dominate leaderboards — MMLU, GSM8K, HumanEval — are all English. This creates a systematic blind spot: models can be improving in English while stagnating or regressing in the thousands of languages that matter to users outside the anglophone internet. Multilingual benchmarks are the apparatus for measuring the gap, and they tell a consistent and uncomfortable story about which languages get served well.

XTREME (Hu et al. 2020) and XTREME-R established the multilingual evaluation template: test a model on QA, natural language inference, NER, and retrieval tasks across 40+ languages, and report both per-language and averaged scores. XTREME was designed in the BERT era and reflects that era's task set; it remains useful as a zero-shot cross-lingual transfer probe.

Flores-200 (NLLB team, Meta 2022) is the canonical machine-translation benchmark: 2,000 sentences from Wikipedia and Wikinews translated into 200 languages by professional translators. It is used both to evaluate translation quality and as a cross-lingual proxy for model coverage — how well does your LLM even handle a given language, translation-wise?

MGSM (Multilingual Grade-School Math, Shi et al. 2022) translates 250 GSM8K problems into 10 languages. It is the cleanest test of whether reasoning capability transfers across languages or is bottlenecked on English-style prompts. The gap in MGSM scores between high-resource and low-resource languages is a useful indicator of a model's multilingual robustness.

BLEnD (2024) and similar benchmarks extend to cultural knowledge: the kind of locally-specific facts, foods, customs, and practices that English-centric data almost never covers. A model can speak fluent Korean and still get every question about Korean holidays wrong, and cultural benchmarks are the only way to catch that.

Persistent patterns in multilingual evaluation. Scores fall off sharply once you leave the top ~10 languages by web-text volume. The gap between English and Spanish is small; the gap between English and Swahili is often 20–40 percentage points on the same task. Tokenisation efficiency (bytes per word in the tokenizer) correlates strongly with end-task performance — languages with inefficient tokenisation pay both a cost penalty (more tokens per character) and a capability penalty (less-learned token-sequence patterns). Efforts to close the gap include: multilingual pretraining (mixture-balanced data), tokenizer adaptation for target languages, and targeted fine-tuning on high-quality translations.

The structural problem in multilingual evaluation is that the reference answers themselves depend on translation quality. A question translated poorly from English into a low-resource language produces a bad benchmark: the "correct" answer may itself be ungrammatical, culturally awkward, or simply mistranslated. Professionally translated benchmarks (Flores, the FRMT datasets, Global-MMLU) partially address this, but professional translation is expensive and coverage is uneven.

A deeper critique comes from the argument that translated benchmarks measure the wrong thing: a good evaluation in language X should use questions that are native to that language's culture and communication style, not questions translated from English. This is the logic behind benchmarks like BLEnD, IndicGLUE, and AraSTS — benchmarks built natively in a target language rather than translated in. These are harder to build, cover fewer languages, and correlate less perfectly with English benchmarks, which is exactly the point.

For a team building products in multiple languages, the practical recipe is layered: (1) use Flores-200 as a coverage sanity check — if the model struggles with translation into a language, it will struggle with everything in that language; (2) run MGSM or a translated reasoning benchmark as a capability probe per language; (3) build a small internal benchmark with native-speaker-authored examples that match the product's actual queries. No public benchmark alone is sufficient.

§8

Agent benchmarks

Static benchmarks ask a model to produce text. Agent benchmarks ask a model to do something — browse a website, edit a codebase, operate a desktop application, book a flight — and grade it on whether the task was completed. The grading signal is usually binary per task (completed or not), and the benchmarks are much harder to build and to interpret than static ones. But they are also much closer to the capability that matters for products built around LLM agents, and in 2025–2026 they have become the focal point of frontier evaluation.

SWE-bench and SWE-bench Verified (Jimenez et al. 2023; OpenAI refinement 2024) — already introduced in §5 — are the bridge between code evaluation and agent evaluation. The agent is given a repository and a GitHub issue; it may run tests, read files, edit code; it succeeds if the repository's tests pass after its patch. The structured environment and objective grading make SWE-bench the benchmark with the cleanest quality signal in agent evaluation.

GAIA (General AI Assistant, Mialon et al. 2023) is a 466-task benchmark for general-purpose digital assistants. Tasks require web browsing, multimodal reading, file manipulation, arithmetic, and cross-reference between sources — "find the highest-grossing film released in the first quarter of 2021 according to Box Office Mojo, then tell me the nationality of its second-billed actor." GAIA's grading is exact-match against a reference answer. It is substantially harder than SWE-bench in terms of total capability coverage; frontier agents scored ~10% when it was released and now clear 40% on the public set.

WebArena (Zhou et al. 2023) and VisualWebArena are browser-automation benchmarks built on top of five functional clone websites (a shopping site, a social forum, a code repository, a map, a CMS). Tasks are realistic user goals ("find the cheapest vacuum cleaner rated above 4 stars with free shipping"); the model operates the browser and the grader checks final state. WebArena has been one of the main drivers of browser-agent research.

OSWorld (Xie et al. 2024) pushes the environment further: Ubuntu and Windows desktop environments where the agent clicks, types, and runs commands to complete tasks like "edit this spreadsheet to add a column of totals" or "install LibreOffice." Grading is state-based — check whether the expected side effects occurred in the environment. OSWorld is the frontier of environment complexity; frontier models score under 25% on it, and the benchmark has years of headroom.

The recurring methodological issues in agent benchmarks. (i) Environment drift — the underlying websites, OS versions, or APIs change, and old scores are no longer reproducible. (ii) Cost and latency — each evaluation is many LLM calls, possibly hundreds per task; running the full benchmark can cost thousands of dollars per model. (iii) Partial credit — binary success/failure hides the fact that some agents fail at the last step of an otherwise-correct plan, others make no progress at all. (iv) Contamination — well-known benchmark tasks get memorised, and the model may have seen the task description in training. Every major agent benchmark has internal variants specifically to test resistance to each of these.

Other benchmarks worth knowing: AgentBench (Liu et al. 2023) — a multi-environment aggregate covering OS, DB, knowledge-graph, card-game, web-shop, and digital-card environments; SWE-Lancer — real freelance programming tasks with real money; TAU-bench (τ-bench, Yao et al. 2024) — tool-using customer-service agents with policy compliance scoring; MLE-Bench — Kaggle-style ML engineering tasks end-to-end.

The open question in agent evaluation is not "can the benchmark distinguish good models from bad?" — it clearly can — but "does the benchmark correlate with what the model can do in the real world?" Agent tasks are so sensitive to environment specifics, tool availability, and prompt format that a model that clears SWE-bench Verified at 70% might perform quite differently as a coding assistant with a different toolset. The honest answer in 2026 is that agent benchmarks are directionally informative but not yet precisely predictive. Pairing them with internal, in-domain evaluations remains essential.

§9

Contamination

Every public benchmark, within a few years of release, ends up on the internet — in papers that quote it, in blog posts that explain it, in GitHub repositories that reimplement it, in Stack Overflow answers that discuss it. Every LLM's pretraining corpus includes a substantial fraction of the internet. It follows that every popular benchmark is almost certainly in every frontier model's pretraining data, at least in paraphrased form. This is data contamination, and it is the single largest structural threat to the validity of LLM capability evaluation.

The naive contamination concern is direct leakage: the model saw the exact (question, answer) pair during training and is recalling rather than reasoning. The naive remedy is string-matching: scan the training corpus for exact or near-exact matches with benchmark questions, remove them, and report the "decontaminated" score. This is necessary but not sufficient. Paraphrased contamination — a blog post that restates the question in different words — passes string-match filters and still leaks. Solution contamination — a textbook that explains the answer without restating the question — also leaks. A fully uncontaminated training run would require knowing, before training began, which texts were "about" benchmarks — which is a harder problem than the benchmarks themselves.

The methods for detecting post-hoc contamination include: (i) membership inference — does the model assign abnormally low perplexity to benchmark strings compared to paraphrases of them, as memorised content tends to have lower loss (Shi et al. 2023, Min-K% Prob); (ii) canonical vs. scrambled comparison — if a model answers the canonical form of a question correctly but not a trivial paraphrase, that's a memorisation signal; (iii) completion probing — prompt the model with a partial benchmark question and see whether it completes to the canonical form (Golchin & Surdeanu 2024); (iv) order sensitivity — a memorised answer is robust to choice-ordering changes in multiple-choice questions in ways a reasoned answer is not.

The empirical picture. Multiple studies (Dekoninck et al. 2024, Golchin & Surdeanu 2024, Oren et al. 2023) have shown that for most widely-reported benchmarks — MMLU, GSM8K, HumanEval, TriviaQA — there is statistically detectable contamination in frontier models, and for some benchmarks the estimated contaminated fraction is large (>20%). The effect on reported scores is usually inflation of a few points but sometimes much larger on specific benchmarks. The field has not agreed on standard contamination-detection tooling; different methods give different numbers; and the model providers themselves typically report only their own (opaque) decontamination procedures.

The responses fall into three categories. Benchmark design responses: publish the benchmark in a way that resists scraping (kept-secret test sets, pay-to-access distribution, solution formats that require interaction with an evaluation server). The GPQA authors, for example, deliberately kept the answer key private and built a scoring server; MMLU-Pro and Humanity's Last Exam follow the same pattern. Methodological responses: prefer dynamic benchmarks (§10), execution-graded benchmarks (code), or benchmarks where memorisation of the answer wouldn't help (agent environments). Reporting responses: treat benchmark numbers as upper bounds, report variance, and compare models only within the same benchmark version.

A related problem, sometimes conflated with contamination, is benchmark overfitting: over many training runs, development decisions get optimised against the benchmark, and the model's apparent capability inflates faster than its real capability improves. This is Goodhart's law applied to LLM development; contamination is a special case of it. Both point to the same conclusion: no single public benchmark is trustworthy in isolation, and any production decision based on benchmark numbers alone is exposed to the possibility that the numbers are telling you something other than what they claim.

For practitioners, the practical defences are: (i) always cross-check public benchmark performance with an internal, held-out, private evaluation; (ii) prefer benchmarks released recently enough that leakage is less likely; (iii) pay attention to the error bars and to the gap between public-split and held-out-split performance when the benchmark distributes both; (iv) treat closely-matching scores across benchmarks (e.g. two models both at 92.3% on MMLU) as noise, not a meaningful comparison.

§10

Dynamic benchmarks and live leaderboards

If static benchmarks eventually leak, the counter-strategy is to make the benchmark move. A dynamic benchmark either refreshes its evaluation data continuously (faster than models can be trained on it), or grades models against live human judgement that cannot be memorised at all. Dynamic benchmarks have become the most trusted signals at the current frontier, and understanding their design is essential for reading 2025–2026 LLM news.

Chatbot Arena (LMSYS, Zheng et al. 2023) is the most influential dynamic benchmark. Users visit a website, type an arbitrary prompt, and see two anonymous model responses side by side; they click which they prefer. The accumulated pairwise comparisons feed a Bradley-Terry (or Elo-style) ranking that updates in real time. As of 2026 the Arena has processed millions of comparisons across hundreds of models; its rankings are widely regarded as the single most credible signal of "which model is actually good." The per-prompt signal is noisy but the aggregate, over millions of votes, is remarkably stable.

Arena is great, but it has well-documented limitations. (i) The prompt distribution is dominated by casual users and reflects their interests (coding, math, chat, creative writing) — not the specific use cases of enterprise users. (ii) Style preferences dominate over substance: length, confidence, and friendliness push votes more than correctness does, creating pressure on models to be pleasing rather than accurate. (iii) The anonymous model identities are not always truly anonymous — models have characteristic phrases and formatting that trained users can recognise. (iv) The vote-bot problem: periodic coordinated voting to game rankings has been documented, and the defence mechanisms are not transparent.

LiveBench (White et al. 2024) is a more recent static-but-rolling benchmark: questions are regenerated monthly from sources that post-date the models' knowledge cutoffs (recent news, just-published math problems, newly-written coding challenges). Each model's score is evaluated against only the questions newer than its training cutoff, which structurally prevents contamination of the benchmark's currently-scored items. LiveBench has become an important second signal alongside Arena; the two rank similarly but disagree enough that reading both is informative.

MixEval and the category of hybrid dynamic benchmarks. MixEval (Ni et al. 2024) blends queries from ten static benchmarks using sampling weights derived from real user traffic, producing an "average score" that approximates real-world usefulness better than any single static benchmark. Its update cadence (the sampled mix changes periodically) and its focus on matching user-query distribution make it a useful middle ground between purely static and purely live. Arena-Hard and Arena-Hard-Auto take a different angle: distil the hardest Arena prompts into a fixed set, grade new models against them via LLM-as-judge, and report the judge-vs-reference-model preference rate. This gives Arena-like signal without requiring fresh human votes for every new model.

Domain-specific dynamic benchmarks are expanding. LiveCodeBench (introduced in §5) does the same continuous-refresh trick for code. BABILong and its rolling variants test long context with new fact patterns. SEAL (Scale AI) maintains private leaderboards on expert-graded domains (law, coding, math) that are not published at all, trading transparency for contamination resistance.

The meta-problem with dynamic benchmarks is trust. A static benchmark is a published document; you can read the questions and verify the scoring yourself. A dynamic benchmark depends on who is curating the new data, how they are selecting it, whether they are free of conflicts with model providers, and how they handle ambiguous or low-quality items. Chatbot Arena has open methodology but relies on an unknown population of voters; LiveBench has a small central team curating questions; SEAL doesn't publish its questions at all. Each solves the contamination problem by moving the trust problem elsewhere.

The pragmatic 2026 evaluation stack is to read several of these together: Chatbot Arena for user preference, LiveBench for reasoning/coding on uncontaminated items, SWE-bench Verified for software engineering, GPQA for scientific reasoning, and at least one held-out internal benchmark for the target use case. Agreement across these signals is much more trustworthy than any single number.

§11

LLM-as-judge

Evaluating open-ended text outputs requires grading, and human grading is slow and expensive. The pragmatic solution that has come to dominate automatic evaluation is LLM-as-judge: use a strong LLM to grade the outputs of another LLM. Given a prompt, a candidate response, and (optionally) a reference answer, the judge returns a score, a preference, or a structured critique. The approach is cheap enough to scale to millions of evaluations and has been shown to correlate well with human judgement — when done carefully. The failure modes when done carelessly are severe.

The foundational paper is MT-Bench and Chatbot Arena (Zheng et al. 2023), which demonstrated that GPT-4, given pairs of responses and asked which was better, agreed with human raters as often as human raters agreed with each other. This result — LLM-as-judge reaches inter-human agreement rates — is what legitimised the method and led to its widespread adoption.

G-Eval (Liu et al. 2023) is the typical single-output scoring pattern: a prompt template asks the judge to score a candidate output on specific dimensions (coherence, consistency, fluency, relevance) on a 1–5 scale, optionally with chain-of-thought reasoning. Probabilistic variants weight the score by the judge's token probabilities rather than its top choice, which reduces discretisation noise. G-Eval and its descendants are now baked into most evaluation frameworks (RAGAS, DeepEval, LangSmith, Promptfoo).

The well-documented biases of LLM judges.

Position bias: the judge prefers the response that appears first (or last) in a pair. Mitigation: always score both orders and average.
Length bias: the judge prefers longer responses, often independently of quality. Mitigation: normalise by length, use length-controlled win rates.
Self-preference: the judge prefers responses generated by itself or by models from the same family. Mitigation: use a judge from a different model family than either candidate.
Style bias: the judge prefers confident, well-formatted, bulleted responses regardless of substance. Mitigation: explicit rubrics that score substance separately from form.
Topic-specific weakness: on technical domains (advanced math, graduate-level science), the judge may be unable to distinguish correct from incorrect answers itself. Mitigation: domain-specialised judges, ground-truth-based scoring where available.

The techniques that make LLM-as-judge more reliable include: reference-based scoring (give the judge a gold answer to compare against, not just the question); pairwise over absolute (asking which of two is better is a more reliable signal than asking how good one is); chain-of-thought judging (ask the judge to reason before scoring); multiple-judge consensus (average scores from several judges, ideally different model families); rubric decomposition (score individual dimensions — correctness, clarity, completeness — rather than overall quality).

The empirical question of "how much can you trust LLM-as-judge?" has a nuanced answer. On tasks where humans agree well with each other and the judge is strong, LLM-as-judge is nearly interchangeable with human rating and substantially cheaper. On tasks where humans disagree (style preferences, subjective quality), LLM-as-judge produces consistent scores that reflect the judge's particular preferences more than any ground truth. On tasks that are harder than the judge (advanced reasoning), LLM-as-judge is worse than random.

A newer line of work — Prometheus (Kim et al. 2023), JudgeLM (Zhu et al. 2023), ArmoRM — trains dedicated judge models on large corpora of (output, rating) pairs. These specialised judges are cheaper than frontier models, run locally, and avoid the self-preference bias. Whether they match frontier judges on quality is an active question; recent evidence suggests they come close for style/preference judging but lag on technical correctness.

The operational position in 2026 is: LLM-as-judge is indispensable, cheap, and well-characterised, but not a full substitute for either ground-truth evaluation (where available) or for human evaluation on the highest-stakes decisions. The best practice is to use LLM-as-judge for fast iteration during development, validate its agreement with humans on a small gold set, and escalate to human evaluation for anything that matters.

§12

Human evaluation

Human evaluation remains the gold standard against which every other method is calibrated — but "human evaluation" is not a single method. It is a family of practices with different costs, reliabilities, and applicability, and the literature on them has accumulated enough accidents and surprises to warrant careful treatment. The naive version ("ask some people which answer is better") works surprisingly badly; the careful version is expensive but reliable.

The first design decision is task format. Pairwise — which of these two responses is better — is the most reliable across most categories. Humans are better at relative judgements than absolute ones. Likert scales — rate this response 1–5 on accuracy, fluency, helpfulness — are more informative (they collect multiple dimensions) but less reliable (the 1–5 mapping drifts across raters, across sessions, and across dimensions). Ranking — order these five responses from best to worst — captures more information per rating but fatigues raters faster. The pragmatic default for most evaluations is pairwise with an optional "tie" option.

The second design decision is who is rating. Crowd workers (Mechanical Turk, Prolific, Scale AI) are cheap and scalable but variable in quality; they require strict qualification tasks, attention checks, and ongoing quality monitoring to get clean data. Domain experts (medical doctors for health responses, lawyers for legal responses, software engineers for code) are 10–100× more expensive but produce substantially more reliable ratings for technical content. The ratings disagreement between crowd and experts on technical tasks is a standard source of methodological trouble; work that uses crowd raters for expert tasks is routinely discounted.

Annotator agreement and why it matters. Inter-rater agreement (measured by Cohen's κ, Krippendorff's α, or Fleiss's κ for multi-rater cases) tells you how noisy your human labels are. For pairwise LLM-output preference, κ typically falls in the 0.3–0.6 range — moderate agreement — which means individual ratings are noisy and you need multiple ratings per item to get reliable signal. Work that reports only single-rater results, or that doesn't report κ at all, is difficult to interpret. The convention for LLM evaluation papers is 3+ ratings per item, with inter-rater agreement statistics reported.

The third design decision is what the raters see. Blind evaluations (no model identities revealed) prevent brand-based bias. Randomised order prevents anchor effects. Length-controlled pairs (comparing equally-long responses) isolate substance from verbosity. Reference-conditioned rating (raters see a gold answer for comparison) helps on factual tasks but may bias raters toward the reference's phrasing. Each of these is a methodological decision, not a default; choices should be reported explicitly.

A recurring finding across LLM human evaluations is the length bias: raters prefer longer responses, controlling for correctness. Anthropic's work on length-controlled win rates and LMSYS's analogous adjustments to Arena have made this visible; the current community consensus is that uncontrolled length in evaluation datasets systematically inflates the apparent advantage of more-verbose models. Any serious evaluation either controls for length explicitly or at least reports length distributions alongside the results.

For products, a practical pattern is the golden-set-plus-sample: maintain a curated set of ~200–1000 representative prompts with reference answers (golden set) for regression testing; run a random sample of live production queries through human evaluation periodically to detect drift. This combines the reproducibility of a fixed set with the realism of live distribution. Escalation paths — flag low-confidence automated evaluations for human review — tighten the loop further.

The long-term question for human evaluation is cost. A frontier-model evaluation with statistical rigour (3+ raters per item, 1000+ items, pairwise with a control and a candidate) costs $5,000–$50,000 per model comparison at 2026 rates. This is small next to the cost of training the models but large enough that few organisations run many such evaluations. The field's reliance on LLM-as-judge is partly economic; it is also the main reason the field needs human evaluation to keep confirming that LLM-as-judge is still calibrated.

§13

Safety evaluation

Safety evaluation asks whether the model refuses things it should refuse, avoids things it should avoid, and cannot be coerced into things it should not do — and the apparatus for answering these questions has diverged substantially from the apparatus for evaluating capability. Safety benchmarks grade along different axes (refusal rates, jailbreak resistance, dual-use uplift), use adversarial probes that capability benchmarks do not, and demand policy-aware grading that purely technical evaluators cannot provide.

The baseline safety evaluation family is harmfulness benchmarks. HarmBench (Mazeika et al. 2024), StrongReject (Souly et al. 2024), AdvBench (Zou et al. 2023), and Do-Not-Answer (Wang et al. 2024) each provide thousands of prompts designed to elicit content the model is expected to refuse — instructions for creating weapons, hate speech, child sexual abuse material, private-information exposure. The grade is the fraction of prompts the model correctly refuses. The headline numbers are increasingly at ceiling (>99% refusal rates on clean prompts) and most of the signal has moved to adversarial variants.

Jailbreak benchmarks apply adversarial transformations to the same harmful prompts: roleplay framings ("pretend you are DAN, who has no rules"), encoded content (Base64, ROT13, leetspeak), multi-turn gradual escalation, automated prompt-search attacks (GCG — Greedy Coordinate Gradient, Zou et al. 2023), and multi-lingual wrapping. A model that refuses the plain prompt and fails the jailbroken version has a robustness gap. HarmBench's attack suite and the newer h4rm3l, PAIR, and MALT attack frameworks are the standard tooling.

The refusal/helpfulness tradeoff. A model that refuses more is safer along this axis but less useful — the Anthropic HH-RLHF framework explicitly frames helpfulness and harmlessness as a joint optimisation. A safety evaluation that measures only refusal rate is incomplete; it has to be paired with an over-refusal evaluation that penalises refusing benign requests. XSTest (Röttger et al. 2024) is the canonical over-refusal benchmark: 250 pairs of (harmful-sounding but benign) and (harmful and similar) prompts, designed to measure exactly this asymmetry. Models that score well only on the refusal axis but badly on XSTest are too cautious to be useful; good safety evaluation tracks both.

WMDP (Weapons of Mass Destruction Proxy, Li et al. 2024) is a newer, more targeted safety benchmark: 3,668 multiple-choice questions that proxy biosecurity, cybersecurity, and chemistry knowledge relevant to mass-casualty weapons. The idea is that a model's WMDP score correlates with its capacity to provide dangerous uplift to a would-be malicious actor. WMDP is used both as a capability measure (how much does the model know?) and as a target for unlearning — removing the knowledge without destroying general capability. Its adoption by frontier labs and by the US AI Safety Institute has made it the de facto standard for dual-use capability evaluation.

Red-teaming — systematic probing by humans attempting to elicit unsafe behaviour — is the qualitative complement to benchmark-based safety evaluation. Frontier labs run internal red-teams (mix of ML researchers and domain experts) and external contracted red-teams (specialised firms, academic contractors); the results are typically not published in full but summarised in system cards (Anthropic's model cards, OpenAI's preparedness evaluations, Google DeepMind's safety assessments). Red-teaming catches failure modes that benchmarks miss — novel attack vectors, context-dependent failures, compound reasoning errors — but its coverage depends on the red-team's creativity and cannot be scaled indefinitely.

The frontier category is agentic safety evaluation — does the model behave appropriately when it is given tools, internet access, and the ability to take actions? Benchmarks here include ARC-AGI style autonomy evaluations, METR's evaluation suite, and Anthropic's / OpenAI's Responsible Scaling Policy checks. The metrics are less about refusal and more about capability under agentic settings — does the model successfully execute a long-horizon malicious plan if asked? As of 2026, frontier models are clearing meaningful agentic safety thresholds that the same models at their release did not clear, and the trajectory motivates the formal commitments in the frontier-lab scaling policies.

For practitioners, the minimal safety evaluation stack for a deployed model is: refusal rate on a harmfulness benchmark; over-refusal rate on XSTest or equivalent; jailbreak resistance on at least one adversarial suite; and a domain-specific red-team pass targeted at the application's actual attack surface. The gap between "we ran one benchmark" and "we have tested the failure modes our users will hit" is usually the difference between a safety evaluation that reassures stakeholders and one that actually predicts production behaviour.

§14

Bias and fairness evaluation

Bias evaluation sits between capability and safety. It asks not whether the model produces harmful content but whether its behaviour varies systematically along demographic, ideological, or stylistic dimensions in ways that matter — whether loan-recommendation advice changes by the applicant's name, whether medical suggestions change by gender, whether the model expresses identifiable political leanings. The apparatus is contested in ways capability benchmarks are not, because measuring bias requires a normative position on what should count as bias, and different normative frameworks produce different measurement choices.

BBQ (Bias Benchmark for QA, Parrish et al. 2022) is the most-cited framework in this space. Each BBQ item presents a short scenario with an ambiguous question ("Who is bad at math, the boy or the girl?") and a disambiguating context ("The girl won the math olympiad; the boy failed the test"). A model shows bias if its answers to ambiguous questions are stereotyped, and if the disambiguating context fails to override the stereotype. BBQ covers nine protected categories (age, disability, gender identity, nationality, physical appearance, race/ethnicity, religion, sexual orientation, socioeconomic status).

Other widely-used benchmarks in this family: StereoSet (Nadeem et al. 2021) — cloze tests with stereotyped vs anti-stereotyped completions; CrowS-Pairs (Nangia et al. 2020) — paired sentences differing only in a demographic term; BOLD (Dhamala et al. 2021) — open-ended generation prompts sampled from Wikipedia, scored for sentiment, regard, and toxicity conditional on demographic attributes.

The methodological contention. Bias benchmarks have been criticised on multiple fronts. (i) Construct validity — do the test items actually measure stereotyping, or do they measure the model's willingness to refuse or hedge? (Blodgett et al. 2021 is the canonical critique.) (ii) Cultural specificity — benchmarks built on US English-speaking assumptions don't transfer to models used in other cultures. (iii) Gaming — models fine-tuned to avoid known benchmark phrases can score well without behaving differently on unseen bias surface. (iv) Disagreement on the metric's sign — some "bias" items are genuine factual patterns (e.g., gender disparities in specific occupations) and measuring them as bias conflates truth-tracking with harm.

The mature position that has emerged is that bias evaluation is necessary but inadequate as run through single benchmarks. Multiple measurement approaches — benchmark-based, counterfactual probes (swap gender/race tokens and measure output divergence), demographic parity tests on real-use queries, and audits by affected communities — each catch different failure modes and miss others. Production bias evaluation typically combines several of these; single-benchmark reports are treated with the same scepticism as single-capability-benchmark reports.

A distinct but adjacent axis is political bias. Benchmarks like PoliTune, OpinionQA (Santurkar et al. 2023), and the Political Compass test measure whether models systematically express views associated with particular ideological positions. The findings across frontier models are fairly consistent (moderate left-of-centre tilt on US political dimensions, especially on social issues), and frontier labs now routinely report these alongside other bias evaluations, though what the "correct" political bias is — or whether there should be one — is contested.

A newer line is sycophancy evaluation: does the model agree with whatever the user asserts, even when the user is wrong? The SycophancyEval benchmark (Sharma et al. 2023) shows that LLMs trained with RLHF from human preferences exhibit measurable sycophancy, and that the effect correlates with rater-preference training signals. Sycophancy is a bias in the statistical sense — systematic deviation from accuracy correlated with a non-task variable — and evaluating for it is increasingly standard.

A defensible production bias-evaluation regime looks roughly like: run BBQ and BOLD-style benchmarks for baseline signal; run counterfactual probes on application-specific query templates; include a sycophancy check; commit to periodic external audit with domain experts for high-stakes applications (healthcare, lending, hiring). The goal is not a single "bias-free" score but a multi-axis picture of where the model's behaviour varies with demographic or contextual inputs, and targeted mitigation of the variances that matter most.

§15

Hallucination and factuality evaluation

LLMs produce text that is locally fluent, globally coherent, and in specific instances entirely wrong. A hallucination is a model output that asserts something false with no indication of uncertainty — a confidently invented citation, a plausible-looking quote attributed to a real person, a fabricated line of code. Evaluating hallucination is distinctively hard because a hallucinated answer and a correct answer are textually indistinguishable until the underlying facts are checked, and automatic fact-checking at scale is itself a research problem.

TruthfulQA (Lin et al. 2022) was the first widely-adopted hallucination benchmark. Its 817 questions are adversarially designed to elicit false answers — misconceptions, urban legends, weak popular beliefs — and grading compares the model's answer against both a true reference and a list of common false answers. TruthfulQA revealed that larger models could be less truthful than smaller ones, because they had absorbed more of the misconceptions present in pretraining data. The benchmark saturated fairly quickly as RLHF-tuned models learned to flag the categories of adversarial questions, but it remains useful as a sanity check.

FActScore (Min et al. 2023) is the canonical long-form factuality evaluation. For each generated passage, the grader decomposes the text into atomic claims, checks each against a knowledge source (Wikipedia, web search), and reports the fraction of claims that are supported. FActScore was an important methodological advance because it measures factuality at the claim level — a passage with one false claim in ten true ones scores 0.9, not 0. The method generalises to any domain with a reliable source of ground truth.

SimpleQA (OpenAI 2024) is a deliberately narrow factoid benchmark: 4,326 short questions with unambiguous one-word or one-phrase answers. Grading is binary per question. SimpleQA is designed to resist style-of-answer gaming and to be easy to grade at scale; its low headline scores (frontier models ~50% as of mid-2025) make it a useful frontier benchmark even as broader benchmarks saturate.

Hallucination in grounded settings. When a model is given retrieved documents (as in a RAG system) and asked to answer from them, hallucination takes a more specific form: the model asserts something not in the retrieved context. Faithfulness (the RAGAS term) is the specific metric for this case. HaluEval (Li et al. 2023) and RGB (Chen et al. 2024) are the corresponding benchmarks: pairs of (context, question, faithful-or-unfaithful answer) with judgement labels. Grounded-setting hallucination is in principle easier to detect than open-domain hallucination because the ground truth is available in context; in practice many systems still fail at it, typically by confidently repeating information from prior knowledge when the context is silent.

Calibration is the adjacent concept: a model with high hallucination rate can still be useful if its confidence tracks its accuracy — low-confidence wrong answers are less harmful than high-confidence wrong ones. Calibration evaluation compares the model's expressed or implicit confidence against its actual accuracy, via metrics like expected calibration error (ECE). A common finding is that post-RLHF instruct-tuned models are systematically over-confident: they assert answers at full confidence whether or not they know, an artefact of training on preference data that rewarded decisive answers.

The frontier of factuality evaluation is LongFact (Wei et al. 2024), which extends FActScore-style atomic-claim checking to very long generated passages (thousands of tokens), and SAFE — Search-Augmented Factuality Evaluator — which uses live web search rather than a frozen reference corpus. These methods push the evaluation into live knowledge territory and inherit both the benefits and the risks of that.

For product teams, the practical factuality stack is: (i) a domain-specific factoid benchmark tailored to the application's actual claims (e.g., for a medical assistant, clinical-guideline QA); (ii) FActScore-style claim-level evaluation on long-form output for at least a sample; (iii) a calibration check — does the model hedge appropriately when uncertain? (iv) grounded-setting faithfulness evaluation if the system uses retrieval. As with bias, single-benchmark reports on factuality are misleading; multi-metric reporting is what production teams need.

§16

The leaderboard critique

Almost every LLM release is accompanied by a table of benchmark numbers — a new model alongside its competitors, often with the new model slightly ahead on a cluster of axes. The tables are eye-catching, compress quickly into marketing, and drive billions of dollars of attention and procurement. They are also, as a class, systematically misleading in ways that the scientific literature has documented repeatedly and that careful practitioners have come to distrust. This section assembles the catalogue of critiques and suggests how to read benchmark tables without being deceived.

Goodhart's law. When a measure becomes a target, it ceases to be a good measure. Benchmarks that drive competitive comparison between frontier labs are optimised against — sometimes explicitly (fine-tuning on data similar to the benchmark), more often implicitly (development decisions that happen to move the benchmark). Over several rounds of this, reported scores rise faster than actual capability, because the gap between the two widens every time a benchmark becomes important enough to pay attention to.

Selection bias in reported scores. A lab that runs a new model on twenty benchmarks and reports the five where it wins is producing an unbiased sample of the model's performance only if all twenty are reported. Selective reporting — which is common in launch announcements and not always explicit — tilts the comparison toward the new model. Replication efforts that rerun the full benchmark suite often find smaller gaps than the launch materials implied.

Prompt and inference settings. A benchmark score depends on the prompt template, the chain-of-thought approach (zero-shot vs few-shot vs reasoning model), the sampling parameters (temperature, top-p, max tokens), and sometimes the hardware (bf16 vs int8 quantisation). Different labs use different defaults; papers that compare across labs using cached numbers are comparing different conditions. The Chatbot Arena and HELM reports, when they rerun everything under consistent settings, often find different rankings from what the original launch tables implied.

Replication and the reality check. Third-party rerunning of frontier benchmarks routinely finds score differences of 2–10 percentage points from original reported numbers. The causes are mundane: different tokenisation, different answer-extraction regex, different few-shot exemplars, different API settings. Individually small, they add up. A benchmark advantage of 1–2 points between two models from different labs is within the noise floor of the replication process and should not be interpreted as a meaningful capability difference. The corollary is that benchmark comparisons within a lab (same infrastructure, same prompt) are more reliable than comparisons across labs.

Eval hacking is the more aggressive version of Goodhart's law. The canonical form is fine-tuning the model on data matching the benchmark's distribution while claiming a "general-purpose" model; the observable signature is a large gap between scores on the target benchmark and on similar held-out benchmarks. More subtle forms include: training on synthetic data generated by extracting benchmark-style problems from pretraining data; using benchmark-evaluation APIs during training as a selection signal; sampling many completions at evaluation time and reporting the best. Each of these is defensible in isolation; each inflates the apparent gap between "benchmark score" and "capability in practice."

The divergence between benchmarks and user experience. The most consistent finding across organisations that have run both is that ranking models by aggregate benchmark score and ranking them by user-satisfaction metrics give different results. Models that win benchmarks often feel subjectively worse on open-ended conversation, long-form reasoning, and agentic tasks. The divergence has become sharp enough that some labs explicitly de-emphasise benchmark numbers in product launches in favour of internal user-study data — itself not perfect, but arguably closer to what matters.

The defensive reading practice: treat benchmark numbers as evidence, not proof. Weight multi-signal agreement (Arena + LiveBench + held-out private evals) over any single number. Pay attention to confidence intervals where reported and imagine generous ones where not. Distrust close calls. Distrust scores that conspicuously saturate the benchmark (98–99%) — they are telling you more about the benchmark's headroom than about the model. And remember that the field does not have a single, reliable, cross-lab comparable capability score — the quest for one is arguably the central methodological open problem.

§17

Evaluation in the LLM lifecycle

Evaluation is not a phase at the end of LLM development. It is a process that threads through every stage — pretraining decisions depend on evaluation, instruction-tuning targets depend on evaluation, fine-tuning rejects bad checkpoints by evaluation, deployment monitors flag regressions by evaluation. Understanding where each evaluation method sits in the lifecycle — and where the inevitable gaps between them become sources of trouble — is the closing synthesis this chapter aims to provide.

The earliest-stage evaluation is pretraining loss and small-scale proxies. A new base model is not usually evaluated on chatbot-style benchmarks; it is evaluated on perplexity, on cloze accuracy, on narrow cloze-style benchmarks (LAMBADA, WinoGrande) that correlate with downstream capability at reasonable cost. Scaling-law work (Chapter 6) depends on these proxies. The main point is that evaluation at pretraining scale is dominated by what is cheap to compute over the largest candidate pool of training decisions, not by what matches the final product.

Post-training evaluation is where the full benchmark apparatus kicks in. Instruction tuning is evaluated on MT-Bench, AlpacaEval, and category-specific capability benchmarks; RLHF is evaluated on the same plus preference-win-rate against baselines; safety training is evaluated on HarmBench and XSTest. This is the stage where the measurement problems of this chapter are most pressing, because this is the stage where the numbers get reported.

Fine-tuning evaluation (Chapter 8) is more application-specific: benchmarks tailored to the target domain plus held-out application test sets plus A/B testing against the base model. A fine-tuning decision that improves the target benchmark while regressing on general capability is a common pattern; the evaluation has to catch it before deployment.

Deployment-time evaluation is the live operational layer: automatic evaluation of a sample of production queries, drift detection (has the model started producing different content than it used to?), user-feedback collection, escalation rates, rewrite rates. This layer typically uses LLM-as-judge for throughput reasons, calibrated periodically against human review. The goal is not so much to establish quality as to detect when quality has drifted.

The through-line. At each stage, evaluation is constrained by cost, by what is available to measure, and by what decision the measurement is supposed to drive. The tension across stages is that the cheap fast evaluations used early in the pipeline do not necessarily predict the slow expensive ones used later. A pretraining decision based on scaling-law proxies may produce a model that excels on perplexity and fails on chatbot preference; an instruction-tuning decision based on AlpacaEval may produce a model that tops automated judges and disappoints live users. The gap between successive evaluation layers is the source of most of the "my model was supposed to be better but isn't" surprises that recur across the field.

Four open problems define the frontier of evaluation research in 2026. Continuous evaluation: how to keep evaluations valid as models update without rebuilding the benchmark from scratch every time. Agentic evaluation: how to grade long-horizon, multi-step, environment-interactive behaviour at scale, when each evaluation run costs dollars and minutes. Alignment evaluation: how to measure properties — honesty, corrigibility, robustness to subtle misuse — that only show up in rare situations and that are not well-captured by any benchmark. Divergence closing: how to narrow the gap between benchmark leaderboards and user-experience metrics, either by better benchmarks, better user studies, or better integration of the two.

A final organising observation: LLM evaluation is downstream of every other concern in the LLM lifecycle, and it will never catch up. The capabilities, failure modes, and usage patterns of frontier models change faster than evaluations can be designed, validated, and published. This is a structural feature, not a bug — evaluation has to be reactive to be relevant — and it means that the right posture for a working practitioner is humility about any single number, investment in multi-signal evaluation, and ongoing scepticism about claims of progress that aren't confirmed across multiple methods. The apparatus this chapter has surveyed is the state of the art; it is also, permanently, a work in progress. The next LLM generation will require its next generation of measurement, and some portion of that generation is being invented right now.

With evaluation, Part VI of the compendium — Natural Language Processing and Large Language Models — reaches its closing chapter. What started with tokenisation and morphology in Chapter 1 has ended with the question of whether we can even tell the difference between a language model that understands and one that appears to. Neither of the standard answers ("clearly yes" or "clearly no") survives contact with the benchmarks. The honest answer is that the apparatus we have is better than nothing, getting better, and insufficient — which is, in practice, the position from which all further work is done.

How to read this chapter

Contents

Why evaluation is hard for LLMs

The evaluation taxonomy

Static capability benchmarks

Reasoning benchmarks

Code benchmarks and execution-based grading

Long-context benchmarks

Multilingual benchmarks

Agent benchmarks

Contamination

Dynamic benchmarks and live leaderboards

LLM-as-judge

Human evaluation

Safety evaluation

Bias and fairness evaluation

Hallucination and factuality evaluation

The leaderboard critique

Evaluation in the LLM lifecycle

Further reading

Textbooks & tutorials

Foundational papers

Modern extensions

Software & tools