Evaluation

All sixteen sections are in draft status. Open problems are flagged inline and consolidated in §14.

This chapter is a new chapter with no direct AIMA 4e antecedent. AIMA’s treatment of evaluation is implicit - each algorithmic chapter discusses what counts as correctness or performance for the algorithms it covers. The modern AI landscape requires explicit evaluation methodology as a distinct subject: as models become more general-purpose and more capable, measuring what they can do becomes substantially harder than measuring what classical AI systems could do. The chapter develops the methodology, the standard benchmarks, the failure modes (contamination, gaming, saturation), and the connections to deployment-decision frameworks (safety evaluations, dangerous-capability evaluations, governance requirements).

The chapter is heavily cross-referenced. Most other chapters in the book contain evaluation sections specific to their domain; this chapter develops the cross-cutting methodology, the canonical evaluation patterns, and the systematic concerns that recur across domains. Cross-references throughout to FM §9 (foundation-model evaluation), LLM §13 (LLM evaluation), Alignment §7 (safety evaluations), AI for Science §11 (scientific-domain evaluation), Generative Models §10 (generative-model evaluation), AI Agents §10 (agent evaluation), MI §11 (interpretability evaluation), Causality §11 (causal-benchmark evaluation), and others.


Scope and What This Chapter Is About

The chapter develops AI evaluation methodology - the systematic approach to measuring what AI systems can do, how well, and under what conditions. We cover the conceptual framework (what is being measured; what evaluation can and cannot tell us), the dominant benchmark families (capability benchmarks, safety evaluations, behavioural benchmarks, agent evaluations), the methodology concerns (contamination, saturation, gaming, reproducibility, ecological validity), the production patterns (LLM-as-judge, human evaluation, hybrid pipelines), and the connections to deployment (governance frameworks, frontier-model release decisions).

Approximate length target: 18,000–25,000 words (a major chapter - evaluation is methodologically substantial and underlies most other chapters’ empirical content).


§1. Motivation and Scope

Why AI evaluation is hard

Three illustrations to anchor the chapter.

Illustration 1: ImageNet saturation, 2012-2017. ImageNet (Russakovsky et al., 2015) was the canonical computer-vision benchmark of the 2012-2017 deep-learning era. AlexNet (Krizhevsky et al., 2012) achieved 84.7% top-5 accuracy (substantial improvement over prior state of the art); ResNet (He et al., 2016) reached 96.4%; subsequent models reached 99%+ by 2017. By 2018, ImageNet was saturated - top-line accuracy could not meaningfully distinguish frontier models. The benchmark had been useful and clear in its era; it became uninformative as models exceeded the human-baseline range.

Illustration 2: MMLU contamination, 2023. MMLU (Hendrycks et al., 2021) - a benchmark of multiple-choice questions across 57 academic subjects - was widely used to measure frontier-LLM knowledge. In 2023, multiple investigations showed that substantial portions of MMLU appeared in training data of major frontier models. Reported MMLU scores partially reflected memorization rather than capability. Subsequent benchmarks (GPQA Diamond, FrontierMath) explicitly designed against contamination.

Illustration 3: Sycophancy and the limits of evaluation, 2024. Anthropic, OpenAI, and Google all published model cards for their frontier 2024 models claiming improvements on safety and helpfulness. Sharma et al. (2023) - and many subsequent independent evaluations - showed that the same models exhibited substantial sycophancy: agreeing with users rather than being accurate. The official benchmark scores were not wrong; they measured what they measured. But what they measured did not capture the sycophancy concern.

These three cases illustrate three persistent problems in AI evaluation.

  • Saturation. Benchmarks become uninformative as capabilities exceed them.

  • Contamination. Models memorize benchmark content from training data; reported scores conflate memorization with capability.

  • Misalignment between metrics and what we care about. Benchmark scores measure what they measure; the connection to deployment value is uncertain.

These problems are not new and not unique to AI. They affect all empirical measurement. But they are substantially worse for AI evaluation than for classical scientific measurement because:

  • AI systems are general-purpose - there is no single dimension of “performance” to measure.

  • AI systems are trained on data that may include the evaluations themselves.

  • AI capabilities are growing rapidly - benchmarks lose informativeness quickly.

  • AI evaluation interacts with strategic actors (model providers; users; regulators) with conflicting incentives.

The chapter develops methodology, benchmarks, and best practices for navigating these difficulties.

What AI evaluation is

A working definition. AI evaluation is the systematic process of measuring AI system properties - capability, behaviour, safety, robustness, alignment - using benchmarks, human judgment, and other empirical methods, in service of informed claims about what the system can do and decisions about deployment.

The breadth. AI evaluation spans:

  • Capability evaluation. What can the model do? Standard benchmarks measure specific capabilities (mathematical reasoning, coding, factual knowledge).

  • Behavioural evaluation. How does the model behave? Refusal of harmful requests, honesty, helpfulness, format compliance.

  • Safety evaluation. What harms could the model cause? Dangerous-capability assessments; red-teaming.

  • Robustness evaluation. How does the model handle adversarial or out-of-distribution inputs?

  • Alignment evaluation. How well does the model’s behaviour match intended values?

Each category has its own benchmark families, methodology concerns, and current best practices.

What AI evaluation is not

Several boundaries worth flagging.

Evaluation is not the same as benchmarking. Benchmarks are one tool for evaluation. Comprehensive evaluation typically combines benchmarks with human evaluation, adversarial probing, and post-deployment monitoring. Reducing “evaluation” to “benchmark scores” misses much of the discipline.

Evaluation is not the same as research. AI research includes evaluating proposed methods against baselines. Evaluation as a discipline concerns the methodology for doing this well, the standard benchmarks, the systematic concerns. The two are related but distinct.

Evaluation is not just engineering. Beyond running benchmarks, evaluation involves judgment about what to measure, how to interpret results, and how to communicate findings. Methodological depth matters.

Evaluation cannot replace deployment monitoring. Pre-deployment evaluation provides evidence; deployment provides ground truth. The two are complementary; neither alone suffices.

Three goals of AI evaluation

A useful framing for what evaluation serves.

1. Scientific progress. Evaluations let researchers compare methods, identify advances, and characterize the state of the field. The benchmark culture has substantially advanced AI by providing common comparison targets.

2. Deployment decisions. Lab decisions about whether to release a model, regulatory decisions about whether to permit deployment, user decisions about which model to use - all depend on evaluation evidence. The stakes vary; the underlying need is the same.

3. Capability tracking and governance. Tracking what AI systems can do over time informs policy, safety planning, and societal adaptation. Evaluations of frontier capabilities (especially dangerous capabilities) are increasingly part of governance frameworks.

Different evaluations serve different goals. Capability benchmarks primarily serve scientific progress; safety evaluations primarily serve deployment decisions and governance. The same evaluation may serve multiple goals; the methodology choices depend partly on which goals are central.

Boundaries with adjacent chapters

This chapter develops the cross-cutting evaluation methodology; chapter-specific evaluation lives in those chapters.

  • Foundation Models §9 develops FM-specific evaluation. This chapter develops the broader methodology.

  • Large Language Models §13 covers LLM-specific limitations and evaluation concerns.

  • Generative Models §10 develops generative-model evaluation (FID, perplexity, human preference) - substantively distinctive content.

  • AI for Science §11 covers evaluation in scientific domains (CASP for proteins, MATBench for materials, etc.) - domain-specific benchmarks with their own conventions.

  • AI Agents §10 covers agent-specific evaluation (SWE-bench, GAIA, OSWorld, long-horizon benchmarks).

  • Alignment §7 covers safety-specific evaluations (TruthfulQA, dangerous-capability evals, red-teaming).

  • Mechanistic Interpretability §11 covers MI-specific evaluation (SAE benchmarks, circuit-validation methodology).

  • Causality §11 covers causal-benchmark evaluation.

  • Reinforcement Learning evaluation lives in the RL chapter; RL benchmarks have their own conventions distinct from supervised-learning benchmarks.

  • Theoretical Foundations of Learning §11 covers theoretical-evaluation framings.

This chapter cross-references these as relevant; the chapter-specific content is not reproduced here.

What this chapter does not try to do

Several explicit exclusions.

  • We do not provide a complete catalogue of every AI benchmark. The space is too large; we focus on the dominant benchmarks and methodology.

  • We do not extensively cover non-AI statistical methodology (sample size, statistical significance, confidence intervals). These matter for AI evaluation but are not AI-specific.

  • We do not cover deployment engineering (A/B testing infrastructure, observability, telemetry pipelines). These matter for production evaluation but are largely engineering rather than evaluation methodology.

  • We do not develop the human-evaluation labour economics in depth. The labour conditions of annotation work are real and substantial; the broader sociology lives elsewhere.

Position taken in this chapter

The chapter takes AI evaluation seriously as a methodological discipline. Done well, evaluation provides essential evidence for scientific progress and deployment decisions. Done poorly, evaluation produces misleading confidence - high benchmark scores that don’t translate to deployment value, safety evaluations that miss real risks, reports that are technically correct but substantively wrong.

The chapter is appropriately critical of current evaluation practice. The benchmark culture has serious problems (contamination, saturation, gaming, misalignment with deployment value). Current evaluation methodology is substantially imperfect. Honest acknowledgement of this is essential for the discipline to improve.

The chapter is also constructive. Despite the imperfections, evaluation methodology has substantially advanced over the 2020-2026 period. The infrastructure (benchmarks, frameworks, governance institutions) is more mature; the methodology is more rigorous; the field’s self-awareness about evaluation problems has improved.


§2. Historical Context

This section traces AI evaluation from the pre-deep-learning era through the modern frontier-AI evaluation infrastructure.

A timeline of the inflection points:

   Pre-2012     Classical-ML evaluation: per-task benchmarks
                  (UCI Machine Learning Repository), simple
                  metrics (accuracy, precision/recall, F1).
                  BLEU for machine translation (2002).
                  ROUGE for summarization (2004).
                                  │
                                  ▼
   2012         ImageNet (Russakovsky et al.) becomes the
                  canonical computer-vision benchmark.
                  AlexNet's 84.7% top-5 accuracy launches
                  the deep-learning era.
                                  │
                                  ▼
   2014-2017    ImageNet saturation cycle: ResNet (96.4%);
                  EfficientNet pushes to 99%+; by 2017 the
                  benchmark cannot distinguish frontier
                  models.
                                  │
                                  ▼
   2018         GLUE benchmark (Wang et al.) for natural
                  language understanding. Multiple tasks;
                  aggregate score; explicitly designed to
                  measure capability across diverse settings.
                                  │
                                  ▼
   2019         GLUE saturation: BERT-large surpasses human
                  baseline within months. SuperGLUE (Wang
                  et al.) introduced as harder successor.
                  SuperGLUE saturates within ~18 months.
                                  │
                                  ▼
   2020         BIG-bench (Srivastava et al.) introduced.
                  Hundreds of diverse tasks; designed to be
                  resistant to single-architecture
                  overfitting. Crowdsourced from community
                  researchers.
                                  │
                                  ▼
   2021         MMLU (Hendrycks et al.) - 57 subjects of
                  multiple-choice questions. Becomes the
                  dominant LLM-knowledge benchmark for
                  several years.
                                  │
                                  ▼
   2022         HELM (Liang et al., Stanford CRFM) - Holistic
                  Evaluation of Language Models. Multi-task,
                  multi-metric framework. Methodological
                  emphasis on systematic comparison.
                                  │
                                  ▼
   2022-2023    ChatGPT era: deployed-LLM evaluation becomes
                  consequential. Existing benchmarks
                  (MMLU, BIG-bench) used to report frontier-
                  model capabilities. The benchmarks become
                  marketing instruments as well as research
                  tools.
                                  │
                                  ▼
   2023         LMSYS Chatbot Arena (Zheng, Chiang et al.):
                  pairwise human preference at massive
                  scale via Elo-style ranking. Crowdsourced
                  comparison of frontier models. The
                  dominant human-preference benchmark from
                  2023 onward.
                                  │
                                  ▼
   2023         Contamination concerns rise. Multiple
                  investigations show that major benchmarks
                  (MMLU, GSM8K, HumanEval) overlap with
                  training data of frontier models. Benchmark
                  scores partially reflect memorization.
                                  │
                                  ▼
   2023-2024    Harder benchmarks emerge in response.
                  GPQA Diamond (Rein et al.) - expert-level
                  questions. FrontierMath (Glazer et al.) -
                  research-mathematics problems. Humanity's
                  Last Exam - designed to remain challenging
                  for years. ARC-AGI-2 - reasoning tasks.
                                  │
                                  ▼
   2023         OpenAI Evals; Anthropic eval infrastructure;
                  EleutherAI lm-evaluation-harness - common
                  tooling for systematic benchmark evaluation.
                                  │
                                  ▼
   2024         Dangerous-capability evaluations mature.
                  RAND CBRN evaluations; METR autonomous-
                  replication evaluations; cybersecurity
                  evaluations (CyberBench, Cybench).
                  Persuasion evaluations.
                                  │
                                  ▼
   2024         Agentic evaluation matures. SWE-bench;
                  GAIA; WebArena; OSWorld; METR Task
                  Evaluations. Long-horizon agent
                  capability tracking.
                                  │
                                  ▼
   2024         AI Safety Institutes (UK AISI, US AISI,
                  others) establish independent frontier-
                  AI evaluation capability. Pre-deployment
                  evaluation under partnership with major
                  labs.
                                  │
                                  ▼
   2024         EU AI Act, US Executive Order, and frontier
                  safety frameworks (Anthropic RSP, OpenAI
                  Preparedness) tie deployment decisions to
                  evaluation results. Evaluation becomes
                  governance-relevant.
                                  │
                                  ▼
   2025-2026    Living benchmarks; continuous evaluation
                  infrastructure; ChatBot-Arena-style
                  real-time leaderboards. Cost-effectiveness
                  metrics standard. Multi-modal evaluation
                  matures.

We develop each phase below.

Pre-deep-learning ML evaluation

The classical era. Through the 1990s and 2000s, ML evaluation was per-task. Each task (image classification, speech recognition, machine translation) had its own benchmark and metric.

The standard pattern. Researchers reported results on canonical datasets (MNIST for digit recognition, TIMIT for speech, WMT for translation). The metrics were domain-specific (accuracy, word-error-rate, BLEU). The community evolved norms for fair comparison; baseline methods were established; new methods reported improvements.

The limitations. Per-task evaluation made it hard to compare methods across tasks. A method strong on one task might be weak on another; the field had no unified picture of generality. The deep-learning era would substantially address this through multi-task benchmarks.

ImageNet and the deep-learning benchmark era

ImageNet (Russakovsky et al., 2015) was the inflection. A million labelled images across 1,000 classes; standardized train/validation/test splits; an annual competition (ILSVRC) with public leaderboards.

The impact. AlexNet’s 2012 ILSVRC victory (Krizhevsky et al.) catalyzed the deep-learning era. ImageNet performance became the standard measure of computer-vision capability. Many architectural advances (VGG, ResNet, Inception, EfficientNet) were validated by ImageNet improvements.

The saturation. ImageNet performance grew rapidly: AlexNet 84.7%, ResNet 96.4% (2016), EfficientNet 99%+ (2019). By 2018, the benchmark could not meaningfully distinguish frontier models. ImageNet remained useful for baseline comparisons but ceased to be the frontier benchmark.

The lesson. Benchmarks have lifespans. The most-useful benchmarks are current - challenging for the best current methods but not impossibly hard. As methods improve, benchmarks need refresh to remain useful.

GLUE, SuperGLUE, and the natural-language era

NLP’s analogue of ImageNet. GLUE (Wang et al., 2018) - a benchmark of nine natural-language understanding tasks (textual entailment, sentiment, paraphrasing, etc.) with an aggregate score. Designed to measure capability across diverse NLU tasks rather than per-task.

The trajectory.

  • 2018: BERT (Devlin et al.) substantially advances GLUE state of the art.

  • Within months, models surpass human-baseline scores on individual GLUE tasks.

  • 2019: SuperGLUE (Wang et al.) introduced as harder successor; eight tasks designed to be more challenging.

  • Within ~18 months, SuperGLUE also approaches saturation.

The pattern matches ImageNet. Benchmarks designed to be hard become easy as models improve; harder successors are needed.

BIG-bench, MMLU, and the LLM era

The benchmark crisis. By 2020-2021, no single benchmark could meaningfully evaluate frontier LLMs. The community responded with very large benchmark suites.

BIG-bench (Srivastava et al., 2022) - Beyond the Imitation Game Benchmark. Crowdsourced from researchers; 204 diverse tasks; explicitly designed to resist single-architecture overfitting. The aggregate suite is too diverse for any single capability category.

MMLU (Hendrycks et al., 2021) - Massive Multitask Language Understanding. 57 subjects of multiple-choice questions covering humanities, STEM, social sciences, professional knowledge. Became the dominant LLM-knowledge benchmark for several years.

Why MMLU dominated. The multiple-choice format gave clean automated scoring. The 57-subject breadth provided a single number characterizing general knowledge. The subjects spanned what people considered “general intelligence.” Frontier-model releases routinely reported MMLU scores; comparisons across models were straightforward.

The MMLU trajectory.

  • GPT-3 (June 2020): 43.9% (random baseline ~25%).

  • PaLM (April 2022): 69.3%.

  • GPT-4 (March 2023): 86.4%.

  • Claude 3 Opus (March 2024): 88.8%.

  • GPT-4o (May 2024): 88.7%.

  • By 2025-2026: frontier models routinely 90%+; benchmark approaching saturation.

By 2024, MMLU was near-saturated. Reported improvements (88.7% → 89.1% → 89.4%) were within noise margins. Harder benchmarks were needed.

HELM and the methodology emphasis

A specific methodological development. HELM (Holistic Evaluation of Language Models; Liang et al., Stanford CRFM, 2022) was less a new benchmark and more a framework for systematic LLM evaluation.

The contributions.

  • Multi-scenario coverage. Evaluate models across many scenarios, not just a few.

  • Multi-metric measurement. For each scenario, measure multiple properties (accuracy, robustness, fairness, bias, efficiency).

  • Reproducibility. Standardized evaluation infrastructure; published code; published exact protocols.

  • Living benchmark. Continuously updated as new models and scenarios emerge.

The reception. HELM was widely cited as best methodological practice. Its specific scenarios and metrics were less universally adopted than its methodological emphasis.

LMSYS Chatbot Arena and human preference at scale

A specific 2023 development that changed the evaluation landscape. Chatbot Arena (Zheng et al., 2023) - crowdsourced pairwise human preference comparisons of LLMs. Users converse with two anonymous models; vote which is better; aggregate via Elo-style rating.

The impact. By 2024, Chatbot Arena was the single most-watched LLM evaluation. Model providers tracked their Arena rankings closely; new model releases were judged substantially by Arena performance.

The advantages. Direct measurement of human preference. Massive scale (millions of comparisons). Continuous (real-time leaderboard updates). Comparative (Elo-style ranking is well-calibrated for preference data).

The limitations. User-base bias (Arena users are not representative of all LLM users). Conversation-style focus (Arena measures conversational preference; not necessarily capability on specific tasks). Gaming concerns (model providers can A/B test against Arena before releasing; potential overfitting to Arena-style preferences).

The 2024-2026 trajectory. Chatbot Arena remains influential. Successor and parallel arenas have emerged (Arena Hard for harder tasks; specialized arenas for coding, image generation, etc.). The pattern of crowdsourced pairwise human preference at scale is firmly established as one of the standard evaluation modes.

The contamination problem

A specific 2023 development. Multiple investigations (Sainz et al. 2023; Magar and Schwartz 2022; others) showed that major benchmarks (MMLU, GSM8K, HumanEval, others) appeared substantially in training data of frontier models. Reported benchmark scores partially reflected memorization rather than capability.

The investigations. Researchers ran “data extraction” attacks - asking models to recite benchmark content - and found that many models could reproduce benchmark items verbatim. Other investigations compared performance on canonical benchmark questions vs paraphrased equivalents; performance dropped on the paraphrases, suggesting memorization-driven inflation of the canonical scores.

The community response.

  • Acknowledgement. Benchmark reports increasingly include contamination caveats.

  • New benchmarks designed against contamination. GPQA Diamond, FrontierMath, Humanity’s Last Exam - explicit contamination resistance.

  • Held-out evaluation. Some labs maintain private held-out benchmark suites used for internal evaluation; public benchmarks for reporting only.

  • Date-based held-out sets. Use only benchmark items released after model training cutoffs.

  • Contamination detection tooling. Open-source tools for checking benchmark-training overlap.

The state in 2026. Contamination is recognized as a serious concern. Best-practice evaluation includes contamination audits. But many widely-reported benchmark scores remain partially contaminated; the gap between reported and “clean” scores is sometimes substantial.

Frontier capability and dangerous-capability evaluations

A 2024 development. As frontier models became more capable, evaluating dangerous capabilities became a priority - not just for research but for governance.

RAND CBRN evaluations (Mouton et al., 2024) - biological, chemical, radiological, nuclear capability assessments.

METR Task Evaluations (Model Evaluation and Threat Research; 2023+) - autonomous-replication evaluations; long-horizon agentic capability.

Cybersecurity evaluations. Cybench, CyberBench, others.

Persuasion evaluations (Salvi et al., 2024 and others).

AI R&D acceleration evaluations. Can the model substantively accelerate AI research itself?

These evaluations feed into frontier safety frameworks (Anthropic RSP, OpenAI Preparedness, DeepMind Frontier Safety; cross-reference Alignment §10) that condition deployment on capability assessments.

Evaluation institutes and governance integration

A 2023-2024 institutional development. AI Safety Institutes (UK, US, Japan, Singapore, others) established independent frontier-AI evaluation capability.

The structure. The institutes partner with frontier-AI labs for pre-deployment evaluation of major models. They publish independent evaluation reports; their assessments inform regulatory decisions.

The trajectory. The UK AISI was first (November 2023); the US AISI followed; other countries followed. By 2025-2026, AISIs are substantial institutional evaluators of frontier AI. The reports they produce are increasingly part of the governance evidence base.

Where this leaves us in 2026

The current state. AI evaluation is substantially more mature than 2018 baselines. The methodology is more rigorous; the benchmark ecosystem is more diverse; the governance integration is more substantial. The methodological problems (contamination, saturation, Goodhart) are recognized and partially addressed.

The remaining issues are substantial. Many widely-cited benchmark scores are partially contaminated. Many benchmarks are approaching saturation. Many real deployment concerns (sycophancy, agentic safety, long-horizon reliability) are inadequately measured by existing benchmarks. The pace of capability growth outstrips benchmark development.

The remaining sections develop the technical content. §3 covers the conceptual framework. §4 covers capability benchmarks. §5 covers behavioural and safety evaluations. §6 covers human evaluation. §7 covers LLM-as-judge. §8 covers contamination. §9 covers saturation and Goodhart. §10 covers agentic evaluation. §11 covers deployment decisions. §12-§16 close out.

Editorial note. Evaluation methodology is rapidly evolving. Specific benchmarks date faster than the underlying methodological concerns. The chapter develops the methodology as the more-durable content; specific benchmark mentions should be treated as time-bounded.


§3. The Evaluation Framework

A systematic look at the dimensions along which AI evaluation operates. This section develops the conceptual scaffolding that organizes the specific benchmarks and methodologies of §4 onward.

What we measure

The first dimension. AI evaluation can target many different properties; choosing well requires clarity about what is being measured.

Capability. What can the model do? Standard benchmarks measure specific capabilities - mathematical reasoning, code generation, factual recall, reading comprehension. Capability evaluation answers: “if asked, can the model produce a correct response?”

Behaviour. How does the model behave? Behavioural evaluation measures patterns of model output beyond raw capability - refusal of harmful requests, format compliance, tone, helpfulness, honesty. Two models with identical capability can have very different behaviour.

Safety. What harms could the model cause? Safety evaluation tests for dangerous capabilities (CBRN, cyber-offensive), harmful behaviours (manipulation, deception), and failure modes (jailbreak vulnerability). Safety overlaps with behaviour but emphasizes worst-case concerns.

Robustness. How does the model handle non-standard inputs? Robustness evaluation tests adversarial inputs, distribution shift, edge cases, paraphrased queries, multilingual inputs. A model that performs well on canonical inputs but degrades on perturbations is brittle.

Alignment. How well does behaviour match intended values? Alignment evaluation overlaps with safety and behaviour but emphasizes the gap between what designers wanted and what the model actually does.

Calibration. Does the model know what it knows? Calibration evaluation tests whether expressed confidence matches actual accuracy. A model that’s confidently wrong is more dangerous than one that’s appropriately uncertain.

Efficiency. What does it cost to use the model? Cost-per-task, latency-per-task, throughput. Increasingly first-class for deployment decisions.

These categories overlap. A “safety evaluation” may measure capability (can the model produce dangerous content?) and behaviour (will it produce it under certain prompts?). The categories are useful organizing principles; specific evaluations often span multiple.

Static vs dynamic evaluation

A methodological axis.

Static evaluation. Fixed dataset; run the model on each input; compute aggregate scores. The dominant pattern; supports reproducibility and comparison.

Examples. MMLU (fixed question set); ImageNet (fixed image set); HumanEval (fixed coding problems).

Strengths. Reproducible (same dataset → same evaluation). Standardized (other researchers can compare to the same numbers). Cheap (the evaluation set is built once; reused indefinitely).

Weaknesses. Risk of contamination (the dataset can leak into training data). Risk of saturation (once models exceed the dataset, the benchmark stops being informative). Risk of overfitting (research community optimizes for the specific benchmark).

Dynamic evaluation. New inputs generated for each evaluation; the model encounters genuinely novel queries.

Examples. LMSYS Chatbot Arena (user-generated queries; new ones constantly). Adversarial red-teaming (humans probe the model with novel attacks). Real-world deployment monitoring (production traffic).

Strengths. Avoids contamination (model has not seen this specific input). Avoids saturation (new content can be made harder if needed). Closer to deployment reality.

Weaknesses. Harder to reproduce. Harder to compare across models (each model may see different inputs). More expensive (each evaluation requires fresh input generation).

The 2026 picture. Both modes are essential. Static evaluation provides standardized comparison; dynamic evaluation provides contamination-resistant capability measurement. Production deployments combine both.

Reference-based vs reference-free metrics

A specific methodological choice for scoring.

Reference-based. Compare model output to a gold-standard reference. Score by similarity (exact match, BLEU, ROUGE, embedding similarity).

Examples. Translation evaluated against reference translations (BLEU). Summarization evaluated against reference summaries (ROUGE). QA evaluated against reference answers (exact match, F1).

Strengths. Objective (the reference defines correctness). Reproducible. Automated.

Weaknesses. Reference may be wrong or incomplete (many valid answers may exist; reference captures only some). Reference-based metrics often miss quality differences (a translation can be substantively better than the reference; BLEU won’t see it).

Reference-free. No gold standard; the metric judges quality directly. Often LLM-as-judge (§7) or human evaluation (§6).

Examples. Human preference for one of two model outputs. LLM-as-judge scoring outputs on rubrics. Toxicity classifiers scoring outputs without reference comparison.

Strengths. Captures quality dimensions reference-based metrics miss. Works when reference answers don’t exist (open-ended generation, creative writing).

Weaknesses. Subjective (different judges may disagree). Expensive (human judgment) or biased (LLM-judge biases).

The 2026 trend. Reference-based metrics dominate for closed-ended tasks (multiple choice, classification, factual QA). Reference-free metrics dominate for open-ended tasks (chat, generation, complex reasoning). Both are used; the choice depends on task structure.

Comparative vs absolute evaluation

A different methodological axis.

Absolute evaluation. How well does this model perform? A score (accuracy 87%; BLEU 32; FID 5.2) characterizes the model’s level of capability.

Strengths. Intuitive. Comparable across models that use the same metric. Tracks progress over time.

Weaknesses. The number is meaningful only if the metric is meaningful. As benchmarks saturate, absolute scores stop discriminating.

Comparative evaluation. Is model A better than model B? Pairwise comparisons; preference rankings; Elo-style ratings.

Examples. LMSYS Chatbot Arena (pairwise human preference → Elo ratings). Win-rate comparisons between models on shared tasks. Pairwise LLM-as-judge.

Strengths. Robust to score saturation (you can always rank two models even if both score 95% on absolute benchmarks). Closer to deployment decisions (users choose between models). Avoids the need to define “good” in absolute terms.

Weaknesses. Less intuitive than absolute scores. Comparisons may not transitive (A > B > C does not always imply A > C). Computational cost (pairwise scales as N² for N models).

The 2026 picture. Comparative evaluation has increased in importance as absolute benchmarks saturate. Chatbot Arena-style ratings are now the most-watched single evaluation for frontier LLMs.

The role of held-out data

A specific methodological principle. The model should not have seen the evaluation data during training. Holding out evaluation data from training is the most-basic evaluation hygiene.

The principle. Training data includes everything the model has been exposed to during pretraining and fine-tuning. Evaluation data should be disjoint from training data - otherwise reported performance reflects memorization rather than generalization.

The complications.

Web-scraped training data. Frontier-LLM training data includes essentially all publicly-available text. Many evaluation benchmarks are on the web; they may inadvertently be in training data.

Translation across formats. Benchmark questions may appear in training data in different formats (reformatted, paraphrased, translated). Simple substring matching misses these.

Released by participants. Benchmark designers may release datasets publicly; later models train on the released data; reported performance is inflated.

Temporal flow. New benchmarks released today may be in tomorrow’s training data; held-out today may be contaminated next year.

The responses.

Strict held-out splits. For each benchmark, designate training, validation, and test splits. Use only the train split for development; use the test split only for final evaluation.

Date-based cutoffs. Use only benchmark items from after the model’s training cutoff. If the model couldn’t have seen the items, contamination is impossible.

Private benchmarks. Some evaluation sets are never released publicly. Used for internal evaluation by labs or for governance evaluations by AISIs.

Contamination audits. Explicitly check whether benchmark items appear in training data. Subtract contaminated scores or flag them in reporting.

The 2026 best practice. Held-out evaluation with explicit contamination audits is standard for serious work. Lighter-weight evaluation (quick model comparisons) may skip these steps with appropriate caveats.

The role of pre-registration and reporting standards

Borrowed from other empirical sciences. Pre-registration - committing to evaluation protocols before running the evaluation - reduces post-hoc cherry-picking.

The practice in AI. Less common than in some empirical sciences but growing.

  • Lab pre-registration. Frontier labs increasingly pre-commit to evaluation protocols for major model releases.

  • Public pre-registration. Some academic benchmark designers pre-register their evaluation plans.

  • Methodology documentation. HELM-style frameworks document evaluation procedures in detail; reduces ambiguity.

The honest assessment. AI evaluation reporting standards lag those of established empirical sciences. Specific reported numbers can be cherry-picked (best of multiple runs; best of multiple prompts; best of multiple configurations) without disclosure. The field is improving but inconsistently.

Putting it together: a worked example

A concrete instance combining the framework dimensions.

Suppose we want to evaluate a new frontier LLM on coding capability.

  • What we measure. Capability (code-generation correctness). Possibly secondary: efficiency (time to generate; tokens consumed).

  • Static or dynamic? Static (canonical coding benchmarks like HumanEval, SWE-bench).

  • Reference-based or reference-free? Mix. HumanEval uses reference test execution (reference-based). LLM-as-judge for code quality (reference-free).

  • Comparative or absolute? Both. Report absolute pass-rate (HumanEval pass@1 = 87%); also report comparison to peer models (vs GPT-4o, vs Claude 3.7).

  • Held-out hygiene. Use HumanEval and SWE-bench Verified (curated to reduce contamination). Flag if these benchmarks may be in training data. Possibly also evaluate on a held-out internal benchmark.

The result. A multi-faceted evaluation that triangulates the model’s coding capability across different methodological choices. No single number; a structured picture.

Where the evaluation framework sits in 2026

The summary. The conceptual framework - measurement categories, static vs dynamic, reference-based vs reference-free, comparative vs absolute, held-out hygiene, pre-registration - is recognized best practice though unevenly applied. Top-tier evaluations follow it carefully; lighter-weight evaluations cut corners.

The remaining issues. Standardization across labs is incomplete. Reporting practices vary substantially. Pre-registration is rare. The framework is useful guidance; consistent application would substantially improve the discipline.


§4. Capability Benchmarks

The most-developed area of AI evaluation. Capability benchmarks measure what models can do - specific tasks with measurable correctness criteria. This section surveys the dominant benchmarks of 2026, organizes them into categories, and develops the dynamics of benchmark development and saturation.

The capability-benchmark landscape

The current landscape includes hundreds of benchmarks across many categories. A representative selection:

General knowledge. MMLU, MMLU-Pro, BIG-bench, AGIEval, OpenBookQA.

Reasoning. GPQA, GPQA Diamond, ARC-AGI, ARC-AGI-2, FrontierMath, Humanity’s Last Exam, BBH (BIG-Bench Hard), MUSR.

Mathematics. GSM8K, MATH, AIME, FrontierMath, Putnam-style problems.

Coding. HumanEval, MBPP, SWE-bench, SWE-bench Verified, LiveCodeBench, BigCodeBench, ClassEval.

Reading comprehension. SQuAD, RACE, DROP, NarrativeQA, QuALITY, MuSR.

Multimodal. MMMU, MMVet, ChartQA, DocVQA, MathVista, MMMU-Pro, MathVerse.

Long-context. Needle-in-Haystack, RULER, LongBench, InfiniteBench, ZeroSCROLLS.

Agentic. SWE-bench, GAIA, WebArena, OSWorld, TheAgentCompany, METR Task Evaluations, RE-Bench.

Tool use. ToolBench, API-Bank, ToolE, BFCL (Berkeley Function Calling Leaderboard).

Multilingual. Global MMLU, XCOPA, M3Exam, multilingual extensions of major benchmarks.

Safety-relevant. TruthfulQA, ToxiGen, RealToxicityPrompts (cross-reference Alignment §7).

This is not exhaustive; the benchmark ecosystem is enormous. We develop the most-consequential benchmarks below.

Knowledge benchmarks

The first category. Knowledge benchmarks measure what the model knows - factual information across domains.

MMLU (Hendrycks et al., 2021). 57 subjects of multiple-choice questions. The dominant LLM-knowledge benchmark from 2021-2024. Trajectory: GPT-3 43.9% → GPT-4 86.4% → frontier models 90%+ by 2025-2026 (near-saturation).

MMLU-Pro (Wang et al., 2024). Successor to MMLU; 10-choice instead of 4-choice; harder questions; reduced label noise. Frontier model performance ~80%+ by 2026; less saturated than MMLU.

AGIEval (Zhong et al., 2023). Aggregates standardized human-administered exams (college entrance exams, law school admission tests, Gaokao). Tests human-grade knowledge across disciplines.

OpenBookQA, AI2 Reasoning Challenge (ARC). Earlier knowledge benchmarks; substantially saturated by 2024.

The honest accounting. Knowledge benchmarks measure what is encoded in the model’s weights from training data. Strong knowledge-benchmark performance doesn’t necessarily reflect deep understanding - memorization plus interpolation often suffices.

Reasoning benchmarks

A different category. Reasoning benchmarks test multi-step inference, novel problem-solving, and abstract reasoning rather than knowledge recall.

GPQA Diamond (Rein et al., 2023). Graduate-level Google-Proof Q&A. Questions written by PhD experts; designed so that internet search alone does not yield correct answers; requires deep domain understanding plus reasoning.

GPQA Diamond trajectory. Released late 2023 with ~35% accuracy for GPT-4. Frontier reasoning models (o1, o3, Claude 3.7+) reach 70%+ by 2025-2026. Still below human-expert level (~80%) but approaching.

FrontierMath (Glazer et al., 2024). Research-level mathematics problems; designed to be hard for current models for years.

FrontierMath trajectory. Released November 2024 with <2% accuracy for then-frontier models. By mid-2025, frontier reasoning models reached 25%+; by mid-2026 some reach 50%+. Still substantially below mathematician performance but progressing faster than predicted.

ARC-AGI and ARC-AGI-2 (Chollet, 2019 and 2025). Abstract reasoning corpus; tasks designed to test fluid intelligence rather than knowledge. ARC-AGI was resistant to scale through 2024; large frontier models scored only modestly higher than small ones. The 2024-2025 breakthrough: reasoning models (o3 in December 2024) achieved substantial ARC-AGI scores via deep inference-time compute. ARC-AGI-2 (released 2025) is harder; current frontier ~20-30%.

Humanity’s Last Exam (CAIS, Scale AI, 2024-2025). 3,000+ expert-written questions across hundreds of subjects; explicitly designed to remain challenging for years. Frontier model performance ~10-25% by mid-2025; substantially harder than other knowledge benchmarks.

BBH (BIG-Bench Hard) (Suzgun et al., 2022). Subset of BIG-bench tasks where the original LLMs of 2022 underperformed humans. Substantially saturated for frontier models by 2024.

The category’s trajectory. Reasoning benchmarks have replaced knowledge benchmarks as the frontier-LLM discriminators of 2024-2026. Knowledge benchmarks saturated; reasoning benchmarks remain (more) discriminating.

Mathematics benchmarks specifically

Mathematics deserves separate treatment.

GSM8K (Cobbe et al., 2021). 8,500 grade-school math word problems. The dominant reasoning benchmark of 2022-2023; saturated for frontier models by 2024 (95%+).

MATH (Hendrycks et al., 2021). Competition-mathematics problems (AMC, AIME-level). State-of-the-art ~30% in 2021; 90%+ by mid-2025; approaching saturation.

AIME 2024, AIME 2025. American Invitational Mathematics Examination problems. Each new year’s AIME is unsolved at release; provides natural contamination-resistant test. Frontier reasoning models solve substantial fractions (AIME 2024 ~80% for top models; AIME 2025 ~70% on release).

FrontierMath. Discussed above. The hardest standard math benchmark.

USAMO, Putnam. Even harder competition mathematics. Largely unsolved by current models.

The trajectory. Math capability has advanced substantially in 2024-2026 driven by reasoning models. GSM8K → MATH → AIME → FrontierMath → USAMO/Putnam represents a clear difficulty progression; frontier models have moved up the ladder.

Coding benchmarks

A specific high-stakes category.

HumanEval (Chen et al., 2021). 164 Python coding problems. The dominant coding benchmark of 2021-2024. Frontier models 90%+ pass@1 by 2024; substantially saturated.

MBPP (Mostly Basic Python Programming; Austin et al., 2021). 974 entry-level Python problems. Similar saturation trajectory.

SWE-bench (Jimenez et al., 2024). 2,294 real GitHub issues from popular Python projects. Agents must produce patches that resolve issues and pass tests. Cross-reference AI Agents §10.

SWE-bench trajectory. State-of-the-art was ~2% in late 2023; ~12% in March 2024 (Devin announcement); ~50% by late 2024 with Claude 3.5 Sonnet; ~70%+ by mid-2025 with frontier reasoning-model agents; ~80% by mid-2026 (SWE-bench Verified subset).

SWE-bench Verified. Curated 500-problem subset with quality-assured ground-truth tests. The standard for serious SWE-bench reporting; less noisy than full SWE-bench.

LiveCodeBench (Jain et al., 2024). Coding problems collected continuously from competitive programming sites (LeetCode, Codeforces). Continuous addition of new problems provides contamination resistance.

BigCodeBench (BigCode collaboration, 2024). Practical programming tasks combining multiple libraries; closer to real software-engineering practice.

The coding-benchmark category is active and rapidly improving. SWE-bench is the flagship; LiveCodeBench provides contamination-resistant tracking; HumanEval is largely historical.

Multi-task and holistic benchmarks

A specific category. Multi-task benchmarks aggregate many tasks for a single composite score.

BIG-bench (Srivastava et al., 2022). 204 diverse tasks crowdsourced from researchers. Designed for breadth; resistant to single-architecture overfitting. The aggregate score combines diverse measurements.

HELM (Liang et al., 2022). Multi-scenario, multi-metric framework. Not just an aggregate score but a systematic comparison across many dimensions.

Open LLM Leaderboard (Hugging Face). Standardized benchmark suite for open-source LLMs; updated continuously.

AGIEval, MMLU-Pro, BIG-bench Hard. Aggregate benchmarks designed for harder/saturation-resistant evaluation.

The honest accounting. Multi-task benchmarks provide aggregate capability pictures. They also obscure task-specific behaviour; a model strong on most BIG-bench tasks might be weak on a specific important one. Aggregate scores are useful as summaries but should not be confused with comprehensive evaluation.

Frontier capability benchmarks

A specific category that has emerged in 2024-2026: benchmarks explicitly designed to remain challenging as capabilities grow.

GPQA Diamond (above). Graduate-level questions; Google-proof; expert-validated.

FrontierMath (above). Research-mathematics problems.

Humanity’s Last Exam (above). Designed for multi-year challenge.

ARC-AGI-2 (above). Abstract reasoning at increased difficulty.

RE-Bench (METR, 2024+). Research-engineering tasks measuring AI ability to do ML research.

Beyond-Human Benchmarks. Several proposals for benchmarks at superhuman difficulty; aspirational rather than currently-meaningful.

The design philosophy. Don’t design for current capabilities; design for capabilities 1-3 years out. The benchmark should remain partially unsolved at release and provide years of meaningful discrimination as capabilities advance.

The success rate is mixed. Some “designed for the future” benchmarks have been substantially solved faster than designers expected (FrontierMath was expected to remain <10% for years; reached 25%+ within months). Others have held up (Humanity’s Last Exam remains mostly unsolved).

The saturation cycle and continuous benchmark refresh

The recurring pattern. New benchmarks are released → models improve rapidly → benchmarks saturate → harder successors are released. This cycle has been operating throughout AI’s modern history.

The cycle dynamics.

   THE BENCHMARK SATURATION CYCLE

   1. BENCHMARK RELEASED
      Initial SOTA: ~20-40% (a hard but tractable level).
      Researchers/labs target the benchmark.

   2. RAPID IMPROVEMENT (1-2 years)
      Methods improve; SOTA rises to 60-80%.
      Benchmark provides meaningful discrimination.

   3. APPROACHING SATURATION (1-2 years later)
      SOTA reaches 90%+; gap between top models narrows.
      Benchmark provides less discrimination among frontier models.

   4. SATURATION
      SOTA ~95%+ clustered. Improvements within noise.
      Benchmark uninformative for frontier comparisons.

   5. NEW BENCHMARK
      Harder successor released. Cycle repeats.

Examples of completed cycles. ImageNet (2012-2018). GLUE (2018-2019). SuperGLUE (2019-2020). GSM8K (2022-2024). HumanEval (2021-2024). MMLU (2021-2026, near-end).

In-progress cycles. MMLU-Pro, GPQA Diamond, FrontierMath, ARC-AGI-2 are all currently in the rapid improvement phase.

The implications.

Benchmark lifespan is finite. Plan accordingly; don’t invest infinitely in any single benchmark.

Continuous refresh is essential. New benchmarks must keep emerging; the community needs sustained infrastructure for benchmark development.

Living benchmarks help. Benchmarks that continuously add new content (LiveCodeBench, LMSYS Chatbot Arena) avoid saturation by construction.

Multiple benchmarks per capability. No single benchmark suffices for any important capability over time; the standard is multi-benchmark coverage.

Cost-effectiveness as a first-class evaluation concern

A specific addition to the capability-benchmark landscape. As benchmark scores converge, cost-effectiveness (capability per dollar; capability per second) becomes increasingly informative.

The pattern. Two models may both score 95% on a benchmark, but one costs 20permilliontokensandtheother20 per million tokens and the other 1.50 per million tokens. The cost-effective model is substantially better for most deployment contexts.

The metrics.

  • Cost per task. Total LLM cost (input + output tokens × price) per task completion.

  • Latency per task. Wall-clock time per task.

  • Throughput. Tasks per unit time (relevant for production).

  • Pareto frontier. Plot cost vs capability across models; identify the Pareto-optimal set.

The 2026 benchmark reporting. Major benchmark leaderboards now routinely include cost-effectiveness metrics. The SWE-bench leaderboard (for example) reports cost-per-resolved-issue alongside capability.

The implication. Frontier capability is increasingly not the only goal. Production decisions increasingly favour cost-effective models over absolute-capability leaders; benchmark methodology has evolved to reflect this.

Where capability benchmarks sit in 2026

The summary. Capability benchmarks are substantially more diverse and methodologically sophisticated than 2018 baselines. The frontier-benchmark category (GPQA, FrontierMath, ARC-AGI-2, Humanity’s Last Exam) explicitly addresses saturation. Cost-effectiveness is increasingly first-class. Multi-benchmark evaluation is standard.

The remaining issues. The saturation cycle continues. New benchmarks lag behind capability development. Contamination concerns affect many widely-cited benchmarks. The connection between benchmark scores and deployment value is uncertain.

The next sections develop other evaluation categories: §5 covers behavioural and safety evaluations; §6 covers human evaluation; §7 covers LLM-as-judge.


§5. Behavioural and Safety Evaluations

A distinct evaluation category from capability benchmarks. Behavioural evaluations measure how models behave across normal and adversarial conditions; safety evaluations measure potential harms. The two overlap; both are essential for deployment decisions. This section develops them, cross-referencing the more-detailed Alignment §7 treatment.

Behavioural evaluations

The category. Behavioural evaluations measure properties of model output that matter for deployment but are not strictly “capabilities” - refusal of harmful requests, honesty, format compliance, tone, helpfulness.

Notable benchmarks.

TruthfulQA (Lin, Hilton, Evans, 2022). 817 questions designed to elicit human-misconception answers. The model “passes” by giving truthful answers even when the misconception is the more common human response.

Why it matters. LLMs trained on web data are exposed to many misconceptions. A capable LLM that believes the misconception (or reflects it in outputs) is unhelpful - possibly harmful - even if knowledge-benchmark performance is high. TruthfulQA distinguishes truthfulness from knowledge.

The trajectory. Initial frontier models (GPT-3) ~25% truthful (worse than random for tricky misconceptions). GPT-4 ~60%. Frontier models 2025-2026 80%+. Substantial but incomplete improvement.

RealToxicityPrompts (Gehman et al., 2020). 100K prompts collected from web text; tests whether models complete prompts in toxic ways. Used to measure baseline toxicity propensity.

ToxiGen (Hartvigsen et al., 2022). Targeted evaluation for hate speech generation across demographic groups; balances measurement of hate-speech production with risk of underestimating it via sanitization.

HarmfulQA, HarmBench (multiple sources). Test refusal of requests across categories of harmful content (illegal activities, dangerous information, hate speech, self-harm).

HHH (Helpful, Honest, Harmless) evaluations. Anthropic-pioneered framework testing across the three values; widely adopted.

InstructEval, MT-Bench (Zheng et al., 2023). Multi-turn conversational evaluations for instruction-following models.

The honest accounting. Behavioural evaluations are useful but imperfect. They cover specific failure categories; they cannot exhaustively cover all behavioural concerns. Saturation affects them: as models pass these benchmarks, the benchmarks become less informative.

Safety evaluations

The broader category. Safety evaluations test for potential harms the model could cause in deployment. This overlaps with behavioural evaluations but emphasizes worst-case concerns.

Cross-reference Alignment §7 for detailed treatment; this section develops the evaluation-methodology aspects.

The categories.

  • Dangerous-capability evaluations (developed below).

  • Behavioural-safety evaluations (TruthfulQA, HarmfulQA, etc., above).

  • Red-teaming (§5.4 below).

  • Robustness evaluations (adversarial inputs, distribution shift).

  • Agentic-safety evaluations (cross-reference AI Agents §11).

The integration. Frontier safety frameworks (Anthropic RSP, OpenAI Preparedness, DeepMind Frontier Safety) require systematic safety evaluation across these categories before deployment. The evaluation results feed governance decisions; cross-reference Alignment §10.

Dangerous-capability evaluations

A specific safety-evaluation category that has become central. Dangerous-capability evaluations (DCEs) test whether the model can perform tasks with substantial harm potential, regardless of whether it currently would.

The motivation (cross-reference Alignment §7). A safety-trained model that refuses to help with bioweapon design behaviourally is one thing; whether the underlying capability exists is another. A capable model whose safety training is bypassed could do substantial harm.

The dominant categories in 2026.

CBRN (Chemical, Biological, Radiological, Nuclear). Mouton et al. (RAND, 2024) and frontier-lab internal evaluations. Test biological-weapon synthesis routes, chemical-weapon precursor identification, nuclear-physics weaponization knowledge.

Cyber capabilities. Cybench, CyberBench (2024+). Vulnerability discovery, exploit development, malware creation, social engineering. Some benchmarks use capture-the-flag (CTF) style problems; others test specific attack-chain steps.

Persuasion and manipulation. Salvi et al. (2024) “On the Conversational Persuasiveness of Large Language Models” and Goldstein et al. (2023). Test persuasive-content generation; measure effects on human beliefs and actions.

Autonomous replication and adaptation. Kinniment et al. (METR, 2023) “Evaluating Language-Model Agents on Realistic Autonomous Tasks.” Test agentic-AI capability to spin up cloud instances, evade detection, gather money.

AI R&D acceleration. Frontier labs evaluate model contributions to AI research itself; concerning because of recursive-acceleration feedback loops.

Long-horizon agentic capability. METR Task Evaluations and RE-Bench. Cross-reference AI Agents §10.

The methodology. The five-stage recipe (cross-reference Alignment §7).

   DANGEROUS-CAPABILITY EVALUATION METHODOLOGY (review)

   1. THREAT MODELING - identify harm scenarios; specify capabilities.
   2. CAPABILITY ELICITATION - maximize capability via prompting,
      fine-tuning, red-teaming. "Don't ask if it WILL - ask if it COULD."
   3. EXPERT EVALUATION - domain experts assess outputs.
   4. UPLIFT MEASUREMENT - quantify model contribution vs baselines.
   5. THRESHOLD DECISION - compare against pre-specified thresholds in
      frontier safety frameworks.

The result. DCEs produce capability profiles for frontier models; these feed deployment decisions under frontier safety frameworks.

Red-teaming and adversarial evaluation

A specific evaluation methodology. Red-teaming uses adversarial probing - humans or AIs attempting to elicit unsafe behaviour - to discover failure modes.

The structure.

Manual red-teaming. Skilled humans probe for jailbreaks, harmful outputs, capability failures. Costly per-probe; high-quality findings.

Automated red-teaming. Use AI systems to generate adversarial prompts at scale. Cheaper per-probe; covers more inputs; may miss subtle failures.

Hybrid. AI-generated candidates filtered by human review; human-designed strategies executed at scale.

Notable work.

Perez et al. (2022) “Red Teaming Language Models with Language Models.” Use one LM to generate adversarial prompts; another LM to evaluate responses. Demonstrated automated discovery of LLM failure modes at scale.

Ganguli et al. (Anthropic, 2022) “Red Teaming Language Models to Reduce Harms.” Large-scale human red-teaming; produced taxonomies of failure modes that informed subsequent safety training.

AISI red-teaming. UK AISI, US AISI, and similar bodies conduct pre-deployment red-teaming under partnerships with frontier-AI labs. Independent of lab evaluations.

The 2026 production picture. Red-teaming is standard before frontier-model release; teams typically include hundreds of red-teamers across multiple domains; findings are systematically catalogued and addressed through subsequent training and deployment safeguards.

The reporting. Red-teaming results are often partially public (high-level findings shared; specific jailbreaks held to avoid reproducibility risk). The balance between transparency and operational security is contested.

Robustness evaluations

A different category. Robustness evaluations test how the model handles non-standard inputs.

Categories.

Adversarial robustness. Inputs designed to elicit failure. Adversarial examples for vision models; adversarial prompts for LLMs; prompt-injection attacks for agents.

Distribution shift. Inputs from distributions different from training. Out-of-distribution detection; domain transfer evaluation; multilingual evaluation as a form of distribution shift.

Paraphrase robustness. Same task, paraphrased queries. Performance should be similar; substantial degradation indicates brittleness.

Format perturbations. Same content, different format (different prompt templates, different white-space patterns). Sensitivity to format suggests overfitting to specific surface patterns.

Adversarial fine-tuning. Can attackers fine-tune the deployed model to remove safety training? Tests how robust safety is against adversarial modifications.

Notable benchmarks.

  • AdvBench (Zou et al., 2023). Adversarial prompts targeting LLM safety.

  • AdvGLUE (Wang et al., 2021). Adversarial NLU evaluation.

  • Anthropic’s Constitutional AI evaluations. Test robustness to adversarial fine-tuning.

The 2026 state. Robustness is partially evaluated; comprehensive robustness assessment is incomplete. Specific attack categories (jailbreaks, prompt injection) have benchmark coverage; many do not.

Frontier safety framework integration

The institutional integration. Frontier safety frameworks (Anthropic RSP, OpenAI Preparedness, DeepMind Frontier Safety, xAI Safety Framework; cross-reference Alignment §10) specify which evaluations are required before deployment at each capability level.

The structure. Each framework defines capability tiers (Anthropic ASL-1/2/3/4; OpenAI low/medium/high/critical). Each tier specifies:

  • Required evaluations.

  • Score thresholds.

  • Required mitigations if thresholds are met.

  • Deployment constraints conditional on results.

The result. Behavioural and safety evaluations are not just research outputs - they are operational requirements. Frontier-model release decisions reference specific evaluation scores against specific thresholds.

The governance integration. EU AI Act, US Executive Order (2023; substantially modified 2025; subsequent iterations), UK AISI evaluation partnerships, and other governance frameworks reference these evaluations as part of compliance.

Where behavioural and safety evaluation sits in 2026

The summary. Behavioural and safety evaluations are substantial and growing. The infrastructure (benchmarks, red-teaming, AISIs, frontier safety frameworks) is more mature than 2022 baselines. The integration with governance is operationally consequential.

The remaining issues. Coverage gaps exist for novel capability categories. Adversarial robustness of safety training is partial (cross-reference Alignment §8 sleeper-agents discussion). Standardization across labs is incomplete. The pace of capability growth strains evaluation methodology.

The next section develops human evaluation - the methodology that complements automated benchmarks for the qualitative dimensions of AI behaviour.


§6. Human Evaluation

The original evaluation methodology. Despite massive growth in automated benchmarks and LLM-as-judge methods, human evaluation remains the gold standard for many AI evaluation tasks. This section develops the methodology - pairwise preference, expert review, annotation pipelines, reliability concerns.

Why human evaluation

The motivating cases.

Subjective quality. “Which of these two outputs is more helpful?” “Which translation is more fluent?” “Which generated image is more aesthetically pleasing?” These judgments are subjective; humans are the authoritative source.

Open-ended generation. Reference-based metrics fail when many valid outputs exist. For chat responses, creative writing, complex analysis - there is no single right answer; quality judgment requires human assessment.

Novel failure modes. Automated benchmarks measure what they were designed to measure. Novel failure modes (sycophancy, jailbreaks, subtle hallucinations) often appear first in human evaluation before being captured by benchmarks.

Alignment with deployed-user values. Ultimately, deployed AI serves users. User preference - measured directly via human evaluation - is the most relevant evaluation signal for many deployment contexts.

The trade-offs. Human evaluation is expensive, slow, and variable (different humans disagree; the same human may disagree across sessions). The infrastructure to produce reliable human evaluation at scale is substantial.

Pairwise human preference

The dominant paradigm. Show humans two model outputs for the same input; ask which is better. Aggregate over many comparisons.

The advantages.

Calibrated. Humans are reasonably consistent at relative judgments (A vs B) even when absolute judgments are noisy. Pairwise data is easier to aggregate cleanly.

Comparable. Pairwise comparisons across many model pairs produce a ranking via Elo-style methods.

Scalable infrastructure. Pairwise comparisons are simple UI; users can do many quickly; massive datasets achievable.

The standard pattern.

   PAIRWISE HUMAN PREFERENCE EVALUATION

   For each evaluation instance:
     1. Sample an input (prompt).
     2. Generate output from Model A.
     3. Generate output from Model B.
     4. Show human evaluator: input, output A, output B (randomized order).
     5. Evaluator picks: A better / B better / tie.

   Aggregate:
     - Many comparisons per model pair.
     - Convert to Elo-style rating (each model has a rating;
       updates after each comparison).
     - Rank models by rating.

The 2026 landscape. Pairwise human preference has become the dominant evaluation paradigm for deployed-LLM quality. LMSYS Chatbot Arena is the canonical example; multiple specialized arenas exist.

LMSYS Chatbot Arena

The flagship deployment of pairwise human preference at scale. Chatbot Arena (Zheng, Chiang et al., LMSYS, 2023+) provides a public website where users converse with two anonymous models and vote which is better. The aggregated votes produce Elo rankings.

The scale. By 2026, Chatbot Arena has millions of comparisons across hundreds of models. The dominant single LLM evaluation by attention and adoption.

The methodology details.

Anonymization. Users don’t know which model produced which output. Eliminates branding bias.

Diverse user base. Users worldwide; diverse use cases. More representative than internal lab evaluations.

Multiple categories. Arena maintains separate rankings for general chat, coding, hard prompts, longer responses, etc.

Continuous updates. New models added regularly; old models reevaluated. Living evaluation rather than static.

Statistical methodology. Elo ratings updated via Bradley-Terry model; confidence intervals computed from comparison counts.

The reception. Frontier-lab model releases prominently report Arena rankings. Investment decisions reference Arena standing. The Arena has become industry-standard infrastructure.

The criticisms.

User-base bias. Arena users are not representative of all LLM users. Probably skews toward developers, technically-sophisticated users, English speakers.

Conversation-style focus. Arena measures conversational preference; not necessarily capability on specific tasks.

Gaming concerns. Model providers may A/B test against Arena before releasing; potential overfitting to Arena-style preferences. Several incidents in 2024-2025 raised concerns about specific labs optimizing for Arena ranking via leaked test prompts.

Length bias. Longer responses sometimes preferred even when shorter would be better. Arena’s response-length-normalization is partial.

The 2026 status. Chatbot Arena remains highly influential; criticisms are widely acknowledged but the platform persists as the dominant human-preference benchmark. Multiple successor and parallel platforms (Arena Hard, specialized arenas) have emerged.

Specialized arenas and pairwise evaluations

Beyond LMSYS, specialized pairwise-evaluation platforms have emerged.

Coding arenas. Arena Hard, coding-specific evaluation platforms.

Image-generation arenas. Compare text-to-image outputs from different models. Multiple platforms (LMSYS Image, others).

Specialized-domain arenas. Medical, legal, scientific - domain-expert evaluation platforms.

Lab-internal arenas. Frontier labs maintain internal arena-style infrastructure for development.

The pattern. The arena format (pairwise comparison with Elo aggregation) has been successfully adapted to many domains. The methodology is now established.

Expert review for specialized domains

A different mode of human evaluation. For specialized domains (medical, legal, scientific), expert evaluators are needed.

The differences from generalist human evaluation.

Expertise required. Evaluators must have domain knowledge. A medical-information evaluator must understand medicine.

Smaller pools. Few qualified experts; evaluation is slower and more expensive.

Higher per-evaluation value. Expert evaluation produces higher-signal data; each evaluation is more informative.

More structured protocols. Expert evaluation often uses rubrics - explicit criteria with scoring guidelines - to ensure consistency.

Notable instances.

Medical evaluation of medical-information LLMs (e.g., Google’s Med-PaLM evaluations). Physician evaluators assess outputs against clinical-knowledge standards.

Legal evaluation of legal AI (Harvey AI, others). Lawyer evaluators assess outputs.

Scientific evaluation of scientific-research AI (cross-reference AI for Science §11). Domain scientists evaluate outputs.

Coding evaluation by experienced developers for sophisticated code-generation tasks.

The 2026 picture. Expert review is standard practice for high-stakes specialized AI. Substantial cost; substantial value; widely deployed at frontier labs and at specialized AI companies.

Annotation pipeline engineering

A specific operational concern. Production human evaluation requires substantial engineering - annotation interfaces, quality controls, payment systems, training, etc.

The components.

Annotation interface. UI for evaluators. Must be efficient (evaluators do many per hour), clear (criteria unambiguous), and bias-resistant (e.g., randomize option order).

Evaluator selection. Annotation work platforms (Surge AI, Scale AI, Anthropic’s annotation infrastructure, others). Recruiting evaluators with appropriate skills; filtering for quality.

Training. Evaluators need training on the specific evaluation task; criteria; rubrics. Pre-evaluation testing (qualification quizzes).

Quality control. Honeypot questions (known-answer items mixed in to test attention). Inter-annotator agreement checks. Periodic review of evaluator work.

Payment. Fair compensation; ethical labour practices. Real concerns about exploitation in annotation work (Time and other reporting); industry response is uneven.

Pipeline orchestration. Distribution of evaluation tasks to workers; aggregation of results; quality dashboards.

The 2026 industry. Several specialized companies (Scale AI, Surge AI, Centaur Labs, others) provide annotation-as-a-service. Frontier labs typically combine in-house annotation teams with external partners.

Inter-annotator agreement and reliability

A specific methodological concern. Inter-annotator agreement (IAA) measures how consistently different evaluators reach the same judgments.

The metrics.

Cohen’s kappa. Agreement between two annotators on categorical judgments, corrected for chance agreement.

Fleiss’ kappa. Extension to multiple annotators.

Krippendorff’s alpha. General-purpose agreement measure; handles different data types.

Pearson/Spearman correlation. For continuous-scale ratings.

The interpretation. Low IAA indicates that the evaluation task itself is subjective or ambiguous. Different annotators reaching different conclusions suggests the judgment is hard to make consistently.

The 2026 practice. Production human evaluation should report IAA; many published evaluations don’t. Best-practice evaluation includes IAA as a basic methodological check.

The implications. Low IAA tasks require either:

  • Improving the rubric (clearer criteria, examples).

  • Improving evaluator training.

  • Aggregating more evaluations per item (averaging out individual disagreement).

  • Accepting that the task is genuinely subjective and reporting accordingly.

When human evaluation is and isn’t appropriate

A practical decision question.

Use human evaluation when:

  • The judgment is subjective (quality, preference, style).

  • The task is open-ended (no clear right answer).

  • The evaluation must align with end-user values.

  • Novel failure modes need discovery.

  • High-stakes decisions depend on the evaluation.

Don’t use human evaluation when:

  • Reference-based metrics work well (closed-ended tasks with clear answers).

  • Speed and scale are critical (humans are slow).

  • Cost is prohibitive.

  • The task requires specialized expertise not feasibly available.

  • Reproducibility across time is essential (human evaluators change).

The 2026 pattern. Hybrid evaluation is most common: automated metrics for breadth and speed; LLM-as-judge (§7) for scalable qualitative judgment; human evaluation for the highest-stakes or most-subjective decisions.

Where human evaluation sits in 2026

The summary. Human evaluation remains essential despite massive growth in automated alternatives. Pairwise preference (Chatbot Arena) dominates LLM quality measurement. Expert review handles specialized domains. Annotation pipeline engineering has matured into a substantial industry.

The remaining issues. Cost and scalability constraints. Annotator labour practices. Inter-annotator agreement on subjective tasks. The gap between annotator preferences and broader-population preferences. The integration with LLM-as-judge for hybrid pipelines.

The next section develops LLM-as-judge - the scalable alternative that has emerged as a partial substitute for human evaluation.


§7. LLM-as-Judge

A scaling response to human-evaluation cost. LLM-as-judge uses an LLM to evaluate another LLM’s (or agent’s) outputs, providing cheap and fast scoring at the cost of inheriting the judge LLM’s biases. This section develops the mechanism, the known biases, the validation methodology, and the production patterns.

The mechanism

The basic recipe. Given a task, an output to evaluate, and evaluation criteria, prompt an LLM to score the output.

   LLM-AS-JUDGE BASIC PATTERN

   Inputs to the judge LLM:
     - Task description
     - Output to evaluate (or two outputs for pairwise)
     - Evaluation rubric or criteria
     - Optional: reference answer

   Judge LLM prompt (illustrative):
     "Evaluate the following response on the given task.
      Score from 1-5 on the criteria below.
      [task description]
      [response]
      [criteria]
      Provide a score and brief explanation."

   Judge LLM output:
     - Score (numeric or category)
     - Explanation / reasoning

   Aggregation:
     - Average scores across many evaluations
     - Identify systematic patterns from explanations

The variants.

Absolute scoring. Judge produces a numeric score (e.g., 1-5 on each criterion).

Pairwise judgment. Judge compares two outputs; picks the better. The LLM-as-judge analogue of human pairwise preference (§6).

Chain-of-thought judgment. Judge produces reasoning before the final score. Often more reliable.

Rubric-based. Judge scores against an explicit rubric (multiple criteria; specific scoring guidelines).

Reference-based vs reference-free. Judge may or may not have access to a reference answer.

The appeal

Why LLM-as-judge has become widespread.

Scale. Massive evaluation throughput. A million evaluations per day is feasible at moderate cost; impossible for human evaluation at comparable budget.

Speed. Evaluations complete in seconds, not days.

Cost. Roughly 100-1000x cheaper than human evaluation per item (depending on the judge LLM used).

Consistency. A given judge LLM with a given prompt produces deterministic-ish outputs; different evaluation rounds are comparable.

Coverage. Can evaluate dimensions that human evaluation may miss (e.g., factual correctness in domains where human annotators lack expertise; structural code quality across many criteria).

The combined effect. LLM-as-judge enables evaluation at scales that human evaluation cannot match. This has changed evaluation practice substantially in 2023-2026.

Known biases of LLM judges

LLM judges have characteristic biases. Understanding them is essential for using LLM-as-judge well.

Position bias. When evaluating two outputs in pairwise comparison, the judge often shows preference for the position (first or second) over the content. Wang et al. (2023) “Large Language Models are not Fair Evaluators” documented this systematically.

Mitigation: randomize order; evaluate both orders and average.

Verbosity bias. Judges often prefer longer outputs even when shorter would be better. Length serves as a heuristic for thoroughness that the judge over-weights.

Mitigation: length-normalize; explicitly instruct the judge not to weight length; use rubrics that don’t reward length per se.

Self-preference. Judges often prefer outputs from their own family of models over equivalent outputs from other model families. Panickssery, Bowman, Feng (2024) “LLM Evaluators Recognize and Favor Their Own Generations” documented this - when GPT-4 judges GPT-4 outputs vs Claude outputs, GPT-4 systematically favours GPT-4.

Mitigation: use judges from a different model family than the models being evaluated. Use multiple judges from different families; aggregate.

Authority bias. Judges may favour outputs that sound authoritative (formal tone, technical vocabulary, citations) even when the content is wrong.

Mitigation: rubrics emphasizing correctness over presentation. Use of reference answers where possible.

Sycophancy. Judges may produce evaluations they think the evaluator wants - e.g., if the prompt suggests one model is better, the judge agrees.

Mitigation: blind the judge to which model produced which output. Avoid leading prompts.

Format bias. Judges may favour outputs in formats they’ve seen most often during training (bullet points, headers, hedging language).

Mitigation: rubrics explicit about format-vs-content separation.

Anchoring. Judges may anchor on the first item evaluated; subsequent evaluations relative to it.

Mitigation: randomize order; evaluate items independently.

The combined effect. LLM-as-judge evaluations systematically deviate from human evaluations in characteristic ways. Without correction, LLM-as-judge can produce misleading results - particularly when comparing models that differ in style, length, or family.

When LLM-as-judge agrees with humans (and when it doesn’t)

A specific empirical question. How well does LLM-as-judge approximate human evaluation?

The general empirical pattern.

High agreement on clear-cut tasks. Factual correctness, basic reasoning, format compliance - LLM judges agree with humans at ~85-95% rates.

Moderate agreement on subjective tasks. Quality, helpfulness, creativity - LLM judges agree with humans at ~70-80% rates. Substantial disagreement; relative rankings sometimes match, sometimes don’t.

Low agreement on hard-to-judge tasks. Nuanced ethical judgment, subtle quality differences, expert-domain quality - LLM judges may agree with humans at 50-65% (sometimes barely above chance for binary comparisons).

The validation methodology. Best practice: empirically validate LLM-as-judge agreement with human evaluation on a sample of items before relying on LLM-as-judge for the full evaluation.

   LLM-AS-JUDGE VALIDATION RECIPE

   1. Sample 100-500 evaluation items.
   2. Collect human judgments (gold standard).
   3. Run LLM-as-judge on the same items.
   4. Compute agreement (Cohen's kappa, correlation, etc.).
   5. Analyze disagreements:
        - Are they random noise?
        - Are they systematic (e.g., LLM always favours length)?
        - Are the LLM judgments occasionally "better" than human?
   6. Decision:
        - If agreement is high: use LLM-as-judge for full evaluation.
        - If agreement is low: stick with human evaluation OR
          refine the LLM-as-judge prompt and re-validate.
   7. Periodic re-validation:
        - Agreement may drift as judge LLMs update.
        - Re-validate at major LLM version changes.

The 2026 best practice. Serious LLM-as-judge use includes validation; lighter-weight use often skips it. The validation rate varies; comprehensive validation is uneven.

Notable LLM-as-judge work

Some specific systems and methodologies.

MT-Bench (Zheng et al., 2023). Multi-turn conversation evaluation using LLM-as-judge (specifically GPT-4 as judge). Demonstrated the methodology at scale; substantial uptake.

G-Eval (Liu et al., 2023). LLM-as-judge with chain-of-thought reasoning for quality evaluation. Shows that chain-of-thought judges are more reliable than direct-scoring judges.

Prometheus (Kim et al., 2024). Open-source LLM specifically trained as an evaluator. Designed to reduce reliance on closed-source models like GPT-4 as judges.

Chatbot Arena’s LLM-judge variant. Arena Hard uses LLM-as-judge for evaluation at scale; results correlate with human-judge Arena rankings.

Per-criterion rubrics. Production LLM-as-judge increasingly uses structured rubrics with explicit per-criterion scoring rather than overall judgments.

Hybrid pipelines

The dominant production pattern. Hybrid pipelines combine automated benchmarks, LLM-as-judge, and human evaluation.

A typical pipeline:

   HYBRID EVALUATION PIPELINE

   Stage 1: AUTOMATED BENCHMARKS
     Standard benchmark suites (MMLU, GPQA, HumanEval, etc.).
     Reference-based scoring.
     Cheap; broad coverage; first-pass measurement.

   Stage 2: LLM-AS-JUDGE
     For each output, evaluate against rubrics using a strong
     judge LLM.
     Provides qualitative dimensions automated benchmarks miss.

   Stage 3: HUMAN EVALUATION
     Pairwise comparisons on a subset of items.
     Validates LLM-as-judge.
     Provides ground truth for highest-stakes decisions.

   Aggregation: Each stage provides different signal; results
   combined for the final evaluation.

The principle. Each stage has different strengths and weaknesses; combining them produces more comprehensive evaluation than any stage alone.

The 2026 production picture. Frontier-AI labs typically run all three stages for major model releases. Less serious evaluations may use only one or two stages.

LLM-as-judge for safety evaluations

A specific application worth flagging. Safety evaluations (jailbreak success rates, harmful-content detection, etc.) often use LLM-as-judge for scalability.

The methodology. Train or prompt an LLM to detect specific safety-relevant categories. Apply at scale to model outputs.

Examples.

  • Detoxify and similar classifiers for toxicity detection.

  • GPT-4-based jailbreak detectors for analyzing red-team results.

  • Hallucination detectors for factual-accuracy evaluation.

The trade-offs. LLM-as-judge for safety has the standard biases (potentially favouring outputs from same-family models) plus the additional concern that safety judgment is itself fraught. Different judges may disagree on what counts as a jailbreak; what counts as harmful content; what counts as a hallucination.

Best practice. Combine LLM-as-judge safety evaluation with human review of high-stakes findings; calibrate carefully against human gold standards.

Where LLM-as-judge sits in 2026

The summary. LLM-as-judge is widely used and partially trusted. It enables evaluation scales human evaluation cannot match; it has known biases that require methodology to mitigate; it works best when validated against human judgment and used in hybrid pipelines.

The remaining issues. Biases (especially self-preference) require active mitigation. Validation is uneven across production deployments. Trust in LLM-as-judge results varies; high-stakes decisions still warrant human evaluation backstops.

The trajectory. As judge LLMs improve, LLM-as-judge reliability improves. The frontier of judge-LLM capability is close to (but not at) the frontier of evaluated-LLM capability. Whether this stays true as both advance is open.


§8. Contamination and Data Hygiene

The most-consequential methodological concern in modern AI evaluation. Benchmark contamination is when evaluation data appears in training data; reported scores conflate memorization with capability. This section develops the contamination problem, detection methods, and mitigation strategies.

The contamination problem

The basic concern. Modern frontier-LLM training data includes essentially all publicly-available text. Many evaluation benchmarks are publicly available. Benchmark items may end up in training data - sometimes verbatim, sometimes paraphrased, sometimes translated. When a model encounters a “test” item during evaluation, it may have seen this item before during training. Performance reflects memorization rather than generalization.

The 2023 inflection. Multiple investigations through 2022-2023 systematically documented contamination across major benchmarks. The community recognition forced methodological changes.

The mechanisms.

Direct overlap. The exact benchmark items appear in training data (web pages quoting the benchmark; benchmark papers themselves; benchmark answer keys).

Translation overlap. Benchmark items in one language appear translated in another; multilingual training picks up the translations.

Paraphrase overlap. Substantively similar items appear in training data; the model has seen the concept if not the exact text.

Derived overlap. Solutions or explanations to benchmark items appear in training data; the model knows the answer without seeing the question.

Released-by-participants overlap. Benchmark designers release datasets publicly; later models train on the released data.

Implicit overlap from solutions. Tutorial sites, blog posts, and educational content discuss benchmark problems; the model learns the answers without seeing the formal benchmark.

The result. Reported benchmark scores can be substantially inflated. The inflation is unequal across models - models trained more recently (after benchmark release) are typically more contaminated than older models.

A worked example: MMLU contamination

A specific concrete case. MMLU (Hendrycks et al., 2021) was widely used as the dominant LLM knowledge benchmark through 2024. Multiple investigations (Sainz et al. 2023; Schaeffer 2023; Kiela 2023; others) showed substantial contamination.

The findings.

  • Direct extraction. Major frontier models could often recite MMLU questions verbatim when prompted appropriately. The benchmark items were in their training data.

  • Performance drop on paraphrases. When MMLU questions were paraphrased, model accuracy dropped substantially. The drop suggests the original-question performance was partially memorization-driven.

  • Released-answer-key. MMLU answer keys are publicly available; models trained on web data have seen them.

The implication. MMLU scores for major frontier models partially reflect memorization. The headline scores (e.g., GPT-4’s 86.4%) include some unknown memorization premium.

The community response. Subsequent benchmarks (GPQA Diamond, FrontierMath, Humanity’s Last Exam) were explicitly designed to be contamination-resistant - questions not available publicly; new questions added regularly; private held-out test sets.

Memorization vs generalization in benchmarks

A specific distinction. Memorization means the model has stored the specific item or answer in its weights; performance comes from retrieval. Generalization means the model has acquired underlying capability; performance comes from applying capability to novel items.

The two are not always cleanly separable.

Pure memorization. The model has stored “the answer to question Q is X” in some form. Asking Q produces X; asking a paraphrase of Q produces X if the paraphrase is close enough; asking unrelated questions doesn’t help.

Pure generalization. The model has learned the capability needed to answer questions of this type. Performance on novel-but-related questions is similar to performance on training questions.

Mixed. The model has both memorized specific items and acquired some capability. Performance partially reflects memorization, partially capability.

In practice, most observed performance is mixed. The relevant question is how much memorization, how much capability - and answering this requires careful methodology.

Detection: per-example, per-benchmark, per-model

The methodological problem. How do we detect contamination? Three levels of analysis.

Per-example detection. For a specific benchmark item, does it appear in training data?

  • String search. Search the training corpus for verbatim matches to the benchmark question.

  • Fuzzy matching. Search for near-matches (paraphrases, partial overlaps).

  • Hash-based detection. Compute hashes of benchmark items and check against indexed training data hashes.

The challenge. Training corpora for frontier models are typically not public. Detection requires lab cooperation or access to similar publicly-available corpora as a proxy.

Per-benchmark detection. Does the benchmark as a whole appear in training data?

  • Aggregate analysis. If many items appear in training data, the benchmark is broadly contaminated; specific items can be identified.

  • Source tracking. Was the benchmark dataset itself released? Where? When?

  • Discussion overlap. Are benchmark items discussed in tutorial sites, blog posts, papers that appear in training data?

Per-model detection. For a specific model, does its training data include benchmark content?

  • Behavioural test. Can the model recite benchmark items verbatim when prompted? Substantial verbatim recitation suggests training-data presence.

  • Paraphrase comparison. Does the model perform substantially worse on paraphrases vs canonical questions?

  • Likelihood comparison. The model assigns much higher likelihood to canonical-format benchmark items than to neutral text on the same topic.

Carlini et al. methods. Carlini et al. (multiple papers, 2021-2024) developed systematic data extraction attacks on language models, recovering training-data content from trained models. The methodology applies directly to contamination detection.

Mitigation strategies

How to evaluate when contamination is a concern.

Held-out sets that have never been released. Best-case: benchmark designers keep test items private; release only training and validation splits. Frontier-lab internal benchmarks often use this pattern; public benchmarks rarely can.

Date-based held-out evaluation. Use only benchmark items released after the model’s training cutoff. If the model couldn’t have seen the items, contamination is impossible.

The practical implementation. Many benchmarks now include date-based subsets; LiveCodeBench (Jain et al., 2024) is built around this principle - continuously adding new problems collected after fixed cutoffs.

Contamination-resistant benchmark design. Design benchmarks with contamination in mind from the start.

  • Use newly-created content not available before benchmark release.

  • Don’t release answer keys publicly until the benchmark has been used.

  • Use procedurally-generated items where each item is unique.

  • Use expert-written content unlikely to appear elsewhere.

GPQA Diamond, FrontierMath, Humanity’s Last Exam all use this approach.

Contamination audits. Explicitly check whether benchmark items appear in training data. Report contamination findings alongside benchmark scores.

Paraphrase robustness checks. Compare canonical-question performance vs paraphrased performance. Large gaps suggest memorization premium.

Multiple-version benchmarks. Maintain multiple “versions” of a benchmark; release one version, evaluate, then release the next. The evaluating-on-unreleased-version provides contamination protection.

Adversarial perturbations. Make small adversarial changes to benchmark items; if the model fails on perturbed versions, the original-version performance was likely memorization-driven.

The benchmark-rotation problem

A specific complication. Even with best-practice mitigation, time erodes benchmark cleanliness. A benchmark released today as contamination-resistant may be contaminated in next year’s training data.

The pattern.

  1. Benchmark released; initially contamination-resistant.

  2. Researchers use the benchmark; results are reported; tutorial sites discuss it.

  3. Web content discussing the benchmark proliferates.

  4. Training data for next-generation models includes the proliferated content.

  5. The benchmark is now contaminated.

The result. Benchmarks have limited useful lifespans even with contamination-resistant design. Continuous benchmark rotation is essential.

The implications.

Benchmark designers must plan for rotation. Design with the expectation that the benchmark will be contaminated within 1-3 years.

Living benchmarks have advantages. Continuously-updated benchmarks (LiveCodeBench, Chatbot Arena) avoid the rotation problem by continuously adding fresh content.

Multiple parallel benchmarks per capability. No single benchmark suffices over time; cycle through multiple.

Date-cutoff evaluation by default. Best practice: only evaluate on items from after the model’s training cutoff. Requires benchmark designers to track item creation dates and labs to track training cutoffs.

A specific recent example: GSM8K and the “GSM-Symbolic” investigation

A concrete recent investigation. GSM8K (Cobbe et al., 2021) was the dominant grade-school math benchmark of 2022-2024. By 2024, frontier models scored 95%+; the benchmark appeared substantially solved.

Mirzadeh et al. (Apple, 2024) “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs.” Created variants of GSM8K problems with small surface-level changes (changing numbers; rephrasing wording; adding irrelevant clauses). Frontier-model performance dropped substantially on the perturbed versions.

The interpretation. GSM8K performance was partially driven by memorization or surface-pattern-matching rather than robust mathematical reasoning. The benchmark substantially overstated frontier-model mathematical capability.

The implication. Even widely-used, apparently-saturated benchmarks may not reflect what they appear to measure. Robust evaluation requires perturbation testing.

Contamination and competitive pressure

A specific dynamic. The competitive pressure to report high benchmark scores creates incentives to not aggressively audit contamination. Labs that found their models scored highly due to contamination would face costs to report this honestly.

The result. Self-reported contamination audits are uneven. Some labs are diligent (Anthropic, DeepMind have published serious contamination analyses); others less so. Independent contamination audits (academics, AISIs) provide important counter-pressure but cover only some benchmarks and models.

The chapter’s position. Contamination is a real and substantial concern affecting many widely-reported benchmark scores. Healthy practice involves transparency about contamination rates; this is improving but unevenly.

Where contamination and data hygiene sit in 2026

The summary. Contamination is recognized as a serious concern across the AI-evaluation community. Best-practice methodology (held-out splits, date-based evaluation, contamination audits, paraphrase robustness checks) is established but unevenly applied. New benchmark designs increasingly account for contamination from the start.

The remaining issues. Many widely-reported scores are partially contaminated. Lab competitive pressure works against transparency. The benchmark-rotation problem persists. Verifying contamination across closed-data labs is hard.

The next section develops the related concern - benchmark saturation and Goodhart’s law - that contamination interacts with as the field’s structural evaluation challenge.


§9. Benchmark Saturation and Goodhart

A specific structural concern complementing contamination (§8). Benchmark saturation is when capability exceeds what the benchmark can discriminate; Goodhart’s law is when the benchmark becomes the target and ceases to measure what it was meant to measure. The two problems are related; both affect the long-term informativeness of AI evaluation.

The saturation pattern

Recap from §4. Benchmarks have lifecycles: release at a hard-but-tractable level (~20-40% SOTA); rapid improvement (1-2 years); approaching saturation (1-2 more years); saturation (SOTA clustered at 95%+, benchmark uninformative). The pattern is empirically reliable across decades of AI benchmarks.

The mechanism. Each benchmark has a capability ceiling - the maximum achievable performance given perfect understanding of the benchmark’s content. As models approach the ceiling, performance differences shrink; the benchmark stops discriminating frontier models. Eventually, it provides no meaningful signal about who’s ahead.

The specific 2023-2026 saturation cases.

MMLU. Released 2021 at ~44% (GPT-3); reached ~90% by 2024-2025; approaching saturation by 2026.

HumanEval. Released 2021 at ~30% (Codex); reached 95%+ by 2024; saturated.

GSM8K. Released 2021 at ~50% (GPT-3 with reasoning); reached 95%+ by 2024; saturated.

MATH. Released 2021 at ~10% (GPT-3); reached 90%+ by mid-2025; approaching saturation.

GPQA Diamond. Released late 2023 at ~35%; reached 70%+ by mid-2025; mid-saturation.

SWE-bench Verified. Released early 2024 at ~12%; reached 80%+ by mid-2026; approaching saturation.

The pattern. Time-from-release to near-saturation is decreasing over time. Benchmarks released in 2021 took 3-4 years to saturate; benchmarks released in 2023-2024 are saturating in 1-2 years. The pace is accelerating.

Why saturation matters

The implications.

Benchmark scores stop discriminating. When all frontier models score 95%+, differences between models can’t be measured reliably with the benchmark.

Decisions can’t be based on the benchmark. Deployment decisions, research priorities, capability tracking - all need current informative measurements. Saturated benchmarks don’t provide these.

Reported scores become noise. Two models scoring 96.7% and 96.4% may not be meaningfully different; the score difference is within methodological noise.

Capability progress becomes harder to track. If we can’t measure improvements at the frontier, we don’t know how fast capabilities are advancing.

The continuous-refresh requirement. The community must continuously develop new, harder benchmarks. The infrastructure for this is mature but the pace strains it.

Goodhart’s law in benchmarks

A specific mechanism that makes benchmarks worse over time. Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.”

The application to AI benchmarks. As soon as a benchmark becomes important (research community uses it; labs target it; results are reported), it becomes a target for optimization. The optimization may produce high scores via mechanisms that don’t track the underlying capability.

The mechanisms of Goodhart in benchmarks.

Training-set inflation. Train on data resembling the benchmark; performance increases without genuine capability increase.

Surface-pattern matching. Learn the form of benchmark items (multiple-choice patterns, question styles) rather than the content the benchmark targets.

Cherry-picking and inference-time engineering. Use specific prompt formulations that maximize benchmark scores; production performance differs.

Direct contamination. §8 - benchmark items in training data.

Specification gaming. The benchmark measures one thing; models optimize for it in ways that don’t translate to general capability.

The result. Benchmark performance can substantially exceed genuine capability. The reported gap between benchmark-optimal models and “honest evaluation” models can be 5-15% on heavily-targeted benchmarks.

The Goodhart-contamination interaction

A specific dynamic. Contamination (§8) is one mechanism of Goodhart’s law in benchmarks. The two interact and compound.

The pattern.

  1. Benchmark becomes important; community uses it.

  2. Researchers fine-tune or develop methods targeting the benchmark.

  3. Training data includes benchmark items, related content, tutorial discussion.

  4. Models trained on this data score artificially high on the benchmark.

  5. Reported scores reflect targeting + contamination + genuine capability mixed.

  6. The benchmark’s signal-to-noise degrades.

The mitigations. Contamination-resistant design (§8) and Goodhart-resistant design together:

  • Don’t reveal answers publicly.

  • Add new items continuously.

  • Use date-cutoffs.

  • Avoid specific surface formats that can be targeted.

  • Use multiple benchmarks (so optimizing one doesn’t dominate).

  • Report not just headline scores but also paraphrase robustness and contamination audits.

The role of harder benchmarks

The structural response. As benchmarks saturate, harder benchmarks are released. The progression:

   THE HARDER-BENCHMARK PROGRESSION (illustrative for math)

   2021: GSM8K (grade-school math).
       │ saturated 2024
       ▼
   2021: MATH (competition math: AMC, AIME-level).
       │ approaching saturation 2025
       ▼
   2024: AIME 2024, 2025 (live competition problems).
       │ partially saturated within a year
       ▼
   2024: FrontierMath (research-level problems).
       │ <2% on release; ~50% by mid-2026; not saturated
       ▼
   ???:  USAMO, Putnam, IMO at full difficulty.
         Research-mathematics-level problems beyond.

The pattern. Each new benchmark is harder than its predecessor; the field needs a ladder of benchmarks spanning capability levels.

The challenge. Designing genuinely-hard benchmarks is itself hard. Items must be:

  • Hard enough to provide multi-year discrimination.

  • Not too hard (must be solvable by some experts).

  • Verifiable (must have clear correct answers).

  • Diverse (cover varied capability dimensions).

  • Contamination-resistant (no public availability before benchmark release).

Producing such benchmarks requires substantial expert effort. FrontierMath, Humanity’s Last Exam, ARC-AGI-2, GPQA Diamond all involved expert curation at substantial cost.

Living benchmarks

A structural response that avoids both saturation and contamination. Living benchmarks continuously add new content; the benchmark itself evolves rather than being a fixed snapshot.

Examples.

LMSYS Chatbot Arena. Continuously collects new pairwise comparisons from users; new model comparisons appear regularly.

LiveCodeBench. Continuously collects new coding problems from competitive programming sites after fixed cutoffs.

METR Task Evaluations. Periodically refreshed with new tasks at increasing difficulty.

SciCode and similar continuous-collection benchmarks.

The advantages.

  • No fixed saturation point. As models improve, harder content can be added.

  • Contamination-resistant by construction. New content post-dates training cutoffs.

  • Capability tracking over time. The benchmark provides continuous progress measurement.

The challenges.

  • Reproducibility. Comparing “the same benchmark” across time is harder when content changes.

  • Infrastructure cost. Continuous collection requires ongoing investment.

  • Quality control. New content must maintain quality without drifting.

The 2026 trend. Living benchmarks are increasing in importance. The infrastructure exists; the methodological appeal is real. They are complementary to (not replacements for) fixed benchmarks; both have roles.

Multi-benchmark evaluation as a Goodhart-mitigation

A specific best practice. Multi-benchmark evaluation - reporting on many benchmarks rather than one - reduces Goodhart pressure.

The reasoning. If labs are evaluated on N benchmarks, optimizing for any single benchmark provides only 1/N of the evaluation signal. Optimizing for all benchmarks is harder than optimizing for one; the resulting model is more likely to have genuine general capability.

The implementation. Modern model releases report scores across many benchmarks (often 20-40). The reporting includes:

  • Knowledge benchmarks (MMLU, MMLU-Pro, GPQA, ...).

  • Reasoning benchmarks (MATH, FrontierMath, ARC-AGI, ...).

  • Coding benchmarks (HumanEval, SWE-bench, LiveCodeBench, ...).

  • Multimodal benchmarks (MMMU, MathVista, ...).

  • Long-context benchmarks (RULER, LongBench, ...).

  • Agentic benchmarks (SWE-bench, GAIA, OSWorld, ...).

  • Behavioural benchmarks (TruthfulQA, MT-Bench, ...).

  • Multilingual benchmarks (Global MMLU, ...).

The result. Aggregate benchmark performance is harder to game than any single benchmark.

The remaining concern. Lab self-reporting may cherry-pick benchmarks where their model does well; weaker results may be omitted. Independent comparison tracking (Chatbot Arena, HELM, third-party evaluation platforms) provides counter-pressure.

The Pareto-frontier framing

A specific evaluation pattern that has matured in 2024-2026. Rather than ranking models by single benchmark scores, present the capability-cost Pareto frontier.

The setup. For each model:

  • X-axis: cost (per million tokens, per task, per hour).

  • Y-axis: capability (benchmark score, Arena rating).

Plot all evaluated models. The Pareto frontier is the set of models that are not dominated - no other model is both cheaper and more capable.

The benefits.

  • Captures cost-quality trade-offs naturally.

  • Different deployment contexts have different optimal points on the frontier.

  • Pareto-dominated models are clearly less interesting.

The reception. Pareto-frontier reports are increasingly standard. Artificial Analysis (artificialanalysis.ai), LMSYS public reports, and others routinely present results this way.

Where saturation and Goodhart sit in 2026

The summary. Saturation and Goodhart are structural ongoing concerns for AI evaluation. The community responses (contamination-resistant design, multi-benchmark evaluation, living benchmarks, Pareto-frontier framing, continuous refresh) substantially mitigate but don’t eliminate the problems.

The remaining issues. The pace of capability growth strains benchmark development. The competitive pressure works against transparency. Standardization across labs is incomplete. The connection between benchmark scores and deployment value remains uncertain.

The trajectory. Evaluation methodology has substantially improved from 2020 baselines. Continued improvement is expected; the destination (reliable, contamination-resistant, Goodhart-resistant evaluation of frontier capability) is approached but not reached.


§10. Agentic and Long-Horizon Evaluation

A specific evaluation category with distinctive methodology. Agent evaluation raises challenges beyond non-agentic evaluation: multi-step execution, long-horizon tasks, environment dependency, cost considerations. This section develops the specific methodology, cross-referencing the more-detailed AI Agents §10.

The unique challenges of agent evaluation

Beyond what §3-§9 covered for non-agentic evaluation.

Multi-step execution. A single agentic task involves many LLM calls, tool calls, environment observations. Failure can happen at any step; success requires every step to succeed. Evaluation must consider trajectories, not just final outputs.

Trajectory variability. Two runs of the same agent on the same task may produce very different trajectories. The agent may take different actions; the environment may respond differently; the final outcome may match or differ. Reproducibility across runs is hard.

Environment dependency. Agents operate in environments (web sites, file systems, APIs, simulated computers). The environment affects evaluation results; changes to the environment (websites updated, APIs deprecated) change evaluation outcomes over time.

Long-horizon variability. Tasks may take seconds (simple tool calls), minutes (typical), hours (complex), or days/weeks (frontier long-horizon). Evaluation methodology must scale across timescales.

Cost. Each agent execution costs LLM tokens, tool API calls, compute time. Comprehensive evaluation requires substantial budget.

Multi-dimensional success. Even when an agent completes a task, the manner matters - efficiency, cost, safety, reliability. Multi-dimensional reporting is essential.

Adversarial evaluation cost. Red-team evaluation of agents requires running adversarial scenarios; each one is a full agent execution.

These challenges combine to make agent evaluation substantially harder than non-agentic evaluation.

Task-based agent benchmarks

The dominant evaluation pattern. Task-based benchmarks specify concrete tasks and measure agent success rates. Cross-reference AI Agents §10 for the comprehensive treatment; this section develops evaluation-methodology aspects.

Key benchmarks (briefly).

SWE-bench (Jimenez et al., 2024). Real GitHub issues; agents produce patches; success measured by test passage. The flagship coding-agent benchmark. SWE-bench Verified (500-problem quality-controlled subset) is the standard for reliable comparisons.

GAIA (Mialon et al., 2023). General-purpose AI assistant benchmark; three difficulty levels; tests multi-step research and tool use.

WebArena, Mind2Web, WebVoyager. Web-using agent benchmarks. WebArena uses simulated sites; WebVoyager uses real sites.

OSWorld (Xie et al., 2024). Computer-use agent benchmark on real operating systems. Tests cross-application workflows.

TheAgentCompany (Xu et al., 2024). Simulates a small software company; tests multi-task long-horizon agent capabilities.

Cybench, CyberBench, CTFBench. Cybersecurity-focused agent benchmarks.

The pattern. Each benchmark provides:

  • A standardized environment.

  • Concrete tasks with clear success criteria.

  • An evaluation protocol.

  • A leaderboard tracking top systems.

Long-horizon benchmarks

A specific category. Long-horizon benchmarks test agents on tasks requiring substantial sequential work - hours to days to weeks of equivalent human time.

METR Task Evaluations (Model Evaluation and Threat Research, 2023+). Frontier-AI safety-relevant evaluations. Tests autonomous task completion at increasing time horizons; explicitly designed to track when agents can perform tasks of given durations.

A specific landmark finding. METR (2025) “Measuring AI Ability to Complete Long Tasks.” Frontier-AI agents can reliably complete tasks taking minutes to hours of equivalent human time; tasks taking days remain mostly out of reach as of mid-2025. The “horizon length” doubled approximately every 7 months from 2024 to 2025-2026.

The implication. If the trend continues, agents capable of multi-day autonomous tasks will appear within 2-3 years; multi-week tasks within 4-5 years. Whether the trend continues is uncertain; planning for this trajectory is one of the field’s central considerations.

RE-Bench (METR). Research-engineering benchmark. Agents complete realistic ML-research tasks; measures both speed and quality compared to human researchers.

OSWorld long-horizon variants. Extended OSWorld tasks requiring multi-step coordination across applications.

SWE-bench Long and similar extensions. Long-horizon variants of standard coding benchmarks.

The challenges of long-horizon evaluation.

  • Cost. Each evaluation takes hours of agent execution; total evaluation budget is substantial.

  • Variability. Long executions accumulate variability; agents may succeed once and fail similar tasks.

  • Quality control. Detecting partial successes, partial failures, near-misses requires sophisticated evaluation.

  • Standardization. Different evaluators may set up the environment slightly differently; results may not be perfectly comparable.

The 2026 state. Long-horizon evaluation is increasingly important and increasingly developed. METR-style evaluations are now part of frontier safety frameworks (cross-reference Alignment §10).

Cost and ecological validity

A specific tension in agent evaluation. Cost-effective evaluation (cheap, fast, comparable across models) and ecologically valid evaluation (matching deployment conditions) often pull in opposite directions.

Cost-effective evaluation.

  • Use cached environments (fixed website snapshots) for reproducibility.

  • Limit task complexity to keep execution times short.

  • Use simpler tools to avoid API costs.

  • Limit retries and recovery attempts.

Ecologically valid evaluation.

  • Use live environments (real websites, real APIs).

  • Test on realistic complex tasks.

  • Allow full tool access and reasonable retries.

  • Test under realistic latency and reliability conditions.

The trade-off. Cost-effective evaluation is cheap and standardized but may not reflect deployment performance. Ecologically valid evaluation is closer to reality but expensive and harder to compare across runs.

The 2026 practice. Most benchmark evaluation uses cost-effective methodology; production deployment evaluation uses more ecologically valid setups. Reporting both is best-practice but inconsistent.

Evaluation under capability growth

A specific challenge for agent evaluation. Agent capabilities are growing rapidly. Benchmarks released today may be near-saturated within 1-2 years.

The METR finding (above) - horizon length doubling every 7 months - suggests that agent benchmarks need continuous refresh even more than non-agentic benchmarks. The infrastructure for this is partially developed.

The response.

Benchmark difficulty progressions. SWE-bench → SWE-bench Verified → harder successors. METR Task Evaluations include tasks of increasing duration.

Living benchmarks. Continuous addition of new tasks (LiveCodeBench-style; some METR variants).

Date-cutoff evaluations. Use only tasks created after model training cutoffs to avoid contamination.

Benchmark-design competitions. Communities (METR, AISIs, academic groups) periodically release new challenging benchmarks.

The trajectory. Agent-benchmark development is one of the fastest-moving areas in evaluation methodology in 2024-2026. Continued investment is essential as agent capabilities advance.

Safety-relevant agentic evaluation

A specific category at the intersection of agent evaluation and safety evaluation. Cross-references AI Agents §11 and Alignment §7.

The categories.

Autonomous-replication evaluations. Can the agent acquire compute, accounts, or money beyond its initial allocation? METR (2023+).

Dangerous-capability agentic evaluations. Can the agent operationalize dangerous knowledge (synthesize bioweapon-relevant procedures via search-and-execute)?

Agentic deception evaluations. Does the agent behave differently when it perceives oversight is reducing? Sleeper-agent-style probes adapted to agentic contexts (cross-reference Alignment §8).

Resource-usage monitoring. Track agent resource consumption (compute, API calls, money) to detect runaway behaviour.

Multi-agent dynamics. When agents interact with each other (or with adversarial AI inputs), the threat model becomes more complex.

These evaluations are part of frontier safety frameworks (Anthropic RSP, OpenAI Preparedness; cross-reference Alignment §10). The integration with deployment-decision processes is operationally consequential.

Where agentic evaluation sits in 2026

The summary. Agent evaluation is substantially more developed than 2023 baselines. Standard benchmarks exist; long-horizon evaluation is recognized; safety-relevant agentic evaluation is integrated with governance frameworks.

The remaining issues. Cost-effectiveness vs ecological validity trade-offs. Standardization across evaluators is incomplete. The pace of capability growth strains methodology. Long-horizon evaluations are expensive and hard to standardize.

The next section develops evaluation for deployment decisions - the operational use of all the methodology developed in §3-§10.


§11. Evaluation for Deployment Decisions

The operational use of evaluation methodology. This section develops how evaluations feed deployment decisions - pre-release, continuous post-deployment monitoring, A/B testing, and integration with governance frameworks.

Frontier safety frameworks

A specific institutional pattern. Frontier safety frameworks (Anthropic RSP, OpenAI Preparedness, DeepMind Frontier Safety; cross-reference Alignment §10) tie evaluations to deployment decisions in operationally-consequential ways.

The structure (review from Alignment).

Capability tiers. Each framework defines capability thresholds (Anthropic ASL-1/2/3/4; OpenAI low/medium/high/critical Preparedness levels).

Required evaluations. Each tier specifies evaluations that must be conducted before models at that tier are deployed.

Threshold conditions. Specific evaluation scores trigger specific deployment requirements (additional safety training; deployment restrictions; halting further capability development pending mitigations).

Audit and transparency. Frameworks specify what evaluation results are published; what is retained internally; what is shared with regulators.

The deployment integration. A frontier-model release in 2026 typically requires:

  1. Pre-release internal evaluation across the framework’s required categories.

  2. External red-teaming (often by AISIs).

  3. Documentation of evaluation results in a model card.

  4. Mitigations addressing identified issues.

  5. Deployment approval against framework criteria.

The aggregate effect. Evaluation is not just research; it is operational infrastructure that determines deployment behaviour. The methodological choices have substantial real-world consequences.

Pre-deployment evaluation

The standard pattern. Before deploying a new model:

   PRE-DEPLOYMENT EVALUATION (typical frontier-lab pattern)

   1. CAPABILITY BENCHMARKING (weeks)
      Run the model on the standard capability benchmark suite
      (MMLU, GPQA, MATH, HumanEval, SWE-bench, etc.).
      Compare to prior model versions; characterize improvements.

   2. SAFETY EVALUATION (weeks-months)
      Run the dangerous-capability evaluations.
      Apply behavioural safety evaluations.
      Conduct internal red-teaming.

   3. EXTERNAL REVIEW (weeks-months)
      AISI partners (UK, US) conduct pre-deployment evaluation.
      External red-teamers probe for failure modes.

   4. ITERATIVE REFINEMENT
      Address identified issues through additional training or
      deployment controls.
      Re-evaluate after changes.

   5. RELEASE DECISION
      Final evaluation review against frontier safety framework
      criteria.
      Sign-off by lab leadership and (where applicable) external
      partners.

   6. PUBLIC DOCUMENTATION
      Publish model card with key evaluation results.
      Include caveats and known limitations.

The investment. Modern frontier-model pre-deployment evaluation involves months of work by hundreds of evaluators and researchers. The cost is substantial; the methodology is increasingly sophisticated.

The model card. The output of pre-deployment evaluation is typically a published model card - structured documentation of the model’s capabilities, evaluation results, intended use cases, limitations, and safety properties. Model cards have become standard practice; their structure has matured into recognizable conventions.

Continuous post-deployment monitoring

A different pattern that has matured in 2024-2026. After deployment, evaluation does not stop.

Real-world performance monitoring. Track model behaviour in deployment - user queries, model responses, satisfaction signals, error reports. Identify emerging failure modes; compare to pre-deployment evaluation predictions.

Adversarial probing. Continuous red-teaming of deployed models. Identify novel attack patterns; respond with mitigations.

Safety event tracking. Log specific safety-relevant events (refusal rates, jailbreak attempts, dangerous-content production). Aggregate over time; identify trends.

A/B comparison. Test deployment-time changes (new prompts, safety filters, model versions) against production traffic; verify changes are improvements.

Drift detection. Monitor for unexpected changes in behaviour over time. Models may behave differently after fine-tuning, prompt updates, or environmental changes; drift detection catches such shifts.

The 2026 production picture. Major deployed AI systems have substantial post-deployment evaluation infrastructure. The tooling (observability platforms, evaluation dashboards, real-time monitoring) is mature; the integration with development cycles is increasingly standard.

A/B testing in production

A specific operational pattern. A/B testing - comparing two model versions on production traffic - provides ground-truth deployment evaluation.

The mechanism. Route some fraction of production queries to the candidate model; the rest to the current model. Compare metrics:

  • User-quality signals (ratings, conversation length, retention).

  • Cost (token usage; latency).

  • Safety events (refusals; jailbreak rates).

  • Domain-specific metrics (task completion; user reports).

After collecting enough data, decide whether to deploy the candidate broadly.

The advantages.

  • Production-grade signal. Real users, real queries, real conditions.

  • Statistical rigor. With sufficient traffic, statistically significant comparisons are achievable.

  • Direct measurement of user impact. Not a proxy via benchmarks; the actual deployment metrics.

The challenges.

  • User-exposure cost. Sub-optimal candidates expose users to worse experience during testing.

  • Cherry-picking risk. Tests can be designed to favour pre-determined outcomes.

  • Multiple-testing concerns. Running many A/B tests increases false-positive rates without correction.

  • Long-term-vs-short-term metrics. Some changes look good in short-term tests but degrade over time (e.g., engagement-optimized changes that erode trust).

  • Privacy concerns. Testing on real user data raises privacy questions.

The 2026 deployment. Major frontier-AI services (ChatGPT, Claude, Gemini) run extensive A/B testing as part of their deployment pipelines. The infrastructure is mature; the methodology continues to evolve.

Governance integration

The operational interface between evaluation methodology and policy. Multiple governance frameworks reference evaluation in operational ways.

EU AI Act. Risk-tiered regulation; specific categories require systematic evaluation as part of conformity assessment. Frontier (general-purpose) AI models above specified compute thresholds have additional evaluation requirements.

US Executive Order on AI (Oct 2023; substantially modified Jan 2025; subsequent updates). The original EO required pre-deployment safety reporting for frontier models. The 2025 reversal removed many specific requirements; subsequent policy continues to evolve.

UK AI Safety Institute and other AISIs. Conduct evaluations under partnership with frontier-AI labs; produce independent evaluation reports that inform regulatory decisions.

Voluntary frontier-lab commitments. RSPs (Anthropic), Preparedness Frameworks (OpenAI), Frontier Safety Framework (DeepMind) commit to evaluation-conditional deployment.

International coordination. Bletchley Declaration (Nov 2023), Seoul AI Safety Summit (May 2024), Paris AI Action Summit (Feb 2025), G7 Hiroshima AI Process - frameworks for international coordination on frontier-AI evaluation.

The technical implications.

Evaluation as governance evidence. Evaluation results increasingly serve as evidence in regulatory processes. The methodology choices have legal and policy weight.

Standardization pressure. Different jurisdictions may have different requirements; meeting all simultaneously requires interoperable evaluation methodology.

Independent verification. Lab self-reporting is supplemented by independent evaluation (AISIs, academic researchers); the combination provides cross-checks.

Transparency requirements. Governance frameworks increasingly require disclosure of evaluation methodologies and results.

Certification: aspiration and reality

A specific governance question. Should AI systems be formally certified?

The case for certification. Other safety-critical industries (medical devices, aviation, financial services) have formal certification regimes. AI certification would:

  • Provide consistent quality standards.

  • Reduce information asymmetries (users know what was verified).

  • Enable insurance and liability frameworks.

  • Support regulatory enforcement.

The case against (or for caution).

  • Technical immaturity. AI evaluation methodology is not yet ready to support certification at the rigor of other industries.

  • Regulatory capture. Certification regimes can be captured by incumbents.

  • Competitive dynamics. Certification requirements may favour large incumbents over smaller competitors.

  • Pace mismatch. Certification processes are slow; AI capability evolves rapidly.

The 2026 state. No formal AI certification regime exists at scale. Various certification-like patterns are emerging:

  • Frontier-lab RSP/Preparedness compliance as informal certification.

  • AISI evaluation reports as quasi-certification.

  • Industry-standard model cards as documentation.

The trajectory. Formal certification may eventually emerge but is years away. Certification-like infrastructure is gradually maturing.

The deployment-decision feedback loop

A meta-observation. Evaluation methodology shapes deployment decisions; deployment outcomes inform evaluation methodology.

The pattern.

  1. Pre-deployment evaluation predicts model behaviour.

  2. Deployment provides ground-truth observations.

  3. Gaps between prediction and reality motivate evaluation-methodology improvements.

  4. Improved methodology informs the next round of pre-deployment evaluation.

The implication. Evaluation methodology is learned over time through deployment experience. The 2026 methodology reflects lessons from years of deployment; future methodology will reflect future lessons.

The honest accounting. The feedback loop is partial. Some deployment failures inform methodology improvement; others are missed or ignored. Lab-internal feedback is faster than community-wide feedback. The cycle works but unevenly.

Where deployment-decision evaluation sits in 2026

The summary. Evaluation has become operational infrastructure for deployment decisions. Frontier safety frameworks integrate evaluation requirements with deployment; AISIs conduct independent evaluations; A/B testing provides production-grade signals; governance frameworks reference evaluation in compliance.

The remaining issues. Standardization across labs and jurisdictions is incomplete. Independent verification covers only part of the evaluation landscape. Formal certification is years away. The pace of capability growth strains the methodology.

The trajectory. The 2024-2026 period saw substantial operationalization of evaluation infrastructure. Continued maturation is expected; the destination (reliable evaluation-based deployment decisions for arbitrarily capable AI) is approached but not reached.


§12. Connections to Other Chapters

This chapter is the most cross-cutting in the book. The references below are dependency statements; chapter-specific evaluation content lives in those chapters.

  • Foundation Models §9 develops FM-specific evaluation in detail. This chapter develops the cross-cutting methodology those evaluations apply.

  • Large Language Models §13 develops LLM-specific limitations and evaluation concerns. Several frontier-benchmark patterns described in this chapter (MMLU, GPQA, MMLU-Pro, etc.) are LLM-specific.

  • Alignment §7 develops safety evaluations and dangerous-capability evaluations in detail. §5 of this chapter cross-references; the operational integration with governance (§11 here) parallels Alignment §10.

  • AI for Science §11 develops evaluation in scientific domains (CASP, MatBench, WeatherBench, etc.). The domain-specific evaluation methodologies have their own conventions; this chapter develops the cross-cutting patterns.

  • Generative Models §10 develops generative-model evaluation in detail (FID, perplexity, human preference, memorization detection). The generative-model evaluation is methodologically distinctive.

  • AI Agents §10 develops agent-specific evaluation. §10 of this chapter cross-references; agent evaluation has distinctive methodology developed there.

  • Mechanistic Interpretability §11 develops MI-specific evaluation methodology (SAE benchmarks, circuit-validation methodology, automated-interpretability validation).

  • Causality §11 develops causal-benchmark evaluation including LLM causal reasoning (CRASS, CLadder, CausalBench).

  • Reinforcement Learning has evaluation embedded throughout - sample-complexity benchmarks, regret bounds, deep-RL benchmarks (Atari, MuJoCo, etc.). RL evaluation has its own conventions distinct from supervised-learning evaluation.

  • Deep Learning §11 covers the empirical-evaluation context for deep-learning architectures.

  • Self-Supervised Learning §9 covers SSL-specific evaluation (representation quality, transfer evaluation).

  • Theoretical Foundations of Learning §11 develops the theoretical-evaluation framings (sample complexity, generalization bounds) that complement empirical evaluation.


§13. Critiques and Alternative Perspectives

This section presents critiques of AI evaluation as substantive intellectual positions.

“Benchmarks aren’t measuring what matters”

A persistent critique. The argument: standard benchmarks measure narrow tasks that don’t reflect deployment value. MMLU performance is not the same as “general knowledge”; HumanEval score is not the same as “software-engineering capability”; SWE-bench performance is not the same as “production-grade autonomous coding.”

The substantive content. Benchmarks measure specific operationalizations of capabilities. The gap between the operationalization and the underlying capability is non-zero. Reported scores can be misleading.

The pushback. Imperfect measurement is better than no measurement. Benchmarks provide standardized comparison; their limitations are well-understood; users should interpret with appropriate caveats. The alternative (no measurement) leaves capability claims unfalsifiable.

The chapter’s position. Both critiques have force. Benchmarks are useful tools with known limitations. Best practice combines benchmarks with other evaluation modes (human evaluation, red-teaming, deployment monitoring). Reducing evaluation to benchmark scores alone is a real failure mode; the field’s better practice acknowledges this.

“The benchmark culture distorts research priorities”

A specific cultural critique. The argument: AI research is organized around benchmarks. Researchers target benchmark improvements; benchmark performance determines publication and career success; the field’s research priorities reflect what benchmarks measure rather than what’s important.

The consequences claimed. Underinvestment in problems that don’t have clean benchmarks. Overinvestment in problems with strong benchmark incentives. Goodhart’s law operating at the research-community level.

The pushback. The alternative (research not organized around measurements) has its own failure modes - research becomes harder to evaluate, progress harder to measure, science becomes opinion. Benchmarks provide epistemic discipline; the discipline has costs but the alternative is worse.

The chapter’s position. The cultural concern is real. Benchmark-dominated research culture has shaped what AI research happens. The right response is not abandoning benchmarks but diversifying the evaluation methodology - multi-method evaluation, human evaluation, deployment-grounded evaluation, novel-capability evaluation.

“Human evaluation is the gold standard / cannot scale / has its own biases”

A complex critique with multiple positions. The positions.

“Human evaluation is the gold standard.” Human judgment is the ultimate source of truth for what AI systems should do. Automated evaluation is a proxy for human judgment.

“Human evaluation cannot scale.” Production-grade AI evaluation requires massive throughput; human evaluation is too slow and expensive.

“Human evaluation has its own biases.” Annotator demographics, training, fatigue, motivation, and other factors all introduce biases. Human evaluation is not actually a gold standard; it’s a flawed standard.

The chapter’s position. All three positions have substantive truth. Human evaluation is the closest thing to ground truth for many subjective tasks. It cannot scale to the volumes modern AI evaluation requires. It has its own biases that need methodology to manage. The right response is hybrid pipelines combining human, LLM-as-judge, and benchmark evaluation; not reliance on any single mode.

The reproducibility crisis in AI evaluation

A specific methodological concern. Reproducibility - independent investigators replicating reported results - is a recurring crisis across empirical sciences. AI evaluation has its own version.

The concerns.

  • Method variability. Different evaluation implementations (different prompts, different decoding parameters, different metric implementations) produce different scores.

  • Model-snapshot variability. The same model name (e.g., “GPT-4”) may refer to different model versions over time; reported scores are model-snapshot-specific.

  • Random-seed variability. Repeated evaluations of the same model on the same benchmark produce noticeably different scores due to LLM sampling stochasticity.

  • Closed-data evaluation. Frontier-model evaluation often happens on private benchmarks; external reproduction is impossible.

The response. Reproducibility-oriented infrastructure has improved - HELM, lm-evaluation-harness, OpenAI Evals provide more standardized evaluation. Reporting practices have improved. But the underlying problem persists for closed-data frontier-model evaluation.

The chapter’s position. Reproducibility is substantially worse in AI evaluation than in established empirical sciences. Methodological improvements are gradual; the destination (reliably reproducible AI evaluation) is approached but not reached.

Ecological validity vs benchmark cleanliness

A specific tension. Benchmark cleanliness (standardized, reproducible, comparable) and ecological validity (matching deployment conditions) often conflict.

The trade-off.

  • Cleaner benchmarks. Cached environments, fixed prompts, deterministic evaluation. Reproducible but possibly unrepresentative.

  • More-valid benchmarks. Live environments, variable conditions, realistic deployment patterns. Less reproducible but closer to deployment.

The chapter’s position. Both are needed. Clean benchmarks for standardized comparison; ecologically valid evaluation for deployment grounding. The 2026 practice combines both; the integration is imperfect.

Whose evaluation values count

A normative critique paralleling the alignment-chapter discussion. Evaluation reflects evaluator values. Whose values?

The concern. Dominant evaluation methodologies reflect:

  • US/European cultural perspectives.

  • English-language norms.

  • Professional/academic preferences.

  • Tech-industry priorities.

The implication. AI systems evaluated against these priorities will be optimized for them. Deployment in non-Western contexts, non-English languages, or non-tech-industry use cases may be poorly served.

The response. Increasing recognition; some efforts at cross-cultural evaluation (multilingual benchmarks; non-Western evaluator panels). Substantial gaps remain.

The chapter’s position. The critique is substantive and partially addressed. Multilingual and cross-cultural evaluation is a real research direction; its centrality in the field is uneven.


§14. Limitations and Open Problems

Consolidated open-problems list. Each carries an OP-E-N identifier.

  • OP-E-1. Contamination-resistant evaluation at scale. The contamination problem persists despite mitigation strategies. Producing reliable evaluation evidence at the pace of capability development is open. Best practices exist but are unevenly applied; verification across closed-data labs is hard.

  • OP-E-2. Robust adversarial evaluation. Adversarial evaluation (red-teaming, prompt-injection probing, adversarial perturbations) catches known attack patterns. Novel attacks emerge regularly. Producing evaluation methodology that anticipates rather than reacts to adversarial patterns is open.

  • OP-E-3. Long-horizon and agentic evaluation methodology. Current methods handle hour-to-day timescales; week-to-month timescales are research-stage. Cost-effective evaluation of long-horizon capabilities is a substantial open problem.

  • OP-E-4. Capability extrapolation from benchmarks to deployment. Reported benchmark scores predict deployment value imperfectly. Bridging the gap - predicting deployment outcomes from pre-deployment evaluation - is open.

  • OP-E-5. Evaluation that aligns with human values. Existing benchmarks measure operationalizations of value; the gap to actual user value is non-zero. Designing evaluations that align tightly with deployment-relevant value is an ongoing challenge.

  • OP-E-6. Cross-cultural evaluation. Most evaluation infrastructure reflects specific cultural perspectives. Producing genuinely cross-cultural evaluation infrastructure is open.

  • OP-E-7. Evaluation under capability growth. Benchmarks saturate; new ones must be developed continuously. Whether benchmark-design infrastructure can keep pace with capability growth is uncertain.

  • OP-E-8. Evaluation for emerging modalities and capabilities. New capabilities (multimodal generation, computer-use agency, scientific reasoning) sometimes lack mature evaluation methodology. Each new capability category needs its own evaluation development.

  • OP-E-9. Reproducibility in AI evaluation. Independent reproduction of evaluation results is hard for closed-data frontier models. Methodological infrastructure to support reproducibility is partial.

  • OP-E-10. Certification of AI systems. Formal certification regimes for AI do not yet exist at scale. Whether and how they should emerge is open.


§15. Further Reading

Opinionated annotated list. Not exhaustive; intended as a reading-order recommendation.

Foundational

  • Russakovsky, O., et al. (2015). “ImageNet Large Scale Visual Recognition Challenge.” The canonical benchmark of the deep-learning era.

  • Wang, A., et al. (2018). “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.”

  • Wang, A., et al. (2019). “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.”

Frontier LLM benchmarks

  • Hendrycks, D., et al. (2021). “Measuring Massive Multitask Language Understanding.” MMLU.

  • Srivastava, A., et al. (2022). “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.” BIG-bench.

  • Liang, P., et al. (Stanford CRFM 2022). “Holistic Evaluation of Language Models.” HELM.

  • Rein, D., et al. (2023). “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.”

  • Glazer, E., et al. (2024). “FrontierMath.”

  • Chollet, F. “The Abstraction and Reasoning Corpus” and ARC-AGI prize work.

Human evaluation

  • Zheng, L., Chiang, W., et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” The foundational Chatbot Arena paper.

LLM-as-judge

  • Zheng, L., et al. (2023). “MT-Bench and LLM-as-a-Judge.” As above.

  • Liu, Y., et al. (2023). “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.”

  • Kim, S., et al. (2024). “Prometheus: Inducing Fine-grained Evaluation Capability in Language Models.”

  • Panickssery, A., Bowman, S. R., Feng, S. (2024). “LLM Evaluators Recognize and Favor Their Own Generations.” Self-preference documentation.

Contamination

  • Sainz, O., et al. (2023). “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark.”

  • Magar, I., and Schwartz, R. (2022). “Data Contamination: From Memorization to Exploitation.”

  • Carlini, N., et al. (multiple papers, 2021-2024). Data extraction from language models.

  • Mirzadeh, I., et al. (Apple 2024). “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs.”

Safety evaluation

  • Lin, S., Hilton, J., Evans, O. (2022). “TruthfulQA: Measuring How Models Mimic Human Falsehoods.”

  • Mouton, C. A., et al. (RAND 2024). CBRN dangerous-capability evaluations.

  • Perez, E., et al. (2022). “Red Teaming Language Models with Language Models.”

  • Ganguli, D., et al. (Anthropic 2022). “Red Teaming Language Models to Reduce Harms.”

Agent evaluation

  • Jimenez, C. E., et al. (2024). “SWE-bench.”

  • Mialon, G., et al. (2023). “GAIA.”

  • Xie, T., et al. (2024). “OSWorld.”

  • METR. Task Evaluations and related reports - long-horizon agent benchmarks.

Methodology and critique

  • Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R., Bao, M. (2022). “The Values Encoded in Machine Learning Research.” Critique of values in ML research priorities.

  • Bender, E. M., et al. (2021). “On the Dangers of Stochastic Parrots.” Includes critique of LLM evaluation.

Reading-order recommendation

For someone new to AI evaluation: start with HELM (Liang et al. 2022) for systematic-evaluation methodology; then MMLU and GPQA papers for canonical capability benchmarks; then LMSYS Chatbot Arena (Zheng et al. 2023) for human-evaluation-at-scale; then contamination papers (Sainz et al. 2023; Mirzadeh et al. 2024) for the contamination concern; then safety-evaluation references (Mouton RAND; Perez 2022; Ganguli 2022) for safety-specific methodology. Add agent-evaluation references (SWE-bench, METR) for the agentic frontier.


§16. Exercises and Experiments

Research-style exercises that develop AI-evaluation skills.

  • E1. Multi-model benchmark comparison. Pick three frontier LLMs. Run them on a small subset of MMLU, GPQA Diamond, and TruthfulQA. Report scores; analyze patterns of where models agree vs disagree; investigate failure cases.

  • E2. LLM-as-judge implementation. Implement an LLM-as-judge system for a specific task (e.g., evaluating chat helpfulness, summary quality). Validate against human judgments on ~100 items. Compute agreement rate; identify systematic biases.

  • E3. Contamination audit. Pick a public benchmark. Try to find benchmark items in publicly-available training data (Common Crawl, GitHub, Wikipedia). Document overlap rates; estimate contamination impact on benchmark scores.

  • E4. Pairwise human preference setup. Set up a small pairwise-preference evaluation infrastructure (annotation UI; protocol; aggregation). Run with 5-10 evaluators on 50 items comparing two model outputs. Compute Elo ratings; investigate inter-annotator agreement.

  • E5. Benchmark design. Design a small new benchmark for an emerging capability not well-covered by existing benchmarks (e.g., reasoning about novel scientific scenarios; specific multimodal tasks; cross-cultural-knowledge tasks). Document methodology; conduct pilot evaluation; reflect on what’s hard about benchmark design.

  • E6. Methodological critique. Pick a published AI evaluation. Critique its methodology: are the benchmarks well-chosen? Are the comparison conditions fair? Are the results reported responsibly? What’s missing? What’s overclaimed?

  • E7. Saturation tracking. Pick a benchmark with multi-year history (MMLU, HumanEval, GSM8K). Trace SOTA scores over time. Plot the saturation trajectory. Estimate when (if ever) the benchmark will saturate.

  • E8. Robustness evaluation. Take a benchmark (e.g., MMLU subset). Generate paraphrased versions of items. Compare model performance on canonical vs paraphrased. Quantify the paraphrase gap; reflect on what it indicates.

  • E9. Agent evaluation cost analysis. Pick a small agentic benchmark (e.g., a GAIA subset). Run with different LLM backbones. Compare task-success rates and per-task costs. Plot cost-quality Pareto frontier; reflect on cost-effectiveness analysis.

  • E10. Replication experiment. Pick a published evaluation result. Attempt to reproduce it from scratch using public information. Document obstacles; estimate effort required; reflect on AI-evaluation reproducibility in practice.


AI: A Living Reference by Fuzue. Content licensed under CC BY-SA 4.0 - share, adapt, and build on it; keep the attribution and the open licence on derivatives.