27 Benchmarks and Their Discontents

Evaluation is how you know any stage of the stack worked, the cross-cutting gate that data, architecture, pre-training, and post-training all answer to. This chapter is about its oldest and most contested instrument, the static benchmark: how the capability surface decomposes into benchmark families, why those families saturate, how train/test contamination silently inflates every number, and what it takes to keep a held-out set trustworthy. By the end a reader can explain why a published benchmark score is a claim about a model and its measurement harness together, and what discipline is needed before that score counts as evidence.

27.1 Problem

A training pipeline produces a new set of weights every time a stage runs, and the only honest way to know a stage worked is an independent number that moved for a reason that survives scrutiny. A benchmark is meant to be that falsifier. The recurring failure is not the absence of a number but belief in a wrong one: a score that is contaminated by test data leaking into training, gamed by tuning toward a known format, or simply noise dressed up as signal. Each of these produces a number that looks excellent and does not transfer, and the damage is silent, because nothing in the score itself announces that it is wrong.

No single layer of evaluation catches all three failures. A practical pipeline treats automated benchmarks, production monitoring, A/B tests, user feedback, and human studies as a Swiss-cheese stack, where each layer catches what the others let through. This chapter is about the first layer, the static benchmark, which is the cheapest, the most comparable across labs, and the most prone to quietly lying.

27.2 Design

The capability surface is too large to measure with one test, so benchmarks decompose it into families, each probing a different axis and each failing in a characteristic way.

Knowledge. Broad factual and academic recall, in the style of MMLU (Hendrycks et al., 2020). It saturates and contaminates easily, and frontier high-water marks now sit near the ceiling.
Reasoning. Multi-step inference beyond recall. It is harder to contaminate but harder to score objectively, which pushes the field toward graduate-level, google-proof variants like GPQA (Rein et al., 2023).
Math. Grade-school through competition, from GSM8K to MATH. The answer is checkable, which is exactly the property that bridges evaluation to verifiable-reward reinforcement learning in Chapter 14.
Code. Function completion and repository-level repair, from HumanEval to SWE-bench-style tasks (Jimenez et al., 2023). It is executable, so the judge is a test runner rather than a model.
Multilingual. The same task across languages. It exposes tokenizer and data-mixture decisions made in Chapter 4 that were invisible while the model was measured only in English.
Long-context. Retrieval and reasoning over long inputs, from needle-in-a-haystack to harder probes like RULER. It gates the long-context extension discussed in Chapter 20.
Agentic. Tool use, multi-step task completion, and web or operating-system interaction, named by benchmarks like GAIA (Mialon et al., 2023), WebArena (Zhou et al., 2023), and τ-bench (Yao et al., 2024). These do not reduce to a single static answer, and their evaluation method is the subject of

Underneath every family sits the same precondition: a held-out set whose answers the model has never seen during training. Held-out is not a property of a file, it is a property of the whole pipeline. A set is held-out only if it appears in no training mixture, which is a contract the data layer must honor, not a checkbox the evaluation layer can tick on its own.

Contamination is the violation of that contract, and because it is silent it has to be hunted rather than waited for. Detection works on overlap and membership. Exact and near-duplicate matching, typically n-gram overlap against the training corpus, catches verbatim and lightly-edited leakage. Canary strings, unique markers embedded in a held-out set, reveal that the set was ingested if the model can reproduce them. Membership-inference signals go further and ask whether a specific example was likely trained on at all, for instance Min-K% Prob, which flags an example when its least-probable tokens are suspiciously high-probability under the model (Shi et al., 2023). None of these is conclusive alone, which is why contamination is reported as a contract and an audit rather than a single test.

27.3 Evolution

The history of benchmarks is a history of saturation. A benchmark is useful only while models fail some of it; once a family clusters near its ceiling, the remaining headroom is noise and the benchmark stops discriminating between systems. Knowledge benchmarks reached this first. MMLU was a meaningful spread when it appeared and is now near ceiling for frontier models, which is why the field moved to deliberately harder, contamination-resistant variants. GPQA was built to be google-proof, with questions that experts outside the field cannot answer even with web access, precisely so that recall and retrieval cannot substitute for reasoning (Rein et al., 2023).

Saturation pushes the design in two directions at once. One is toward harder static sets, which buys time but inherits the same fate: any fixed set is a target, and a target eventually gets hit, whether by genuine capability or by contamination. The other is toward living evaluation that resists a fixed target altogether. Chatbot Arena (Chiang et al., 2024) replaces a frozen test set with a stream of fresh human-preference comparisons, so there is no static answer key to leak and the measurement renews itself as the model population changes. The arc runs from a fixed academic exam, to a harder google-proof exam, to a moving evaluation with no answer key to memorize.

27.4 Trade-offs

The saturation arc is really a single trade-off between two kinds of benchmark, and a published number is only as meaningful as the reader’s awareness of which kind it came from.

Static versus dynamic. Static benchmarks are cheap to run, reproducible to the bit, and comparable across labs and over time, which is why leaderboards are built on them. They also saturate, contaminate, and miss anything that is not a single static answer. Dynamic evaluation, meaning rotating sets and live environments, resists gaming because there is no fixed target to overfit, but costs more to run and is harder to compare across time because the test itself is moving.
Public versus private held-out. A public benchmark should be assumed contaminated over time, since its answer key is on the internet and flows into the next training run by default. A private held-out set is trustworthy, but only while its integrity holds, and integrity is a wasting asset: every leak, every accidental inclusion, every published example erodes it. The honest posture is to trust a private set until it leaks and to assume a public set already has.

27.5 Implementation

Operationalizing held-out integrity is where the principle becomes a procedure. A set earns the word held-out only if it is never placed in any training mixture, version-pinned so a score names the exact set it was measured against, access-controlled so it does not leak through carelessness, rotated when a leak is detected, and large enough to give a usable signal rather than a single fragile number. This is the asset the evaluation layer owns and defends.

The evaluation layer does not, however, scrub the training corpus. The build-time removal of test data, the deduplication and overlap filtering, belongs to the data pipeline. The evaluation layer supplies the held-out sets and the canaries, consumes the data layer’s decontamination contract, and audits that the contract held. Auditing is the load-bearing word: the evaluation layer’s job is to verify, after the fact, that nothing it defined as held-out made it into a mixture, and to surface it loudly when something did.

A subtler contamination path bypasses the corpus entirely. Synthetic data generated by a teacher model that was itself trained on a benchmark carries that benchmark’s content into the student, even though the student never saw the benchmark directly. This is one reason the synthetic-data practice in Chapter 12 and the held-out contract here are coupled: a clean corpus is not enough if the teacher was dirty.

All of this is why a single benchmark number, reported without context, is folklore rather than measurement. Contamination-aware reporting is the minimum discipline: state the decontamination contract version the score was produced under, and disclose any known leakage alongside the number, so a reader can tell a real result from a contaminated one before building on it.

What’s contested

Do benchmark scores measure the model, or the harness around it? The same weights produce different scores depending on the prompt template, the number of few-shot examples, how the answer is parsed out of the generation, and the decoding settings, and these choices are not standardized across labs. A reported number is therefore a property of the model and its measurement harness together, and two labs running the same benchmark on the same model can disagree, and rankings can flip, purely because their harnesses differ. This is the motivation behind standardized evaluation frameworks like HELM (Liang et al., 2022), which pin the harness so that a score reflects the model rather than the measurement apparatus. The contested question is how much of any cross-model gap is capability and how much is harness, and the only honest answer is to fix and publish the harness, not to trust a bare number.

Constraint arrow

Whether an evaluation number means anything is decided one layer down. The build-time decontamination in Chapter 4, the removal of test data from the training corpus, is what makes a held-out score a measurement at all. The evaluation layer can define held-out sets, supply canaries, and audit the result, but it cannot clean the corpus itself. If the data layer leaks the sets the evaluation layer defined, every downstream number is silently inflated and no amount of statistical care upstream of the leak can recover it. The integrity of the upper layer’s measurement is dictated by the discipline of the lower layer’s pipeline.

27.6 Further reading

Hendrycks et al., “Measuring Massive Multitask Language Understanding” (MMLU), 2020. arXiv:2009.03300
Liang et al., “Holistic Evaluation of Language Models” (HELM), 2022. arXiv:2211.09110
Chen et al., “Evaluating Large Language Models Trained on Code” (HumanEval), 2021. arXiv:2107.03374
Cobbe et al., “Training Verifiers to Solve Math Word Problems” (GSM8K), 2021. arXiv:2110.14168
Hendrycks et al., “Measuring Mathematical Problem Solving with the MATH Dataset,” 2021. arXiv:2103.03874
Rein et al., “GPQA: A Graduate-Level Google-Proof Q&A Benchmark,” 2023. arXiv:2311.12022
Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,” 2023 (ICLR 2024). arXiv:2310.06770
Hsieh et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?,” 2024 (COLM 2024). arXiv:2404.06654
Mialon et al., “GAIA: a benchmark for General AI Assistants,” 2023 (ICLR 2024). arXiv:2311.12983
Yao et al., “τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains,” 2024. arXiv:2406.12045
Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” 2023 (ICLR 2024). arXiv:2307.13854
Chiang et al., “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference,” 2024 (ICML 2024). arXiv:2403.04132
Shi et al., “Detecting Pretraining Data from Large Language Models,” 2023 (ICLR 2024). arXiv:2310.16789
Xie et al., “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments,” 2024 (NeurIPS 2024). arXiv:2404.07972
OpenAI, “BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents,” 2025. arXiv:2504.12516