Wiki: Benchmarks

A benchmark is only as useful as the gap it actually measures. Across LLM evaluation, multi-agent systems, and vision-language research, the same structural problem recurs: tests get designed for one purpose, then get applied to broader claims they cannot support.

The sharpest articulation of this comes from Meiklejohn’s survey of multi-agent systems Part 7. HumanEval and SWE-bench were designed for single-agent coding tasks. When applied to multi-agent pipelines, they cannot measure coordination quality, communication overhead, or failure recovery, which are precisely the things that distinguish multi-agent architectures from single-agent ones. Numbers from those tests look legible, but they answer the wrong question. Imbue’s pipeline experiment on SWE-bench Pro illustrates a downstream cost: running an implementer-reviewer-fixer loop against the benchmark revealed that weaker fixer agents broke correct code, a failure mode the benchmark wasn’t built to surface.

SysMoBench runs into an analogous mismatch from the other direction Can LLMs model real-world systems in TLA+?. Leading LLMs score near-perfect on TLA+ syntax, but only around 46% on conformance and 41% on invariant checks. The models are generating textbook protocol descriptions rather than faithfully modeling the actual systems in the source code. Syntax scores, the easy-to-measure proxy, look impressive; the meaningful scores do not.

The RTK token-compression controversy is a miniature version of the same issue The Token Compression Illusion. Claimed 60-90% token savings are measured only on Bash output stripping, with no task-accuracy benchmarks to show the compression doesn’t degrade downstream results. The metric exists; the benchmark that would justify trusting it does not.

Effort-level benchmarking adds a different wrinkle. A hands-on test of Claude Opus 4.7 across five reasoning-effort levels on 29 real tasks found a non-monotonic curve: medium effort outperformed high, xhigh, and max on pass rate, equivalence, and cost-efficiency. More compute did not monotonically improve results. This matches Colin Breck’s broader argument that impressive performance gains often don’t change outcomes when attention thresholds, discrete capacity increments, or pipeline backpressure absorb the improvement before it reaches the user.

On capability trajectory, a LessWrong analysis estimating no-CoT task-completion time horizons finds GPT-5.5 handling roughly three-minute human tasks at 50% reliability, with a doubling time of about one year since 2019. The benchmark here is explicitly designed to track a capability trend over time rather than claim absolute performance, which is one of the cleaner uses of benchmark methodology in the surveyed sources.

The AI memory systems comparison table surveyed 74 systems across architecture, data model, search modes, and benchmark coverage. Listing whether a system has benchmark data at all is itself a meaningful signal; many do not.

The consistent thread: a benchmark measures what it was designed to measure, and the field repeatedly applies tests outside their design envelope. The remedy isn’t more benchmarks but better-scoped ones tied to the failure modes that actually matter in production.