Skip to content

Reading / 2026-05/2026-05-14t190300-opus-47-low-vs-medium-vs-high-vs-xhigh-vs-max-the-reasoning

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks

Benchmarking Claude Opus 4.7 across five reasoning-effort levels on 29 real GraphQL-go-tools tasks shows a non-monotonic curve: medium effort wins on pass rate, equivalence, and code-review quality, while high, xhigh, and max cost more without improving outcomes.

May 14, 2026 · tech · Stet

Read at the source →

Topics

  • benchmarks
  • llm-engineering
  • ai-assisted-coding
  • llm-inference
  • developer-productivity

Cited by

  • AI-assisted coding

    AI coding assistants accelerate development but introduce tradeoffs around skill atrophy, codebase design, verification, and security that shape how much value they actually deliver.

  • Benchmarks

    Benchmarks in multi-agent AI research measure coordination overhead, error propagation, and task performance, exposing how architectural choices translate into real costs across single- and multi-agent systems.

  • Developer productivity

    Developer productivity spans tooling choices, organizational alignment, and the human skills those tools depend on, with a growing body of sources questioning whether AI-assisted workflows deliver on their promise without eroding the judgment they require.

  • LLM Engineering

    The practical discipline of building, evaluating, and operating systems that use large language models, spanning knowledge architecture, agent control flow, inference optimization, and the human and organizational costs of getting it wrong.

  • LLM inference

    LLM inference spans the full stack from VRAM constraints and quantization choices on consumer hardware to latency optimization in production agent services, with tooling debates about transparency, local runtimes, and cost-efficient alternatives to large models.

Related

back to /reading