Labs / citation-faithfulness

Citation faithfulness

Does the wiki say what its sources say? Two AI judges grade every citation; the headline is how often they agree.

The question

When the wiki cites a source, the source backs the claim. And cheap Haiku agrees with pricier Sonnet often enough that routine checks don't need Sonnet at all.

Judge agreement89% · 167 claims

Faithfulness · does the source back the claim?

85%supported · 14% partial · 1% unsupported(334 claims, both judges)

Haiku

88% · 167 claims

Sonnet

83% · 167 claims

supportedpartialunsupportedparaphrase counts; inference past the source doesn't

Agreement · do the two judges return the same verdict?

89%148 agree · 19 disagree(of 167 claims both judges scored)

Each judge pulls its own citing sentences, so only the 167 both happened to extract can be compared head-to-head.

16 of 19 splits land on the partial / supported line — both judges read the source as on-topic but split on how much it backs. Only 3 are a hard supported↔unsupported flip. Open a row to read each judge's reasoning side by side.

One critique goes further: agent memory fails when it stores assertions rather than beliefs, missing provenance, confidence, and revision history.

Hunsupported→Ssupported▸▾

Haikuunsupported

The source summary mentions that agent memory systems fail by storing assertions rather than beliefs and missing provenance, confidence, and revision history. However, the claim adds 'scope' to the list of missing elements, while the source lists 'scope' instead of 'revision history.' The source does mention revision history implicitly through 'supersession' and 'outcome-scored pruning,' but the exact phrasing of what is missing does not match the claim's specification.

Sonnetsupported

The source summary explicitly states that agent memory systems fail because they 'store assertions rather than beliefs -- missing provenance, confidence, scope, and revision history,' which directly entails the claim.

read the cited source ↗

On the backend, Shell Tricks That Actually Make Life Easier is actually about the divide between interview performance and real work.

Hsupported→Sunsupported▸▾

Haikusupported

The source directly states that algorithm interviews test skills that 'weakly correlate with production performance' and contrasts this with what 'real engineering requires,' explicitly addressing the divide between interview performance and actual work. The summary's framing matches the claim's assertion about this divide.

Sonnetunsupported

The source is titled 'Learn Algorithms for Interviews, Forget Them for Work' and is about the divide between interview performance and real work. The claim misidentifies this source as being about 'Shell Tricks That Actually Make Life Easier,' which is a different article entirely.

read the cited source ↗

A paper cited by Falk and Tsoukalas argues that competitive pressure pushes firms to shed workers before automation's productivity gains are confirmed, producing collectively suboptimal outcomes.

Hsupported→Sunsupported▸▾

Haikusupported

The source summary directly states that 'competitive pressure pushes them to lay off workers prematurely, leading to collectively suboptimal outcomes even when automation's productivity gains are uncertain.' This matches the claim's core assertion about competitive pressure driving premature workforce reduction before productivity gains are confirmed.

Sonnetunsupported

The claim attributes this argument to a paper cited by Falk and Tsoukalas within a source about cognitive offloading and LLMs. The source text supplied is about 'The AI Layoff Trap' and does not mention Falk and Tsoukalas or present itself as a paper cited by them. The source text is an unrelated reading entry, so it does not support the attribution made in the claim.

read the cited source ↗

Papers like CAMEL, MetaGPT, and AutoGen demonstrated that agents could be made to collaborate, but shipped without concurrency control or escalation paths Getting Up to Speed, Part 3.

Hsupported→Spartial▸▾

Haikusupported

The source summary explicitly states that the five papers (including CAMEL, MetaGPT, and AutoGen) have 'shared failure modes like missing concurrency control and no escalation paths,' which directly supports the claim that these papers 'shipped without concurrency control or escalation paths.'

Sonnetpartial

The source summary confirms CAMEL, MetaGPT, and AutoGen are among the five papers covered and that missing concurrency control and no escalation paths are identified failure modes. However, the source does not explicitly state these papers 'demonstrated agents could collaborate,' and the failure modes are listed as shared across the wave rather than specifically attributed to those three papers.

read the cited source ↗

Aiyan's data engineering case study traced a system through three architectures and found that environmental constraints, tool design, and context visibility outperformed prompt refinements at every stage.

Hsupported→Spartial▸▾

Haikusupported

The source summary directly states that a data engineering agent evolved through three architectures and that environmental constraints (tool design, ID keys, context visibility) outperform prompt engineering for LLM reliability. This matches the claim that environmental constraints, tool design, and context visibility outperformed prompt refinements at every stage across three architectures.

Sonnetpartial

The source confirms a system evolved through three architectures and that environmental constraints, tool design, and context visibility outperform prompt engineering. However, the source does not attribute the case study to 'Aiyan,' and the claim's phrase 'at every stage' goes slightly beyond what the source explicitly states, which only says these factors outperform prompt engineering generally.

read the cited source ↗

The AI Memory Systems feature comparison tracks 74 systems across architecture, data model, and benchmarks, indicating how fragmented this space remains.

Hsupported→Spartial▸▾

Haikusupported

The source explicitly states it is 'A live comparison table of 74 AI agent memory systems across architecture, data model' and additional dimensions like 'benchmarks.' The claim that the comparison 'tracks 74 systems across architecture, data model, and benchmarks' directly matches the source's description.

Sonnetpartial

The source confirms 74 systems tracked and that the comparison spans architecture, data model, and benchmarks, matching those specific claims. However, the claim that this 'indicates how fragmented this space remains' is an interpretive conclusion not stated in the source, which only describes the table's contents without offering that editorial judgment.

read the cited source ↗

Harness Design for Long-Running Application Development describes Anthropic's own GAN-inspired planner-generator-evaluator architecture for multi-hour coding sessions, where the evaluator role exists specifically to counter self-evaluation bias.

Hsupported→Spartial▸▾

Haikusupported

The source summary explicitly describes 'a GAN-inspired multi-agent architecture—planner, generator, and evaluator' used during 'multi-hour autonomous coding sessions' and mentions that it 'overcomes...self-evaluation bias.' This directly supports the claim about the planner-generator-evaluator architecture for multi-hour coding sessions and the evaluator's role in countering self-evaluation bias.

Sonnetpartial

The source summary mentions a GAN-inspired planner-generator-evaluator architecture for multi-hour sessions and self-evaluation bias, supporting most of the claim. However, it does not explicitly state the evaluator role exists 'specifically to counter self-evaluation bias'; it only says the architecture 'overcomes' both context anxiety and self-evaluation bias without singling out the evaluator's purpose.

read the cited source ↗

Zod also appears in Seven Cool JavaScript Libraries as a general-purpose runtime type validation tool.

Hpartial→Ssupported▸▾

Haikupartial

The source mentions Zod as one of seven libraries covered but does not explicitly describe it as a 'general-purpose runtime type validation tool.' The summary only notes that the article covers 'what each does' without specifying Zod's particular function.

Sonnetsupported

The source summary explicitly lists Zod among the seven libraries covered, and the article's framing as a tour of what each library does supports characterizing Zod as a runtime type validation tool within that context.

read the cited source ↗

Helply demonstrates this with live meeting transcription and AI-generated answers, offloading cognitive work that previously required manual note-taking and retrieval.

Hsupported→Spartial▸▾

Haikusupported

The source states Helply provides 'real-time transcription and AI-generated answers during calls,' which directly supports the claim about 'live meeting transcription and AI-generated answers.' The claim's characterization of these features as 'offloading cognitive work that previously required manual note-taking and retrieval' is a reasonable paraphrase of automating what would otherwise be manual tasks.

Sonnetpartial

The source confirms real-time transcription and AI-generated answers during calls, supporting the meeting transcription and AI-answers part of the claim. However, the source says nothing about offloading cognitive work or replacing manual note-taking and retrieval; that framing goes beyond what the source states.

read the cited source ↗

Finite Attention makes that cost explicit: systems that maximize data output without filtering for relevance shift cognitive burden onto on-call workers, producing burnout even when the underlying automation is technically sound.

Hsupported→Spartial▸▾

Haikusupported

The source summary states that on-call burnout stems from 'systems designed to maximize data output without accounting for human attention limits.' This directly supports the claim's assertion that systems maximizing data output without filtering for relevance shift cognitive burden onto on-call workers and produce burnout. The claim's framing that automation can be 'technically sound' while still producing burnout aligns with the source's framing of the problem as a design issue rather than a technical failure.

Sonnetpartial

The source summary supports the part about systems maximizing data output without accounting for human attention limits causing on-call burnout. However, the claim adds that this occurs 'even when the underlying automation is technically sound,' which is not stated in the source. The source also proposes a solution (push-based, multi-bot architecture) but does not frame the cost as being 'made explicit' in the way the claim suggests.

read the cited source ↗

The AI Memory Systems comparison table lists benchmark support as one of its filterable columns across 74 systems, treating the presence or absence of benchmark results as a first-class property of a tool's maturity.

Hsupported→Spartial▸▾

Haikusupported

The source summary explicitly states the comparison table includes 'benchmarks' as one of the categories across which 74 AI agent memory systems are compared, and mentions 'filterable columns', directly supporting the claim about benchmark support being a filterable column among the systems listed.

Sonnetpartial

The source confirms a comparison table of 74 AI memory systems with filterable columns including benchmarks. However, the claim that the table treats benchmark presence as a 'first-class property of a tool's maturity' is an interpretive assertion not stated in the source, which only lists benchmarks as one of several columns.

read the cited source ↗

Where tests run also matters: staging and production serve different verification needs, and treating them interchangeably creates false confidence Playwright Testing in Staging vs Production.

Hpartial→Ssupported▸▾

Haikupartial

The source discusses a decision framework for splitting tests between staging and production environments, which touches on the topic that different environments serve different purposes. However, the source summary does not explicitly state that treating them interchangeably creates false confidence, only that tests should be split between the environments. The claim about false confidence goes beyond what is explicitly stated in the provided source text.

Sonnetsupported

The source explicitly addresses splitting tests between staging and production and covers 'which flows belong where,' directly supporting the claim that staging and production serve different verification needs and should not be treated interchangeably.

read the cited source ↗

Compromised npm packages in the SAP ecosystem carried credential-stealing payloads that harvested cloud secrets and used CI-adjacent tooling as persistence vectors SAP-Related npm Packages Compromised in Credential-Stealing Supply Chain Attack.

Hsupported→Spartial▸▾

Haikusupported

The source summary directly supports the claim. It states that the compromised npm packages contained 'a credential-stealing, self-propagating payload that harvests cloud secrets' and 'abuses Claude Code and VS Code configs as persistence vectors.' This matches the claim's assertion about credential-stealing payloads that harvested cloud secrets and used CI-adjacent tooling as persistence vectors.

Sonnetpartial

The source confirms harvesting cloud secrets and using Claude Code and VS Code configs as persistence vectors, supporting most of the claim. However, the source describes these as 'persistence vectors' in VS Code/Claude Code configs, not specifically 'CI-adjacent tooling.' The claim's framing of 'CI-adjacent tooling' is a characterization not explicitly stated in the source.

read the cited source ↗

MarkdownLM addresses the policy-enforcement side by blocking non-compliant code at the Git layer before it merges MarkdownLM, though that targets standards conformance more than active compromise.

Hsupported→Spartial▸▾

Haikusupported

The source states that MarkdownLM's Lun tool blocks 'non-compliant code at the Git layer before it merges,' which directly supports the claim's description of MarkdownLM blocking non-compliant code at the Git layer before it merges.

Sonnetpartial

The source confirms that MarkdownLM's Lun tool blocks non-compliant code at the Git layer before it merges, supporting that part of the claim. However, the source does not say this targets 'standards conformance more than active compromise'; that qualifier goes beyond what the source states.

read the cited source ↗

Platform reliability is also a variable: declining GitHub stability GitHub is Sinking means CI pipelines built entirely on hosted platforms inherit that platform's risk profile.

Hpartial→Ssupported▸▾

Haikupartial

The source asserts that GitHub's reliability has declined, which supports the claim's notion of declining stability. However, the source does not specifically discuss how CI pipelines inherit platform risk or connect this decline to CI infrastructure, making it support only the narrower premise rather than the full causal claim about CI pipeline risk.

Sonnetsupported

The source explicitly states 'GitHub's reliability and quality have declined sharply under Microsoft,' which directly supports the claim about declining GitHub stability. The source's recommendation to migrate before further deterioration aligns with the inherited risk profile framing.

read the cited source ↗

Jujutsu's first-class conflict handling and auto-committed working copy reframe version control as less interruptive, and a concrete workflow for reviewing large pull requests using jj shows how that translates to reduced cognitive overhead in practice.

Hsupported→Spartial▸▾

Haikusupported

The source directly supports the claim's reference to Jujutsu's features. It states that Jujutsu 'auto-commits the working copy, records conflicts as first-class objects,' which matches the claim's description of 'first-class conflict handling and auto-committed working copy.' The source confirms these are core features of the system.

Sonnetpartial

The source confirms jj's first-class conflict handling and auto-committed working copy, which matches that part of the claim. However, the source does not mention anything about version control being 'less interruptive' or reduced cognitive overhead; those characterizations go beyond what the source states.

read the cited source ↗

Building a UI Without Breakpoints argues that intrinsic layouts, container queries, and clamp() values are usually a better fit for component-first UIs than viewport breakpoints, because they encode intent more precisely.

Hsupported→Spartial▸▾

Haikusupported

The source summary states that modern UIs should 'replace viewport breakpoints with intrinsic layouts, fluid clamp() values, container units, and container queries,' which directly supports the claim that intrinsic layouts, container queries, and clamp() values are better alternatives to viewport breakpoints for component-first UIs. The claim's characterization of these as 'encoding intent more precisely' is a reasonable paraphrase of the replacement rationale implicit in the source's argument structure.

Sonnetpartial

The source supports most of the claim: it argues for replacing viewport breakpoints with intrinsic layouts, fluid clamp() values, and container queries. However, the source does not mention encoding intent more precisely; that specific framing is absent from the source text.

read the cited source ↗

Font pairing references and CSS text effects deal with typographic aesthetics rather than responsive scaling, and the Micrographics Figma library is about layout composition, not type sizing.

Hpartial→Ssupported▸▾

Haikupartial

The source describes the Micrographics library as containing layouts and modular building blocks, confirming it relates to 'layout composition.' However, the source does not explicitly state that it is 'not' about type sizing or discuss what it excludes, so the claim's contrastive framing (distinguishing it from responsive scaling) goes beyond what the source directly supports.

Sonnetsupported

The source describes a Figma library of micrographic layouts and vector symbols for 'adding technical, data-heavy texture to compositions,' which confirms it is about layout composition. Nothing in the source addresses type sizing or responsive scaling.

read the cited source ↗

Utopia's type scale graph view makes it easier to see how two sizes within a fluid modular scale relate across viewport breakpoints, which affects how paired faces read relative to each other at different sizes — a pairing that feels balanced at desktop proportions may shift at minimum viewport if the scale compresses unevenly.

Hsupported→Spartial▸▾

Haikusupported

The source confirms that Utopia's type scale graph view plots font sizes across viewports (min and max) to help designers understand relationships within a fluid modular scale. This directly supports the claim that the graph view makes it easier to see how two sizes relate across viewport breakpoints and how this affects pairing perception at different sizes.

Sonnetpartial

The source confirms Utopia's graph view plots font sizes across min and max viewports to clarify relationships within a fluid modular scale, supporting the first part of the claim. However, the source says nothing about font pairing, how paired faces read relative to each other, or how uneven scale compression at minimum viewport affects pairing balance -- those assertions go beyond what the source states.

read the cited source ↗

The cost bet · what the agreement buys

Haiku

$0.24

Sonnet

$0.67

Haiku grades the same claims for $0.24 against Sonnet's $0.67 — about 2.8× cheaper. At 89% agreement, routine passes can run Haiku alone and route only the 19 splits to Sonnet and a human.

Per-article · 22 wiki conceptsdisagreements first

continuous-integration

4 split

2 split

2 split

2 split

1 split

1 split

developer-productivity

1 split

1 split

1 split

1 split

1 split

1 split

1 split

agree

agree

agree

agree

agree

agree

agree

agree

agree

An AI builds the wiki. It reads each saved article, clusters the articles by topic, and writes one synthesis paragraph per cluster, citing the sources it drew from. The question that matters here is whether those citations hold: when the paragraph cites an article, does the article actually say what the paragraph claims?

So for every sentence in the wiki that links to a saved article, two AI judges read the sentence and the source and return one of three verdicts:

supported — the article backs the claim. Paraphrase counts; inference past what the source says doesn’t.
partial — the article is on the topic but backs only a weaker version, or doesn’t quite get there.
unsupported — the article doesn’t back the claim, contradicts it, or is about something else.

The two judges aren’t the same size. Haiku is small and cheap. Sonnet is bigger and runs about three times the cost. Grading the same claims with both turns the cost-quality tradeoff into a number: if they mostly agree, later passes can run Haiku alone and kick only the disagreements up to a human, and to Sonnet.

When they disagree, I read the source myself. Sometimes the citation is fine and a judge got it wrong. Sometimes the wiki overreached and the citation needs retagging or removal. And sometimes the claim sits in a gray zone the rubric hasn’t pinned down yet — those cases are worth the most, because they’re where the rubric grows.

All of this assumes the wiki cited the right article to begin with. That assumption gets its own check in Topic stability, which asks whether the topic tags the clustering relies on hold still over time. What each run costs lands in Ingest pipeline cost.

The judge prompt is versioned. Bump the version and every prior score is thrown out on the next pass, so verdicts written against two different rubrics never get averaged together.

back to /labs