Skip to content
Back to about
8 min read

Wiki synthesis pipeline

Contents · 7
Sources read 118
Wiki articles 43
Total spend $2.34
Citation accuracy
Judge agreement

This site runs an LLM pipeline that turns saved reading into a per-concept wiki, with cost tracking and evals on every layer. 118 sources read so far, 43 synthesized wiki articles, $2.34 total spend across every priced call. The pipeline ships open-source on Cloudflare Workers and measures its own output through a tiered eval layer documented below. Numbers on this page are pulled from sidecar JSON at build time, so a stale count never sneaks in.

Architecture

Wiki synthesis pipeline data flow Reading sources flow through the /link endpoint into per-source markdown entries, are clustered by topic into wiki articles by /synthesize, and measured by /eval into labs sidecar JSON consumed by labs cells. Reading sources URLs saved via iOS Shortcut POST /link fetch + summarize commit reading entry src/content/reading/<slug>.md topics[] stable slugs cluster by topic /synthesize ≥ MIN_WIKI_SOURCES commit wiki article src/content/wiki/<slug>.md crosslink waitUntil follow-up measure /eval tier 0/1 today (2-5 forthcoming) write labs sidecars src/content/labs/data/*.json read at build labs cells /labs/<slug> surface artifact worker endpoint
Pipeline data flow. Endpoints run in workers/site-ingest; artifacts land in src/content.

The diagram covers the three content layers and the worker endpoints that write them. Reading entries (src/content/reading/) hold one summary per source URL with a stable topics[] array. Wiki articles (src/content/wiki/) compile from clusters of reading entries that share a topic, gated by a 2-source minimum so under-cited concepts never ship. Labs sidecars (src/content/labs/data/) hold raw measurements that the labs cells render at build time. Keeping the three layers separate means a citation, a synthesis, and a measurement each have their own file on disk, their own Zod schema, and their own provenance trail.

The five mutating endpoints (/link, /synthesize, /recompile, /contribute, /eval) all run through one substrate in workers/site-ingest/src/pipeline.ts. Each endpoint implements a Strategy<S> interface that produces the file mutation; the substrate handles branch creation, commit, PR open, and secondary cross-link insertion. No endpoint re-implements GitHub plumbing. Cross-linking runs as a follow-up phase via ctx.waitUntil after the four prose-writing endpoints; /eval writes a sidecar JSON rather than a wiki or reading entry and opts out of the crosslink phase. The whole pipeline fits inside Cloudflare’s 30s wall-clock budget for a single request.

Schemas and provenance

Every artifact carries provenance in its frontmatter: compiled_at, compiled_with (model identifier), compile_cost (with the pricing snapshot captured at the moment of the call), and title_source (which branch produced the title). Schemas live in src/lib/schemas/content.ts and are imported by both the Astro site (for content-collection validation at build) and the worker (for pre-commit validation of drafted content). Zod once, validated in two places.

Eval layer

Tier 0: structural lint

Catches orphan citations, hallucinated related_concepts, and sub-threshold concepts. Cheap, deterministic, no LLM call. Lives in workers/site-ingest/src/lint.ts and runs on every PR. Tier 0 is the only tier that’s fully shipped today.

Tier 1: citation faithfulness (Haiku vs Sonnet judge)

A two-judge LLM-as-judge eval. For each [claim → source] pair in a wiki article, both Haiku and Sonnet rate whether the source actually supports the claim. Headline numbers: Sonnet accuracy — on the latest run, Haiku and Sonnet agree on — of pairs. Disagreement between judges is the interesting signal: those rows get pulled into the watchlist for human review. See the labs cell.

Tier 2: synthesis quality rubric

Scores articles on novelty (does it say something the individual sources don’t?), coherence (does it read like one essay or three stapled together?), depth, and internal consistency. Mean novelty: —. Mean coherence: —. See the labs cell (forthcoming).

Tier 3: voice adherence

Catches drift from voice-reference.md. The eval feeds a wiki paragraph and the voice reference to a judge model and asks for a binary “in voice / out of voice” verdict, with a flagged-paragraph list as the actionable output. See the labs cell (forthcoming).

Tier 4: recompile stability watchlist

Deterministic, no LLM call. Tracks which wiki articles got rewritten when new reading entries with their topics landed, and surfaces articles that churn often relative to the size of their new-source delta. Currently — articles flagged. — recompiles in the last 30 days. See the labs cell (forthcoming).

Tier 5: cross-model synthesis comparison

Compiles the same topic-cluster with Haiku, Sonnet, and Opus, then scores the results across the Tier 2 rubric to plot a cost-quality frontier.

In flight; the cell will land at /labs/model-comparison after the first comparison run.

Failure modes catalog

The eval layer surfaces five categories of failure. The list below describes the categories, how the eval catches them, and where to see real examples once the eval has run.

Citation drift. A wiki claim is broader than what the cited source actually supports. Tier 1 catches this when the judge model rates a [claim → source] pair as unsupported. First production examples will land at /labs/citation-faithfulness after the first eval run.

Voice drift. A paragraph that doesn’t match the site voice (corporate adjectives, hedging qualifiers, em-dashes, the not-X-but-Y construction). Tier 3 catches this paragraph-by-paragraph against voice-reference.md. First production examples will land at /labs/voice-adherence after the first eval run.

Recompile churn. A single new reading source triggers a full rewrite of a wiki article, even when the source adds nothing material. Tier 4 catches this deterministically by tracking the diff size of recompiles relative to their new-source delta. Real examples surface in the recompile stability cell (forthcoming) as they accumulate.

Hallucinated related concepts. A wiki article cites related_concepts: [...] slugs that don’t exist in the wiki collection. Tier 0 catches this on every PR via the structural lint and rejects the article before merge.

Sub-threshold synthesis. A wiki article gets compiled with fewer than 2 contributing sources. Tier 0 catches this in the schema: the wiki frontmatter Zod schema’s sources field is min({MIN_WIKI_SOURCES}), so an under-cited article fails build-time validation before it ever renders.

Cost economics

Total spend across every priced call so far: $2.34. Cost per wiki article (total spend divided by 43 compiled articles): $0.054. Cost per reading source (total spend divided by 118 ingested sources): $0.020. The breakdown by model and the daily series live in the ingest-pipeline-cost lab cell, regenerated on every build by scripts/labs-aggregate.mjs.

The cadence shapes the numbers. /now runs once a week. /link runs when I save a URL from my phone, which clusters in evening reading sessions. /synthesize runs on demand when a topic crosses the threshold. This is a personal-cadence pipeline, not a high-volume one, so the absolute spend is small. The point of tracking it is the per-article and per-source ratios, which set the budget envelope for any future change to prompts or models.

What I’d build differently

Prompt caching is the obvious miss. I skipped it deliberately because the cadence doesn’t fit: /now runs weekly (cache TTL is 5 minutes or 1 hour), /link runs ad-hoc with a different URL each time, and /synthesize runs once per topic compile. None of those flows reuse a system prompt within the cache window. If a future surface starts hitting the same prompt several times in a session (a chat-with-the-wiki interface, say), caching becomes free money and I’d add it then.

A vector index for retrieval is the second skip. The current architecture is Karpathy’s “LLM Wiki” pattern: hand-curated topics, deterministic clustering by topics[] slug, no embeddings. That holds up at the current 43 concepts because clustering is small and stable. If the wiki passes ~200 articles I’d revisit, both for fuzzy topic matching at ingest and for an on-site search that knows about synonyms.

Repo and code