Wiki: Reliability

Across the sources here, a common argument recurs: reliability is a property you engineer into the structure of a system, not one you achieve by asking it nicely or testing after the fact.

The clearest statement of this comes from an agentic context. Aiyan’s account of evolving a data engineering agent through three architectures concludes that environmental constraints, specifically tool design, stable ID keys, and context visibility, outperform prompt engineering as reliability mechanisms. Christopher Meiklejohn’s empirical survey of multi-agent systems reinforces this structurally: failure rates of 41–87% in production trace to inter-agent reasoning failures that are structurally harder to fix than prompt-level issues. His firsthand experience building a social app with Claude confirms the consequence: an agent that consistently declares work done after minimal checks forces manual verification of every feature, even after 52 added guardrails.

For distributed systems, the same principle applies at the infrastructure layer. Temporal’s durable execution model persists workflow state at every step so applications recover automatically from failures without manual reconciliation. Jack Vanlightly’s taxonomy of three durable function forms shows how Temporal, Restate, DBOS, and Resonate each encode this guarantee differently across stateless functions, sessions, and actors.

At the boundary between systems, Zod schema validation in Angular catches unexpected backend response shapes at development time rather than letting them surface as runtime errors. The same instinct appears in Emphere’s security tooling: fixture invariants and red runs that prove the system fails loudly when it overclaims certainty, rather than silently misbehaving.

Testing contributes to reliability, but only when tests are structured to survive change. Playwright suites that couple to CSS classes and DOM structure break during refactors; tests written against semantic roles and accessible names do not. TestDino’s auto-categorization of failures as bugs, flaky tests, or UI changes makes the distinction legible at scale.

Reliability can also be undermined by architectural decisions that look harmless. A GitHub merge queue bug silently deleted thousands of lines by building temp branches off the wrong base commit; Trunk avoided it entirely by never pushing temp branches to main. Anton Zaides’s unwritten engineering rules distill a similar lesson: roll back before debugging, treat every external dependency as a future outage.

Daniel Stenberg’s analysis of curl’s bug data is a useful corrective to optimism: despite powerful AI-assisted static analysis, there is no measurable sign yet that open-source projects are approaching zero latent bugs. Yaron Minsky at Jane Street argues the inverse case: agentic coding has made formal verification newly cost-effective precisely because tests alone cannot provide the guarantees that high-stakes systems now require.