Wiki: Flaky tests

A flaky test is one that produces inconsistent results across identical runs, making CI signal unreliable and eroding trust in the test suite itself. At scale the problem compounds quickly: Mendral’s CI agent handles 33 million test executions per week at PostHog and spends significant effort tracing flaky tests to root causes before opening fix PRs automatically, because at that volume manual triage is untenable.

Much of the flakiness in frontend suites comes from how tests are written. Currents on UI refactors argues that tests break not primarily because selectors are fragile, but because they couple to implementation details: CSS classes, DOM structure, positional relationships. Selectors anchored to semantic roles, accessible names, and ARIA labels survive UI changes because those attributes travel with intent rather than structure.

AI-generated tests introduce their own class of flakiness sources. How To Test Frontend documents patterns such as over-mocking, testing only happy paths, and writing assertions that match a buggy implementation rather than intended behavior. Tests like that may pass consistently while catching nothing, which is a different failure mode than intermittent failures but contributes to the same loss of confidence in the suite.

Tooling has moved to address detection and categorization. TestDino auto-categorizes failures into bugs, flaky tests, and UI changes, reducing the time engineers spend classifying failures before they can act on them. Environment mismatch also contributes: Currents on staging vs production notes that certain failure modes only appear in production, meaning tests that pass in staging may flake or fail in ways attributable to environment rather than code, complicating root-cause analysis.