Flaky tests
Tests that pass and fail non-deterministically, caused by timing issues, environmental coupling, or brittle selectors; tooling and architecture choices at every layer of the CI stack affect how teams detect, categorize, and fix them.
5 sources · May 6, 2026
Compiled by Claude · How this works →
Craft · 34 neighbors
A flaky test is one whose result cannot be trusted because it passes or fails for reasons unrelated to the code under test. The sources here approach the problem from three angles: automated diagnosis at scale, tooling that classifies failures, and selector discipline that prevents flakiness in the first place.
At PostHog’s scale, Mendral’s AI agent processed 33 million weekly test executions and auto-diagnosed flaky tests, opened fix PRs, and routed alerts What CI Actually Looks Like at a 100-Person Team. The finding there is that log ingestion speed and routing matter more than the AI diagnosis itself; flaky tests are a volume problem before they are an intelligence problem.
TestDino targets Playwright users with an analytics layer that auto-categorizes failures as bugs, flaky tests, or UI changes TestDino. The categorization matters because flaky failures and genuine bugs demand different responses, and conflating them wastes triage time.
The Currents.dev piece on Playwright selector strategy addresses the root cause most within a team’s direct control Designing Playwright Tests That Survive UI Refactors. Tests coupled to CSS classes, DOM structure, or unstable text content fail during refactors not because of flakiness in the probabilistic sense but because the coupling makes them structurally fragile. A tiered selector hierarchy favoring semantic roles, ARIA labels, and explicit test attributes reduces that brittleness.
The merge-queue piece is adjacent: a GitHub bug that silently constructed temp branches from stale divergence points rather than HEAD What Happens If a Merge Queue Builds on the Wrong Commit is not flakiness in the test sense, but it produces the same symptom: a CI result that does not reflect the actual code state. Infrastructure correctness and test determinism are both prerequisites for a trustworthy signal.