A flaky test suite is one of the most expensive forms of technical debt. Engineers stop trusting CI. They retry green-or-red runs reflexively. They start ignoring real failures. Velocity drops 20 to 40 percent before anyone notices. This post is the playbook we use to take a flaky suite from "we don't trust this" to "this is a real safety net."
The diagnosis pipeline
Four steps, two weeks
Four steps. Two weeks for a typical suite. Most of the work is in step 3.
Step 1: Measure
Before fixing, measure. Get a CSV from your CI of every test run in the last 30 days, with pass/fail and timestamp. Group by test name. Compute flake rate as "ratio of failures that were retried and then passed."
The output looks like this:
test name runs flaky rate
user.auth.signin.happy_path 412 47 11.4%
payments.checkout.upgrade_plan 387 32 8.3%
dashboard.empty_state.loading 298 21 7.0%
...
The top 10 to 20 flaky tests usually account for 70 to 90 percent of all flakes. Fix those and you've fixed the suite.
Step 2: Bucket by root cause
For each flaky test, identify the root cause. Common buckets:
Timing assumptions
Test waits "5 seconds for the API" but the API sometimes takes 6. Test passes when fast, fails when slow.
Fix: Replace timeouts with explicit waits for state. Playwright's waitFor, Cypress's should. Wait for the actual condition, not for clock time.
Test data contamination
Tests share data. One test's leftover state breaks another.
Fix: Per-test isolation. Database transactions that roll back. Unique test data per run (UUIDs, not "test@example.com").
Order dependence
Tests pass when run in order, fail when run in parallel.
Fix: Make every test self-contained. No "test 2 assumes test 1 created the user." Each test creates what it needs.
Selector fragility
Test uses a deeply-nested CSS selector that breaks when the DOM changes.
Fix: Page Object pattern with stable selectors. data-testid attributes for anything you'll target in tests. Avoid nth-child, deep CSS paths, and text content as a primary selector.
Environment differences
Test passes locally, fails in CI. Or passes in one CI environment, fails in another.
Fix: Identify the environment difference. Locale, timezone, font rendering, browser version, network speed. Either standardise the environment or make the test robust to the variation.
Real product bugs
The "flaky" test is actually catching a real intermittent bug. Genuinely race conditions in the product code.
Fix: Fix the product, not the test. These are gold.
Step 3: Fix the top three buckets
In a typical engagement we find:
- 40 percent of flakes are timing assumptions.
- 25 percent are selector fragility.
- 15 percent are test data contamination.
- 10 percent are real product race conditions.
- 10 percent are everything else.
Fixing timing and selectors usually drops flake rate from 8-10 percent to 2-3 percent in the first week.
The timing fix in practice
Replace patterns like:
await page.waitForTimeout(2000);
await page.click('.submit-button');
With:
await page.locator('[data-testid="submit-button"]').waitFor({ state: 'visible' });
await page.locator('[data-testid="submit-button"]').click();
The first version is racing the clock. The second waits for the actual state we care about.
The selector fix in practice
Replace patterns like:
await page.click('div.user-menu > ul > li:nth-child(3) > a');
With:
await page.locator('[data-testid="user-menu-settings"]').click();
The first version breaks when the DOM changes. The second is stable across refactors.
Step 4: Prevent regression
A stabilised suite gets re-flaked unless you defend it.
CI rules
- No auto-retry on the first run. Auto-retry hides flakes by making them invisible. Use manual retry with logging.
- Fail the build if flake rate climbs above 1 percent. Have a dashboard. Page on regression.
- Quarantine new flaky tests. A flaky test moves to a "quarantine" suite that doesn't block PRs, but is tracked. Owner has 7 days to fix or delete.
Code review rules
- Every new test goes through review. Specifically check for the patterns above.
- Selector strategy in the PR description. Reviewer confirms the selector is stable.
- Tests reviewed by QA, not just by engineering. Different eyes catch different issues.
Ownership
- Every test has an owner. A team or a name. When the test flakes, the owner is paged.
- Flake rate as a team metric. Track per team. Compare. Reward improvements.
Cultural changes
- Treat flake rate as a quality metric. Stop accepting "tests are flaky" as a baseline.
- Block on flake fixes. When flake rate climbs, dedicate time to fix before adding new tests.
- Reward the engineer who fixes a flaky test. Public recognition. It's high-leverage work.
Tooling that helps
- Playwright trace viewer. Records the full state of a failed test. Replays it. Invaluable for diagnosing flakes.
- Cypress dashboard. Same idea for Cypress.
- CI insights. Your CI provider's flake-detection features.
- Spec-aware retries. Built-in retry mechanisms that mark a test as flaky when it passes on retry.
Modern AI tools are also surprisingly useful here. Asking Claude Code or Cursor "why might this Playwright test be flaky" against the test source plus the failure trace often surfaces the timing issue or the selector fragility in seconds.
The retry-loop trap
Auto-retry on test failures is the most common way teams hide a flaky suite from themselves. The CI green-rate looks healthy while the underlying flake rate compounds. Track the first-attempt pass rate, not the after-retries rate. The gap between them is the real flake number, and shrinking that gap is the only fix that matters.
Common mistakes
- Auto-retrying everywhere. Hides flakes. They come back as production bugs later.
- Disabling flaky tests indefinitely. They become noise. Delete or fix.
- Treating flake rate as a non-priority. Then it grows and now you have a much bigger problem.
- Letting engineers ship features alongside flaky tests. Set the bar: green CI means green CI.
How Hashorn helps teams stabilise their suites
Hashorn provides QA automation and AI QA testing services that often start with a flaky-suite engagement. We follow the four-step process above, embed alongside your team, and leave with a suite under 1 percent flake. For teams that want a dedicated partner to own QA long-term, our dedicated QA team engagement covers stabilisation plus ongoing automation work.
Conclusion
A flaky test suite is fixable. The fix is mechanical: measure, bucket, fix, prevent. Two to four weeks of focused work takes most suites from "we don't trust this" to "this catches real bugs." The hardest part isn't the engineering; it's the cultural discipline to keep flake rate low after it's been fixed.
Frequently asked questions
Need help building AI-powered software, QA automation, or secure cloud systems?
Talk to Hashorn's engineering team. Dedicated senior engineers, QA, and security with same-week ramp.