Hashorn

QA Automation

Stabilising a Flaky Test Suite: A Step-by-Step Guide

The exact process we use to take a test suite from 10 percent flake rate to under 1 percent. Diagnosis, fixes, and the cultural changes that keep it stable.

By Hashorn TeamMay 30, 2026 6 min read

A flaky test suite is one of the most expensive forms of technical debt. Engineers stop trusting CI. They retry green-or-red runs reflexively. They start ignoring real failures. Velocity drops 20 to 40 percent before anyone notices. This post is the playbook we use to take a flaky suite from "we don't trust this" to "this is a real safety net."

The diagnosis pipeline

Four steps, two weeks

Four steps. Two weeks for a typical suite. Most of the work is in step 3.

Step 1: Measure

Before fixing, measure. Get a CSV from your CI of every test run in the last 30 days, with pass/fail and timestamp. Group by test name. Compute flake rate as "ratio of failures that were retried and then passed."

The output looks like this:

   test name                                  runs   flaky   rate
   user.auth.signin.happy_path                412    47      11.4%
   payments.checkout.upgrade_plan             387    32      8.3%
   dashboard.empty_state.loading              298    21      7.0%
   ...

The top 10 to 20 flaky tests usually account for 70 to 90 percent of all flakes. Fix those and you've fixed the suite.

Step 2: Bucket by root cause

For each flaky test, identify the root cause. Common buckets:

Timing assumptions

Test waits "5 seconds for the API" but the API sometimes takes 6. Test passes when fast, fails when slow.

Fix: Replace timeouts with explicit waits for state. Playwright's waitFor, Cypress's should. Wait for the actual condition, not for clock time.

Test data contamination

Tests share data. One test's leftover state breaks another.

Fix: Per-test isolation. Database transactions that roll back. Unique test data per run (UUIDs, not "test@example.com").

Order dependence

Tests pass when run in order, fail when run in parallel.

Fix: Make every test self-contained. No "test 2 assumes test 1 created the user." Each test creates what it needs.

Selector fragility

Test uses a deeply-nested CSS selector that breaks when the DOM changes.

Fix: Page Object pattern with stable selectors. data-testid attributes for anything you'll target in tests. Avoid nth-child, deep CSS paths, and text content as a primary selector.

Environment differences

Test passes locally, fails in CI. Or passes in one CI environment, fails in another.

Fix: Identify the environment difference. Locale, timezone, font rendering, browser version, network speed. Either standardise the environment or make the test robust to the variation.

Real product bugs

The "flaky" test is actually catching a real intermittent bug. Genuinely race conditions in the product code.

Fix: Fix the product, not the test. These are gold.

Step 3: Fix the top three buckets

In a typical engagement we find:

  • 40 percent of flakes are timing assumptions.
  • 25 percent are selector fragility.
  • 15 percent are test data contamination.
  • 10 percent are real product race conditions.
  • 10 percent are everything else.

Fixing timing and selectors usually drops flake rate from 8-10 percent to 2-3 percent in the first week.

The timing fix in practice

Replace patterns like:

await page.waitForTimeout(2000);
await page.click('.submit-button');

With:

await page.locator('[data-testid="submit-button"]').waitFor({ state: 'visible' });
await page.locator('[data-testid="submit-button"]').click();

The first version is racing the clock. The second waits for the actual state we care about.

The selector fix in practice

Replace patterns like:

await page.click('div.user-menu > ul > li:nth-child(3) > a');

With:

await page.locator('[data-testid="user-menu-settings"]').click();

The first version breaks when the DOM changes. The second is stable across refactors.

Step 4: Prevent regression

A stabilised suite gets re-flaked unless you defend it.

CI rules

  • No auto-retry on the first run. Auto-retry hides flakes by making them invisible. Use manual retry with logging.
  • Fail the build if flake rate climbs above 1 percent. Have a dashboard. Page on regression.
  • Quarantine new flaky tests. A flaky test moves to a "quarantine" suite that doesn't block PRs, but is tracked. Owner has 7 days to fix or delete.

Code review rules

  • Every new test goes through review. Specifically check for the patterns above.
  • Selector strategy in the PR description. Reviewer confirms the selector is stable.
  • Tests reviewed by QA, not just by engineering. Different eyes catch different issues.

Ownership

  • Every test has an owner. A team or a name. When the test flakes, the owner is paged.
  • Flake rate as a team metric. Track per team. Compare. Reward improvements.

Cultural changes

  • Treat flake rate as a quality metric. Stop accepting "tests are flaky" as a baseline.
  • Block on flake fixes. When flake rate climbs, dedicate time to fix before adding new tests.
  • Reward the engineer who fixes a flaky test. Public recognition. It's high-leverage work.

Tooling that helps

  • Playwright trace viewer. Records the full state of a failed test. Replays it. Invaluable for diagnosing flakes.
  • Cypress dashboard. Same idea for Cypress.
  • CI insights. Your CI provider's flake-detection features.
  • Spec-aware retries. Built-in retry mechanisms that mark a test as flaky when it passes on retry.

Modern AI tools are also surprisingly useful here. Asking Claude Code or Cursor "why might this Playwright test be flaky" against the test source plus the failure trace often surfaces the timing issue or the selector fragility in seconds.

The retry-loop trap

Auto-retry on test failures is the most common way teams hide a flaky suite from themselves. The CI green-rate looks healthy while the underlying flake rate compounds. Track the first-attempt pass rate, not the after-retries rate. The gap between them is the real flake number, and shrinking that gap is the only fix that matters.

Common mistakes

  • Auto-retrying everywhere. Hides flakes. They come back as production bugs later.
  • Disabling flaky tests indefinitely. They become noise. Delete or fix.
  • Treating flake rate as a non-priority. Then it grows and now you have a much bigger problem.
  • Letting engineers ship features alongside flaky tests. Set the bar: green CI means green CI.

How Hashorn helps teams stabilise their suites

Hashorn provides QA automation and AI QA testing services that often start with a flaky-suite engagement. We follow the four-step process above, embed alongside your team, and leave with a suite under 1 percent flake. For teams that want a dedicated partner to own QA long-term, our dedicated QA team engagement covers stabilisation plus ongoing automation work.

Conclusion

A flaky test suite is fixable. The fix is mechanical: measure, bucket, fix, prevent. Two to four weeks of focused work takes most suites from "we don't trust this" to "this catches real bugs." The hardest part isn't the engineering; it's the cultural discipline to keep flake rate low after it's been fixed.

Frequently asked questions

Need help building AI-powered software, QA automation, or secure cloud systems?

Talk to Hashorn's engineering team. Dedicated senior engineers, QA, and security with same-week ramp.

Have an engineering challenge you'd like a hand with?

Tell us what you're building, we'll tell you how we'd ship it.

Book an intro call →