The challenge
Helix shipped weekly but their 90-minute regression suite failed twice a day with "fix-it-by-rerunning" flakiness. Engineers stopped trusting CI, started merging to main without waiting for green, and three production incidents in two months were all "should have been caught in regression." The board had asked the question every founder dreads: do we actually have testing, or do we just have tests?
The deeper symptoms underneath the slow, flaky suite:
- 240 specs accumulated over four years; no clear ownership for most of them.
- Selenium 3 with hand-rolled wait helpers, every spec had its own retry logic.
- A single CI tier: every PR ran the full suite, even for typo fixes.
- Three of the five flakiest specs covered the company's most critical revenue path.
How we approached it
Two days of audit before any code changed. We mapped the flake rate per spec, identified the worst offenders, and proposed a tiered strategy: fast PR-time tier (under 15 minutes) for highest-leverage paths, deeper nightly tier for everything else, pre-release tier for performance and cross-browser.
The rewrite plan was incremental, we couldn't take the existing suite offline while we built the new one. So we ran them in parallel for four weeks, gradually shifting the source of truth.
What we shipped
Audit phase (week 1)
- Per-spec flake report ranked by failures-per-day
- Top 12 specs identified as 80% of all failures
- Cost-of-flake calculation: ~12 engineer-hours/day lost to reruns
- Tiering proposal accepted by Helix leadership
Migration phase (weeks 2–5)
- Playwright migration of 240 specs across 14 product surfaces
- AI-assisted test scaffolding: we wrote prompts; AI generated first drafts; QA reviewed every assertion before it shipped
- Hand-rolled retry logic replaced with Playwright's built-in auto-waiting
- New CI workflow: PR tier (15 min) → nightly tier (full suite, 47 min) → pre-release tier (cross-browser, perf)
- BrowserStack integration for the cross-browser smoke
Hardening phase (week 6)
- Test ownership map: every spec now has a named owner from Helix engineering
- Slack alerts for nightly failures routed to owning team
- Documentation: Playwright conventions, retry policy, "when to use which tier"
Outcomes
- PR regression time: 90 min → 47 min (full suite at parity in nightly tier)
- Flake rate: ~6% → < 1% (measured over 4 weeks post-launch)
- Engineers re-running failed CI: ~12/day → 0
- Three production incidents in the next quarter, all caught in regression. Zero "should have been caught" incidents in the six months since.
- Trust in CI restored. Engineers wait for green again.
What we'd repeat
AI is excellent at scaffolding tests and converting between frameworks. AI is bad at writing assertions that actually fail when behavior breaks. We had QA engineers review every assertion AI proposed, and rejected or rewrote about a third of them. The combination of AI scaffolding + human assertion review was 3–4× faster than humans alone, with no quality loss.
The other lesson: ownership matters more than framework choice. Half the flakiness was specs that nobody owned, written by engineers who had since left. The ownership map alone would have helped, even without the Playwright migration.