Case study · B2B SaaS · Series A

Regression suite cut from 90 minutes to 47, and made trustworthy again

Helix shipped weekly but their flaky 90-minute Selenium suite had stopped catching bugs. We rewrote it on Playwright with AI-assisted scaffolding in six weeks, dropped flake rate below 1%, and restored engineering trust in CI.

Case studyB2B SaaS · Series A

Helix

47% faster, <1% flake

Client

Helix

Engagement

QA pod · project

Duration

6 weeks (Q1 2026)

Team

2 senior QA engineers, 1 platform engineer (part-time)

ServicesQuality AssuranceAI Software Development

Outcomes at a glance

PR regression time

90 min → 47 min

Flake rate

~6% → < 1%

Engineer-hours/day rerunning CI

~12 → 0

Sprint timeline

How the engagement unfolded

Wk 1
Flake audit + tiering proposal
Two-day audit catalogued every spec by failure rate. Top 12 specs accounted for 80% of all failures. We proposed a tiered PR / nightly / pre-release strategy.
Flake report · tiering signed off
Wk 2
Playwright pilot on the worst offenders
Migrated the five most-flaky spec families to Playwright. Auto-waiting replaced hand-rolled retry logic. Pilot kept running in parallel with the old suite.
5 spec families on Playwright
Wk 3
AI-assisted scaffolding rolled out
QA engineers wrote prompts for the rest of the 240-spec migration. AI scaffolded; humans reviewed every assertion line-by-line. ~3-4x faster than humans alone, no quality loss.
Scaffolding pipeline · 120 specs done
Wk 4
Full migration + new CI workflow
Remaining specs migrated. New CI tiers: PR (15min hot path), nightly (full 47min), pre-release (cross-browser + perf). Old suite retired.
240 specs · 3 tiers live
Wk 5
BrowserStack + ownership map
Cross-browser smoke through BrowserStack. Every spec assigned a named owner from Helix engineering. Slack alerts on nightly failures routed to owning team.
Ownership map · cross-browser tier
Wk 6
Handover + monitoring
Flake-rate dashboard, weekly health report template, and engineering documentation on conventions, retry policy, and tier selection.
Health dashboard · docs handed over

Architecture

The stack we shipped on

Test runner

Replaced Selenium 3 + custom wait helpers

Playwright
TypeScript
Hand-rolled fixtures

CI tiers

Tiered so PR stays under 15 minutes

GitHub Actions
PR · nightly · pre-release

Cross-browser

Smoke layer on the pre-release tier

BrowserStack
Chrome
Firefox
Safari

AI scaffolding

Generates first-draft specs; humans review every assertion

OpenAI
Custom prompt library

Observability

Slack alerts
Flake-rate dashboard

Risks we actively managed

AI-generated assertions that pass but don't actually catch behavior breaks — every assertion human-reviewed; ~33% rewritten.
Migration regressions — old and new suites ran in parallel for four weeks before retiring the old.
Team adoption — every spec given a named owner so failures route to humans, not a black hole.
Test-data drift between PR and nightly tiers — shared fixture library with deterministic seeds.

Workflow

Tracked end-to-end in BuildOS.

Every meeting summary, requirement, sprint, task, and metric in this case study was rendered in BuildOS during the engagement. The customer's team had read-only access to the same workspace from week one, they saw Friday demos, weekly velocity, and AI-generated checklists without us sending status emails.

See BuildOS

The challenge

Helix shipped weekly but their 90-minute regression suite failed twice a day with "fix-it-by-rerunning" flakiness. Engineers stopped trusting CI, started merging to main without waiting for green, and three production incidents in two months were all "should have been caught in regression." The board had asked the question every founder dreads: do we actually have testing, or do we just have tests?

The deeper symptoms underneath the slow, flaky suite:

240 specs accumulated over four years; no clear ownership for most of them.
Selenium 3 with hand-rolled wait helpers, every spec had its own retry logic.
A single CI tier: every PR ran the full suite, even for typo fixes.
Three of the five flakiest specs covered the company's most critical revenue path.

How we approached it

Two days of audit before any code changed. We mapped the flake rate per spec, identified the worst offenders, and proposed a tiered strategy: fast PR-time tier (under 15 minutes) for highest-leverage paths, deeper nightly tier for everything else, pre-release tier for performance and cross-browser.

The rewrite plan was incremental, we couldn't take the existing suite offline while we built the new one. So we ran them in parallel for four weeks, gradually shifting the source of truth.

What we shipped

Audit phase (week 1)

Per-spec flake report ranked by failures-per-day
Top 12 specs identified as 80% of all failures
Cost-of-flake calculation: ~12 engineer-hours/day lost to reruns
Tiering proposal accepted by Helix leadership

Migration phase (weeks 2–5)

Playwright migration of 240 specs across 14 product surfaces
AI-assisted test scaffolding: we wrote prompts; AI generated first drafts; QA reviewed every assertion before it shipped
Hand-rolled retry logic replaced with Playwright's built-in auto-waiting
New CI workflow: PR tier (15 min) → nightly tier (full suite, 47 min) → pre-release tier (cross-browser, perf)
BrowserStack integration for the cross-browser smoke

Hardening phase (week 6)

Test ownership map: every spec now has a named owner from Helix engineering
Slack alerts for nightly failures routed to owning team
Documentation: Playwright conventions, retry policy, "when to use which tier"

Outcomes

PR regression time: 90 min → 47 min (full suite at parity in nightly tier)
Flake rate: ~6% → < 1% (measured over 4 weeks post-launch)
Engineers re-running failed CI: ~12/day → 0
Three production incidents in the next quarter, all caught in regression. Zero "should have been caught" incidents in the six months since.
Trust in CI restored. Engineers wait for green again.

What we'd repeat

AI is excellent at scaffolding tests and converting between frameworks. AI is bad at writing assertions that actually fail when behavior breaks. We had QA engineers review every assertion AI proposed, and rejected or rewrote about a third of them. The combination of AI scaffolding + human assertion review was 3–4× faster than humans alone, with no quality loss.

The other lesson: ownership matters more than framework choice. Half the flakiness was specs that nobody owned, written by engineers who had since left. The ownership map alone would have helped, even without the Playwright migration.

“Our engineers had stopped trusting CI, they'd merge to main without waiting for green. Six weeks later, every PR runs in 47 minutes, the flakes are gone, and trust is back. We've shipped without a regression incident since.”

Ravi Krishnan

VP Engineering, Helix

Want a result like this?

Tell us what you're building, we'll tell you how we'd ship it.

Book an intro call →