Hashorn

Case study · B2B SaaS · Series A

Regression suite cut from 90 minutes to 47, and made trustworthy again

Helix shipped weekly but their flaky 90-minute Selenium suite had stopped catching bugs. We rewrote it on Playwright with AI-assisted scaffolding in six weeks, dropped flake rate below 1%, and restored engineering trust in CI.

Case studyB2B SaaS · Series A

Helix

47% faster, <1% flake

Client

Helix

Engagement

QA pod · project

Duration

6 weeks (Q1 2026)

Team

2 senior QA engineers, 1 platform engineer (part-time)

ServicesQuality AssuranceAI Software Development

Outcomes at a glance

PR regression time

90 min → 47 min

Flake rate

~6% → < 1%

Engineer-hours/day rerunning CI

~12 → 0

Sprint timeline

How the engagement unfolded

  1. Wk 1

    Flake audit + tiering proposal

    Two-day audit catalogued every spec by failure rate. Top 12 specs accounted for 80% of all failures. We proposed a tiered PR / nightly / pre-release strategy.

    Flake report · tiering signed off

  2. Wk 2

    Playwright pilot on the worst offenders

    Migrated the five most-flaky spec families to Playwright. Auto-waiting replaced hand-rolled retry logic. Pilot kept running in parallel with the old suite.

    5 spec families on Playwright

  3. Wk 3

    AI-assisted scaffolding rolled out

    QA engineers wrote prompts for the rest of the 240-spec migration. AI scaffolded; humans reviewed every assertion line-by-line. ~3-4x faster than humans alone, no quality loss.

    Scaffolding pipeline · 120 specs done

  4. Wk 4

    Full migration + new CI workflow

    Remaining specs migrated. New CI tiers: PR (15min hot path), nightly (full 47min), pre-release (cross-browser + perf). Old suite retired.

    240 specs · 3 tiers live

  5. Wk 5

    BrowserStack + ownership map

    Cross-browser smoke through BrowserStack. Every spec assigned a named owner from Helix engineering. Slack alerts on nightly failures routed to owning team.

    Ownership map · cross-browser tier

  6. Wk 6

    Handover + monitoring

    Flake-rate dashboard, weekly health report template, and engineering documentation on conventions, retry policy, and tier selection.

    Health dashboard · docs handed over

Architecture

The stack we shipped on

Test runner

Replaced Selenium 3 + custom wait helpers

  • Playwright
  • TypeScript
  • Hand-rolled fixtures

CI tiers

Tiered so PR stays under 15 minutes

  • GitHub Actions
  • PR · nightly · pre-release

Cross-browser

Smoke layer on the pre-release tier

  • BrowserStack
  • Chrome
  • Firefox
  • Safari

AI scaffolding

Generates first-draft specs; humans review every assertion

  • OpenAI
  • Custom prompt library

Observability

  • Slack alerts
  • Flake-rate dashboard

Risks we actively managed

  • AI-generated assertions that pass but don't actually catch behavior breaks — every assertion human-reviewed; ~33% rewritten.
  • Migration regressions — old and new suites ran in parallel for four weeks before retiring the old.
  • Team adoption — every spec given a named owner so failures route to humans, not a black hole.
  • Test-data drift between PR and nightly tiers — shared fixture library with deterministic seeds.
Workflow

Tracked end-to-end in BuildOS.

Every meeting summary, requirement, sprint, task, and metric in this case study was rendered in BuildOS during the engagement. The customer's team had read-only access to the same workspace from week one, they saw Friday demos, weekly velocity, and AI-generated checklists without us sending status emails.

The challenge

Helix shipped weekly but their 90-minute regression suite failed twice a day with "fix-it-by-rerunning" flakiness. Engineers stopped trusting CI, started merging to main without waiting for green, and three production incidents in two months were all "should have been caught in regression." The board had asked the question every founder dreads: do we actually have testing, or do we just have tests?

The deeper symptoms underneath the slow, flaky suite:

  • 240 specs accumulated over four years; no clear ownership for most of them.
  • Selenium 3 with hand-rolled wait helpers, every spec had its own retry logic.
  • A single CI tier: every PR ran the full suite, even for typo fixes.
  • Three of the five flakiest specs covered the company's most critical revenue path.

How we approached it

Two days of audit before any code changed. We mapped the flake rate per spec, identified the worst offenders, and proposed a tiered strategy: fast PR-time tier (under 15 minutes) for highest-leverage paths, deeper nightly tier for everything else, pre-release tier for performance and cross-browser.

The rewrite plan was incremental, we couldn't take the existing suite offline while we built the new one. So we ran them in parallel for four weeks, gradually shifting the source of truth.

What we shipped

Audit phase (week 1)

  • Per-spec flake report ranked by failures-per-day
  • Top 12 specs identified as 80% of all failures
  • Cost-of-flake calculation: ~12 engineer-hours/day lost to reruns
  • Tiering proposal accepted by Helix leadership

Migration phase (weeks 2–5)

  • Playwright migration of 240 specs across 14 product surfaces
  • AI-assisted test scaffolding: we wrote prompts; AI generated first drafts; QA reviewed every assertion before it shipped
  • Hand-rolled retry logic replaced with Playwright's built-in auto-waiting
  • New CI workflow: PR tier (15 min) → nightly tier (full suite, 47 min) → pre-release tier (cross-browser, perf)
  • BrowserStack integration for the cross-browser smoke

Hardening phase (week 6)

  • Test ownership map: every spec now has a named owner from Helix engineering
  • Slack alerts for nightly failures routed to owning team
  • Documentation: Playwright conventions, retry policy, "when to use which tier"

Outcomes

  • PR regression time: 90 min → 47 min (full suite at parity in nightly tier)
  • Flake rate: ~6% → < 1% (measured over 4 weeks post-launch)
  • Engineers re-running failed CI: ~12/day → 0
  • Three production incidents in the next quarter, all caught in regression. Zero "should have been caught" incidents in the six months since.
  • Trust in CI restored. Engineers wait for green again.

What we'd repeat

AI is excellent at scaffolding tests and converting between frameworks. AI is bad at writing assertions that actually fail when behavior breaks. We had QA engineers review every assertion AI proposed, and rejected or rewrote about a third of them. The combination of AI scaffolding + human assertion review was 3–4× faster than humans alone, with no quality loss.

The other lesson: ownership matters more than framework choice. Half the flakiness was specs that nobody owned, written by engineers who had since left. The ownership map alone would have helped, even without the Playwright migration.

Our engineers had stopped trusting CI, they'd merge to main without waiting for green. Six weeks later, every PR runs in 47 minutes, the flakes are gone, and trust is back. We've shipped without a regression incident since.
RK

Ravi Krishnan

VP Engineering, Helix

Want a result like this?

Tell us what you're building, we'll tell you how we'd ship it.

Book an intro call →