Case study · B2B SaaS · Growth-stage

AI feature shipped to production in 10 days for a SaaS product

A SaaS product team needed a new AI feature in production fast, and trustworthy enough to put in front of real users. We designed it, built it with an evaluation harness from day one, and shipped it behind a feature flag in 10 working days.

Case studyB2B SaaS · Growth-stage

B2B SaaS product

10 working days · AI feature live

Client

B2B SaaS product

Engagement

Fast MVP · fixed scope, fixed deadline

Duration

10 working days

Team

2 senior engineers (1 AI-focused) + 1 QA · 1 Hashorn PM

ServicesAI Software DevelopmentQuality Assurance

Outcomes at a glance

Time to production

10 working days

Quality gate

Eval set, scored on every change

Rollout control

Feature-flagged

Sprint timeline

How the engagement unfolded

  1. Day 1

    Brief, eval criteria, kickoff

    Locked the one AI use case, the success criteria, and the guardrails. The most important output: a written definition of what 'good output' means, which became the evaluation set.

    Use case + eval criteria defined

  2. Days 2-3

    Evaluation harness first

    Built a small evaluation harness and a labelled set of real example inputs before building the feature, so quality was measurable from the start rather than judged by vibes.

    Eval harness running on a labelled set

  3. Days 4-6

    The feature, against the evals

    Prompt and retrieval pipeline built and iterated against the eval set. Grounding the model on the product's own data to cut hallucination, with every change scored before it merged.

    Feature passing the eval bar

  4. Days 7-8

    Product integration + guardrails

    Wired into the product UI behind a feature flag, with input validation, rate limits, cost controls, and a fallback path when the model is uncertain or unavailable.

    Feature live behind a flag in staging

  5. Day 9

    QA + safety pass

    Playwright across the feature's critical paths, adversarial testing on the prompt, and a review of failure and abuse cases before any user saw it.

    Critical paths green · abuse cases reviewed

  6. Day 10

    Production rollout

    Shipped to production behind the flag, enabled for an initial cohort, with traces, cost, and quality monitored in one place.

    Live in production · staged rollout

Architecture

The stack we shipped on

Product

Feature shipped dark, enabled per cohort

  • Next.js 15
  • TypeScript
  • Feature flags

AI pipeline

Grounded on first-party data to reduce hallucination

  • OpenAI
  • Retrieval over product data
  • Prompt versioning

Evaluation

Built before the feature, not after

  • Labelled eval set
  • Automated scoring
  • Regression gate on every change

Guardrails

  • Input validation
  • Rate + cost limits
  • Uncertain/Unavailable fallback

Observability

  • Tracing
  • Per-call cost
  • Quality dashboard

Cloud

  • AWS
  • PostgreSQL
  • GitHub Actions

Risks we actively managed

  • Hallucination and wrong answers: the model is grounded on the product's own data and every change is scored against a labelled eval set before it merges.
  • Silent quality regressions: an evaluation gate runs on every change, so a prompt tweak that helps one case and breaks five is caught in CI.
  • Runaway cost: per-call cost limits, rate limiting, and monitoring so an AI feature cannot quietly burn the budget.
  • Unsafe rollout: shipped behind a feature flag and enabled for a small cohort first, so any issue is contained and reversible.
Workflow

Tracked end-to-end in BuildOS.

Every meeting summary, requirement, sprint, task, and metric in this case study was rendered in BuildOS during the engagement. The customer's team had read-only access to the same workspace from week one, they saw Friday demos, weekly velocity, and AI-generated checklists without us sending status emails.

The challenge

A SaaS product company had a live product and a clear, time-sensitive need for a new AI feature. They wanted it in production, not in a demo, and trustworthy enough to put in front of real users.

The hard part of shipping AI quickly is not writing the prompt. It is knowing whether the output is actually good, and keeping it good as the feature changes. That is where most fast AI builds go wrong: they ship something that demos well and degrades quietly. The constraint we set ourselves was that quality had to be measurable from day one.

How we approached it

A two-engineer pod, one focused on the AI pipeline, with a QA engineer and a Hashorn PM. We inverted the usual order and built the evaluation harness before the feature. On day one we wrote down what good output means for this specific use case, and by day three we had a labelled set of real example inputs and an automated way to score against it.

That single decision shaped everything. From then on, every change to the prompt or the retrieval pipeline was scored before it merged, so we were iterating against a number, not a hunch. Grounding the model on the product's own data kept it honest, and a feature flag meant we could ship to production without a risky big-bang launch.

What we shipped

Days 1 to 3, the measurement first. The use case, the success criteria, the guardrails, and a working evaluation harness on a labelled set of real inputs.

Days 4 to 6, the feature against the evals. The prompt and retrieval pipeline, grounded on first-party data and iterated until it cleared the quality bar, with every change scored in CI.

Days 7 to 8, product integration. Wired into the UI behind a feature flag, with input validation, rate and cost limits, and a fallback for when the model is uncertain or unavailable.

Days 9 to 10, safety and rollout. An adversarial QA pass, Playwright on the critical paths, and a staged production rollout to an initial cohort with traces, cost, and quality all monitored in one place. This is the same discipline we describe in our writing on observability for AI products and building an LLM evaluation harness.

The outcome

  • A production AI feature in 10 working days, shipped behind a flag for a controlled rollout.
  • Quality measured from day two against a labelled eval set, so the team shipped against a number rather than a hunch.
  • A durable evaluation harness that keeps gating quality on every future change, owned by the client in their own repo.

What we'd repeat

Building the evaluation harness before the feature is the thing we would do again on every AI build, fast or slow. It turned a subjective question into a measurable one and made it safe to move quickly, because we could always tell whether a change helped or hurt. The other call that paid off was shipping dark behind a feature flag: it removed all the launch-day risk and let the feature prove itself on a small cohort first.

FAQ

Frequently asked questions

Want a result like this?

Tell us what you're building, we'll tell you how we'd ship it.

Book an intro call →