The challenge
A SaaS product company had a live product and a clear, time-sensitive need for a new AI feature. They wanted it in production, not in a demo, and trustworthy enough to put in front of real users.
The hard part of shipping AI quickly is not writing the prompt. It is knowing whether the output is actually good, and keeping it good as the feature changes. That is where most fast AI builds go wrong: they ship something that demos well and degrades quietly. The constraint we set ourselves was that quality had to be measurable from day one.
How we approached it
A two-engineer pod, one focused on the AI pipeline, with a QA engineer and a Hashorn PM. We inverted the usual order and built the evaluation harness before the feature. On day one we wrote down what good output means for this specific use case, and by day three we had a labelled set of real example inputs and an automated way to score against it.
That single decision shaped everything. From then on, every change to the prompt or the retrieval pipeline was scored before it merged, so we were iterating against a number, not a hunch. Grounding the model on the product's own data kept it honest, and a feature flag meant we could ship to production without a risky big-bang launch.
What we shipped
Days 1 to 3, the measurement first. The use case, the success criteria, the guardrails, and a working evaluation harness on a labelled set of real inputs.
Days 4 to 6, the feature against the evals. The prompt and retrieval pipeline, grounded on first-party data and iterated until it cleared the quality bar, with every change scored in CI.
Days 7 to 8, product integration. Wired into the UI behind a feature flag, with input validation, rate and cost limits, and a fallback for when the model is uncertain or unavailable.
Days 9 to 10, safety and rollout. An adversarial QA pass, Playwright on the critical paths, and a staged production rollout to an initial cohort with traces, cost, and quality all monitored in one place. This is the same discipline we describe in our writing on observability for AI products and building an LLM evaluation harness.
The outcome
- A production AI feature in 10 working days, shipped behind a flag for a controlled rollout.
- Quality measured from day two against a labelled eval set, so the team shipped against a number rather than a hunch.
- A durable evaluation harness that keeps gating quality on every future change, owned by the client in their own repo.
What we'd repeat
Building the evaluation harness before the feature is the thing we would do again on every AI build, fast or slow. It turned a subjective question into a measurable one and made it safe to move quickly, because we could always tell whether a change helped or hurt. The other call that paid off was shipping dark behind a feature flag: it removed all the launch-day risk and let the feature prove itself on a small cohort first.