Hashorn

Case study · SaaS Recommendations · Series B

12 ML models from notebooks to production rails in eight weeks

Trove had 7 ML models running on hand-managed EC2 with manual deploys, no monitoring, and a customer SLA breach two weeks earlier. We stood up a full MLOps platform on AWS in 8 weeks and migrated all 12 models with zero downtime.

Case studySaaS Recommendations · Series B

Trove

12 models · zero manual deploys

Client

Trove

Engagement

Platform stand-up + migration

Duration

8 weeks (Q2 2026)

Team

1 senior ML engineer + 2 senior platform engineers

ServicesMLOpsCloud & DevOps

Outcomes at a glance

Models in production on new rails

12 / 12

Manual deploys per week

3–4 → 0

p95 inference latency

240ms → 88ms (-63%)

First drift alert before customer noticed

Week 6

Sprint timeline

How the engagement unfolded

  1. Wk 1

    Audit + migration plan

    One-week audit catalogued 7 production models, 5 deployment gaps, and the lineage void. Migration plan agreed: incremental, reversible, one model at a time.

    Stack inventory · prioritised gap list

  2. Wk 2

    Reference platform on EKS

    Argo Workflows for training, MLflow as the source of truth for artifacts, BentoML for packaging, Triton for serving. End-to-end on one reference model.

    Reference pipeline · sample model live

  3. Wk 3

    Feature parity + first migration

    Feature service with online/offline parity tests on every release. First production model migrated with a 5% → 25% → 100% canary rollout.

    Model 1 of 12 · latency improved 40%

  4. Wk 4

    Canary tooling + 4 more migrations

    Automated rollback on error-rate regression. Models 2-5 migrated in parallel sprints; each gated on parity-test pass.

    5 of 12 · zero rollbacks

  5. Wk 5

    Drift detection wired in

    Evidently AI dashboards live for data drift, prediction drift, and target drift on labelled subset. PagerDuty integration tested with a synthetic drift injection.

    Drift alerts firing on synthetic data

  6. Wk 6

    Drift catch + migrations 6-9

    Real drift alert fired on the user_country feature — customer's upstream system had silently changed the encoding. Migrations 6-9 completed alongside the incident response.

    Real drift caught + averted

  7. Wk 7

    Final migrations + monitoring

    Models 10-12 migrated. Prometheus + Grafana dashboards across latency, throughput, error rate, and per-model business KPIs (recommendation CTR, conversion rate).

    12 of 12 · 14 dashboards live

  8. Wk 8

    Runbook + handover

    47-page runbook covering incident playbooks for drift, latency regression, and feature-store outage. 30-day Slack hyper-care window opens.

    Handover complete · retainer optional

Architecture

The stack we shipped on

Orchestration

  • Argo Workflows
  • Kubernetes
  • AWS EKS

Model registry

Single source of truth; explicit promotion to production

  • MLflow
  • S3 artifacts

Packaging + serving

  • BentoML
  • Triton Inference Server

Features

Parity tests on every release

  • Custom feature service
  • Redis online
  • S3/Parquet offline

Observability

  • Prometheus
  • Grafana
  • Evidently AI
  • PagerDuty

CI/CD

  • GitHub Actions
  • Canary rollouts
  • Auto-rollback on regression

Risks we actively managed

  • Migration downtime — every model migrated with a canary; existing EC2 deployment stayed live until parity was proven.
  • Online/offline feature drift — the bug class behind the original SLA breach. Parity tests now run on every release.
  • Customer SLAs during migration — p95 latency budget reduced as we went; rollback triggered on any regression.
  • Drift detection false positives — calibrated against historical data; alert fatigue managed with severity tiers.
Workflow

Tracked end-to-end in BuildOS.

Every meeting summary, requirement, sprint, task, and metric in this case study was rendered in BuildOS during the engagement. The customer's team had read-only access to the same workspace from week one, they saw Friday demos, weekly velocity, and AI-generated checklists without us sending status emails.

The challenge

Trove's recommendation engine had 7 ML models in production and 5 more in development, all deployed by SSH-ing into EC2 boxes and running a manual rollout script. No autoscaling, no model registry, no drift monitoring. A customer SLA breach two weeks before our engagement started had triggered a board-level conversation about reliability.

The deeper symptoms behind the obvious gaps:

  • No lineage. Three of the seven production models had no documented training data version. If they degraded, nobody could reproduce the training run.
  • No drift detection. When a customer's data changed, recommendations would silently degrade until support tickets accumulated.
  • No deployment isolation. A bad model push would replace the live model with no canary, no rollback path.
  • Online/offline feature drift. The features used at training time and the features used at inference time were computed in two different services that had silently drifted apart.

How we approached it

One-week audit, then a stand-up + migration plan agreed with Trove's ML team. The migration was incremental and reversible: build the pipeline, prove it on one model, then move the rest one at a time. Every migrated model had to outperform its EC2 predecessor on latency before it became canonical.

Trove's data scientists kept ownership of modeling. We owned the rails: how models ship, run, scale, and stay accountable.

What we shipped

Reference platform (weeks 1–3)

  • Pipeline orchestration on AWS EKS. Argo Workflows for training + eval; Triton for serving; everything running on a shared cluster with isolated namespaces.
  • MLflow as the single source of truth. Every training run logs hyperparameters, metrics, and model artifacts. Promotion to production requires explicit signoff.
  • BentoML for container packaging. Reproducible build → push → deploy chain via GitHub Actions.
  • Online + offline feature parity. Custom feature service backed by Redis (online) + S3/Parquet (offline); enforced parity tests on every release.

Migration (weeks 4–7)

  • Reference model first. Trove's lowest-risk model migrated first to prove the pipeline. p95 latency dropped from 280ms to 92ms; that became the proof point for the rest.
  • Canary rollouts at 5% → 25% → 100% with automatic rollback on error-rate regression.
  • Eleven more models migrated over four weeks, one per workstream, parallel to ongoing feature work.
  • Feature schema validation at training and inference; mismatch fails the deploy.

Operations (week 8)

  • Drift detection (Evidently AI) wired to PagerDuty. Three drift dashboards: data drift, prediction drift, and target drift on the labeled subset.
  • Monitoring stack (Prometheus + Grafana). 14 dashboards covering latency, throughput, error rates, and per-model business KPIs (recommendation CTR, conversion rate).
  • 47-page runbook + handover. Including incident playbooks for drift, latency regression, and feature-store outage.
  • 30-day Slack hyper-care window for the ML team's questions during handover.

Outcomes

  • All 12 models in production on the new rails. Zero downtime during migration.
  • 0 manual deploys in the 30 days post-migration (down from 3–4 per week).
  • p95 inference latency: 240ms → 88ms (-63%). Combination of autoscaling, Triton optimizations, and removing the per-request feature lookup.
  • Drift detection caught a feature schema change in week 6. Customer's upstream system had silently changed how user_country was encoded; alert fired before customer noticed degraded recommendations. Estimated impact averted: roughly the same magnitude as the original SLA breach that started this engagement.
  • Internal ML team owns the platform. They've shipped two new models on the rails since handover.

What we'd repeat

Migrate incrementally. The first model on the new rails caught three platform bugs that would have blocked migration of the other eleven. If we'd tried to migrate all twelve at once, we'd have spent the first month firefighting.

The other lesson: feature parity is the silent killer. The bug that caused the original SLA breach was an online/offline feature drift bug, the model worked fine in training, fine in evaluation, and badly in production because the feature service computed user_country differently at inference time. Building feature parity tests into the pipeline meant that class of bug couldn't happen again.

We had a customer SLA breach that nobody on our team saw coming until support tickets started landing. Eight weeks later, drift detection caught a feature schema change before any customer saw degraded recommendations. That's the difference.
VS

Vikram Shah

Head of ML, Trove

Want a result like this?

Tell us what you're building, we'll tell you how we'd ship it.

Book an intro call →