The challenge
Trove's recommendation engine had 7 ML models in production and 5 more in development, all deployed by SSH-ing into EC2 boxes and running a manual rollout script. No autoscaling, no model registry, no drift monitoring. A customer SLA breach two weeks before our engagement started had triggered a board-level conversation about reliability.
The deeper symptoms behind the obvious gaps:
- No lineage. Three of the seven production models had no documented training data version. If they degraded, nobody could reproduce the training run.
- No drift detection. When a customer's data changed, recommendations would silently degrade until support tickets accumulated.
- No deployment isolation. A bad model push would replace the live model with no canary, no rollback path.
- Online/offline feature drift. The features used at training time and the features used at inference time were computed in two different services that had silently drifted apart.
How we approached it
One-week audit, then a stand-up + migration plan agreed with Trove's ML team. The migration was incremental and reversible: build the pipeline, prove it on one model, then move the rest one at a time. Every migrated model had to outperform its EC2 predecessor on latency before it became canonical.
Trove's data scientists kept ownership of modeling. We owned the rails: how models ship, run, scale, and stay accountable.
What we shipped
Reference platform (weeks 1–3)
- Pipeline orchestration on AWS EKS. Argo Workflows for training + eval; Triton for serving; everything running on a shared cluster with isolated namespaces.
- MLflow as the single source of truth. Every training run logs hyperparameters, metrics, and model artifacts. Promotion to production requires explicit signoff.
- BentoML for container packaging. Reproducible build → push → deploy chain via GitHub Actions.
- Online + offline feature parity. Custom feature service backed by Redis (online) + S3/Parquet (offline); enforced parity tests on every release.
Migration (weeks 4–7)
- Reference model first. Trove's lowest-risk model migrated first to prove the pipeline. p95 latency dropped from 280ms to 92ms; that became the proof point for the rest.
- Canary rollouts at 5% → 25% → 100% with automatic rollback on error-rate regression.
- Eleven more models migrated over four weeks, one per workstream, parallel to ongoing feature work.
- Feature schema validation at training and inference; mismatch fails the deploy.
Operations (week 8)
- Drift detection (Evidently AI) wired to PagerDuty. Three drift dashboards: data drift, prediction drift, and target drift on the labeled subset.
- Monitoring stack (Prometheus + Grafana). 14 dashboards covering latency, throughput, error rates, and per-model business KPIs (recommendation CTR, conversion rate).
- 47-page runbook + handover. Including incident playbooks for drift, latency regression, and feature-store outage.
- 30-day Slack hyper-care window for the ML team's questions during handover.
Outcomes
- All 12 models in production on the new rails. Zero downtime during migration.
- 0 manual deploys in the 30 days post-migration (down from 3–4 per week).
- p95 inference latency: 240ms → 88ms (-63%). Combination of autoscaling, Triton optimizations, and removing the per-request feature lookup.
- Drift detection caught a feature schema change in week 6. Customer's upstream system had silently changed how
user_countrywas encoded; alert fired before customer noticed degraded recommendations. Estimated impact averted: roughly the same magnitude as the original SLA breach that started this engagement. - Internal ML team owns the platform. They've shipped two new models on the rails since handover.
What we'd repeat
Migrate incrementally. The first model on the new rails caught three platform bugs that would have blocked migration of the other eleven. If we'd tried to migrate all twelve at once, we'd have spent the first month firefighting.
The other lesson: feature parity is the silent killer. The bug that caused the original SLA breach was an online/offline feature drift bug, the model worked fine in training, fine in evaluation, and badly in production because the feature service computed user_country differently at inference time. Building feature parity tests into the pipeline meant that class of bug couldn't happen again.