Skip to content
Back to PredictFlow Feature Store

Event-driven auto-retraining via drift hook (DEPRECATED)

✗ DeprecatedPredictFlow Feature Store04 — Monitoring & Auto-Retrain (origin of original framing → reversal)
By AI-DE Engineering Team·Stakeholders: ML platform engineer, on-call, model-quality reviewer

Context (when this was Accepted)

The original landing copy and early Module 04 design promised a fully automated retraining trigger: monitoring/drift_detection.py detects feature drift via PSI / KS test, emits an event, and a downstream consumer kicks off a retraining job, which auto-promotes the new model to Production stage in MLflow Registry on accuracy gate pass.

The promised pattern looked like:

[Prediction served] → [drift_detection.py] ──drift>threshold──→ [event bus] →
  [retraining job] ──accuracy>gate──→ [auto-promote to MLflow Production]

The pitch was clean: the system retrains itself.

What changed (and why we reversed)

In implementation, three problems surfaced:

  1. False-positive drift triggers were a real failure mode. A weekend marketing push shifts the input distribution by 18 % (PSI

    0.25 on monthly_charges), the auto-trigger fires, and Sunday morning we have a new model trained on contaminated weekend data getting auto-promoted to Production.

  2. No human-in-the-loop = no accountability. When an auto-promoted model dropped recall by 6 % the following Tuesday, the post-mortem was "the system did it" — no engineer could vouch for the new model.

  3. The event-bus dependency added a new failure mode. Adding Kafka topics for drift events meant Module 04 now had to teach Kafka producer / consumer patterns for the monitoring layer, which is out of scope for an MLOps tutorial.

The fix landed in Module 04: keep the drift detection (it's accurate), keep the retraining job (it's necessary), but decouple the trigger:

  • monitoring/drift_detection.py — emits drift scores + alerts to Prometheus + AlertManager → Slack channel (#ml-drift).
  • k8s/retraining-cronjob.yml — runs nightly (or on demand via kubectl create job --from=cronjob).
  • retraining/trigger_retraining.py — pulls latest features, retrains, registers a new MLflow Staging version, posts a Slack approval request to a human reviewer.
  • Promotion to Production stage is a manual mlflow models transition-stage call (or a UI click), gated by the reviewer.
[Prediction served] → [drift_detection.py] ──drift>threshold──→ [Slack alert] →
  [human reviews drift report] →
  [kubectl create job --from=cronjob/retrain] →
  [retraining_trigger writes new Staging model] →
  [reviewer promotes Staging → Production after eval]

The reversal trades "the system retrains itself" for "the system signals when retraining is justified, then a human decides". In production this is what teams actually run.

Why we left this ADR Deprecated rather than deleting it

Future readers looking at monitoring/drift_detection.py will see it emits Prometheus metrics + Slack alerts but does not publish to a Kafka topic or invoke a retraining job directly. The interesting question — why didn't they ship the auto-trigger — is answered by this ADR.

The MADR convention treats Deprecated ADRs as part of the permanent record. We follow that convention.

What we got wrong (and what we'd do again)

Got wrong:

  • We marketed "automated retraining" before we'd shipped the human-gate story. The original landing-page bullet ("drift detection with automated retraining triggers") was a forward-looking claim that required a maturity stage we hadn't reached.
  • We treated the drift detector and the retraining trigger as a single pipeline. They're separate concerns: detection answers "is the model wrong?", retraining answers "is the new candidate better?", and promotion answers "should we ship it?". Conflating the three was the underlying architectural mistake.

Got right:

  • The drift detector itself (PSI + KS + Evidently Reports) is a keeper. The math hasn't changed; what changed is who decides what to do with the score.
  • The CronJob shape (k8s/retraining-cronjob.yml) is the right operational primitive. It's predictable, observable, and resumable.
  • The manual promotion step preserves the audit trail. A registry transition has a userId field; an automated promotion doesn't.

When (if ever) to revisit

A future ADR could re-enable event-driven auto-promotion when all of these are true:

  1. Feature drift has been trending stable (no false-positive promotions for a 90-day window).
  2. The shadow-traffic infrastructure exists — the candidate model serves 5 % of real traffic for a week before promotion is even considered.
  3. A rollback signal (real-time accuracy on golden test set) is wired and tested.
  4. There's an Eng Manager + ML Lead sign-off that the audit trail doesn't require human-in-the-loop for compliance.

Until then, this ADR stays Deprecated and the cron-+-gate pattern stays the production path.

References

  • monitoring/drift_detection.py (the metric computer; not an event emitter)
  • k8s/retraining-cronjob.yml (the cron orchestrator)
  • retraining/trigger_retraining.py (the retraining job, writes to Staging)
  • prometheus/alerts/model_alerts.yml (drift alert rules → AlertManager → Slack)
  • alertmanager/config.yml (Slack routing for #ml-drift)
  • ADR-003 (MLflow Registry — the stage transition is the human gate)
  • ADR-004 (BentoML — picks up new models from MLflow Production stage)
Built into the project

This decision shipped as part of PredictFlow Feature Store — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open