Model Drift Explained: What It Is and How It Works
Model drift is when a production model's accuracy degrades because real-world data has shifted from the training data. Data drift is when input feature distributions change. Concept drift is when the relationship between inputs and outputs changes. Both require monitoring and retraining to fix — neither resolves itself.
Data Drift vs Concept Drift
Drift types and what changes
DATA DRIFT (covariate shift):
Training: P(X) = age:35±5, income:60k±15k
Production: P(X) = age:28±4, income:45k±12k ← shifted
Effect: model sees ages/incomes it never trained on
CONCEPT DRIFT:
Training: P(Y|X) = high_income → low churn risk
Production: P(Y|X) = high_income → high churn risk ← shifted
Effect: the world changed; features same, labels differ
PREDICTION DRIFT (early warning proxy):
Training avg score: 0.23
Production avg score: 0.61 ← something changedTypes of Drift
Data Drift
Feature distributions shift
The statistical distribution of input features changes. User demographics shift, market conditions change, or data collection processes change. The model still works — but on inputs it was not trained for.
Concept Drift
Input-output relationship changes
The relationship between features and labels changes. A fraud model trained pre-pandemic fails post-pandemic even if transaction features look the same — behavior patterns changed.
Label Drift
Target distribution changes
The proportion of target classes shifts — e.g. churn rate drops from 15% to 5% after a product improvement. A model calibrated for 15% churn becomes poorly calibrated.
Detection Methods
| Test | Best for | Threshold |
|---|---|---|
| Population Stability Index (PSI) | Categorical + binary features | PSI > 0.2 = drift |
| Kolmogorov-Smirnov (KS) test | Continuous feature distributions | p-value < 0.05 |
| Chi-square test | Categorical feature frequency | p-value < 0.05 |
| Jensen-Shannon divergence | Prediction distribution shift | JS > 0.1 = alert |
| CUSUM / page-Hinkley | Gradual concept drift detection | Custom threshold |
Detecting Drift with Evidently AI
Evidently AI runs statistical tests across all features and produces a drift report. Integrate into an Airflow DAG to run nightly.
from evidently import Report
from evidently.metrics import (
DatasetDriftMetric,
DataDriftTable,
)
report = Report(metrics=[
DatasetDriftMetric(), # overall drift Y/N
DataDriftTable(), # per-feature drift scores
])
report.run(
reference_data=training_df, # from training
current_data=production_df, # last 7 days
)
report.save_html('drift_report.html')
result = report.as_dict()
print(f'Drift detected: {result["metrics"][0]["result"]["dataset_drift"]}')Common Mistakes
Only monitoring infrastructure metrics
A model can serve 99.9% uptime while its predictions are completely wrong. Monitoring latency and error rates tells you nothing about model accuracy. Add prediction distribution and feature drift metrics.
Alerting on every individual feature drift
With 50 features, a 5% significance threshold means ~2.5 false-positive alerts per monitoring run by chance alone. Use dataset-level drift metrics (PSI across all features) and require a minimum share of drifted features before alerting.
Waiting for labels before detecting drift
Production labels are often delayed by days or weeks. Detect data drift (input distribution shifts) and prediction drift (output distribution shifts) as early-warning proxies — don't wait for labeled ground truth.
Fixed retraining schedules instead of drift-triggered
Retraining every Monday wastes compute when data is stable and misses rapid drift between runs. Trigger retraining on drift signals. If drift is rare, scheduled retraining wastes money; if it's frequent, fixed schedules miss it.
FAQ
- What is model drift?
- Model drift is when a production model's accuracy degrades because real-world data has shifted from training data. It's the primary reason models need ongoing monitoring and retraining.
- What is the difference between data drift and concept drift?
- Data drift: input feature distributions change (users skew younger, income changes). Concept drift: the relationship between inputs and outputs changes (same features, different correct labels). Both degrade accuracy but require different fixes.
- How do you detect model drift?
- Use PSI for categorical features, KS test for continuous features, Chi-square for categorical frequency. Tools: Evidently AI, Whylogs, NannyML. Monitor prediction distribution as an early proxy when labels are delayed.
- How do you fix model drift?
- Retrain on recent data. For data drift, include new distribution in training window. For concept drift, re-examine feature engineering and labeling. Use drift-triggered retraining pipelines for automation.