MLOps at 10,000 Models

Uber

Michelangelo · platform engineering

Problem

Manual model deploys took weeks; no shared feature definitions

Data scientists struggled to deploy ML models to production. Manual deployment took weeks, no standardization across teams, no visibility into model performance. The company needed a platform to democratize ML across 10,000+ engineers and enable rapid iteration on pricing, ETA, fraud detection, and rider-driver matching.

Scale

Models in prod: 10K+
Predictions/day: 100B+
Training jobs/mo: 50K+
Features: 1M+
Teams using: 500+
Use cases: Pricing · ETA · fraud · matching

Solution

Centralized platform with feature store as the anchor

Uber developed an end-to-end ML platform that standardized model development, training, deployment, and monitoring across all 500+ teams. The platform eliminated manual deployment, reduced training time from 12h to 2h via distributed training (Horovod), and enabled real-time serving at <10ms P99. The centerpiece is a Cassandra-backed feature store that pre-computes features and serves them to both training and production — eliminating train/serve skew.

TensorFlowXGBoostPyTorchMLflowHorovodCassandra (feature store)HDFSSparkKubernetes

Feature store: pre-computed features in Cassandra for low-latency serving
Distributed training on Spark/Horovod for large datasets
Experiment tracking via MLflow (parameters, metrics, lineage)
Model registry: centralized catalog with lineage to data + code + params
Containerized model serving on Kubernetes with auto-scaling
Real-time performance + drift monitoring per model
A/B testing framework with automatic rollback on negative impact
Streaming features alongside batch — same DSL for both
Self-service: teams ship without platform team involvement

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonfeature_store.pyFeature DSL — define features once, reuse across pricing/fraud/matching

from michelangelo.feature_store import FeatureDefinition, FeatureGroup

@FeatureGroup(name='rider_features', update_frequency='hourly')
class RiderFeatures:
    """Rider behavior features shared across pricing, fraud, and matching."""

    @FeatureDefinition(
        name='rides_last_7_days',
        dependencies=['trips_table'],
        compute_mode='batch',
    )
    def rides_last_7_days(self, context):
        return """
        SELECT rider_id, COUNT(*) AS rides_last_7_days
        FROM trips_table
        WHERE completed_at >= DATE_SUB(CURRENT_DATE, 7)
        GROUP BY rider_id
        """

    @FeatureDefinition(
        name='avg_trip_distance_km',
        dependencies=['trips_table'],
        compute_mode='batch',
    )
    def avg_trip_distance(self, context):
        return """
        SELECT rider_id, AVG(distance_km) AS avg_trip_distance_km
        FROM trips_table
        WHERE completed_at >= DATE_SUB(CURRENT_DATE, 30)
        GROUP BY rider_id
        """

    @FeatureDefinition(
        name='fraud_score_realtime',
        dependencies=['rider_signals'],
        compute_mode='streaming',
    )
    def fraud_score(self, context):
        """Real-time fraud signal from streaming pipeline."""
        return context.fetch_from_kafka('rider-fraud-scores')

# Features auto-computed and stored in Cassandra
# Serving layer fetches with <5ms latency

Pythondistributed_training.pyHorovod-based data-parallel training across 32 GPUs

import tensorflow as tf
import horovod.tensorflow as hvd
from michelangelo.training import ModelTrainer

hvd.init()
gpus = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

class UberPricingModel(ModelTrainer):
    """Surge pricing trained on 2 TB historical data across 32 GPUs."""

    def build_model(self):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation='relu'),
            tf.keras.layers.Dropout(0.3),
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dense(1),  # predicted surge multiplier
        ])
        lr = 0.001 * hvd.size()           # scale LR with cluster size
        optimizer = tf.keras.optimizers.Adam(lr)
        optimizer = hvd.DistributedOptimizer(optimizer)
        model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])
        return model

    def train(self, dataset):
        callbacks = [
            hvd.callbacks.BroadcastGlobalVariablesCallback(0),
            hvd.callbacks.MetricAverageCallback(),
        ]
        if hvd.rank() == 0:
            callbacks.append(
                tf.keras.callbacks.ModelCheckpoint('./model_{epoch}.h5')
            )
        self.model.fit(
            dataset,
            epochs=10,
            callbacks=callbacks,
            verbose=1 if hvd.rank() == 0 else 0,
        )

# Training time: 12h → 2h with 32 GPUs

Pythonmonitoring.pyDrift detection via K-S test — auto-trigger retraining on significant shift

from michelangelo.monitoring import ModelMonitor, DriftDetector

class SurgePricingMonitor(ModelMonitor):
    """Watch surge pricing model for data + concept drift, alert + retrain."""

    def __init__(self, model_id):
        super().__init__(model_id)
        self.drift_detector = DriftDetector(method='ks_test', threshold=0.05)

    def check_data_drift(self, production_data, training_data):
        drift_detected = {}
        for feature in ['rider_count', 'driver_count', 'time_of_day']:
            ks_statistic, p_value = self.drift_detector.detect(
                training_data[feature], production_data[feature]
            )
            if p_value < 0.05:
                drift_detected[feature] = {
                    'ks_statistic': ks_statistic,
                    'p_value': p_value,
                    'severity': 'HIGH' if p_value < 0.01 else 'MEDIUM',
                }
        if drift_detected:
            self.alert(
                f"Data drift detected in {len(drift_detected)} features",
                drift_detected,
            )
            self.trigger_retraining()

    def check_concept_drift(self):
        recent_mae = self.get_metric('mae', window='24h')
        baseline_mae = self.get_metric('mae', window='training')
        degradation = (recent_mae - baseline_mae) / baseline_mae
        if degradation > 0.15:
            self.alert(
                f"Model performance degraded by {degradation*100:.1f}%",
                {'recent_mae': recent_mae, 'baseline_mae': baseline_mae,
                 'action': 'RETRAIN_IMMEDIATELY'},
            )

Outcomes

Business outcomes

Model deployment: 4 weeks → <1 day
500+ teams deploy models independently
UberEats delivery ETA accuracy +30%
Fraud detection rate +40% via faster model updates

Technical outcomes

Training time: 12h → 2h via distributed training
Prediction P99: <10ms for real-time models
Feature store: 1M+ QPS at <5ms latency
Automated retraining: models updated daily without manual intervention

Impact

Democratized ML at enterprise scale

$20M annually saved through efficient resource usage, reduced manual work, and faster iteration cycles enabled by self-service model deployment.

Takeaways

Feature store is foundational. Centralized feature computation eliminates train/serve skew and enables reuse across teams.
Standardization accelerates innovation. A common API across frameworks reduces cognitive load and increases velocity.
Monitoring is non-negotiable. Track data drift, concept drift, and model performance — silent failures are worse than crashes.
A/B testing for models, not just UI. Gradual rollouts with automatic metrics tracking prevents bad models from impacting users.
Model registry with lineage. Know which data, code, and parameters produced each model for reproducibility and debugging.

Netflix

Metaflow · DX-first

Problem

Notebooks took months to reach production via engineering rewrites

Recommendation models drove 80% of viewing but took months to develop and deploy. Manual ML workflows did not scale to hundreds of data scientists. Most experiments never reached production. Data scientists spent 60% of their time rewriting notebook code into production pipelines.

Scale

Active users: 200M+
Models in prod: Thousands
Predictions/sec: Millions
Experiments/yr: 10K+
Training data: 2+ PB
Model updates: Daily for top models

Solution

Metaflow — same Python code, notebook to production

Netflix developed an end-to-end ML platform centered on Metaflow, a Python library that bridges notebook and production. Data scientists write code once with @step decorators; the same code runs locally or on AWS SageMaker without rewriting. The platform includes automated validation (performance vs baseline, fairness, latency), hybrid serving (real-time for homepage, batch for emails), and a rigorous A/B testing framework with causal inference for experiment analysis.

TensorFlowPyTorchXGBoostMetaflowJupyterSparkCassandraEVCacheAWS SageMaker

Metaflow: Python library for ML workflows — data → train → deploy
Notebooks + Metaflow for interactive dev + production deploy
Feature engineering on Spark; serving from Cassandra + EVCache
Hybrid serving: online (real-time) + offline (batch) predictions
A/B testing platform: randomized experiments + causal analysis
Model versioning: Git for code, S3 for artifacts
Auto-scaling prediction service on AWS, traffic-based
Automated validator gate: performance, fairness, latency

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonrecommendation_flow.py@step decorators define the DAG; @batch requests cloud resources

from metaflow import FlowSpec, step, Parameter, batch, S3

class RecommendationModelFlow(FlowSpec):
    """Netflix recommendation pipeline — same code runs in notebook + prod."""

    model_type = Parameter('model_type', default='neural_network')

    @step
    def start(self):
        from pyspark.sql import SparkSession
        spark = SparkSession.builder.appName("NetflixML").getOrCreate()
        self.training_data = spark.read.parquet('s3://netflix-data/views/')
        self.features = self.training_data.select(
            'user_id', 'title_id', 'viewing_time',
            'completion_rate', 'device_type',
        )
        self.next(self.train)

    @batch(cpu=32, memory=128000, gpu=4)   # request cloud resources
    @step
    def train(self):
        import tensorflow as tf
        model = tf.keras.Sequential([
            tf.keras.layers.Embedding(input_dim=500000, output_dim=256),
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid'),
        ])
        model.fit(self.features, epochs=10, batch_size=10000)
        self.model_path = f's3://netflix-models/{self.model_type}/v{self.run_id}'
        model.save(self.model_path)
        self.train_accuracy = 0.89
        self.train_auc = 0.94
        self.next(self.validate)

    @step
    def validate(self):
        baseline_auc = 0.90
        if self.train_auc < baseline_auc:
            raise Exception(f"AUC {self.train_auc} < baseline {baseline_auc}")
        bias_score = self.check_demographic_bias()
        if bias_score > 0.1:
            raise Exception(f"Model shows bias: {bias_score}")
        self.next(self.end)

    @step
    def end(self):
        print(f"Model ready: {self.model_path}")
        print(f"AUC: {self.train_auc}, Accuracy: {self.train_accuracy}")

# Run locally:   python recommendation_flow.py run
# Deploy cloud:  python recommendation_flow.py run --with batch

Pythonmodel_validator.pyPre-deploy gate — performance, fairness, latency. Catches 90% of issues.

import time
import numpy as np

class NetflixModelValidator:
    """Pre-deployment validation — catches 90% of issues before prod."""

    def __init__(self, model, test_data):
        self.model = model
        self.test_data = test_data
        self.validation_results = {}

    def validate_all(self) -> dict:
        checks = [
            self.check_model_performance(),
            self.check_demographic_bias(),
            self.check_latency_requirements(),
        ]
        return {'passed': all(checks), 'results': self.validation_results}

    def check_model_performance(self) -> bool:
        preds = self.model.predict(self.test_data)
        auc = self.calculate_auc(preds, self.test_data.labels)
        baseline_auc = 0.90
        improvement = auc - baseline_auc
        self.validation_results['performance'] = {
            'auc': auc, 'baseline': baseline_auc,
            'improvement': improvement, 'passed': improvement >= 0.02,
        }
        return improvement >= 0.02

    def check_demographic_bias(self) -> bool:
        bias_detected = False
        for demo in ['age_group', 'region', 'subscription_type']:
            group_aucs = {}
            for group in self.test_data[demo].unique():
                gd = self.test_data[self.test_data[demo] == group]
                group_aucs[group] = self.calculate_auc(
                    self.model.predict(gd), gd.labels,
                )
            max_auc, min_auc = max(group_aucs.values()), min(group_aucs.values())
            if (max_auc - min_auc) / max_auc > 0.05:
                bias_detected = True
                self.validation_results[f'bias_{demo}'] = {
                    'max_auc': max_auc, 'min_auc': min_auc,
                    'disparity': (max_auc - min_auc) / max_auc, 'passed': False,
                }
        return not bias_detected

    def check_latency_requirements(self) -> bool:
        latencies = []
        for _ in range(1000):
            t0 = time.time()
            self.model.predict(self.test_data.sample(1))
            latencies.append(time.time() - t0)
        p99 = np.percentile(latencies, 99)
        passed = p99 < 0.05  # 50ms SLO
        self.validation_results['latency'] = {
            'p99_ms': p99 * 1000, 'requirement_ms': 50, 'passed': passed,
        }
        return passed

Pythonexperiment_framework.pyA/B with t-test + effect size — automated ship/iterate/kill

from scipy import stats
import numpy as np

class NetflixExperimentFramework:
    """A/B testing with statistical rigor — 10K+ experiments per year."""

    def __init__(self, experiment_id):
        self.experiment_id = experiment_id

    def analyze_experiment(self, duration_days=14):
        treatment = self.get_engagement_metrics('treatment')
        control   = self.get_engagement_metrics('control')

        t_stat, p_value = stats.ttest_ind(treatment, control)

        pooled_std = np.sqrt(
            (np.std(treatment) ** 2 + np.std(control) ** 2) / 2
        )
        effect_size = (np.mean(treatment) - np.mean(control)) / pooled_std

        is_significant            = p_value < 0.05
        is_practically_significant = abs(effect_size) > 0.1

        return {
            'experiment_id': self.experiment_id,
            'duration_days': duration_days,
            'treatment_mean': np.mean(treatment),
            'control_mean':   np.mean(control),
            'lift': (np.mean(treatment) - np.mean(control)) / np.mean(control),
            'p_value': p_value,
            'effect_size': effect_size,
            'statistically_significant': is_significant,
            'practically_significant':   is_practically_significant,
            'decision': self.make_decision(is_significant, effect_size),
        }

    def make_decision(self, is_significant, effect_size):
        if is_significant and effect_size > 0.1:
            return "SHIP — strong positive impact"
        if is_significant and effect_size > 0.05:
            return "SHIP_WITH_MONITORING — modest positive impact"
        if is_significant and effect_size < -0.05:
            return "ROLLBACK — negative impact detected"
        return "INCONCLUSIVE — continue experiment or iterate"

Outcomes

Business outcomes

Personalization → 10%+ boost in user engagement
Model development cycle: months → weeks
Data scientists ship 3× more experiments per quarter
Content discovery 25% faster for users

Technical outcomes

Experimentation velocity: 1K → 10K+ experiments/yr
Training time -60% via better orchestration
Serving auto-scales to millions of QPS
Reproducibility: any model rebuildable from metadata

Impact

Transformed ML from bottleneck to competitive advantage

Millions saved annually through consolidated ML infrastructure. The recommendation engine drives 80% of viewing, with 200M+ users enjoying personalized content updated daily.

Takeaways

Notebooks in production work if done right. Metaflow bridges experimentation and production without code rewrites.
Not every model needs real-time serving. Batch predictions for non-critical use cases save massive compute.
Causal inference > correlation. Randomized experiments and causal analysis reveal true impact on business metrics.
Automate model validation. Pre-deploy checks (data quality, performance, bias) catch 90% of issues before users.
Invest in experiment infrastructure early. Fast experimentation is competitive advantage — Netflix ships 100× more experiments than peers.

Common pitfalls

The mistakes both teams hit on the road from "it works on my laptop" to "it runs the business."

Train/serve skew from duplicate feature pipelines

Problem

Different feature computation logic in training and serving caused a 20% drop in Uber's fraud detection accuracy in production. Models passed offline validation but failed in production.

Solution

Centralized feature store with a single source of truth. Same features for training and serving, pre-computed and served at <5ms.

Impact

Eliminated train/serve skew. Production accuracy now within 2% of offline validation.

No notebook-to-production bridge

Problem

Netflix data scientists spent 60% of time rewriting notebook code into production pipelines. Models took 3–4 months to deploy. Most experiments never made it.

Solution

Adopt Metaflow (or equivalent) so the same Python with @step decorators runs in both notebook and production. No rewriting.

Impact

Deployment time: months → weeks. 3× more experiments shipped per quarter. Data scientists focus on modeling, not engineering.

No monitoring after deployment

Problem

Uber's pricing model silently degraded for two weeks before business noticed a revenue drop. Cost $5M in lost revenue before anyone caught it.

Solution

Comprehensive monitoring: data drift detection, prediction distribution tracking, performance metrics, automated alerts on degradation thresholds.

Impact

Detect degradation within hours instead of weeks. Prevented estimated $20M in losses through early detection.

Build it, don't just read about it

Build your own MLOps platform

The Uber + Netflix patterns combine cleanly: feature store as the data anchor + Metaflow-style DAG runner as the DX anchor + automated validation gate as the safety anchor. You do not need 500 engineers to get there — you need the right primitives.

Our MLOps module walks through the full stack: feature store design, distributed training, the validator gate pattern, hybrid serving, and the A/B framework that lets you ship with confidence.

Start the MLOps module Browse MLOps projects