Apache Spark at Petabyte Scale

Netflix

Batch · auto-tuning pioneer

Problem

MapReduce too slow for 24-hour ML iteration cycles

Netflix processes over 500 billion events daily for personalized recommendations, A/B testing, and content analytics. Hadoop MapReduce was the bottleneck — recommendation model training took 24 hours, making it impossible to run fast iteration cycles for data scientists. The company needed a platform that handled both one-time analytics and repeated iterative algorithms efficiently.

Scale

Events/day: 500B+
Warehouse: 60+ PB
Spark jobs/day: 100K+
Active clusters: Thousands
EC2 peak: 10K+ instances
Models/day: Thousands trained

Solution

Transient EMR + auto-optimizer + Metacat catalog

Netflix built a production Spark platform on AWS EMR using transient clusters that spin up on-demand and shut down after job completion. They combined Spark with S3 (Parquet), Delta Lake for ACID transactions, and developed proprietary tools — Metacat (unified metadata) and Spark Auto-Optimizer (automatic config tuning). Workloads are separated into ETL, ML training, and ad-hoc cluster types.

Apache SparkAWS EMRS3 + ParquetDelta LakeJupyterMetaflowGenie

Transient EMR clusters: spin up on demand, terminate after job completion
S3 as primary data lake with Parquet for columnar storage + predicate pushdown
Notebook-driven dev via Jupyter for data scientists
Genie for job orchestration and cluster lifecycle
Delta Lake for ACID transactions on S3
Separate clusters per workload class: ETL, ML training, ad-hoc
Auto-scaling driven by queue depth + SLA target
Metacat: unified metadata across Hive, S3, RDS
Spark Auto-Optimizer: tunes executors, memory, shuffle from past job stats
Tiered storage: hot SSD, cold S3 Standard-IA

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonspark_auto_optimizer.pyNetflix's auto-optimizer that tunes Spark configs from past job stats

def optimize_spark_config(job_stats):
    """
    Netflix's auto-optimizer that tunes Spark configs
    based on job characteristics (past run stats).
    """
    shuffle_heavy = job_stats['shuffle_read_bytes'] > 100 * 1024**3  # 100GB
    memory_intensive = job_stats['peak_memory_mb'] > 50 * 1024       # 50GB

    config = {}

    if shuffle_heavy:
        config.update({
            'spark.executor.memory': '32g',
            'spark.executor.memoryOverhead': '4g',
            'spark.shuffle.service.enabled': 'true',
            'spark.dynamicAllocation.enabled': 'true',
            'spark.sql.shuffle.partitions': '2000',
        })
    elif memory_intensive:
        config.update({
            'spark.executor.memory': '64g',
            'spark.memory.fraction': '0.8',
            'spark.memory.storageFraction': '0.3',
        })
    else:
        config.update({
            'spark.executor.memory': '16g',
            'spark.executor.cores': '4',
            'spark.dynamicAllocation.enabled': 'true',
        })

    return config

Pythontransient_emr.pySpin up an EMR cluster, run a job, auto-terminate

import boto3

def run_spark_job_transient(job_script, input_path, output_path):
    """Transient cluster pattern — cheaper than long-running."""
    emr = boto3.client('emr')
    cluster = emr.run_job_flow(
        Name='Transient-Spark-Job',
        ReleaseLabel='emr-6.10.0',
        Instances={
            'InstanceGroups': [
                {'InstanceRole': 'MASTER', 'InstanceType': 'm5.xlarge',  'InstanceCount': 1},
                {'InstanceRole': 'CORE',   'InstanceType': 'm5.4xlarge', 'InstanceCount': 10},
            ],
            'KeepJobFlowAliveWhenNoSteps': False,  # auto-terminate
        },
        Steps=[{
            'Name': 'Run Spark Job',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['spark-submit', job_script, input_path, output_path],
            },
        }],
    )
    return cluster['JobFlowId']

Pythoncost_attribution.pyPer-team cost tracking — invoiced by job, not by month

from pyspark.sql import SparkSession

def track_job_cost(spark: SparkSession, team_name: str):
    """Tracks EC2 cost per team and job for chargeback."""
    metrics = spark.sparkContext.statusTracker()
    executor_count = len(metrics.getExecutorInfos())

    executor_type = "m5.4xlarge"   # $0.768/hr on-demand
    hourly_rate = 0.768
    duration_hours = metrics.getJobInfo().duration / 3600

    total_cost = executor_count * hourly_rate * duration_hours

    log_cost({
        'team': team_name,
        'job_id': spark.sparkContext.applicationId,
        'cost_usd': total_cost,
        'executor_count': executor_count,
        'duration_hours': duration_hours,
    })
    return total_cost

Outcomes

Business outcomes

Reduced recommendation model training time from 24 hours to 2 hours
Enabled real-time A/B test analysis (results in <30 min vs next-day)
Improved content discovery → 10% increase in user engagement
Data scientists ship 50% more experiments due to faster iteration

Technical outcomes

Processing P95 latency: 12h → <3h vs Hadoop
Cluster utilization 40% → 75% via auto-scaling
Shuffle spill to disk reduced 80% via memory tuning
Job failure rate 15% → <2% via better error handling

Impact

From 24-hour batch cycles to 2-hour ML iterations

Model training time dropped from 24 hours to 2 hours, enabling Netflix data scientists to ship 50% more experiments and deploy better recommendation algorithms faster.

Takeaways

Transient clusters are cheaper than long-running clusters for batch. Shut down aggressively when idle.
Data skew is the #1 performance killer. Use salting, broadcast joins, and repartitioning to balance work across executors.
Monitor everything: GC time, shuffle read/write, spill to disk. These metrics reveal bottlenecks invisible in application logs.
Parquet + predicate pushdown = ~10× speedup. Columnar formats with partition pruning drastically reduce data scanned.
Don't fight Spark's memory management. Tune executor memory, overhead, and storage fraction based on workload (CPU- vs memory-bound).

Uber

Streaming operator

Problem

Surge pricing required sub-second latency across 10K+ cities

Uber processes 15M trips daily across 10,000+ cities, generating over 1M events per second. The company needed real-time analytics for driver surge pricing, trip-to-driver matching, and fraud detection — all requiring sub-second latency and exactly-once semantics. Legacy batch systems running every 5 minutes could not keep up.

Scale

Trips/day: 15M
Events/sec: 1M+
Ingested/day: 100 TB
Spark clusters: 500+
Streaming jobs: 2,000+
Substrate: HDFS + GCP

Solution

Structured Streaming + Kafka + Hudi on Kubernetes

Uber built a streaming platform using Spark Structured Streaming (replacing older DStreams) with Kafka as the bus and Apache Hudi for ACID upserts on the lake. They created Marmaray (generic Kafka ingestion) and AthenaX (SQL-based stream processing). The platform runs on Kubernetes for resource isolation, with Apache Pinot for real-time OLAP and Presto for ad-hoc queries.

Spark Structured StreamingApache KafkaApache HudiKubernetesApache PinotPrestoHorovod

Structured Streaming for micro-batch (1–5s windows)
Kafka as event bus across all sources
Apache Hudi for incremental lakes with upserts
Hybrid deploy: critical jobs on dedicated clusters, others on shared
Presto for interactive queries on Spark output
Apache Pinot for real-time OLAP on streaming results
Kubernetes for resource isolation + multi-tenancy
Marmaray: generic Kafka ingestion framework for all sources
AthenaX: SQL-based stream processing for non-engineers
Exactly-once semantics via idempotent upserts to Hudi

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonsurge_pricing_streaming.pySurge pricing — driver locations → 2s window → Hudi upsert

from pyspark.sql import SparkSession
from pyspark.sql.functions import window, count, col, from_json, when

spark = SparkSession.builder.appName("SurgePricing").getOrCreate()

driver_stream = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "driver-locations")
    .load())

surge = (driver_stream
    .selectExpr("CAST(value AS STRING) as json")
    .select(from_json("json", schema).alias("data"))
    .select("data.*")
    .withWatermark("timestamp", "10 seconds")
    .groupBy(window("timestamp", "2 seconds"), "city_id", "zone_id")
    .agg(
        count("driver_id").alias("available_drivers"),
        count("trip_request_id").alias("trip_requests"),
    )
    .withColumn(
        "surge_multiplier",
        when(col("trip_requests") / col("available_drivers") > 2.0, 1.8)
        .when(col("trip_requests") / col("available_drivers") > 1.5, 1.5)
        .otherwise(1.0),
    ))

(surge.writeStream
    .format("org.apache.hudi")
    .option("hoodie.table.name", "surge_pricing")
    .option("hoodie.datasource.write.recordkey.field", "zone_id")
    .option("hoodie.datasource.write.precombine.field", "timestamp")
    .option("hoodie.datasource.write.operation", "upsert")
    .option("checkpointLocation", "/checkpoints/surge")
    .start("/data/surge_pricing"))

Pythonfraud_detection_streaming.pyFraud scoring with foreachBatch + idempotent UPSERT

def detect_fraud_and_write(batch_df, batch_id):
    """
    Custom foreachBatch sink with exactly-once semantics
    even on retries.
    """
    fraud_scores = ml_model.transform(batch_df)
    fraud_alerts = fraud_scores.filter(col("fraud_score") > 0.85)

    (fraud_alerts.write
        .format("jdbc")
        .option("url", "jdbc:postgresql://db:5432/fraud")
        .option("dbtable", "fraud_alerts")
        .option("driver", "org.postgresql.Driver")
        .mode("append")
        .option(
            "onConflictAction",
            "DO UPDATE SET fraud_score = EXCLUDED.fraud_score",
        )
        .save())

    processed_batches.add(batch_id)

(trip_events.writeStream
    .foreachBatch(detect_fraud_and_write)
    .option("checkpointLocation", "/checkpoints/fraud")
    .start())

YAMLmarmaray_config.yamlMarmaray: generic Postgres CDC → Kafka ingestion

job:
  name: postgres-trips-to-kafka
  type: batch          # runs every 5 minutes

source:
  type: postgres
  connection:
    host: trips-db.uber.internal
    database: trips
    table: completed_trips
  watermark:
    column: updated_at
    checkpoint_path: s3://checkpoints/trips-cdc
  incremental: true

transform:
  - type: add_metadata
    fields:
      - { name: ingestion_timestamp, value: ${current_timestamp} }
      - { name: source_table,        value: completed_trips }
  - type: validate_schema
    required_fields: [trip_id, rider_id, driver_id, fare]

sink:
  type: kafka
  topic: trips-completed
  bootstrap_servers: kafka-prod:9092
  partition_key: city_id
  producer:
    acks: all
    enable_idempotence: true
    max_in_flight_requests: 1

Outcomes

Business outcomes

Surge pricing latency: 5min → 30sec
Dynamic pricing → +20% driver earnings
Fraud caught within 1 minute (95% catch rate)
Rider-matching accuracy +15% during peak hours

Technical outcomes

Streaming P99 latency: 30s → <2s vs batch
Zero data loss with exactly-once processing
Survived 10× NYE traffic spike with no downtime
Pipeline failure rate: 20% → <1% via idempotency

Impact

From 5-minute batch to 30-second real-time surge

Surge pricing latency dropped from 5 minutes to 30 seconds, enabling dynamic pricing that increased driver earnings by 20% and dramatically improved matching accuracy during peak demand.

Takeaways

Structured Streaming over DStreams. Unified batch/streaming API cuts code duplication and cognitive load.
Exactly-once requires idempotent sinks. Design databases and lakes to handle duplicate writes gracefully.
Watermarks are critical for late events. Tune watermark delays per source — GPS ≠ payments.
Sub-second batches add overhead. Match interval to actual SLA: 1s for surge, 30s for ML features.
Apache Hudi for streaming lakes. Upserts + incremental processing are game-changers for CDC + real-time analytics.

Common pitfalls

The mistakes both teams hit on the road from "it works on my laptop" to "it runs the business."

Running long-lived Spark clusters 24/7

Problem

Netflix wasted tens of millions annually on idle EC2 during off-peak hours. Clusters stayed up even when no jobs were running, and provisioning was sized for peak.

Solution

Switch to transient EMR clusters that spin up on demand and shut down after job completion. Add aggressive auto-scaling tied to job queue depth, not calendar.

Impact

Reduced compute costs by 60%+ through transient clusters and queue-driven scaling.

Default Spark configs for every job

Problem

One-size-fits-all configs caused 15% job failure rates and 10× slower runs than necessary. Shuffle-heavy jobs need different memory/parts than memory-bound ones.

Solution

Build an auto-optimizer that reads past job stats (shuffle bytes, peak memory) and emits tuned executor memory, cores, and shuffle partition counts.

Impact

Failure rate 15% → <2% and average performance 3–4× faster.

Streaming writes without idempotent sinks

Problem

Uber's task retries caused duplicate records in fraud alerts. Riders received 3–5 identical notifications for the same suspicious transaction.

Solution

Switch to UPSERT operations keyed by (trip_id + timestamp). Track batch_id in checkpoints for application-level dedup as a backstop.

Impact

Duplicate alerts eliminated. Support tickets down 40%.

Build it, don't just read about it

Build your own Spark platform

Netflix and Uber both started from a working Spark job. The hard part is the platform around it — auto-tuning, transient clusters, idempotent sinks, cost attribution. The patterns are repeatable; you just have to wire them up.

Our Spark module covers the full path: Parquet partitioning, the auto-optimizer pattern, Structured Streaming with Hudi, exactly-once writes, and the cost-attribution playbook teams need from day one.

Start the Spark module Browse Spark projects