Spark Cost Optimization

Reduce your Apache Spark costs by 50-80% with these proven strategies

Why Cost Optimization Matters

Apache Spark clusters can easily cost $10,000-100,000+ per month at scale. Most organizations overspend by 50-80% due to poor cluster sizing, inefficient storage, and suboptimal job configurations.

Real Example: Netflix reduced Spark costs from $50M to $35M annually (30% savings) by implementing auto-scaling, spot instances, and storage tiering. The strategies below are battle-tested at companies processing petabytes daily.

1. AWS Spot Instances for Spark

Spot instances cost 70-90% less than on-demand instances. Perfect for Spark workloads since tasks are fault-tolerant and can handle interruptions.

Configuration Strategy

Master Node: On-Demand

Keep master node on-demand to ensure cluster stays alive. Cost: ~$50-200/month

Core Nodes: Mix of On-Demand + Spot (20/80 split)

20% on-demand for stability, 80% spot for cost savings. Ensures minimum capacity even during spot interruptions.

Task Nodes: 100% Spot

Task nodes don't store data, perfect for spot. Interruptions just pause tasks, no data loss.

EMR Spot Configuration

# EMR cluster with spot instances
aws emr create-cluster \
  --name "Spark-Spot-Cluster" \
  --release-label emr-6.10.0 \
  --applications Name=Spark \
  --ec2-attributes KeyName=my-key \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1 \
    InstanceGroupType=CORE,InstanceType=r5.2xlarge,InstanceCount=2,\
      BidPrice=OnDemandPrice \
    InstanceGroupType=TASK,InstanceType=r5.2xlarge,InstanceCount=10,\
      BidPrice=OnDemandPrice,Market=SPOT \
  --use-default-roles

💰 Cost Savings

• On-demand r5.2xlarge: $0.504/hour
• Spot r5.2xlarge: $0.15/hour (70% savings)
• 10-node cluster: $120/day → $36/day = $3,000/month savings

2. Right-Sizing Clusters

Most clusters are over-provisioned by 2-3x. Right-sizing based on actual workload reduces waste.

Sizing Guidelines

Workload Type	Instance Type	CPU:Memory Ratio	Executor Config
ETL (CPU-bound)	c5.4xlarge	1:2 (16 CPU, 32GB)	4-5 executors/node
ML Training	r5.2xlarge	1:8 (8 CPU, 64GB)	2-3 executors/node
Streaming	m5.xlarge	1:4 (4 CPU, 16GB)	1-2 executors/node
Data Science	r5.4xlarge	1:8 (16 CPU, 128GB)	3-4 executors/node

⚠️ Common Over-Provisioning Mistakes

• Using r5.8xlarge (32 CPU, 256GB) for simple ETL → Use c5.4xlarge instead
• Running 24/7 clusters for batch jobs → Use transient clusters
• Too many small executors → Consolidate to 4-5 executors per node
• Keeping dev clusters same size as prod → Dev should be 20-30% of prod size

Optimal Executor Configuration

# Optimal config for r5.2xlarge (8 CPU, 64GB RAM)
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-cores 4 \
  --executor-memory 26g \
  --driver-memory 4g \
  --conf spark.executor.memoryOverhead=6g \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.minExecutors=1 \
  --conf spark.dynamicAllocation.maxExecutors=20 \
  your-job.py

# Why these numbers?
# - 2 executors per node (8 CPU / 4 cores = 2)
# - 26GB executor memory + 6GB overhead = 32GB per executor
# - Leaves 4GB for OS and other processes
# - Dynamic allocation scales from 1 to 20 executors based on load

3. S3 vs HDFS Cost Tradeoffs

Storage strategy dramatically impacts costs. S3 is cheaper for storage but slower for access. HDFS is faster but more expensive.

S3 (Recommended)

✓Cost: $0.023/GB/month (Standard)
✓Durability: 99.999999999%
✓No cluster required for storage
✗Slower than HDFS (network latency)
✗API costs for PUT/GET requests

HDFS

✓Performance: 2-3x faster than S3
✓No API request costs
✗Cost: $0.10-0.15/GB/month (EBS)
✗Requires always-on cluster ($$)
✗Manual replication management

Hybrid Strategy (Best ROI)

1.Long-term storage: S3 Standard for data older than 30 days
2.Active data: S3 with caching on local SSD (instance store)
3.Intermediate results: Instance store (ephemeral, free, fast)
4.Archival: S3 Glacier ($0.004/GB/month) for cold data

💰 Cost Comparison (10TB dataset)

S3 Strategy:

• Storage: $230/month
• API requests: $50/month
• Total: $280/month

HDFS Strategy:

• EBS: $1,000/month
• EC2 (3 nodes): $2,000/month
• Total: $3,000/month

Savings with S3: $2,720/month ($32,640/year)

4. Auto-Scaling Policies

Auto-scaling adjusts cluster size based on workload, avoiding idle capacity during off-peak hours.

EMR Auto-Scaling Configuration

{
  "Rules": [
    {
      "Name": "ScaleOut",
      "Description": "Scale out when YARN memory utilization > 75%",
      "Action": {
        "SimpleScalingPolicyConfiguration": {
          "AdjustmentType": "CHANGE_IN_CAPACITY",
          "ScalingAdjustment": 2,
          "CoolDown": 300
        }
      },
      "Trigger": {
        "CloudWatchAlarmDefinition": {
          "MetricName": "YARNMemoryAvailablePercentage",
          "ComparisonOperator": "LESS_THAN",
          "Threshold": 25,
          "Period": 300
        }
      }
    },
    {
      "Name": "ScaleIn",
      "Description": "Scale in when YARN memory utilization < 25%",
      "Action": {
        "SimpleScalingPolicyConfiguration": {
          "AdjustmentType": "CHANGE_IN_CAPACITY",
          "ScalingAdjustment": -1,
          "CoolDown": 300
        }
      },
      "Trigger": {
        "CloudWatchAlarmDefinition": {
          "MetricName": "YARNMemoryAvailablePercentage",
          "ComparisonOperator": "GREATER_THAN",
          "Threshold": 75,
          "Period": 300
        }
      }
    }
  ]
}

📊 Auto-Scaling Best Practices

• Set min instances to handle baseline load (e.g., 2-3 core nodes)
• Set max instances to budget cap (e.g., 20 nodes max)
• Use 5-minute cooldown to prevent thrashing
• Scale out aggressively (add 2-3 nodes), scale in conservatively (remove 1 node)
• Monitor on YARN metrics, not CPU (CPU can be misleading for Spark)

💰 Cost Savings

Scenario: Cluster runs 24/7 but only busy 8 hours/day (business hours)

• Without auto-scaling: 20 nodes × 24 hours = 480 node-hours/day
• With auto-scaling: (3 nodes × 16h) + (20 nodes × 8h) = 208 node-hours/day
• Savings: 57% reduction in compute costs

5. Real Examples with $$ Savings

Example 1: E-Commerce Company

Before: 10 r5.4xlarge on-demand nodes 24/7 for nightly ETL

After: Transient cluster with 2 r5.2xlarge on-demand + 8 spot, runs 4 hours/night

Before:

$12,096/month

After:

$784/month

Savings: $11,312/month (93%)

Example 2: Media Streaming Company

Before: 50TB HDFS storage on persistent cluster

After: Migrated to S3 with tiered storage (Standard + Glacier)

Before:

$15,000/month

After:

$800/month

Savings: $14,200/month (95%)

Example 3: Financial Services

Before: Over-provisioned r5.8xlarge instances for simple ETL

After: Right-sized to c5.4xlarge + optimized executor config

Before:

$8,400/month

After:

$2,800/month

Savings: $5,600/month (67%)

Quick Wins Checklist

Implement these optimizations to see immediate cost reduction (typically 40-60% savings):

This Week

✓ Switch to spot instances for task nodes
✓ Enable dynamic allocation
✓ Move data from HDFS to S3

This Month

✓ Right-size instance types per workload
✓ Implement auto-scaling policies
✓ Set up S3 lifecycle policies for archival

Continue Learning

Practice with Spark Project

Build the ShopStream pipeline and apply these optimizations

Spark Case Studies

See how Netflix and Uber optimize Spark at petabyte scale