Spark Cost Optimization
Reduce your Apache Spark costs by 50-80% with these proven strategies
Why Cost Optimization Matters
Apache Spark clusters can easily cost $10,000-100,000+ per month at scale. Most organizations overspend by 50-80% due to poor cluster sizing, inefficient storage, and suboptimal job configurations.
Real Example: Netflix reduced Spark costs from $50M to $35M annually (30% savings) by implementing auto-scaling, spot instances, and storage tiering. The strategies below are battle-tested at companies processing petabytes daily.
1. AWS Spot Instances for Spark
Spot instances cost 70-90% less than on-demand instances. Perfect for Spark workloads since tasks are fault-tolerant and can handle interruptions.
Configuration Strategy
Master Node: On-Demand
Keep master node on-demand to ensure cluster stays alive. Cost: ~$50-200/month
Core Nodes: Mix of On-Demand + Spot (20/80 split)
20% on-demand for stability, 80% spot for cost savings. Ensures minimum capacity even during spot interruptions.
Task Nodes: 100% Spot
Task nodes don't store data, perfect for spot. Interruptions just pause tasks, no data loss.
EMR Spot Configuration
# EMR cluster with spot instances
aws emr create-cluster \
--name "Spark-Spot-Cluster" \
--release-label emr-6.10.0 \
--applications Name=Spark \
--ec2-attributes KeyName=my-key \
--instance-groups \
InstanceGroupType=MASTER,InstanceType=m5.xlarge,InstanceCount=1 \
InstanceGroupType=CORE,InstanceType=r5.2xlarge,InstanceCount=2,\
BidPrice=OnDemandPrice \
InstanceGroupType=TASK,InstanceType=r5.2xlarge,InstanceCount=10,\
BidPrice=OnDemandPrice,Market=SPOT \
--use-default-roles💰 Cost Savings
- • On-demand r5.2xlarge: $0.504/hour
- • Spot r5.2xlarge: $0.15/hour (70% savings)
- • 10-node cluster: $120/day → $36/day = $3,000/month savings
2. Right-Sizing Clusters
Most clusters are over-provisioned by 2-3x. Right-sizing based on actual workload reduces waste.
Sizing Guidelines
| Workload Type | Instance Type | CPU:Memory Ratio | Executor Config |
|---|---|---|---|
| ETL (CPU-bound) | c5.4xlarge | 1:2 (16 CPU, 32GB) | 4-5 executors/node |
| ML Training | r5.2xlarge | 1:8 (8 CPU, 64GB) | 2-3 executors/node |
| Streaming | m5.xlarge | 1:4 (4 CPU, 16GB) | 1-2 executors/node |
| Data Science | r5.4xlarge | 1:8 (16 CPU, 128GB) | 3-4 executors/node |
⚠️ Common Over-Provisioning Mistakes
- • Using r5.8xlarge (32 CPU, 256GB) for simple ETL → Use c5.4xlarge instead
- • Running 24/7 clusters for batch jobs → Use transient clusters
- • Too many small executors → Consolidate to 4-5 executors per node
- • Keeping dev clusters same size as prod → Dev should be 20-30% of prod size
Optimal Executor Configuration
# Optimal config for r5.2xlarge (8 CPU, 64GB RAM) spark-submit \ --master yarn \ --deploy-mode cluster \ --executor-cores 4 \ --executor-memory 26g \ --driver-memory 4g \ --conf spark.executor.memoryOverhead=6g \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=20 \ your-job.py # Why these numbers? # - 2 executors per node (8 CPU / 4 cores = 2) # - 26GB executor memory + 6GB overhead = 32GB per executor # - Leaves 4GB for OS and other processes # - Dynamic allocation scales from 1 to 20 executors based on load
3. S3 vs HDFS Cost Tradeoffs
Storage strategy dramatically impacts costs. S3 is cheaper for storage but slower for access. HDFS is faster but more expensive.
S3 (Recommended)
- ✓Cost: $0.023/GB/month (Standard)
- ✓Durability: 99.999999999%
- ✓No cluster required for storage
- ✗Slower than HDFS (network latency)
- ✗API costs for PUT/GET requests
HDFS
- ✓Performance: 2-3x faster than S3
- ✓No API request costs
- ✗Cost: $0.10-0.15/GB/month (EBS)
- ✗Requires always-on cluster ($$)
- ✗Manual replication management
Hybrid Strategy (Best ROI)
- 1.Long-term storage: S3 Standard for data older than 30 days
- 2.Active data: S3 with caching on local SSD (instance store)
- 3.Intermediate results: Instance store (ephemeral, free, fast)
- 4.Archival: S3 Glacier ($0.004/GB/month) for cold data
💰 Cost Comparison (10TB dataset)
- • Storage: $230/month
- • API requests: $50/month
- • Total: $280/month
- • EBS: $1,000/month
- • EC2 (3 nodes): $2,000/month
- • Total: $3,000/month
4. Auto-Scaling Policies
Auto-scaling adjusts cluster size based on workload, avoiding idle capacity during off-peak hours.
EMR Auto-Scaling Configuration
{
"Rules": [
{
"Name": "ScaleOut",
"Description": "Scale out when YARN memory utilization > 75%",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": 2,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"MetricName": "YARNMemoryAvailablePercentage",
"ComparisonOperator": "LESS_THAN",
"Threshold": 25,
"Period": 300
}
}
},
{
"Name": "ScaleIn",
"Description": "Scale in when YARN memory utilization < 25%",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -1,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"MetricName": "YARNMemoryAvailablePercentage",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 75,
"Period": 300
}
}
}
]
}📊 Auto-Scaling Best Practices
- • Set min instances to handle baseline load (e.g., 2-3 core nodes)
- • Set max instances to budget cap (e.g., 20 nodes max)
- • Use 5-minute cooldown to prevent thrashing
- • Scale out aggressively (add 2-3 nodes), scale in conservatively (remove 1 node)
- • Monitor on YARN metrics, not CPU (CPU can be misleading for Spark)
💰 Cost Savings
Scenario: Cluster runs 24/7 but only busy 8 hours/day (business hours)
- • Without auto-scaling: 20 nodes × 24 hours = 480 node-hours/day
- • With auto-scaling: (3 nodes × 16h) + (20 nodes × 8h) = 208 node-hours/day
- • Savings: 57% reduction in compute costs
5. Real Examples with $$ Savings
Example 1: E-Commerce Company
Before: 10 r5.4xlarge on-demand nodes 24/7 for nightly ETL
After: Transient cluster with 2 r5.2xlarge on-demand + 8 spot, runs 4 hours/night
Example 2: Media Streaming Company
Before: 50TB HDFS storage on persistent cluster
After: Migrated to S3 with tiered storage (Standard + Glacier)
Example 3: Financial Services
Before: Over-provisioned r5.8xlarge instances for simple ETL
After: Right-sized to c5.4xlarge + optimized executor config
Quick Wins Checklist
Implement these optimizations to see immediate cost reduction (typically 40-60% savings):
This Week
- ✓ Switch to spot instances for task nodes
- ✓ Enable dynamic allocation
- ✓ Move data from HDFS to S3
This Month
- ✓ Right-size instance types per workload
- ✓ Implement auto-scaling policies
- ✓ Set up S3 lifecycle policies for archival