MLOps Cost Optimization
Reduce infrastructure costs by 60-80% with smart resource management
Why MLOps Cost Optimization Matters
MLOps infrastructure costs add up quickly: experiment tracking servers, artifact storage, training compute, and serving infrastructure. Teams often waste 60-80% of their MLOps budget on idle resources, unnecessary artifact storage, and over-provisioned infrastructure.
This guide shows practical techniques to cut costs while maintaining productivity. We'll cover artifact lifecycle management, compute auto-shutdown, experiment tracking optimization, and resource right-sizing.
Real Impact: Teams using these techniques reduced MLOps costs from $50K/month to $10-15K/month with no loss in productivity.
Quick Wins (Implement Today)
1. S3 Lifecycle Policies
Auto-delete old experiment artifacts after 90 days
2. Auto-Shutdown Notebooks
Stop idle SageMaker/Databricks notebooks after 1 hour
3. Compress Model Artifacts
Use gzip for model checkpoints and metadata
4. Spot Instances for Training
Use spot for long-running, fault-tolerant training
1. Artifact Storage Optimization
Model artifacts, experiment data, and training logs accumulate quickly. Without lifecycle management, storage costs grow 10-20% monthly. Most artifacts older than 90 days are never accessed again.
S3 Lifecycle Policies
Lifecycle Configuration
# lifecycle-policy.json
{
"Rules": [
{
"Id": "TransitionExperimentArtifacts",
"Status": "Enabled",
"Filter": {
"Prefix": "experiments/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
},
{
"Id": "DeleteFailedRuns",
"Status": "Enabled",
"Filter": {
"Prefix": "experiments/failed/"
},
"Expiration": {
"Days": 30
}
},
{
"Id": "KeepProductionModels",
"Status": "Enabled",
"Filter": {
"Prefix": "models/production/"
},
"Transitions": [
{
"Days": 180,
"StorageClass": "STANDARD_IA"
}
]
}
]
}
# Apply lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
--bucket my-mlops-artifacts \
--lifecycle-configuration file://lifecycle-policy.jsonStorage Class Pricing (per GB/month)
| Storage Class | Cost/GB | Retrieval | Best For |
|---|---|---|---|
| S3 Standard | $0.023 | Free | Active experiments (<30 days) |
| S3 Standard-IA | $0.0125 (46% cheaper) | $0.01/GB | Occasional access (30-90 days) |
| S3 Glacier | $0.004 (83% cheaper) | $0.03/GB | Archival (90+ days) |
| Delete | $0 (100% savings) | - | Failed runs, temp data (365+ days) |
Real Example: 100TB Artifact Storage
Artifact Compression
Compress Model Checkpoints
import gzip
import pickle
import mlflow
# Save compressed model
def save_compressed_model(model, path):
"""Save model with gzip compression (40-60% size reduction)"""
with gzip.open(f"{path}.pkl.gz", 'wb') as f:
pickle.dump(model, f, protocol=pickle.HIGHEST_PROTOCOL)
# MLflow artifact logging with compression
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
# Enable compression for large models
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE
)
# Manual compression for custom artifacts
with gzip.open('training_data.csv.gz', 'wt') as f:
df.to_csv(f, index=False)
mlflow.log_artifact('training_data.csv.gz')
# Example: 100GB uncompressed model → 40GB compressed
# Savings: 60GB × $0.023 = $1.38/month per model
# With 1000 models: $1,380/month savings2. Compute Resource Management
Auto-Shutdown Policies
Data scientists often leave notebooks running overnight or forget to stop instances after experimentation. Auto-shutdown saves 60-80% on idle compute costs.
SageMaker Auto-Shutdown (Lambda)
import boto3
from datetime import datetime, timedelta
sagemaker = boto3.client('sagemaker')
def lambda_handler(event, context):
"""
Stop SageMaker notebook instances idle for >1 hour
Run this Lambda every 15 minutes via EventBridge
"""
response = sagemaker.list_notebook_instances(
StatusEquals='InService'
)
for notebook in response['NotebookInstances']:
name = notebook['NotebookInstanceName']
# Get last modified time
details = sagemaker.describe_notebook_instance(
NotebookInstanceName=name
)
last_modified = details['LastModifiedTime']
idle_time = datetime.now(last_modified.tzinfo) - last_modified
# Stop if idle > 1 hour
if idle_time > timedelta(hours=1):
print(f"Stopping idle notebook: {name} (idle {idle_time})")
sagemaker.stop_notebook_instance(
NotebookInstanceName=name
)
return {'statusCode': 200}
# Example savings:
# 10 ml.p3.2xlarge instances ($3.825/hour each)
# Without auto-shutdown: 24 hours × 10 × $3.825 = $918/day
# With auto-shutdown: 8 hours × 10 × $3.825 = $306/day
# Savings: $612/day = $18,360/month (67%)Tip: Set idle threshold to 1 hour for notebooks, 30 minutes for training jobs without checkpointing. For long training jobs with checkpointing, disable auto-shutdown.
Spot Instances for Training
SageMaker Managed Spot Training
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri='my-training-image',
role=role,
instance_count=1,
instance_type='ml.p3.2xlarge',
# Enable managed spot training
use_spot_instances=True,
max_wait=7200, # Max wait time in seconds
max_run=3600, # Max training time
# Checkpointing for spot recovery
checkpoint_s3_uri='s3://bucket/checkpoints/',
checkpoint_local_path='/opt/ml/checkpoints'
)
# Spot savings: 70-90% vs on-demand
# Example: 10-hour training on ml.p3.2xlarge
# On-demand: 10 × $3.825 = $38.25
# Spot: 10 × $0.50 = $5.00 (87% savings)
# Even with 2-3 interruptions, total cost << on-demandWhen to Use Spot Instances
3. Experiment Tracking Optimization
MLflow, Weights & Biases, and other tracking tools can generate massive amounts of data if not configured carefully. Optimize what you log to reduce storage and query costs.
Optimize MLflow Logging
import mlflow
# ❌ DON'T: Log every training step (generates 100K+ log entries)
for epoch in range(100):
for batch in range(1000):
loss = train_step(batch)
mlflow.log_metric("batch_loss", loss, step=epoch*1000+batch)
# Generates 100K metric entries!
# ✅ DO: Log only epoch-level metrics
for epoch in range(100):
epoch_loss = 0
for batch in range(1000):
loss = train_step(batch)
epoch_loss += loss
mlflow.log_metric("epoch_loss", epoch_loss/1000, step=epoch)
# Generates only 100 metric entries
# ✅ DO: Sample high-frequency metrics
step = 0
for epoch in range(100):
for batch in range(1000):
loss = train_step(batch)
# Log every 100 steps instead of every step
if step % 100 == 0:
mlflow.log_metric("batch_loss", loss, step=step)
step += 1
# Reduces logging overhead by 99% while maintaining visibilityLogging Best Practices
- • Final model metrics (accuracy, F1, etc.)
- • Hyperparameters
- • Epoch-level training metrics
- • Model artifacts (compressed)
- • Confusion matrix (not per-batch)
- • Batch-level metrics (unless sampled)
- • Full training datasets
- • Intermediate checkpoints (keep last 3)
- • Debug print statements
- • Every hyperparameter combo in tuning
Real-World Example: Complete MLOps Stack
Mid-Size ML Team (10 data scientists)
50 experiments/week, 200 models trained/month
Before Optimization
After Optimization
Your MLOps Cost Optimization Checklist
Implement these 10 changes this week to save 60-80%
Storage Optimization
Compute Optimization
Tracking Optimization
Monitoring
Related Resources
Spark Cost Optimization
Reduce Spark cluster costs by 70-90% with spot instances and auto-scaling
LLM Cost Optimization
Cut LLM API costs by 70-95% with batching, caching, and smart model selection
MLOps Case Studies
How Uber and Netflix built production MLOps platforms at scale
Build an MLOps Pipeline
Hands-on project: experiment tracking, model registry, and automated deployment