MLOps Cost Optimization

Reduce infrastructure costs by 60-80% with smart resource management

Why MLOps Cost Optimization Matters

MLOps infrastructure costs add up quickly: experiment tracking servers, artifact storage, training compute, and serving infrastructure. Teams often waste 60-80% of their MLOps budget on idle resources, unnecessary artifact storage, and over-provisioned infrastructure.

This guide shows practical techniques to cut costs while maintaining productivity. We'll cover artifact lifecycle management, compute auto-shutdown, experiment tracking optimization, and resource right-sizing.

Real Impact: Teams using these techniques reduced MLOps costs from $50K/month to $10-15K/month with no loss in productivity.

Quick Wins (Implement Today)

1. S3 Lifecycle Policies

Auto-delete old experiment artifacts after 90 days

💰 Save 50-70% on storage

2. Auto-Shutdown Notebooks

Stop idle SageMaker/Databricks notebooks after 1 hour

💰 Save 60-80% on compute

3. Compress Model Artifacts

Use gzip for model checkpoints and metadata

💰 Save 40-60% on storage

4. Spot Instances for Training

Use spot for long-running, fault-tolerant training

💰 Save 70-90% on training

1. Artifact Storage Optimization

Model artifacts, experiment data, and training logs accumulate quickly. Without lifecycle management, storage costs grow 10-20% monthly. Most artifacts older than 90 days are never accessed again.

S3 Lifecycle Policies

Lifecycle Configuration

# lifecycle-policy.json
{
  "Rules": [
    {
      "Id": "TransitionExperimentArtifacts",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "experiments/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    },
    {
      "Id": "DeleteFailedRuns",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "experiments/failed/"
      },
      "Expiration": {
        "Days": 30
      }
    },
    {
      "Id": "KeepProductionModels",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "models/production/"
      },
      "Transitions": [
        {
          "Days": 180,
          "StorageClass": "STANDARD_IA"
        }
      ]
    }
  ]
}

# Apply lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-mlops-artifacts \
  --lifecycle-configuration file://lifecycle-policy.json

Storage Class Pricing (per GB/month)

Storage Class	Cost/GB	Retrieval	Best For
S3 Standard	$0.023	Free	Active experiments (<30 days)
S3 Standard-IA	$0.0125 (46% cheaper)	$0.01/GB	Occasional access (30-90 days)
S3 Glacier	$0.004 (83% cheaper)	$0.03/GB	Archival (90+ days)
Delete	$0 (100% savings)	-	Failed runs, temp data (365+ days)

Real Example: 100TB Artifact Storage

Before: 100TB × $0.023 = $2,300/month

After lifecycle policies:

• 20TB active (30 days) × $0.023 = $460

• 30TB Standard-IA (30-90 days) × $0.0125 = $375

• 40TB Glacier (90-365 days) × $0.004 = $160

• 10TB deleted after 365 days = $0

New cost: $995/month (57% savings = $1,305/month)

Artifact Compression

Compress Model Checkpoints

import gzip
import pickle
import mlflow

# Save compressed model
def save_compressed_model(model, path):
    """Save model with gzip compression (40-60% size reduction)"""
    with gzip.open(f"{path}.pkl.gz", 'wb') as f:
        pickle.dump(model, f, protocol=pickle.HIGHEST_PROTOCOL)

# MLflow artifact logging with compression
mlflow.sklearn.log_model(
    model,
    "model",
    signature=signature,
    # Enable compression for large models
    serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE
)

# Manual compression for custom artifacts
with gzip.open('training_data.csv.gz', 'wt') as f:
    df.to_csv(f, index=False)

mlflow.log_artifact('training_data.csv.gz')

# Example: 100GB uncompressed model → 40GB compressed
# Savings: 60GB × $0.023 = $1.38/month per model
# With 1000 models: $1,380/month savings

2. Compute Resource Management

Auto-Shutdown Policies

Data scientists often leave notebooks running overnight or forget to stop instances after experimentation. Auto-shutdown saves 60-80% on idle compute costs.

SageMaker Auto-Shutdown (Lambda)

import boto3
from datetime import datetime, timedelta

sagemaker = boto3.client('sagemaker')

def lambda_handler(event, context):
    """
    Stop SageMaker notebook instances idle for >1 hour
    Run this Lambda every 15 minutes via EventBridge
    """
    response = sagemaker.list_notebook_instances(
        StatusEquals='InService'
    )

    for notebook in response['NotebookInstances']:
        name = notebook['NotebookInstanceName']

        # Get last modified time
        details = sagemaker.describe_notebook_instance(
            NotebookInstanceName=name
        )

        last_modified = details['LastModifiedTime']
        idle_time = datetime.now(last_modified.tzinfo) - last_modified

        # Stop if idle > 1 hour
        if idle_time > timedelta(hours=1):
            print(f"Stopping idle notebook: {name} (idle {idle_time})")
            sagemaker.stop_notebook_instance(
                NotebookInstanceName=name
            )

    return {'statusCode': 200}

# Example savings:
# 10 ml.p3.2xlarge instances ($3.825/hour each)
# Without auto-shutdown: 24 hours × 10 × $3.825 = $918/day
# With auto-shutdown: 8 hours × 10 × $3.825 = $306/day
# Savings: $612/day = $18,360/month (67%)

Tip: Set idle threshold to 1 hour for notebooks, 30 minutes for training jobs without checkpointing. For long training jobs with checkpointing, disable auto-shutdown.

Spot Instances for Training

SageMaker Managed Spot Training

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri='my-training-image',
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',

    # Enable managed spot training
    use_spot_instances=True,
    max_wait=7200,  # Max wait time in seconds
    max_run=3600,   # Max training time

    # Checkpointing for spot recovery
    checkpoint_s3_uri='s3://bucket/checkpoints/',
    checkpoint_local_path='/opt/ml/checkpoints'
)

# Spot savings: 70-90% vs on-demand
# Example: 10-hour training on ml.p3.2xlarge
# On-demand: 10 × $3.825 = $38.25
# Spot: 10 × $0.50 = $5.00 (87% savings)
# Even with 2-3 interruptions, total cost << on-demand

When to Use Spot Instances

✓Long training jobs (4+ hours) - Time to recover from interruptions

✓Checkpointed training - Can resume from last checkpoint

✓Hyperparameter tuning - Many parallel trials, failures OK

✗Real-time inference - Need guaranteed availability

✗Short jobs (<1 hour) - Interruption overhead too high

3. Experiment Tracking Optimization

MLflow, Weights & Biases, and other tracking tools can generate massive amounts of data if not configured carefully. Optimize what you log to reduce storage and query costs.

Optimize MLflow Logging

import mlflow

# ❌ DON'T: Log every training step (generates 100K+ log entries)
for epoch in range(100):
    for batch in range(1000):
        loss = train_step(batch)
        mlflow.log_metric("batch_loss", loss, step=epoch*1000+batch)
        # Generates 100K metric entries!

# ✅ DO: Log only epoch-level metrics
for epoch in range(100):
    epoch_loss = 0
    for batch in range(1000):
        loss = train_step(batch)
        epoch_loss += loss

    mlflow.log_metric("epoch_loss", epoch_loss/1000, step=epoch)
    # Generates only 100 metric entries

# ✅ DO: Sample high-frequency metrics
step = 0
for epoch in range(100):
    for batch in range(1000):
        loss = train_step(batch)

        # Log every 100 steps instead of every step
        if step % 100 == 0:
            mlflow.log_metric("batch_loss", loss, step=step)
        step += 1

# Reduces logging overhead by 99% while maintaining visibility

Logging Best Practices

What TO log:

• Final model metrics (accuracy, F1, etc.)
• Hyperparameters
• Epoch-level training metrics
• Model artifacts (compressed)
• Confusion matrix (not per-batch)

What NOT to log:

• Batch-level metrics (unless sampled)
• Full training datasets
• Intermediate checkpoints (keep last 3)
• Debug print statements
• Every hyperparameter combo in tuning

Real-World Example: Complete MLOps Stack

Mid-Size ML Team (10 data scientists)

50 experiments/week, 200 models trained/month

Before Optimization

Notebook compute:$15,000 (24/7 running)

Training compute:$20,000 (on-demand)

Artifact storage:$8,000 (no lifecycle)

Tracking overhead:$3,000 (batch logging)

MLflow server:$2,000

Monthly cost:$48,000

After Optimization

Notebook compute:$4,500 (auto-shutdown, 70% savings)

Training compute:$3,000 (spot instances, 85% savings)

Artifact storage:$3,200 (lifecycle + compression, 60% savings)

Tracking overhead:$600 (epoch logging, 80% savings)

MLflow server:$2,000 (same)

Monthly cost:$13,300

Monthly Savings

$34,700 (72%)

Annual savings: $416,400 with zero productivity loss

Your MLOps Cost Optimization Checklist

Implement these 10 changes this week to save 60-80%

No internet connection. Some features may be unavailable.