Skip to content

PredictFlow MLOps Project

Step-by-Step Walkthrough: Build a Production MLOps Pipeline

Total Time: ~2 hours
Difficulty: Intermediate
Tools: MLflow, DVC, scikit-learn

What You'll Build

In this walkthrough, you'll build a production-ready MLOps pipeline for PredictFlow, a customer churn prediction system. You'll learn industry-standard tools and practices:

  • Set up MLflow for experiment tracking and model registry
  • Version your data with DVC (Data Version Control)
  • Train a baseline ML model with automatic logging
  • Track experiments and compare model performance
  • Register models and manage lifecycle stages
  • Ensure reproducibility with data + code versioning

Prerequisites

Python 3.8+ installed
Git installed (for DVC)
Basic understanding of machine learning concepts
Familiarity with scikit-learn
1

Set Up MLOps Environment

30 min

1.1 Create Project Structure

# Create project directory
mkdir predictflow-mlops
cd predictflow-mlops
# Create subdirectories
mkdir -p data/raw data/processed models notebooks src
# Initialize Git
git init

1.2 Install Dependencies

Create a virtual environment and install MLOps tools:

# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install packages
pip install mlflow==2.9.0 dvc==3.30.0 \
scikit-learn==1.3.2 pandas==2.1.3 \
matplotlib==3.8.2 seaborn==0.13.0
# Create requirements.txt
pip freeze > requirements.txt

1.3 Initialize DVC

Set up DVC for data versioning:

# Initialize DVC
dvc init
# Configure local remote storage
dvc remote add -d local_storage .dvc/cache
# Commit DVC configuration
git add .dvc .dvcignore
git commit -m "Initialize DVC"
What is DVC?
DVC (Data Version Control) works like Git for your data and models. It tracks large files without storing them in Git, enabling dataset versioning, reproducibility, and collaboration.

1.4 Start MLflow Tracking Server

Launch MLflow UI for experiment tracking:

# Start MLflow server (in a new terminal)
mlflow server \
--host 127.0.0.1 \
--port 5000 \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlruns

Open your browser to: http://localhost:5000

What You Should See
MLflow UI should open showing "No experiments yet." You'll create experiments in the next steps.

1.5 Create Sample Dataset

Generate a synthetic customer churn dataset:

# Create data/generate_data.py
import pandas as pd
import numpy as np
np.random.seed(42)
n_customers = 5000
data = {
'customer_id': range(n_customers),
'tenure_months': np.random.randint(1, 72, n_customers),
'monthly_charges': np.random.uniform(20, 120, n_customers),
'total_charges': np.random.uniform(100, 8000, n_customers),
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers),
'churn': np.random.binomial(1, 0.27, n_customers)
}
df = pd.DataFrame(data)
df.to_csv('data/raw/customers.csv', index=False)
print(f"Created dataset: {len(df)} rows")
# Run the script
python data/generate_data.py

1.6 Track Data with DVC

# Add data to DVC tracking
dvc add data/raw/customers.csv
# Commit the .dvc file to Git
git add data/raw/customers.csv.dvc data/raw/.gitignore
git commit -m "Track customer data with DVC"
What This Does
DVC creates a .dvc file containing the data's hash and metadata. The actual CSV is added to .gitignore, so Git only tracks the small .dvc file, not the large dataset.
Common Issues
  • • If MLflow won't start on port 5000, try port 5001 instead
  • • On Windows, activate venv with venv\Scripts\activate
  • • If DVC init fails, ensure you're in a Git repository first
2

Build and Track Baseline Model

45 min

2.1 Create Training Script

Build a churn prediction model with MLflow auto-logging:

# src/train_baseline.py
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder
# Set MLflow tracking URI
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction")
# Load data
df = pd.read_csv('data/raw/customers.csv')
# Encode categorical features
le = LabelEncoder()
df['contract_encoded'] = le.fit_transform(df['contract_type'])
# Features and target
features = ['tenure_months', 'monthly_charges', 'total_charges', 'contract_encoded']
X = df[features]
y = df['churn']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Enable autologging
mlflow.sklearn.autolog()
# Train model
with mlflow.start_run(run_name="baseline-rf"):
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Log custom metrics
mlflow.log_metric("test_accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("test_precision", precision_score(y_test, y_pred))
mlflow.log_metric("test_recall", recall_score(y_test, y_pred))
print("Model trained and logged to MLflow!")

2.2 Run Training

# Train the model
python src/train_baseline.py
Expected Output
Model trained and logged to MLflow!

✓ Check MLflow UI at http://localhost:5000
✓ You'll see experiment "churn-prediction" with 1 run

2.3 Explore MLflow UI

Open http://localhost:5000 and explore:

  1. Click on the "churn-prediction" experiment
  2. View your run named "baseline-rf"
  3. Check the Parameters: n_estimators, max_depth, random_state
  4. Check the Metrics: test_accuracy, test_precision, test_recall
  5. Download the Model artifact under "Artifacts"
Auto-logging Magic
mlflow.sklearn.autolog() automatically captures: model parameters, training metrics, model artifacts, requirements.txt, and even the training code! No manual logging needed for standard scikit-learn workflows.
3

Experiment Tracking & Model Registry

30 min

3.1 Run Multiple Experiments

Try different hyperparameters to compare models:

# Experiment with different configs
configs = [
{'n_estimators': 50, 'max_depth': 5},
{'n_estimators': 100, 'max_depth': 10},
{'n_estimators': 200, 'max_depth': 15},
]
for i, config in enumerate(configs):
with mlflow.start_run(run_name=f"experiment-{i+1}"):
model = RandomForestClassifier(**config, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mlflow.log_metric("test_accuracy", accuracy_score(y_test, y_pred))
print(f"Experiment {i+1} complete")
Compare in MLflow UI
Select multiple runs, click "Compare", and view a side-by-side comparison of parameters and metrics. Sort by test_accuracy to find the best model!

3.2 Register Best Model

Promote your best model to the Model Registry:

# In MLflow UI:
# 1. Click on your best-performing run
# 2. Scroll to "Artifacts" section
# 3. Click "Register Model"
# 4. Name it "churn-predictor"
# 5. Click "Register"
# Or register programmatically:
run_id = "<your-run-id>" # Get from MLflow UI
mlflow.register_model(
f"runs:/{run_id}/model",
"churn-predictor"
)

3.3 Transition Model to Production

# Transition to Staging
client = mlflow.MlflowClient()
client.transition_model_version_stage(
name="churn-predictor",
version=1,
stage="Staging"
)
# After validation, promote to Production
client.transition_model_version_stage(
name="churn-predictor",
version=1,
stage="Production"
)
Model Lifecycle
NoneStagingProductionArchived
This workflow enables safe deployments with approval gates and rollback capabilities.
4

Test Reproducibility

15 min

4.1 Load Model from Registry

# Load production model
model_uri = "models:/churn-predictor/Production"
loaded_model = mlflow.sklearn.load_model(model_uri)
# Make predictions
sample_data = X_test.iloc[:5]
predictions = loaded_model.predict(sample_data)
print(f"Predictions: {predictions}")

4.2 Simulate Data Changes

Update your dataset and track changes with DVC:

# Modify dataset (add more rows)
python data/generate_data.py # Generates new data
# Track updated data
dvc add data/raw/customers.csv
git add data/raw/customers.csv.dvc
git commit -m "Update customer dataset v2"
# Retrain with new data
python src/train_baseline.py
Reproducibility Achieved!
With DVC tracking your data and MLflow tracking your experiments, you can reproduce any model by checking out the corresponding Git commit and DVC data version.

4.3 View Lineage

Check data and model lineage:

# View DVC data versions
git log --oneline data/raw/customers.csv.dvc
# View all MLflow runs
mlflow experiments search
Full Traceability
You now have complete lineage: which code version + which data version = which model version. This is essential for debugging, compliance, and production ML systems.
Troubleshooting
  • Model not found: Ensure you registered the model first
  • DVC errors: Run dvc pull to fetch data
  • MLflow connection refused: Check MLflow server is running on port 5000
See the MLOps Troubleshooting Guide for more solutions.

Walkthrough Complete!

You've built a production MLOps pipeline with experiment tracking, data versioning, and model registry. You're ready for Part 2!

What You've Learned:

MLflow setup and experiment tracking
DVC for data version control
Auto-logging with scikit-learn
Model registry and lifecycle management
Comparing multiple experiments
Model deployment stages (Staging/Production)
Reproducibility with data + code versioning
End-to-end ML pipeline lineage tracking
Press Cmd+K to open