PredictFlow MLOps Project

Step-by-Step Walkthrough: Build a Production MLOps Pipeline

Total Time: ~2 hours

Difficulty: Intermediate

Tools: MLflow, DVC, scikit-learn

What You'll Build

In this walkthrough, you'll build a production-ready MLOps pipeline for PredictFlow, a customer churn prediction system. You'll learn industry-standard tools and practices:

Set up MLflow for experiment tracking and model registry
Version your data with DVC (Data Version Control)
Train a baseline ML model with automatic logging
Track experiments and compare model performance
Register models and manage lifecycle stages
Ensure reproducibility with data + code versioning

Prerequisites

Python 3.8+ installed

Git installed (for DVC)

Basic understanding of machine learning concepts

Familiarity with scikit-learn

Set Up MLOps Environment

30 min

1.1 Create Project Structure

# Create project directory

mkdir predictflow-mlops

cd predictflow-mlops

# Create subdirectories

mkdir -p data/raw data/processed models notebooks src

# Initialize Git

git init

1.2 Install Dependencies

Create a virtual environment and install MLOps tools:

# Create virtual environment

python -m venv venv

source venv/bin/activate # On Windows: venv\Scripts\activate

# Install packages

pip install mlflow==2.9.0 dvc==3.30.0 \

scikit-learn==1.3.2 pandas==2.1.3 \

matplotlib==3.8.2 seaborn==0.13.0

# Create requirements.txt

pip freeze > requirements.txt

1.3 Initialize DVC

Set up DVC for data versioning:

# Initialize DVC

dvc init

# Configure local remote storage

dvc remote add -d local_storage .dvc/cache

# Commit DVC configuration

git add .dvc .dvcignore

git commit -m "Initialize DVC"

What is DVC?

DVC (Data Version Control) works like Git for your data and models. It tracks large files without storing them in Git, enabling dataset versioning, reproducibility, and collaboration.

1.4 Start MLflow Tracking Server

Launch MLflow UI for experiment tracking:

# Start MLflow server (in a new terminal)

mlflow server \

--host 127.0.0.1 \

--port 5000 \

--backend-store-uri sqlite:///mlflow.db \

--default-artifact-root ./mlruns

Open your browser to: http://localhost:5000

What You Should See

MLflow UI should open showing "No experiments yet." You'll create experiments in the next steps.

1.5 Create Sample Dataset

Generate a synthetic customer churn dataset:

# Create data/generate_data.py

import pandas as pd

import numpy as np

np.random.seed(42)

n_customers = 5000

data = {

'customer_id': range(n_customers),

'tenure_months': np.random.randint(1, 72, n_customers),

'monthly_charges': np.random.uniform(20, 120, n_customers),

'total_charges': np.random.uniform(100, 8000, n_customers),

'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers),

'churn': np.random.binomial(1, 0.27, n_customers)

}

df = pd.DataFrame(data)

df.to_csv('data/raw/customers.csv', index=False)

print(f"Created dataset: {len(df)} rows")

# Run the script

python data/generate_data.py

1.6 Track Data with DVC

# Add data to DVC tracking

dvc add data/raw/customers.csv

# Commit the .dvc file to Git

git add data/raw/customers.csv.dvc data/raw/.gitignore

git commit -m "Track customer data with DVC"

What This Does

DVC creates a .dvc file containing the data's hash and metadata. The actual CSV is added to .gitignore, so Git only tracks the small .dvc file, not the large dataset.

Common Issues

• If MLflow won't start on port 5000, try port 5001 instead
• On Windows, activate venv with venv\Scripts\activate
• If DVC init fails, ensure you're in a Git repository first

Build and Track Baseline Model

45 min

2.1 Create Training Script

Build a churn prediction model with MLflow auto-logging:

# src/train_baseline.py

import mlflow

import mlflow.sklearn

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score

from sklearn.preprocessing import LabelEncoder

# Set MLflow tracking URI

mlflow.set_tracking_uri("http://localhost:5000")

mlflow.set_experiment("churn-prediction")

# Load data

df = pd.read_csv('data/raw/customers.csv')

# Encode categorical features

le = LabelEncoder()

df['contract_encoded'] = le.fit_transform(df['contract_type'])

# Features and target

features = ['tenure_months', 'monthly_charges', 'total_charges', 'contract_encoded']

X = df[features]

y = df['churn']

# Split data

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

# Enable autologging

mlflow.sklearn.autolog()

# Train model

with mlflow.start_run(run_name="baseline-rf"):

model = RandomForestClassifier(

n_estimators=100,

max_depth=10,

random_state=42

)

model.fit(X_train, y_train)

# Predictions

y_pred = model.predict(X_test)

# Log custom metrics

mlflow.log_metric("test_accuracy", accuracy_score(y_test, y_pred))

mlflow.log_metric("test_precision", precision_score(y_test, y_pred))

mlflow.log_metric("test_recall", recall_score(y_test, y_pred))

print("Model trained and logged to MLflow!")

2.2 Run Training

# Train the model

python src/train_baseline.py

Expected Output

Model trained and logged to MLflow!

✓ Check MLflow UI at http://localhost:5000
✓ You'll see experiment "churn-prediction" with 1 run

2.3 Explore MLflow UI

Open http://localhost:5000 and explore:

Click on the "churn-prediction" experiment
View your run named "baseline-rf"
Check the Parameters: n_estimators, max_depth, random_state
Check the Metrics: test_accuracy, test_precision, test_recall
Download the Model artifact under "Artifacts"

Auto-logging Magic

mlflow.sklearn.autolog() automatically captures: model parameters, training metrics, model artifacts, requirements.txt, and even the training code! No manual logging needed for standard scikit-learn workflows.

Experiment Tracking & Model Registry

30 min

3.1 Run Multiple Experiments

Try different hyperparameters to compare models:

# Experiment with different configs

configs = [

{'n_estimators': 50, 'max_depth': 5},

{'n_estimators': 100, 'max_depth': 10},

{'n_estimators': 200, 'max_depth': 15},

]

for i, config in enumerate(configs):

with mlflow.start_run(run_name=f"experiment-{i+1}"):

model = RandomForestClassifier(**config, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mlflow.log_metric("test_accuracy", accuracy_score(y_test, y_pred))

print(f"Experiment {i+1} complete")

Compare in MLflow UI

Select multiple runs, click "Compare", and view a side-by-side comparison of parameters and metrics. Sort by test_accuracy to find the best model!

3.2 Register Best Model

Promote your best model to the Model Registry:

# In MLflow UI:

# 1. Click on your best-performing run

# 2. Scroll to "Artifacts" section

# 3. Click "Register Model"

# 4. Name it "churn-predictor"

# 5. Click "Register"

# Or register programmatically:

run_id = "<your-run-id>" # Get from MLflow UI

mlflow.register_model(

f"runs:/{run_id}/model",

"churn-predictor"

)

3.3 Transition Model to Production

# Transition to Staging

client = mlflow.MlflowClient()

client.transition_model_version_stage(

name="churn-predictor",

version=1,

stage="Staging"

)

# After validation, promote to Production

client.transition_model_version_stage(

name="churn-predictor",

version=1,

stage="Production"

)

Model Lifecycle

None → Staging → Production → Archived
This workflow enables safe deployments with approval gates and rollback capabilities.

Test Reproducibility

15 min

4.1 Load Model from Registry

# Load production model

model_uri = "models:/churn-predictor/Production"

loaded_model = mlflow.sklearn.load_model(model_uri)

# Make predictions

sample_data = X_test.iloc[:5]

predictions = loaded_model.predict(sample_data)

print(f"Predictions: {predictions}")

4.2 Simulate Data Changes

Update your dataset and track changes with DVC:

# Modify dataset (add more rows)

python data/generate_data.py # Generates new data

# Track updated data

dvc add data/raw/customers.csv

git add data/raw/customers.csv.dvc

git commit -m "Update customer dataset v2"

# Retrain with new data

python src/train_baseline.py

Reproducibility Achieved!

With DVC tracking your data and MLflow tracking your experiments, you can reproduce any model by checking out the corresponding Git commit and DVC data version.

4.3 View Lineage

Check data and model lineage:

# View DVC data versions

git log --oneline data/raw/customers.csv.dvc

# View all MLflow runs

mlflow experiments search

Full Traceability

You now have complete lineage: which code version + which data version = which model version. This is essential for debugging, compliance, and production ML systems.

Troubleshooting

• Model not found: Ensure you registered the model first
• DVC errors: Run dvc pull to fetch data
• MLflow connection refused: Check MLflow server is running on port 5000

See the MLOps Troubleshooting Guide for more solutions.

Walkthrough Complete!

You've built a production MLOps pipeline with experiment tracking, data versioning, and model registry. You're ready for Part 2!

Continue to Part 2: CI/CD & Deployment View Troubleshooting Guide

What You've Learned:

MLflow setup and experiment tracking

DVC for data version control

Auto-logging with scikit-learn

Model registry and lifecycle management

Comparing multiple experiments

Model deployment stages (Staging/Production)

Reproducibility with data + code versioning

End-to-end ML pipeline lineage tracking