ai-de.net/Projects/P07 · PredictFlow — production MLOps platform with Feast + BentoML

Last updated 2026-05-22By AI-DE Engineering Team

EXPERT-tier · PRO unlocks Modules 01-02AI & vectors trackP07

Build a
production MLOps
platform with Feast + BentoML

Ship a real MLOps platform with MLflow + DVC reproducibility, a Feast feature store (offline + Redis online with PIT correctness), BentoML on Kubernetes with HPA + canary rollouts, Evidently drift detection with cron-gated retraining, and Prometheus + Grafana + AlertManager observability. Modules 01-02 unlock with PRO; the full platform with EXPERT.

Timeline

30-40 hours

Difficulty

Senior+

Stack

MLflow · Feast · BentoML · K8s · Evidently · Claude

See EXPERT benefits

The MLOps system-design portfolio piece for staff ML-platform roles — 5 committed ADRs, a runnable cost-model CSV, a real K8s deploy with canary + HPA, and a drift-detection runbook you can defend in an architecture review.

By the end you will have wired

MLflow tracking + DVC versioning + a registered churn model with Staging stage transitions
Feast feature store with Redis online (P99 < 5ms) + Parquet offline + PIT-correct training joins
BentoML service on K8s with HPA, canary 10→30→50→100% promotion, and an 8-stage GitHub Actions pipeline
Evidently drift detection (PSI + KS) + Prometheus alerts + AlertManager → Slack + drift-aware retraining cron
Part-5 RAG layer: Anthropic Claude calls reading live features from Redis, plus a Kafka-driven freshness signal
5 ADRs (one Deprecated documenting the event-driven retraining reversal) committed alongside the code

PREREQ · SENIOR+Built for engineers shipping ML in production. Comfortable with Python (pandas, scikit-learn), Docker, and Git, plus at least one of: Kubernetes basics, model deployment, or real production observability. Not a “what is MLOps” course.

predictflow.platform · 5 modules · 4 tenants · churn-predictor:v3 · drift watch armed

RLS + audit ✓

Foundation

Feature store

Serving (K8s)

Observe + retrain

MLflow Registrychurn-predictor · Staging → Prod

DVC + S3content-addressable hash

scikit-learn · XGBoostgrid search · 5 model types

Reproducibilityseed + Git SHA + DVC hash

Tracking + versioning — see ADR-003

Feast registryPostgres · features + lineage

Redis onlineP99 < 5ms · materialized

Parquet offlinePIT-correct training joins

Kafka sync + backfillstreaming + idempotent

Online/offline parity — see ADR-001 + ADR-002

BentoML · /predictPydantic · adaptive batching

K8s Deployment + HPAmin 3 · max 10 · 70% CPU

canary 10→30→50→100%metric-gated promote

.github/workflows8-stage · accuracy gate ≥ 0.80

BentoML + canary — see ADR-004

EvidentlyPSI · KS · ClassificationPreset

Prometheus + Grafana3 dashboards · cost panel

AlertMgr → Slack#ml-drift · severity-gated

retrain CronJobdrift-aware · human-gate promote

Cron + gate — see ADR-005 (Deprecated)

# Cost story — $285 → $199 (−30%)

1-yr Reserved Instances on RDS + ElastiCache + EC2 fleet

HPA min 3 → 2 + scale-in window 300s → 60s

Part-5 RAG cascade routes 70% to Claude Haiku first

→ ~$0.20 per 1k predictions at optimized load

# Drift + retrain — cron + manual gate

drift_detection.py emits PSI/KS to Prometheus

AlertMgr posts to Slack #ml-drift on threshold breach

Nightly cron retrains; reviewer transitions Staging → Production

→ ADR-005 documents the event-driven reversal

P99 < 5ms

feature lookup latency

5 ADRs

committed in starter kit

−30%

cost vs baseline · $285 → $199

Curriculum · 5 modules · 30-40 hours

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Modules 01-02 (~13h) ship a working tracked-and-versioned churn model behind a Feast feature store — included with PRO. Modules 03-05 (~20h additional) layer on the production CI/CD, monitoring, and AI-integration story and unlock with EXPERT.

P07 · 5 modules · 30-40 hours · ~105 lessons

Free preview EXPERT required

M01

⊘Foundation — ML Experimentation & Tracking

MLflow tracking server + Model Registry, DVC data versioning with S3 remote, baseline churn model, hyperparameter tuning, and the reproducibility story (random seeds + Git + DVC content hash).

6h21 lessonsPRO TIER

Unlock with PRO →

M02

⊘Feature Store — Feast Offline/Online + Registry

Feast with Postgres registry, Redis online store (P99 < 5ms), Parquet offline, PIT-correct training joins, feature lineage, Kafka-driven streaming sync, and the backfill engine.

7h22 lessonsPRO TIER

Unlock with PRO →

M03

⊘CI/CD & Model Serving — BentoML on Kubernetes

BentoML service with Pydantic validation, Docker multi-stage build, K8s Deployment + Service + Ingress + HPA, canary deployment with 10→30→50→100% promotion gates, and an 8-stage GitHub Actions pipeline with an accuracy gate.

8h24 lessonsEXPERT TIER

Unlock with EXPERT →

M04

⊘Monitoring & Auto-Retrain — Evidently + Prometheus

Evidently drift detection (PSI + KS + ClassificationPreset), Prometheus + Grafana panels, AlertManager → Slack routing, S3 prediction logging in Parquet, and a cron-gated retraining job (the human-in-the-loop pattern documented in ADR-005).

7h22 lessonsEXPERT TIER

Unlock with EXPERT →

M05

⊘AI Integration — RAG + Agents on the Feature Store

Anthropic Claude API calls reading live features from Redis, a Kafka-driven freshness signal for stale feature rejection, pgvector semantic-search features, and a feature-aware agent with Claude tool-use bound to Feast lookups.

5h16 lessonsEXPERT TIER

Unlock with EXPERT →

Modules 01-02 with PRO ($29/mo) · Modules 03-05 with EXPERT ($79/mo)

See plans →

Backed by curriculum

MLOps for Data Engineers

7 modules16 hoursMLflow · Feast · BentoML · Drift detection

Open curriculum

iThe MLOps curriculum is the foundation for this project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.

The build, in 3 phases

Foundation. Production. Scale.

Each phase ends with a tagged release, a passing red-team / load suite, and a runbook drill. No ambiguity about where you are.

01~13h

Foundation (Modules 01-02)

Tracked + versioned churn model behind a Feast feature store. MLflow Registry has the Staging model; Redis is serving online lookups in <5ms.

✓MLflow Registry · churn-predictor · Staging
✓Feast online store (Redis) + offline (Parquet) parity
✓Kafka-driven feature sync + backfill engine

02~15h

Production (Modules 03-04)

BentoML on K8s with canary + HPA + GH Actions CI/CD, plus Evidently drift + Prometheus + Grafana + Slack + cron retrain.

✓BentoML service · canary 10→30→50→100% promotion
✓Evidently drift detector + 3 Grafana dashboards
✓Cron retraining job + ADR-005 cron-+-gate runbook

03~5h

Scale (Module 05)

RAG + agents reading live features. Anthropic Claude calls with Feast tool-use; freshness signal blocks stale predictions.

✓Claude RAG with feature retrieval
✓Kafka freshness signal + stale-rejection
✓feature-aware agent · pgvector + access-control layer

Project setup · 15 minutes

One command. Local MLflow + Feast + Redis + BentoML + Postgres + Prometheus.

What lives in the repo

You get the real platform on day one — MLflow + DVC for tracking, Feast for the feature store, Redis for online + Postgres for registry, BentoML for serving, Evidently for drift detection, and Prometheus + Grafana + AlertManager for the dashboards.

src/ + features/ — MLflow training scripts, Feast entity + feature view definitions
feature_store/ — registry.py + sync.py + backfill.py (Kafka → Redis materialization)
service.py + bentofile.yaml — BentoML service with Pydantic validation, packaged for Docker
k8s/ + .github/workflows/ — Deployment + Service + Ingress + HPA + canary YAML; 8-stage CI/CD
monitoring/ + prometheus/ + grafana/ — Evidently drift, Prometheus alerts, Grafana dashboards, Slack
docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV

Download · Starter Kit · 88 files · 483 KB

PredictFlow Feature Store Starter Kit

Pre-built MLOps platform: 5 tutorial modules of source, Docker compose, Feast configs, K8s manifests, GH Actions workflows, Prometheus + Grafana configs. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 88 files · ADRs + cost model bundled · last updated 2026-05-09

~/projects/predictflow-feature-store — zsh

1. Unzip and start the stack

$ unzip predictflow-feature-store-starter.zip

$ cd predictflow-feature-store && cp .env.example .env

$ docker compose up -d # Postgres + Redis + MLflow + Prometheus + Grafana

2. Train + register a baseline model

$ python src/run_experiments.py # MLflow autolog

$ python src/register_model.py # promote to Staging

3. Materialize features + query Feast online store

$ feast apply

$ feast materialize $(date -v-30d +%Y-%m-%d) $(date +%Y-%m-%d)

$ python test_online_features.py # P99 < 5ms

4. Build + serve the model with BentoML

$ bentoml build

$ bentoml serve service:ChurnPredictor --port 3000

$ curl localhost:3000/predict -d '{"customer_id": 12345, ...}'

10k

customer rows

10k

transactions

behavioral events

predictions / mo

Production hardening

The same churn model — but built for the production case.

Most ML tutorials show you a notebook hitting a CSV. This shows what changes when 4 tenants share a feature store, on-call owns the drift dashboard, and finance asks for the cost-per-prediction.

Notebook MLWhat most teams ship

Reproducibility

Cannot reproduce training runs

Features

Ad-hoc engineering per notebook

Deployment

Manual model export + handoff

Monitoring

No post-deployment tracking

Retraining

Quarterly manual retrain

Cost

Whatever the bill says next month

Your production MLOpsModules 01–05

✓

Reproducibility

MLflow tracking + DVC versioned data + PIT-correct features

✓

Features

Feast offline (Parquet) + online (Redis P99 < 5ms) with shared registry

✓

Deployment

BentoML CI/CD with canary 10→30→50→100% on K8s + HPA

✓

Monitoring

Evidently drift (PSI + KS) + Prometheus alerts + AlertMgr → Slack

✓

Retraining

Drift-aware cron + human-gate promote (ADR-005 documents the reversal)

✓

Cost

Per-tenant token + serving attribution + Grafana cost panel + CSV cost model

EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the “event-driven auto-retraining” reversal. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

Use Feast over Tecton / Hopsworks / DIY for the feature-store layer

Context

Open-source orchestration with online + offline + registry pluggability

Decision

feature_store.yaml + Python FeatureView decorators

Tradeoff

BYO streaming + lineage UI vs full vendor independence

Reversal

Tecton / Hopsworks swap is ~2-3 engineer-weeks behind the registry

ADR-002Accepted

Online store is Redis, not DynamoDB or Postgres direct

Context

<10ms P99 budget for feature lookup on the request path

Decision

Feast online_store.type: redis · ElastiCache prod · local Docker dev

Tradeoff

Memory cap + cache-miss handling vs sub-5ms in-memory gets

Reversal

DynamoDB or Postgres direct is one YAML field + materialize re-run

ADR-003Accepted

Tracking + data versioning is MLflow + DVC, not W&B / Neptune

Context

Zero-cost local-dev parity + no vendor lock-in for tutorial reach

Decision

mlflow autolog + dvc add · Git-native versioning · Model Registry stages

Tradeoff

Two tools + no managed UI vs $0 self-host + open exit ramp

Reversal

W&B swap is ~1-2 engineer-weeks (autolog + Registry + DVC migration)

ADR-004Accepted

Model serving is BentoML, not TorchServe / Seldon / FastAPI-by-hand

Context

scikit-learn / XGBoost first-class + zero-handroll Docker build

Decision

bentoml.service + bentoml containerize · K8s plain Deployment

Tradeoff

Bento store mental model vs single-command image build + Pydantic + adaptive batching

Reversal

FastAPI swap is ~3-5 engineer-days behind the service.py contract

ADR-005Deprecated

Event-driven auto-retraining via drift hook

Context

Original framing: drift_detection.py emits events, retraining auto-promotes

Decision

Reverted in M04 — moved to cron + manual promotion gate (Slack alert → reviewer → Staging → human promote)

Why reversed

False-positive drift triggers + no human-in-the-loop = no accountability

Replaced by

Drift-aware retraining cron + manual MLflow stage transition

EXPERT-only · cost model

Read the FinOps story for the platform you actually ship.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 4-tenant beta load (~1M predictions/mo), real AWS RDS + EC2 + ElastiCache + Anthropic list prices, with the model-cascade and reserved-instance levers wired up. Preview the CSV →

ComponentBaseline / moOptimized / moDelta

MLflow tracking server

EC2 t4g.medium · 1 instance · sqlite + EFS

$30

$22

−$8

BentoML serving · 3 replicas

EC2 t4g.medium × 3 · HPA min · /predict

$90

$66

−$24

Postgres (RDS) · registry + offline metadata

db.t4g.medium · 100GB gp3

$98

$68

−$30

ElastiCache Redis · online store

cache.t4g.small · primary + replica

$54

$36

−$18

Anthropic Claude · Part 5 RAG

~5M tok/mo · Haiku 70% / Sonnet 30%

$10

−$6

S3 + egress + Grafana free tier

snapshots + DVC + prediction logs · 50GB free observability

—

Total · 4 tenants

~$0.30 per 1k predictions at baseline

$285

$199

−$86 (−30%)

Optimization levers

1-yr Reserved Instances

Commit to 12-month reserved capacity on RDS + ElastiCache + the 4 EC2 instances once load is stable for 30 days. Standard ~30% off list.

−$80 / mo

HPA tightening + faster scale-in

Drop HPA min from 3 to 2 replicas. Set scale-in window from 300s → 60s. Cost-of-capacity vs P99-latency tradeoff modeled in Module 04 cost dashboard.

−$22 / mo

Model cascade (Haiku → Sonnet)

In Part 5 RAG flow, route 70% of requests to Claude Haiku first; escalate to Sonnet only when retrieval confidence < 0.7.

−$5 / mo

EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your drift-detection rollout plan. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a runbook.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

Mira K.

Ex-staff · ML platform · ride-hailing scale

Feature stores at 100M+ daily lookups, online/offline parity, multi-tenant feature registries

“Send the diff. I'll go line-by-line through your Feast registry and your retriever and pick out the joins that leak.”

Daniel T.

Principal · MLOps · Series-D fintech

Drift detection in production, cron + gate retraining, model rollback patterns, on-call

“Send your worst drift report. We'll walk it backwards from the Slack alert to the feature pipeline that drifted.”

Anya S.

Eng manager · ML platform · public Series-D

Org design for ML platforms, hiring rubrics, staff-MLE interview prep, scope-of-work for ML teams

“If you're prepping for staff-MLE promo, send your ADR draft. We'll work backwards from the rubric.”

Format

Async

Turnaround

7 days

Cohort

12 members

Scope

ADR + arch review

Request a slot →

What your tier unlocks

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT

Modules 01-02 of P07

MLflow tracking + Feast feature store (~13h)

—

Included

Modules 03-05 of P07

BentoML on K8s + Drift + RAG (~20h)

—

Included

5 committed ADRs + cost-model CSV

Starter kit docs/adr/ + docs/cost-model/

—

Included

PRO project catalog

Production-grade builds

All current

All current + this one

Curriculum

All 7 tracks

Phase 1 only

All

All + bonus modules

Code review

Senior+ reviewers

—

4 / month

Unlimited

Cohort-beta architecture review

Async · 7-day turnaround · 12-member cap

—

Included

Certificate

Verifiable on LinkedIn

—

Yes

Yes + LinkedIn rec

$79/mo

billed monthly · open enrollment · cancel anytime

or annual

$699/yr save 26%

Unlock EXPERT →

Who this is for

Pick this if you own the drift dashboard, not just a model.

Senior ML engineers

You've shipped models. Now you own the feature store, the drift dashboard, the canary promote gate, and the architecture review with platform.

MLOps engineers

You absorb new models without absorbing new vendors. Feast, Redis, BentoML, Prometheus — tools your platform team already operates.

Engineering managers · ML platform

You need a reference architecture for the feature-store + monitoring questions your CTO will ask before the ML team gets headcount approval.

Founding engineers · ML startups

Your investors will ask about training-serving parity and unit economics before they ask about scale. The 5 ADRs + cost model is the answer.

Related curriculum

Going deeper? Four tracks back this project.

The MLOps curriculum is the foundation. These four tracks let you go deeper on the parts that matter most for your role.

FAQ · EXPERT tier

Quick answers.

How is this different from PRO?+

Modules 01-02 (MLflow + DVC + Feast offline/online) are included with PRO at $29/mo. The rest of the platform — Modules 03-05 (BentoML on K8s + drift + RAG layer), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you the foundation; EXPERT gets you the system you'd defend in an architecture review.

Why Feast over Tecton or building it ourselves?+

ADR-001 lays out the full tradeoff. Short version: Tecton is best-in-class managed but blocks tutorial reproducibility (cloud-only). DIY costs months and reinvents the PIT-correctness bug class. Feast is the open-source middle that wins on local-dev parity + zero vendor lock-in. The reversal plan is ~2-3 engineer-weeks if you outgrow it.

Does the auto-retraining actually fire on drift, or do I have to apply it?+

ADR-005 (Deprecated) is the honest answer. The original framing was 'auto-trigger on drift'; we reverted in Module 04 to a cron CronJob + manual promotion gate after false-positive drift signals caused weekend retrain incidents. The current pattern: drift_detection.py emits a Prometheus metric + AlertManager → Slack alert; the nightly retraining cron writes a new MLflow Staging version; a human reviewer transitions Staging → Production. This is what production teams actually run.

Do I need a GPU?+

No. The churn-prediction workload is scikit-learn + XGBoost (CPU-only). Module 05's RAG flow uses Anthropic Claude API calls — also no local GPU needed. The cost-model CSV is built around CPU EC2 instances; total monthly cost at 4-tenant beta load is ~$285 baseline / $199 optimized.

How long until I can finish this project?+

30-40 hours of focused work across 5 modules. Most learners spread it across 5-7 weeks alongside a day job. Modules 01-02 alone (~13 hours) get you a tracked + versioned model behind a working feature store — included with PRO at $29/mo.

Is this enough to interview for staff ML platform roles?+

It's a strong forcing function. Staff ML platform interviews lean heavily on system design (feature stores, monitoring, deployment, cost) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated with receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have a portfolio.

Related projects

Paired with this project

P24·PAID·streaming

Real-time fraud feature store

Feast + Kafka + Spark Streaming spine for a fraud model: 22 features, p99 < 10ms, Schema Registry + Avro, Helm/K8s.

Explore project →

Ready to ship a real MLOps platform?

Start with PRO ($29/mo) for Modules 01-02 — MLflow + DVC tracking and the Feast feature store. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

See EXPERT benefits

P07 · PredictFlow · EXPERT · PRO unlocks M01-M02Unlock EXPERT →

Build aproduction MLOpsplatform with Feast + BentoML

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Foundation. Production. Scale.

One command. Local MLflow + Feast + Redis + BentoML + Postgres + Prometheus.

What lives in the repo

PredictFlow Feature Store Starter Kit

The same churn model — but built for the production case.

Write the ADRs staff engineers actually get judged on.

Use Feast over Tecton / Hopsworks / DIY for the feature-store layer

Online store is Redis, not DynamoDB or Postgres direct

Tracking + data versioning is MLflow + DVC, not W&B / Neptune

Model serving is BentoML, not TorchServe / Seldon / FastAPI-by-hand

Event-driven auto-retraining via drift hook

Read the FinOps story for the platform you actually ship.

Optimization levers

Async architecture review with a staff-level reviewer (cohort beta).

Bring a diff, an ADR draft, or a runbook.

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

Pick this if you own the drift dashboard, not just a model.

Senior ML engineers

MLOps engineers

Engineering managers · ML platform

Founding engineers · ML startups

Going deeper? Four tracks back this project.

Feature Stores for ML

MLOps for Data Engineers

Data Observability & Quality

Python for Data Engineers

Quick answers.

Paired with this project

Ready to ship a real MLOps platform?

Build a
production MLOps
platform with Feast + BentoML