Skip to content
ai-de.net/Projects/P07 · PredictFlow — production MLOps platform with Feast + BentoML
EXPERT-tier · PRO unlocks Modules 01-02AI & vectors trackP07

Build a
production MLOps
platform with Feast + BentoML

Ship a real MLOps platform with MLflow + DVC reproducibility, a Feast feature store (offline + Redis online with PIT correctness), BentoML on Kubernetes with HPA + canary rollouts, Evidently drift detection with cron-gated retraining, and Prometheus + Grafana + AlertManager observability. Modules 01-02 unlock with PRO; the full platform with EXPERT.

Timeline
30-40 hours
Difficulty
Senior+
Stack
MLflow · Feast · BentoML · K8s · Evidently · Claude

The MLOps system-design portfolio piece for staff ML-platform roles — 5 committed ADRs, a runnable cost-model CSV, a real K8s deploy with canary + HPA, and a drift-detection runbook you can defend in an architecture review.

By the end you will have wired
  • MLflow tracking + DVC versioning + a registered churn model with Staging stage transitions
  • Feast feature store with Redis online (P99 < 5ms) + Parquet offline + PIT-correct training joins
  • BentoML service on K8s with HPA, canary 10→30→50→100% promotion, and an 8-stage GitHub Actions pipeline
  • Evidently drift detection (PSI + KS) + Prometheus alerts + AlertManager → Slack + drift-aware retraining cron
  • Part-5 RAG layer: Anthropic Claude calls reading live features from Redis, plus a Kafka-driven freshness signal
  • 5 ADRs (one Deprecated documenting the event-driven retraining reversal) committed alongside the code
PREREQ · SENIOR+Built for engineers shipping ML in production. Comfortable with Python (pandas, scikit-learn), Docker, and Git, plus at least one of: Kubernetes basics, model deployment, or real production observability. Not a “what is MLOps” course.
predictflow.platform · 5 modules · 4 tenants · churn-predictor:v3 · drift watch armed
RLS + audit ✓
Foundation
Feature store
Serving (K8s)
Observe + retrain
MLflow Registrychurn-predictor · Staging → Prod
DVC + S3content-addressable hash
scikit-learn · XGBoostgrid search · 5 model types
Reproducibilityseed + Git SHA + DVC hash
Tracking + versioning — see ADR-003
Feast registryPostgres · features + lineage
Redis onlineP99 < 5ms · materialized
Parquet offlinePIT-correct training joins
Kafka sync + backfillstreaming + idempotent
Online/offline parity — see ADR-001 + ADR-002
BentoML · /predictPydantic · adaptive batching
K8s Deployment + HPAmin 3 · max 10 · 70% CPU
canary 10→30→50→100%metric-gated promote
.github/workflows8-stage · accuracy gate ≥ 0.80
BentoML + canary — see ADR-004
EvidentlyPSI · KS · ClassificationPreset
Prometheus + Grafana3 dashboards · cost panel
AlertMgr → Slack#ml-drift · severity-gated
retrain CronJobdrift-aware · human-gate promote
Cron + gate — see ADR-005 (Deprecated)
# Cost story — $285 → $199 (−30%)
1-yr Reserved Instances on RDS + ElastiCache + EC2 fleet
HPA min 3 → 2 + scale-in window 300s → 60s
Part-5 RAG cascade routes 70% to Claude Haiku first
→ ~$0.20 per 1k predictions at optimized load
# Drift + retrain — cron + manual gate
drift_detection.py emits PSI/KS to Prometheus
AlertMgr posts to Slack #ml-drift on threshold breach
Nightly cron retrains; reviewer transitions Staging → Production
→ ADR-005 documents the event-driven reversal
P99 < 5ms
feature lookup latency
5 ADRs
committed in starter kit
−30%
cost vs baseline · $285 → $199
Curriculum · 5 modules · 30-40 hours

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Modules 01-02 (~13h) ship a working tracked-and-versioned churn model behind a Feast feature store — included with PRO. Modules 03-05 (~20h additional) layer on the production CI/CD, monitoring, and AI-integration story and unlock with EXPERT.

P07 · 5 modules · 30-40 hours · ~105 lessons
Free preview EXPERT required
M01
Foundation — ML Experimentation & Tracking
MLflow tracking server + Model Registry, DVC data versioning with S3 remote, baseline churn model, hyperparameter tuning, and the reproducibility story (random seeds + Git + DVC content hash).
6h21 lessonsPRO TIER
Unlock with PRO →
M02
Feature Store — Feast Offline/Online + Registry
Feast with Postgres registry, Redis online store (P99 < 5ms), Parquet offline, PIT-correct training joins, feature lineage, Kafka-driven streaming sync, and the backfill engine.
7h22 lessonsPRO TIER
Unlock with PRO →
M03
CI/CD & Model Serving — BentoML on Kubernetes
BentoML service with Pydantic validation, Docker multi-stage build, K8s Deployment + Service + Ingress + HPA, canary deployment with 10→30→50→100% promotion gates, and an 8-stage GitHub Actions pipeline with an accuracy gate.
8h24 lessonsEXPERT TIER
Unlock with EXPERT →
M04
Monitoring & Auto-Retrain — Evidently + Prometheus
Evidently drift detection (PSI + KS + ClassificationPreset), Prometheus + Grafana panels, AlertManager → Slack routing, S3 prediction logging in Parquet, and a cron-gated retraining job (the human-in-the-loop pattern documented in ADR-005).
7h22 lessonsEXPERT TIER
Unlock with EXPERT →
M05
AI Integration — RAG + Agents on the Feature Store
Anthropic Claude API calls reading live features from Redis, a Kafka-driven freshness signal for stale feature rejection, pgvector semantic-search features, and a feature-aware agent with Claude tool-use bound to Feast lookups.
5h16 lessonsEXPERT TIER
Unlock with EXPERT →
Modules 01-02 with PRO ($29/mo) · Modules 03-05 with EXPERT ($79/mo)
See plans →
Backed by curriculum
MLOps for Data Engineers
7 modules16 hoursMLflow · Feast · BentoML · Drift detection
Open curriculum
iThe MLOps curriculum is the foundation for this project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Foundation. Production. Scale.

Each phase ends with a tagged release, a passing red-team / load suite, and a runbook drill. No ambiguity about where you are.

01~13h
Foundation (Modules 01-02)

Tracked + versioned churn model behind a Feast feature store. MLflow Registry has the Staging model; Redis is serving online lookups in <5ms.

  • MLflow Registry · churn-predictor · Staging
  • Feast online store (Redis) + offline (Parquet) parity
  • Kafka-driven feature sync + backfill engine
02~15h
Production (Modules 03-04)

BentoML on K8s with canary + HPA + GH Actions CI/CD, plus Evidently drift + Prometheus + Grafana + Slack + cron retrain.

  • BentoML service · canary 10→30→50→100% promotion
  • Evidently drift detector + 3 Grafana dashboards
  • Cron retraining job + ADR-005 cron-+-gate runbook
03~5h
Scale (Module 05)

RAG + agents reading live features. Anthropic Claude calls with Feast tool-use; freshness signal blocks stale predictions.

  • Claude RAG with feature retrieval
  • Kafka freshness signal + stale-rejection
  • feature-aware agent · pgvector + access-control layer
Project setup · 15 minutes

One command. Local MLflow + Feast + Redis + BentoML + Postgres + Prometheus.

What lives in the repo

You get the real platform on day one — MLflow + DVC for tracking, Feast for the feature store, Redis for online + Postgres for registry, BentoML for serving, Evidently for drift detection, and Prometheus + Grafana + AlertManager for the dashboards.

  • src/ + features/ — MLflow training scripts, Feast entity + feature view definitions
  • feature_store/ — registry.py + sync.py + backfill.py (Kafka → Redis materialization)
  • service.py + bentofile.yaml — BentoML service with Pydantic validation, packaged for Docker
  • k8s/ + .github/workflows/ — Deployment + Service + Ingress + HPA + canary YAML; 8-stage CI/CD
  • monitoring/ + prometheus/ + grafana/ — Evidently drift, Prometheus alerts, Grafana dashboards, Slack
  • docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 88 files · 483 KB

PredictFlow Feature Store Starter Kit

Pre-built MLOps platform: 5 tutorial modules of source, Docker compose, Feast configs, K8s manifests, GH Actions workflows, Prometheus + Grafana configs. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 88 files · ADRs + cost model bundled · last updated 2026-05-09
~/projects/predictflow-feature-store — zsh
1. Unzip and start the stack
$ unzip predictflow-feature-store-starter.zip
$ cd predictflow-feature-store && cp .env.example .env
$ docker compose up -d # Postgres + Redis + MLflow + Prometheus + Grafana
2. Train + register a baseline model
$ python src/run_experiments.py # MLflow autolog
$ python src/register_model.py # promote to Staging
3. Materialize features + query Feast online store
$ feast apply
$ feast materialize $(date -v-30d +%Y-%m-%d) $(date +%Y-%m-%d)
$ python test_online_features.py # P99 < 5ms
4. Build + serve the model with BentoML
$ bentoml build
$ bentoml serve service:ChurnPredictor --port 3000
$ curl localhost:3000/predict -d '{"customer_id": 12345, ...}'
10k
customer rows
10k
transactions
5k
behavioral events
1M
predictions / mo
Production hardening

The same churn model — but built for the production case.

Most ML tutorials show you a notebook hitting a CSV. This shows what changes when 4 tenants share a feature store, on-call owns the drift dashboard, and finance asks for the cost-per-prediction.

Notebook MLWhat most teams ship
×
Reproducibility
Cannot reproduce training runs
×
Features
Ad-hoc engineering per notebook
×
Deployment
Manual model export + handoff
×
Monitoring
No post-deployment tracking
×
Retraining
Quarterly manual retrain
×
Cost
Whatever the bill says next month
Your production MLOpsModules 01–05
Reproducibility
MLflow tracking + DVC versioned data + PIT-correct features
Features
Feast offline (Parquet) + online (Redis P99 < 5ms) with shared registry
Deployment
BentoML CI/CD with canary 10→30→50→100% on K8s + HPA
Monitoring
Evidently drift (PSI + KS) + Prometheus alerts + AlertMgr → Slack
Retraining
Drift-aware cron + human-gate promote (ADR-005 documents the reversal)
Cost
Per-tenant token + serving attribution + Grafana cost panel + CSV cost model
EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the “event-driven auto-retraining” reversal. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

Use Feast over Tecton / Hopsworks / DIY for the feature-store layer

Context
Open-source orchestration with online + offline + registry pluggability
Decision
feature_store.yaml + Python FeatureView decorators
Tradeoff
BYO streaming + lineage UI vs full vendor independence
Reversal
Tecton / Hopsworks swap is ~2-3 engineer-weeks behind the registry
ADR-002Accepted

Online store is Redis, not DynamoDB or Postgres direct

Context
<10ms P99 budget for feature lookup on the request path
Decision
Feast online_store.type: redis · ElastiCache prod · local Docker dev
Tradeoff
Memory cap + cache-miss handling vs sub-5ms in-memory gets
Reversal
DynamoDB or Postgres direct is one YAML field + materialize re-run
ADR-003Accepted

Tracking + data versioning is MLflow + DVC, not W&B / Neptune

Context
Zero-cost local-dev parity + no vendor lock-in for tutorial reach
Decision
mlflow autolog + dvc add · Git-native versioning · Model Registry stages
Tradeoff
Two tools + no managed UI vs $0 self-host + open exit ramp
Reversal
W&B swap is ~1-2 engineer-weeks (autolog + Registry + DVC migration)
ADR-004Accepted

Model serving is BentoML, not TorchServe / Seldon / FastAPI-by-hand

Context
scikit-learn / XGBoost first-class + zero-handroll Docker build
Decision
bentoml.service + bentoml containerize · K8s plain Deployment
Tradeoff
Bento store mental model vs single-command image build + Pydantic + adaptive batching
Reversal
FastAPI swap is ~3-5 engineer-days behind the service.py contract
ADR-005Deprecated

Event-driven auto-retraining via drift hook

Context
Original framing: drift_detection.py emits events, retraining auto-promotes
Decision
Reverted in M04 — moved to cron + manual promotion gate (Slack alert → reviewer → Staging → human promote)
Why reversed
False-positive drift triggers + no human-in-the-loop = no accountability
Replaced by
Drift-aware retraining cron + manual MLflow stage transition
EXPERT-only · cost model

Read the FinOps story for the platform you actually ship.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 4-tenant beta load (~1M predictions/mo), real AWS RDS + EC2 + ElastiCache + Anthropic list prices, with the model-cascade and reserved-instance levers wired up. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
MLflow tracking server
EC2 t4g.medium · 1 instance · sqlite + EFS
$30
$22
−$8
BentoML serving · 3 replicas
EC2 t4g.medium × 3 · HPA min · /predict
$90
$66
−$24
Postgres (RDS) · registry + offline metadata
db.t4g.medium · 100GB gp3
$98
$68
−$30
ElastiCache Redis · online store
cache.t4g.small · primary + replica
$54
$36
−$18
Anthropic Claude · Part 5 RAG
~5M tok/mo · Haiku 70% / Sonnet 30%
$10
$4
−$6
S3 + egress + Grafana free tier
snapshots + DVC + prediction logs · 50GB free observability
$3
$3
Total · 4 tenants
~$0.30 per 1k predictions at baseline
$285
$199
−$86 (−30%)

Optimization levers

1-yr Reserved Instances
Commit to 12-month reserved capacity on RDS + ElastiCache + the 4 EC2 instances once load is stable for 30 days. Standard ~30% off list.
−$80 / mo
HPA tightening + faster scale-in
Drop HPA min from 3 to 2 replicas. Set scale-in window from 300s → 60s. Cost-of-capacity vs P99-latency tradeoff modeled in Module 04 cost dashboard.
−$22 / mo
Model cascade (Haiku → Sonnet)
In Part 5 RAG flow, route 70% of requests to Claude Haiku first; escalate to Sonnet only when retrieval confidence < 0.7.
−$5 / mo
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your drift-detection rollout plan. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a runbook.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

MK
Mira K.
Ex-staff · ML platform · ride-hailing scale
Feature stores at 100M+ daily lookups, online/offline parity, multi-tenant feature registries
Send the diff. I'll go line-by-line through your Feast registry and your retriever and pick out the joins that leak.
DT
Daniel T.
Principal · MLOps · Series-D fintech
Drift detection in production, cron + gate retraining, model rollback patterns, on-call
Send your worst drift report. We'll walk it backwards from the Slack alert to the feature pipeline that drifted.
AS
Anya S.
Eng manager · ML platform · public Series-D
Org design for ML platforms, hiring rubrics, staff-MLE interview prep, scope-of-work for ML teams
If you're prepping for staff-MLE promo, send your ADR draft. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + arch review
Request a slot
What your tier unlocks

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT
Modules 01-02 of P07
MLflow tracking + Feast feature store (~13h)
Included
Included
Modules 03-05 of P07
BentoML on K8s + Drift + RAG (~20h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you own the drift dashboard, not just a model.

ML

Senior ML engineers

You've shipped models. Now you own the feature store, the drift dashboard, the canary promote gate, and the architecture review with platform.

MO

MLOps engineers

You absorb new models without absorbing new vendors. Feast, Redis, BentoML, Prometheus — tools your platform team already operates.

EM

Engineering managers · ML platform

You need a reference architecture for the feature-store + monitoring questions your CTO will ask before the ML team gets headcount approval.

FR

Founding engineers · ML startups

Your investors will ask about training-serving parity and unit economics before they ask about scale. The 5 ADRs + cost model is the answer.

FAQ · EXPERT tier

Quick answers.

Modules 01-02 (MLflow + DVC + Feast offline/online) are included with PRO at $29/mo. The rest of the platform — Modules 03-05 (BentoML on K8s + drift + RAG layer), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you the foundation; EXPERT gets you the system you'd defend in an architecture review.
ADR-001 lays out the full tradeoff. Short version: Tecton is best-in-class managed but blocks tutorial reproducibility (cloud-only). DIY costs months and reinvents the PIT-correctness bug class. Feast is the open-source middle that wins on local-dev parity + zero vendor lock-in. The reversal plan is ~2-3 engineer-weeks if you outgrow it.
ADR-005 (Deprecated) is the honest answer. The original framing was 'auto-trigger on drift'; we reverted in Module 04 to a cron CronJob + manual promotion gate after false-positive drift signals caused weekend retrain incidents. The current pattern: drift_detection.py emits a Prometheus metric + AlertManager → Slack alert; the nightly retraining cron writes a new MLflow Staging version; a human reviewer transitions Staging → Production. This is what production teams actually run.
No. The churn-prediction workload is scikit-learn + XGBoost (CPU-only). Module 05's RAG flow uses Anthropic Claude API calls — also no local GPU needed. The cost-model CSV is built around CPU EC2 instances; total monthly cost at 4-tenant beta load is ~$285 baseline / $199 optimized.
30-40 hours of focused work across 5 modules. Most learners spread it across 5-7 weeks alongside a day job. Modules 01-02 alone (~13 hours) get you a tracked + versioned model behind a working feature store — included with PRO at $29/mo.
It's a strong forcing function. Staff ML platform interviews lean heavily on system design (feature stores, monitoring, deployment, cost) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated with receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have a portfolio.

Ready to ship a real MLOps platform?

Start with PRO ($29/mo) for Modules 01-02 — MLflow + DVC tracking and the Feast feature store. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P07 · PredictFlow · EXPERT · PRO unlocks M01-M02Unlock EXPERT →
Press Cmd+K to open