Build a
production MLOps
platform with Feast + BentoML
Ship a real MLOps platform with MLflow + DVC reproducibility, a Feast feature store (offline + Redis online with PIT correctness), BentoML on Kubernetes with HPA + canary rollouts, Evidently drift detection with cron-gated retraining, and Prometheus + Grafana + AlertManager observability. Modules 01-02 unlock with PRO; the full platform with EXPERT.
The MLOps system-design portfolio piece for staff ML-platform roles — 5 committed ADRs, a runnable cost-model CSV, a real K8s deploy with canary + HPA, and a drift-detection runbook you can defend in an architecture review.
- MLflow tracking + DVC versioning + a registered churn model with Staging stage transitions
- Feast feature store with Redis online (P99 < 5ms) + Parquet offline + PIT-correct training joins
- BentoML service on K8s with HPA, canary 10→30→50→100% promotion, and an 8-stage GitHub Actions pipeline
- Evidently drift detection (PSI + KS) + Prometheus alerts + AlertManager → Slack + drift-aware retraining cron
- Part-5 RAG layer: Anthropic Claude calls reading live features from Redis, plus a Kafka-driven freshness signal
- 5 ADRs (one Deprecated documenting the event-driven retraining reversal) committed alongside the code
Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.
Modules 01-02 (~13h) ship a working tracked-and-versioned churn model behind a Feast feature store — included with PRO. Modules 03-05 (~20h additional) layer on the production CI/CD, monitoring, and AI-integration story and unlock with EXPERT.
Foundation. Production. Scale.
Each phase ends with a tagged release, a passing red-team / load suite, and a runbook drill. No ambiguity about where you are.
Tracked + versioned churn model behind a Feast feature store. MLflow Registry has the Staging model; Redis is serving online lookups in <5ms.
- ✓MLflow Registry · churn-predictor · Staging
- ✓Feast online store (Redis) + offline (Parquet) parity
- ✓Kafka-driven feature sync + backfill engine
BentoML on K8s with canary + HPA + GH Actions CI/CD, plus Evidently drift + Prometheus + Grafana + Slack + cron retrain.
- ✓BentoML service · canary 10→30→50→100% promotion
- ✓Evidently drift detector + 3 Grafana dashboards
- ✓Cron retraining job + ADR-005 cron-+-gate runbook
RAG + agents reading live features. Anthropic Claude calls with Feast tool-use; freshness signal blocks stale predictions.
- ✓Claude RAG with feature retrieval
- ✓Kafka freshness signal + stale-rejection
- ✓feature-aware agent · pgvector + access-control layer
One command. Local MLflow + Feast + Redis + BentoML + Postgres + Prometheus.
What lives in the repo
You get the real platform on day one — MLflow + DVC for tracking, Feast for the feature store, Redis for online + Postgres for registry, BentoML for serving, Evidently for drift detection, and Prometheus + Grafana + AlertManager for the dashboards.
- src/ + features/ — MLflow training scripts, Feast entity + feature view definitions
- feature_store/ — registry.py + sync.py + backfill.py (Kafka → Redis materialization)
- service.py + bentofile.yaml — BentoML service with Pydantic validation, packaged for Docker
- k8s/ + .github/workflows/ — Deployment + Service + Ingress + HPA + canary YAML; 8-stage CI/CD
- monitoring/ + prometheus/ + grafana/ — Evidently drift, Prometheus alerts, Grafana dashboards, Slack
- docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
PredictFlow Feature Store Starter Kit
Pre-built MLOps platform: 5 tutorial modules of source, Docker compose, Feast configs, K8s manifests, GH Actions workflows, Prometheus + Grafana configs. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.
The same churn model — but built for the production case.
Most ML tutorials show you a notebook hitting a CSV. This shows what changes when 4 tenants share a feature store, on-call owns the drift dashboard, and finance asks for the cost-per-prediction.
MLflow tracking + DVC versioned data + PIT-correct featuresFeast offline (Parquet) + online (Redis P99 < 5ms) with shared registryBentoML CI/CD with canary 10→30→50→100% on K8s + HPAWrite the ADRs staff engineers actually get judged on.
Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the “event-driven auto-retraining” reversal. The kind of doc that travels with you to your next role. Preview ADR-001 →
Use Feast over Tecton / Hopsworks / DIY for the feature-store layer
feature_store.yaml + Python FeatureView decoratorsOnline store is Redis, not DynamoDB or Postgres direct
online_store.type: redis · ElastiCache prod · local Docker devTracking + data versioning is MLflow + DVC, not W&B / Neptune
mlflow autolog + dvc add · Git-native versioning · Model Registry stagesModel serving is BentoML, not TorchServe / Seldon / FastAPI-by-hand
bentoml.service + bentoml containerize · K8s plain DeploymentEvent-driven auto-retraining via drift hook
cron + manual promotion gate (Slack alert → reviewer → Staging → human promote)Read the FinOps story for the platform you actually ship.
Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 4-tenant beta load (~1M predictions/mo), real AWS RDS + EC2 + ElastiCache + Anthropic list prices, with the model-cascade and reserved-instance levers wired up. Preview the CSV →
Optimization levers
Async architecture review with a staff-level reviewer (cohort beta).
Submit your repo, your ADR draft, or your drift-detection rollout plan. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.
Bring a diff, an ADR draft, or a runbook.
The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.
PRO unlocks Modules 01-02. EXPERT unlocks the full platform.
PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.
Pick this if you own the drift dashboard, not just a model.
Senior ML engineers
You've shipped models. Now you own the feature store, the drift dashboard, the canary promote gate, and the architecture review with platform.
MLOps engineers
You absorb new models without absorbing new vendors. Feast, Redis, BentoML, Prometheus — tools your platform team already operates.
Engineering managers · ML platform
You need a reference architecture for the feature-store + monitoring questions your CTO will ask before the ML team gets headcount approval.
Founding engineers · ML startups
Your investors will ask about training-serving parity and unit economics before they ask about scale. The 5 ADRs + cost model is the answer.
Going deeper? Four tracks back this project.
The MLOps curriculum is the foundation. These four tracks let you go deeper on the parts that matter most for your role.
Quick answers.
Ready to ship a real MLOps platform?
Start with PRO ($29/mo) for Modules 01-02 — MLflow + DVC tracking and the Feast feature store. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).