Build a
production-grade
experimentation platform on dbt + scipy
A real experimentation data platform — not a tutorial. Land 100K assignment events and 50 experiments through a star-schema dimensional model with idempotent merges. Compute Welch's t-tests, confidence intervals, MDE, and power in scipy. Detect SRM and segment HTE with chi-squared + Bonferroni. Drive a GREEN/YELLOW/RED ship-decision scorecard. Govern lifecycle with a Python state machine. Serve flags from FastAPI + Redis with deterministic SHA256 bucketing.
The system-design question Netflix, Airbnb, LinkedIn, and Booking.com ask in analytics + product-DE rounds — “walk me through how you’d build experimentation infra at scale.” This project gives you the artifacts: schema, stats, scorecard, governance.
- 26+ dbt models — dim_experiments, dim_variants, fact_assignments, dim_metric_registry, mart_variant_comparison, mart_experiment_scorecard
- scipy statistical engine — Welch's t-test + 95% CI, MDE + power analysis, guardrail one-sided tests, rolling-z-score anomaly detection
- Decision framework — GREEN/YELLOW/RED scorecard combining primary lift + guardrail outcomes + segment HTE (chi-squared + Bonferroni)
- SRM detector — chi-squared on observed vs expected variant allocation, severity-flagged before you trust the result
- Lifecycle state machine — DRAFT → REVIEW → APPROVED → RUNNING → CONCLUDED → ARCHIVED with audit log + Airflow daily enforcement DAG
- Real-time flag service — FastAPI + Redis with deterministic SHA256(flag_key:user_id)%10000 bucketing and sub-10ms eval
Every product team runs experiments. Few have rigor.
Most A/B testing in the wild is eyeballed conversion-rate diffs. The pattern that wins promo and senior+ interviews is: deterministic bucketing, Welch + CI + MDE, guardrails, SRM detection, and a scorecard that says GREEN only when significance + direction + guardrails all check.
p-value > eyeball
Welch's t-test with 95% CI and MDE check is the difference between “B looks better” and “we ship B.” This project ships the scipy code, not just the chart.
SRM > trust the split
Sample-ratio mismatch silently invalidates 5-10% of experiments at most companies. A chi-squared SRM detector flags allocation drift before anyone trusts the lift.
Scorecard > debate
GREEN/YELLOW/RED logic — primary significance + correct direction + no guardrail violation — turns “should we ship” into a deterministic answer.
State machine > drift
DRAFT → REVIEW → APPROVED → RUNNING → CONCLUDED with audit logs and Airflow enforcement is what stops experiments from running 90 days after they should have ended.
Module 01 is free. The rest unlocks with PRO.
Try the first 3-4 hours — model the assignment events, build the dim/fact tables, write dbt tests for bucket integrity, and lock in the dimensional model. If the rhythm clicks, upgrade to unlock the statistical engine, decision scorecard, and lifecycle modules.
Product Thinking for Data Engineers
Product Thinking teaches the why behind every model in this project — KPI hierarchy, A/B testing, stakeholder framing. PRO subscribers get full access to every module.
Three sprints. Three checkpoints. One production-shaped platform.
Each phase ends with a tagged commit and a runnable artifact. No theory decks.
Star schema for experimentation: dim_experiments / dim_variants / fact_assignments. Idempotent dbt incremental merges. Schema-registry contracts on the raw event tables.
- ✓raw.assignment_events + dim/fact dbt models
- ✓Bucket-range invariants (0-9999) + uniqueness tests
- ✓Feature-flag snapshot integration
scipy Welch + CI + MDE + power. Guardrail one-sided tests with severity. Decision scorecard combining primary lift + guardrails + segment HTE. SRM detector.
- ✓src/stats/significance.py + power_analysis.py + guardrails.py
- ✓mart_variant_comparison + mart_experiment_scorecard
- ✓SRM chi-squared + segment HTE Bonferroni
Python lifecycle state machine with audit log + Airflow enforcement. FastAPI + Redis flag-eval with deterministic SHA256 bucketing.
- ✓experimentation/lifecycle/state_machine.py
- ✓dags/experiment_lifecycle_dag.py (Airflow)
- ✓serving/flag_service.py + flag_sync.py (FastAPI + Redis)
One stack. dbt + Postgres + Python + Airflow + Redis + FastAPI.
Pre-configured docker-compose with Postgres, Redis, Airflow, and the dbt project scaffolded. 5 sample CSVs (50K assignments, 100K events, 50 experiments, 500 feature flags, 10 metric definitions) with intentional quality issues seeded for the dbt tests.
What lives in the repo
Everything you need to run the experimentation platform locally — dbt project with 26+ models, scipy statistical engine, FastAPI flag service, Airflow lifecycle DAGs, and the seeded sample data with quality issues that the dbt tests catch.
- docker-compose.yml — Postgres, Redis, Airflow scheduler + webserver
- dbt_project.yml + models/ — 26+ dbt models across staging / intermediate / marts
- src/stats/ — scipy: significance.py, power_analysis.py, guardrails.py, anomaly_detector.py
- experimentation/lifecycle/ — state_machine.py + Airflow enforcement DAG
- serving/ — FastAPI flag_service.py + Redis flag_sync.py
- seeds/ — 5 sample CSVs with quality issues (50K assignments, 100K events)
Experimentation Platform Starter Kit
Pre-configured docker-compose stack, dbt project scaffolded with 26+ models, scipy statistical engine, FastAPI flag service, Airflow lifecycle DAGs, and 5 seeded CSVs with intentional quality issues for the dbt tests to catch.
The same A/B test — but built for the statistical-rigor case.
Most experimentation tutorials show you WHERE variant = ‘treatment’ and a conversion diff. This one shows what changes when you actually need to defend the result to a calibration committee or a launch review.
SHA256(flag_key:user_id) % 10000 with sticky bucketing across sessionsp<0.05 + correct direction + no guardrail violationchi-squared SRM detector flags 48/52 splits before you trust the resultBonferroni — overall winner can hide a losing segmentReal review from senior engineers who’ve shipped this stack.
Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.
4 reviews / month
Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — experimentation + dbt + scipy for this project. Async, comments inline, average turnaround 31 hours.
2 office hours / month
Live 30-min sessions with a senior data engineer or product DE. Architecture questions, whiteboard a tricky stat method, mock a system-design interview. Group sessions also available.
One subscription. 15+ projects, all curriculum, code review.
PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.
Pick this if you’re shipping experiments, not learning to.
Analytics engineers
You ship dbt models for experimentation but the stats sit in a notebook. This puts the stats in the warehouse next to the metrics.
Product DEs
You support PMs running experiments. This is the platform you wish someone had built — schema + scorecards + governance + a flag service.
Senior DE → Tech Lead
You're scoping the experimentation platform at your company. This project is the prior art — read it, don't re-derive it.
Platform DEs
You run the data platform. This shows the experimentation primitives — assignment, exposure, decision, governance — without coupling to one warehouse.
Going deeper? Three tracks back this platform.
Experimentation is a meeting point of modeling, the dbt engine, and observability. These three curriculums let you go deeper on the layers underneath every model in this project.
Quick answers.
Ready to ship a real experimentation platform?
Start with module 01 — free, no card. About 3-4 hours. By the end you'll have the dimensional model, dbt tests, and bucket-range invariants under your fingers — the schema spine the next 3 modules sit on.