Skip to content
ai-de.net/Projects/P29 · Experimentation Platform on dbt + scipy
PRO · module 01 free previewStaff Engineering trackP29

Build a
production-grade
experimentation platform on dbt + scipy

A real experimentation data platform — not a tutorial. Land 100K assignment events and 50 experiments through a star-schema dimensional model with idempotent merges. Compute Welch's t-tests, confidence intervals, MDE, and power in scipy. Detect SRM and segment HTE with chi-squared + Bonferroni. Drive a GREEN/YELLOW/RED ship-decision scorecard. Govern lifecycle with a Python state machine. Serve flags from FastAPI + Redis with deterministic SHA256 bucketing.

Timeline
14-16 hours
Difficulty
Senior+
Stack
dbt · Postgres · scipy · FastAPI · Airflow

The system-design question Netflix, Airbnb, LinkedIn, and Booking.com ask in analytics + product-DE rounds — “walk me through how you’d build experimentation infra at scale.” This project gives you the artifacts: schema, stats, scorecard, governance.

By the end you will have wired
  • 26+ dbt models — dim_experiments, dim_variants, fact_assignments, dim_metric_registry, mart_variant_comparison, mart_experiment_scorecard
  • scipy statistical engine — Welch's t-test + 95% CI, MDE + power analysis, guardrail one-sided tests, rolling-z-score anomaly detection
  • Decision framework — GREEN/YELLOW/RED scorecard combining primary lift + guardrail outcomes + segment HTE (chi-squared + Bonferroni)
  • SRM detector — chi-squared on observed vs expected variant allocation, severity-flagged before you trust the result
  • Lifecycle state machine — DRAFT → REVIEW → APPROVED → RUNNING → CONCLUDED → ARCHIVED with audit log + Airflow daily enforcement DAG
  • Real-time flag service — FastAPI + Redis with deterministic SHA256(flag_key:user_id)%10000 bucketing and sub-10ms eval
PREREQComfortable with dbt + Postgres + Python (scipy fluency helpful but not required). Pairs well with the Product Thinking curriculum (KPIs, A/B testing, experimentation infra) and the dbt deep dive.
experiments.warehouse.* · scorecard live
p < 0.05 · SRM ok
Events
Dbt models
Scipy engine
Outputs
raw.assignment50K · bucket 0–9999
raw.user_action100K · JSONB ctx
raw.experiments50 · feature flags
+ dbt tests · contracts
dim_experimentsincremental · merge
fact_assignmentsgrain: user × exp
mart_scorecardGREEN / YELLOW / RED
26+ models · staging→marts
significance.pyWelch + 95% CI
power_analysis.pyMDE + sample size
guardrails.pySRM · HTE · Bonferroni
scipy.stats · numpy
scorecardSHIP · HOLD · EXTEND
state_machineDRAFT → ARCHIVED
flag_serviceFastAPI + Redis
sub-10ms p99
# Statistical rigor (Module 02)
t, p = scipy.stats.ttest_ind_from_stats(
m_a, sd_a, n_a, m_b, sd_b, n_b)
CI = diff ± z_0.975 * SE_diff
→ Welch + 95% CI + MDE + power
● Deterministic bucketing (Module 04)
key = f"{flag_key}:{user_id}"
bucket = int(sha256(key)[:8], 16) % 10000
return "treatment" if bucket < 5000 else …
→ sticky · cached in Redis · sub-10ms
26+
dbt models
10
metric registry
p<0.05
with CI + MDE
Why this matters in 2026

Every product team runs experiments. Few have rigor.

Most A/B testing in the wild is eyeballed conversion-rate diffs. The pattern that wins promo and senior+ interviews is: deterministic bucketing, Welch + CI + MDE, guardrails, SRM detection, and a scorecard that says GREEN only when significance + direction + guardrails all check.

p-value > eyeball

Welch's t-test with 95% CI and MDE check is the difference between &ldquo;B looks better&rdquo; and &ldquo;we ship B.&rdquo; This project ships the scipy code, not just the chart.

SRM > trust the split

Sample-ratio mismatch silently invalidates 5-10% of experiments at most companies. A chi-squared SRM detector flags allocation drift before anyone trusts the lift.

Scorecard > debate

GREEN/YELLOW/RED logic — primary significance + correct direction + no guardrail violation — turns &ldquo;should we ship&rdquo; into a deterministic answer.

State machine > drift

DRAFT → REVIEW → APPROVED → RUNNING → CONCLUDED with audit logs and Airflow enforcement is what stops experiments from running 90 days after they should have ended.

Curriculum · 4 modules · 14-16 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3-4 hours — model the assignment events, build the dim/fact tables, write dbt tests for bucket integrity, and lock in the dimensional model. If the rhythm clicks, upgrade to unlock the statistical engine, decision scorecard, and lifecycle modules.

P29 · 14-16 hours · 4 modules
Free preview EXPERT required
Module 01 is free — no card required. Get the dbt schema and dimensional model under your fingers before paying.
M01
Foundation: event modeling + dimensional schema + dbt tests
Model raw.assignment_events / raw.user_action_events / raw.metric_events. Build dim_experiments, dim_variants, fact_assignments with incremental merge strategies. Add dbt tests for uniqueness, referential integrity, and bucket-range invariants (0-9999). Stage the feature-flag snapshot integration. The dimensional spine the next 3 modules sit on.
3-4h6 lessonsFREE PREVIEW
Start →
M02
Statistical engine: Welch + MDE + power + guardrails
Build the metric registry (10 metrics with SLA freshness). Compute pre/in/post experiment windows. Welch's t-test (scipy.stats.ttest_ind_from_stats) with 95% CI. MDE and required-sample-size with z_alpha + z_beta. One-sided guardrail tests with severity flagging. Rolling z-score anomaly detection. Multi-comparison correction (Bonferroni, Holm-Bonferroni).
4-5h6 lessonsEXPERT TIER
Unlock with EXPERT →
M03
Decision framework: scorecard + segment HTE + SRM
Build mart_variant_comparison + mart_experiment_scorecard with traffic-light (GREEN/YELLOW/RED/GRAY) logic combining primary significance + direction + guardrail outcomes. Per-segment HTE chi-squared with Bonferroni correction. SRM detector (chi-squared on observed vs expected allocation). Decision rules for ship/hold/extend.
3-4h6 lessonsEXPERT TIER
Unlock with EXPERT →
M04
Production: lifecycle state machine + flag service + governance
Python state machine (Enum + transitions dict + max-duration guardrails) with audit log. Airflow daily enforcement DAG (auto-conclude, auto-archive). FastAPI + Redis flag-eval service with deterministic SHA256(flag:user)%10000 bucketing and sub-10ms p99. dim_feature_flags sync (60s). Governance approvals + traffic budgets outlined.
3-4h6 lessonsEXPERT TIER
Unlock with EXPERT →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

Product Thinking for Data Engineers

8 modules·14 hours·KPIs & metrics·A/B testing·Experimentation infra·Stakeholder communication·Data strategy
Open curriculum

Product Thinking teaches the why behind every model in this project — KPI hierarchy, A/B testing, stakeholder framing. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production-shaped platform.

Each phase ends with a tagged commit and a runnable artifact. No theory decks.

01~4h
Schema + dimensional model + bucketing

Star schema for experimentation: dim_experiments / dim_variants / fact_assignments. Idempotent dbt incremental merges. Schema-registry contracts on the raw event tables.

  • raw.assignment_events + dim/fact dbt models
  • Bucket-range invariants (0-9999) + uniqueness tests
  • Feature-flag snapshot integration
02~6h
Statistical engine + decision scorecard

scipy Welch + CI + MDE + power. Guardrail one-sided tests with severity. Decision scorecard combining primary lift + guardrails + segment HTE. SRM detector.

  • src/stats/significance.py + power_analysis.py + guardrails.py
  • mart_variant_comparison + mart_experiment_scorecard
  • SRM chi-squared + segment HTE Bonferroni
03~5h
Lifecycle state machine + flag service

Python lifecycle state machine with audit log + Airflow enforcement. FastAPI + Redis flag-eval with deterministic SHA256 bucketing.

  • experimentation/lifecycle/state_machine.py
  • dags/experiment_lifecycle_dag.py (Airflow)
  • serving/flag_service.py + flag_sync.py (FastAPI + Redis)
Project setup · 10 minutes

One stack. dbt + Postgres + Python + Airflow + Redis + FastAPI.

Pre-configured docker-compose with Postgres, Redis, Airflow, and the dbt project scaffolded. 5 sample CSVs (50K assignments, 100K events, 50 experiments, 500 feature flags, 10 metric definitions) with intentional quality issues seeded for the dbt tests.

What lives in the repo

Everything you need to run the experimentation platform locally — dbt project with 26+ models, scipy statistical engine, FastAPI flag service, Airflow lifecycle DAGs, and the seeded sample data with quality issues that the dbt tests catch.

  • docker-compose.yml — Postgres, Redis, Airflow scheduler + webserver
  • dbt_project.yml + models/ — 26+ dbt models across staging / intermediate / marts
  • src/stats/ — scipy: significance.py, power_analysis.py, guardrails.py, anomaly_detector.py
  • experimentation/lifecycle/ — state_machine.py + Airflow enforcement DAG
  • serving/ — FastAPI flag_service.py + Redis flag_sync.py
  • seeds/ — 5 sample CSVs with quality issues (50K assignments, 100K events)
Download · Starter Kit

Experimentation Platform Starter Kit

Pre-configured docker-compose stack, dbt project scaffolded with 26+ models, scipy statistical engine, FastAPI flag service, Airflow lifecycle DAGs, and 5 seeded CSVs with intentional quality issues for the dbt tests to catch.

6.3 MB · 101 files · 5 sample CSVs · PRO required
~/projects/ab-testing-platform — zsh
1. Clone and start the stack
$ git clone github.com/ai-de/p29-ab-testing-platform
$ cd p29-ab-testing-platform && docker-compose up -d
2. Seed Postgres + run the dbt models
$ make seed # load 5 sample CSVs with quality issues
$ make dbt-build # 26+ models across staging / intermediate / marts
3. Run the dbt tests (catch the seeded quality issues)
$ dbt test --select tag:experimentation
$ # Expected: SRM warning on raw_assignments (48/52 split)
$ # Expected: timestamp test failure on raw_user_events (2% pre-assignment)
4. Run the statistical engine on the seeded experiment
$ python -m src.stats.significance --experiment_id checkout_flow_v2
$ # → Welch's t-test, 95% CI, MDE check, GREEN/YELLOW/RED scorecard
5. Boot the flag service + hit it
$ uvicorn serving.flag_service:app --reload
$ curl -X POST localhost:8000/flags/checkout_flow_v2/evaluate -d '{"user_id": "u_42"}'
50K
assignment events
100K
user-action events
50
experiments
10
metric registry
Production hardening

The same A/B test — but built for the statistical-rigor case.

Most experimentation tutorials show you WHERE variant = ‘treatment’ and a conversion diff. This one shows what changes when you actually need to defend the result to a calibration committee or a launch review.

Tutorial-grade A/B testWhat you have today
×
Bucketing
App-side <code>random()</code> with no sticky behavior
×
Significance
Eyeball the conversion-rate diff
×
Decision
&ldquo;B looks like it&rsquo;s winning&rdquo;
×
Allocation drift
Trust the 50/50 split and ship
×
Stopping
End the test the day p &lt; 0.05
×
Segments
&ldquo;Conversion went up overall&rdquo;
Production-grade platformModule 01–04
Bucketing
Deterministic SHA256(flag_key:user_id) % 10000 with sticky bucketing across sessions
Significance
Welch’s t-test + 95% CI + MDE check + power validation (scipy)
Decision
GREEN scorecard requires p<0.05 + correct direction + no guardrail violation
Allocation drift
chi-squared SRM detector flags 48/52 splits before you trust the result
Stopping
Pre-registered duration with alpha-spending; no peeking without sequential correction (out-of-scope, called out)
Segments
Per-segment HTE chi-squared + Bonferroni — overall winner can hide a losing segment
PRO benefit · code review

Real review from senior engineers who’ve shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — experimentation + dbt + scipy for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer or product DE. Architecture questions, whiteboard a tricky stat method, mock a system-design interview. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Unlock EXPERT
Who this is for

Pick this if you’re shipping experiments, not learning to.

AE

Analytics engineers

You ship dbt models for experimentation but the stats sit in a notebook. This puts the stats in the warehouse next to the metrics.

PD

Product DEs

You support PMs running experiments. This is the platform you wish someone had built — schema + scorecards + governance + a flag service.

TL

Senior DE → Tech Lead

You're scoping the experimentation platform at your company. This project is the prior art — read it, don't re-derive it.

PE

Platform DEs

You run the data platform. This shows the experimentation primitives — assignment, exposure, decision, governance — without coupling to one warehouse.

FAQ

Quick answers.

No — and the catalog used to claim Kafka + ClickHouse, which was misleading. The real stack is dbt + Postgres + Python (scipy) + Airflow + FastAPI + Redis. The patterns transfer cleanly to BigQuery / Snowflake / ClickHouse with engine-specific SQL changes only.
No. This project is frequentist — Welch's t-test + CI + MDE + power + Bonferroni for multi-comparison correction. Sequential testing and Bayesian A/B are deliberately out of scope; the SEO used to over-promise this and we corrected it.
Mentioned as an interview-prep concept in module 02 only. Not implemented. If demand is there, we may add a P5 module covering CUPED + sequential testing later.
No. Postgres + dbt-core + Airflow + Redis all run locally via docker-compose. The patterns transfer to BigQuery / Snowflake / Redshift with config and SQL-dialect changes only.
P11 is contracts + lineage + RBAC for any data domain. P29 (this) is experimentation-specific: statistical rigor + decision logic + lifecycle governance for A/B tests. They share dbt + governance posture but the deliverables are different.
Yes — especially the system-design rounds at Netflix, Airbnb, LinkedIn, Booking.com, Stripe. After this you can whiteboard the schema, walk through Welch + MDE + SRM, defend the GREEN/YELLOW/RED scorecard, and reason about lifecycle governance without hand-waving.

Ready to ship a real experimentation platform?

Start with module 01 — free, no card. About 3-4 hours. By the end you'll have the dimensional model, dbt tests, and bucket-range invariants under your fingers — the schema spine the next 3 modules sit on.

P29 · Experimentation Platform on dbt + scipy · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open