ai-de.net/Projects/P29 · Experimentation Platform on dbt + scipy

PRO · module 01 free previewStaff Engineering trackP29

Build a
production-grade
experimentation platform on dbt + scipy

A real experimentation data platform — not a tutorial. Land 100K assignment events and 50 experiments through a star-schema dimensional model with idempotent merges. Compute Welch's t-tests, confidence intervals, MDE, and power in scipy. Detect SRM and segment HTE with chi-squared + Bonferroni. Drive a GREEN/YELLOW/RED ship-decision scorecard. Govern lifecycle with a Python state machine. Serve flags from FastAPI + Redis with deterministic SHA256 bucketing.

Timeline

14-16 hours

Difficulty

Senior+

Stack

dbt · Postgres · scipy · FastAPI · Airflow

See PRO benefits

The system-design question Netflix, Airbnb, LinkedIn, and Booking.com ask in analytics + product-DE rounds — “walk me through how you’d build experimentation infra at scale.” This project gives you the artifacts: schema, stats, scorecard, governance.

By the end you will have wired

26+ dbt models — dim_experiments, dim_variants, fact_assignments, dim_metric_registry, mart_variant_comparison, mart_experiment_scorecard
scipy statistical engine — Welch's t-test + 95% CI, MDE + power analysis, guardrail one-sided tests, rolling-z-score anomaly detection
Decision framework — GREEN/YELLOW/RED scorecard combining primary lift + guardrail outcomes + segment HTE (chi-squared + Bonferroni)
SRM detector — chi-squared on observed vs expected variant allocation, severity-flagged before you trust the result
Lifecycle state machine — DRAFT → REVIEW → APPROVED → RUNNING → CONCLUDED → ARCHIVED with audit log + Airflow daily enforcement DAG
Real-time flag service — FastAPI + Redis with deterministic SHA256(flag_key:user_id)%10000 bucketing and sub-10ms eval

PREREQComfortable with dbt + Postgres + Python (scipy fluency helpful but not required). Pairs well with the Product Thinking curriculum (KPIs, A/B testing, experimentation infra) and the dbt deep dive.

experiments.warehouse.* · scorecard live

p < 0.05 · SRM ok

Events

Dbt models

Scipy engine

Outputs

raw.assignment50K · bucket 0–9999

raw.user_action100K · JSONB ctx

raw.experiments50 · feature flags

+ dbt tests · contracts

dim_experimentsincremental · merge

fact_assignmentsgrain: user × exp

mart_scorecardGREEN / YELLOW / RED

26+ models · staging→marts

significance.pyWelch + 95% CI

power_analysis.pyMDE + sample size

guardrails.pySRM · HTE · Bonferroni

scipy.stats · numpy

scorecardSHIP · HOLD · EXTEND

state_machineDRAFT → ARCHIVED

flag_serviceFastAPI + Redis

sub-10ms p99

# Statistical rigor (Module 02)

t, p = scipy.stats.ttest_ind_from_stats(

m_a, sd_a, n_a, m_b, sd_b, n_b)

CI = diff ± z_0.975 * SE_diff

→ Welch + 95% CI + MDE + power

● Deterministic bucketing (Module 04)

key = f"{flag_key}:{user_id}"

bucket = int(sha256(key)[:8], 16) % 10000

return "treatment" if bucket < 5000 else …

→ sticky · cached in Redis · sub-10ms

26+

dbt models

metric registry

p<0.05

with CI + MDE

Why this matters in 2026

Every product team runs experiments. Few have rigor.

Most A/B testing in the wild is eyeballed conversion-rate diffs. The pattern that wins promo and senior+ interviews is: deterministic bucketing, Welch + CI + MDE, guardrails, SRM detection, and a scorecard that says GREEN only when significance + direction + guardrails all check.

p-value > eyeball

Welch's t-test with 95% CI and MDE check is the difference between “B looks better” and “we ship B.” This project ships the scipy code, not just the chart.

SRM > trust the split

Sample-ratio mismatch silently invalidates 5-10% of experiments at most companies. A chi-squared SRM detector flags allocation drift before anyone trusts the lift.

Scorecard > debate

GREEN/YELLOW/RED logic — primary significance + correct direction + no guardrail violation — turns “should we ship” into a deterministic answer.

State machine > drift

DRAFT → REVIEW → APPROVED → RUNNING → CONCLUDED with audit logs and Airflow enforcement is what stops experiments from running 90 days after they should have ended.

Curriculum · 4 modules · 14-16 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3-4 hours — model the assignment events, build the dim/fact tables, write dbt tests for bucket integrity, and lock in the dimensional model. If the rhythm clicks, upgrade to unlock the statistical engine, decision scorecard, and lifecycle modules.

P29 · 14-16 hours · 4 modules

Free preview EXPERT required

Module 01 is free — no card required. Get the dbt schema and dimensional model under your fingers before paying.

M01

✓Foundation: event modeling + dimensional schema + dbt tests

Model raw.assignment_events / raw.user_action_events / raw.metric_events. Build dim_experiments, dim_variants, fact_assignments with incremental merge strategies. Add dbt tests for uniqueness, referential integrity, and bucket-range invariants (0-9999). Stage the feature-flag snapshot integration. The dimensional spine the next 3 modules sit on.

3-4h6 lessonsFREE PREVIEW

Start →

M02

⊘Statistical engine: Welch + MDE + power + guardrails

Build the metric registry (10 metrics with SLA freshness). Compute pre/in/post experiment windows. Welch's t-test (scipy.stats.ttest_ind_from_stats) with 95% CI. MDE and required-sample-size with z_alpha + z_beta. One-sided guardrail tests with severity flagging. Rolling z-score anomaly detection. Multi-comparison correction (Bonferroni, Holm-Bonferroni).

4-5h6 lessonsEXPERT TIER

Unlock with EXPERT →

M03

⊘Decision framework: scorecard + segment HTE + SRM

Build mart_variant_comparison + mart_experiment_scorecard with traffic-light (GREEN/YELLOW/RED/GRAY) logic combining primary significance + direction + guardrail outcomes. Per-segment HTE chi-squared with Bonferroni correction. SRM detector (chi-squared on observed vs expected allocation). Decision rules for ship/hold/extend.

3-4h6 lessonsEXPERT TIER

Unlock with EXPERT →

M04

⊘Production: lifecycle state machine + flag service + governance

Python state machine (Enum + transitions dict + max-duration guardrails) with audit log. Airflow daily enforcement DAG (auto-conclude, auto-archive). FastAPI + Redis flag-eval service with deterministic SHA256(flag:user)%10000 bucketing and sub-10ms p99. dim_feature_flags sync (60s). Governance approvals + traffic budgets outlined.

3-4h6 lessonsEXPERT TIER

Unlock with EXPERT →

3 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

Backed by curriculum

Product Thinking for Data Engineers

8 modules·14 hours·KPIs & metrics·A/B testing·Experimentation infra·Stakeholder communication·Data strategy

Open curriculum→

Product Thinking teaches the why behind every model in this project — KPI hierarchy, A/B testing, stakeholder framing. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production-shaped platform.

Each phase ends with a tagged commit and a runnable artifact. No theory decks.

01~4h

Schema + dimensional model + bucketing

Star schema for experimentation: dim_experiments / dim_variants / fact_assignments. Idempotent dbt incremental merges. Schema-registry contracts on the raw event tables.

✓raw.assignment_events + dim/fact dbt models
✓Bucket-range invariants (0-9999) + uniqueness tests
✓Feature-flag snapshot integration

02~6h

Statistical engine + decision scorecard

scipy Welch + CI + MDE + power. Guardrail one-sided tests with severity. Decision scorecard combining primary lift + guardrails + segment HTE. SRM detector.

✓src/stats/significance.py + power_analysis.py + guardrails.py
✓mart_variant_comparison + mart_experiment_scorecard
✓SRM chi-squared + segment HTE Bonferroni

03~5h

Lifecycle state machine + flag service

Python lifecycle state machine with audit log + Airflow enforcement. FastAPI + Redis flag-eval with deterministic SHA256 bucketing.

✓experimentation/lifecycle/state_machine.py
✓dags/experiment_lifecycle_dag.py (Airflow)
✓serving/flag_service.py + flag_sync.py (FastAPI + Redis)

Project setup · 10 minutes

One stack. dbt + Postgres + Python + Airflow + Redis + FastAPI.

Pre-configured docker-compose with Postgres, Redis, Airflow, and the dbt project scaffolded. 5 sample CSVs (50K assignments, 100K events, 50 experiments, 500 feature flags, 10 metric definitions) with intentional quality issues seeded for the dbt tests.

What lives in the repo

Everything you need to run the experimentation platform locally — dbt project with 26+ models, scipy statistical engine, FastAPI flag service, Airflow lifecycle DAGs, and the seeded sample data with quality issues that the dbt tests catch.

docker-compose.yml — Postgres, Redis, Airflow scheduler + webserver
dbt_project.yml + models/ — 26+ dbt models across staging / intermediate / marts
src/stats/ — scipy: significance.py, power_analysis.py, guardrails.py, anomaly_detector.py
experimentation/lifecycle/ — state_machine.py + Airflow enforcement DAG
serving/ — FastAPI flag_service.py + Redis flag_sync.py
seeds/ — 5 sample CSVs with quality issues (50K assignments, 100K events)

Download · Starter Kit

Experimentation Platform Starter Kit

Pre-configured docker-compose stack, dbt project scaffolded with 26+ models, scipy statistical engine, FastAPI flag service, Airflow lifecycle DAGs, and 5 seeded CSVs with intentional quality issues for the dbt tests to catch.

6.3 MB · 101 files · 5 sample CSVs · PRO required

~/projects/ab-testing-platform — zsh

1. Clone and start the stack

$ git clone github.com/ai-de/p29-ab-testing-platform

$ cd p29-ab-testing-platform && docker-compose up -d

2. Seed Postgres + run the dbt models

$ make seed # load 5 sample CSVs with quality issues

$ make dbt-build # 26+ models across staging / intermediate / marts

3. Run the dbt tests (catch the seeded quality issues)

$ dbt test --select tag:experimentation

$ # Expected: SRM warning on raw_assignments (48/52 split)

$ # Expected: timestamp test failure on raw_user_events (2% pre-assignment)

4. Run the statistical engine on the seeded experiment

$ python -m src.stats.significance --experiment_id checkout_flow_v2

$ # → Welch's t-test, 95% CI, MDE check, GREEN/YELLOW/RED scorecard

5. Boot the flag service + hit it

$ uvicorn serving.flag_service:app --reload

$ curl -X POST localhost:8000/flags/checkout_flow_v2/evaluate -d '{"user_id": "u_42"}'

50K

assignment events

100K

user-action events

experiments

metric registry

Production hardening

The same A/B test — but built for the statistical-rigor case.

Most experimentation tutorials show you WHERE variant = ‘treatment’ and a conversion diff. This one shows what changes when you actually need to defend the result to a calibration committee or a launch review.

Tutorial-grade A/B testWhat you have today

Bucketing

App-side <code>random()</code> with no sticky behavior

Significance

Eyeball the conversion-rate diff

Decision

“B looks like it’s winning”

Allocation drift

Trust the 50/50 split and ship

Stopping

End the test the day p < 0.05

Segments

“Conversion went up overall”

Production-grade platformModule 01–04

✓

Bucketing

Deterministic SHA256(flag_key:user_id) % 10000 with sticky bucketing across sessions

✓

Significance

Welch’s t-test + 95% CI + MDE check + power validation (scipy)

✓

Decision

GREEN scorecard requires p<0.05 + correct direction + no guardrail violation

✓

Allocation drift

chi-squared SRM detector flags 48/52 splits before you trust the result

✓

Stopping

Pre-registered duration with alpha-spending; no peeking without sequential correction (out-of-scope, called out)

✓

Segments

Per-segment HTE chi-squared + Bonferroni — overall winner can hide a losing segment

PRO benefit · code review

Real review from senior engineers who’ve shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — experimentation + dbt + scipy for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer or product DE. Architecture questions, whiteboard a tricky stat method, mock a system-design interview. Group sessions also available.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Unlock EXPERT →

Who this is for

Pick this if you’re shipping experiments, not learning to.

Analytics engineers

You ship dbt models for experimentation but the stats sit in a notebook. This puts the stats in the warehouse next to the metrics.

Product DEs

You support PMs running experiments. This is the platform you wish someone had built — schema + scorecards + governance + a flag service.

Senior DE → Tech Lead

You're scoping the experimentation platform at your company. This project is the prior art — read it, don't re-derive it.

Platform DEs

You run the data platform. This shows the experimentation primitives — assignment, exposure, decision, governance — without coupling to one warehouse.

Related curriculum

Going deeper? Three tracks back this platform.

Experimentation is a meeting point of modeling, the dbt engine, and observability. These three curriculums let you go deeper on the layers underneath every model in this project.

FAQ

Quick answers.

Does this use Kafka or ClickHouse?+

No — and the catalog used to claim Kafka + ClickHouse, which was misleading. The real stack is dbt + Postgres + Python (scipy) + Airflow + FastAPI + Redis. The patterns transfer cleanly to BigQuery / Snowflake / ClickHouse with engine-specific SQL changes only.

Are sequential testing or Bayesian methods covered?+

No. This project is frequentist — Welch's t-test + CI + MDE + power + Bonferroni for multi-comparison correction. Sequential testing and Bayesian A/B are deliberately out of scope; the SEO used to over-promise this and we corrected it.

What about CUPED variance reduction?+

Mentioned as an interview-prep concept in module 02 only. Not implemented. If demand is there, we may add a P5 module covering CUPED + sequential testing later.

Do I need a real cloud warehouse or AWS credentials?+

No. Postgres + dbt-core + Airflow + Redis all run locally via docker-compose. The patterns transfer to BigQuery / Snowflake / Redshift with config and SQL-dialect changes only.

How is this different from data-governance-contracts (P11)?+

P11 is contracts + lineage + RBAC for any data domain. P29 (this) is experimentation-specific: statistical rigor + decision logic + lifecycle governance for A/B tests. They share dbt + governance posture but the deliverables are different.

Will this help with senior+ data engineering interviews?+

Yes — especially the system-design rounds at Netflix, Airbnb, LinkedIn, Booking.com, Stripe. After this you can whiteboard the schema, walk through Welch + MDE + SRM, defend the GREEN/YELLOW/RED scorecard, and reason about lifecycle governance without hand-waving.

Ready to ship a real experimentation platform?

Start with module 01 — free, no card. About 3-4 hours. By the end you'll have the dimensional model, dbt tests, and bucket-range invariants under your fingers — the schema spine the next 3 modules sit on.

See PRO benefits

P29 · Experimentation Platform on dbt + scipy · PRO · module 01 freeUpgrade to PRO →

Build aproduction-gradeexperimentation platform on dbt + scipy

Every product team runs experiments. Few have rigor.

p-value > eyeball

SRM > trust the split

Scorecard > debate

State machine > drift

Module 01 is free. The rest unlocks with PRO.

Product Thinking for Data Engineers

Three sprints. Three checkpoints. One production-shaped platform.

One stack. dbt + Postgres + Python + Airflow + Redis + FastAPI.

What lives in the repo

Experimentation Platform Starter Kit

The same A/B test — but built for the statistical-rigor case.

Real review from senior engineers who’ve shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you’re shipping experiments, not learning to.

Analytics engineers

Product DEs

Senior DE → Tech Lead

Platform DEs

Going deeper? Three tracks back this platform.

Advanced Data Modeling & Architecture

dbt & Analytics Engineering

Data Observability & Quality

Quick answers.

Ready to ship a real experimentation platform?

Build a
production-grade
experimentation platform on dbt + scipy