Skip to content
ai-de.net/Projects/P25 · DataGuard Reliability
PRO · Part 0 free previewPlatform trackP25

Operate the data platform like
Google SRE
operates production — across 5 stages

Build the dependency graph that maps the platform, the dep-aware SLO ladder, and the alerter that turns 47 symptom alerts into one root-cause page. Inject 5 chaos types, recover with retry · backfill · circuit-breaker · validation, then own the incident through SEV1–SEV4 with blameless 5-Whys postmortems and a cost-vs-reliability calculator that knows the cost of each additional 9.

Timeline
13-15 hours
Difficulty
Senior+
Stack
Python · Postgres · Prometheus · graph algos

This is the system-design answer for “how would you operate a data platform across 4 teams without paging on every symptom?” — asked at staff DE rounds at Uber, Airbnb, Meta, Stripe.

By the end you will have shipped
  • A 6-pipeline dependency graph with BFS, DFS, and topological sort plus a failure-surface analysis (the spine for every other layer)
  • A three-tier SLO ladder with error-budget burn-rate tracking and dependency-aware violation detection — upstream failures don't fire downstream pages
  • A DependencyBasedAlerter that suppresses inherited alerts and routes by severity (SEV1 → on-call page, SEV2 → Slack, SEV3 → email)
  • A FailureSimulator with 5 chaos types (data_loss · schema_drift · upstream_down · latency_spike · duplicate_data) plus retry/backoff, IdempotentLoader, CheckpointManager, BackfillSystem, and a CircuitBreaker state machine
  • An incident lifecycle (Detect → Respond → Resolve → Learn) with 30+ Postgres tables, SEV1–SEV4 state machine, 5-Whys postmortems with why_1..why_5 columns, RootCauseTracer (DFS), and 10 root-cause categories
  • A multi-team platform model — teams + pipeline_ownership + access-pattern matrix — with a cost-vs-reliability calculator (99% = 1x · 99.9% = 5–10x · 99.99% = 50–100x)
PREREQBuilt for senior+ data engineers. Comfortable with Python (classes, dataclasses, enums), SQL DDL + joins, Docker. Prior exposure to data observability or system design helps but isn’t required. SRE familiarity is a bonus.
dataguard.reliability.* · 6 pipelines · 0 cascading pages
47 symptoms → 1 page
Map · Measure
Monitor
Recover
Own
dep_graph6 pipelines · BFS / DFS / topo
tier_config99.9 / 99.5 / 95
error_budgetburn-rate alerts · dep-aware
800 violationsseeded · burning
failure surfaces + tier ladder
DependencyBasedAlerter47 sym → 1 page · uses MAP graph
severity routingpage · Slack · email
alert_history5-min dedup window
root-cause-only paging
FailureSim5 chaos types · drop · drift · down
CircuitBreakerCLOSED → OPEN → HALF_OPEN
backfill + retrydry-run + validate
chaos · recover · validate
SEV1incident lifecycle
5-Whyspostmortem template
RCA × 10root-cause categories
teams30+ tables owned
incidents → playbook → action items
# Seeded reliability platform
10K pipeline runs · 50 incidents (15 open)
800 SLA violations · 20 postmortems
real burning budgets to fix on Part 0
→ DependencyBasedAlerter cuts 47 → 1
● Staff-level depth
5 chaos types · 4 recovery patterns
10 root-cause categories · 5-Whys
cost-vs-reliability calculator built in
→ platform-grade reliability spine
10K
runs · 50 incidents seeded
5
chaos types · 4 recovery patterns
10
root-cause categories
Why this matters in 2026

Data platforms are first-class production systems — not back-office plumbing.

When the revenue dashboard shows $0, the data team is on the hook the same way the API team is when /checkout returns 500. The patterns you wire here — dependency-aware SLOs, root-cause-only paging, chaos engineering, blameless postmortems — are the staff-level rubric every data-platform interview now grades against.

47 symptom alerts is the on-call exit interview

When one upstream pipeline drops 15% of rows, the naive setup pages on every downstream symptom — schema check, row count, freshness, 12 dependent dashboards. The DependencyBasedAlerter turns that into one root-cause page. The difference between staying on-call and quitting.

Chaos engineering moved into data

Netflix's chaos monkey is now data-team table-stakes. The FailureSimulator's 5 chaos types let you prove the platform recovers — before the production fire proves it doesn't. Recovery patterns (retry/backoff, backfill, circuit-breaker, validation) ship as the recovery_playbook.

Multi-team ownership is the staff-level question

When 4 domain teams share a warehouse, 'whose dashboard is wrong?' becomes the first 30 minutes of every incident. The teams + pipeline_ownership + access-pattern matrix + RootCauseTracer answer it in three seconds, not three Slack threads.

Reliability has a price tag

Going from 99% to 99.9% is roughly 5–10x the compute. From 99.9% to 99.99% is another 5–10x. The cost-vs-reliability calculator surfaces that conversation as a number, not a vibe — the artifact that gets a budget approved.

Curriculum · 5 modules · 13-15 hours

Part 0 is free. The other four unlock with PRO.

Try the first 2 hours — map the platform, build the dependency graph, identify the failure surfaces. If the loop clicks, upgrade to unlock SLOs, dep-aware alerting, chaos engineering, and the multi-team incident management layer.

P25 · 13-15 hours · 5 modules
Free preview PRO required
Part 0 is free — no card required. Map the platform before paying.
M01
Map the platform — dependency graph + failure surfaces
Build the system topology over 6 pipelines with BFS, DFS, and topological sort. Trace critical paths. Identify failure surfaces before measuring anything. The artifact every other layer depends on.
2h8 lessonsFREE PREVIEW
Start →
M02
SLA, SLO & error budgets — dep-aware violation detection
Define a three-tier SLO ladder. Implement error-budget burn-rate tracking. Build dependency-aware violation detection that knows when an upstream failure cascades — and when it doesn't.
3h10 lessonsPRO TIER
Unlock with PRO →
M03
Monitoring & dep-based alerting — 47 symptoms → 1 page
Ship MetricsCollector, AlertingEngine, and the DependencyBasedAlerter that uses Part 0's graph to suppress inherited alerts. Route by severity (SEV1 → page, SEV2 → Slack, SEV3 → email) with deduplication.
3h11 lessonsPRO TIER
Unlock with PRO →
M04
Failure simulation & recovery — chaos · retry · circuit-break
Build a FailureSimulator with 5 chaos types. Implement retry/backoff with jitter, IdempotentLoader, CheckpointManager, BackfillSystem with dry-run, and a CircuitBreaker state machine (CLOSED/OPEN/HALF_OPEN). Codify everything as a RecoveryPlaybook.
3h12 lessonsPRO TIER
Unlock with PRO →
M05
Incidents & multi-team — RootCauseTracer + 5-Whys + cost
Stand up the incident lifecycle (Detect → Respond → Resolve → Learn) with SEV1–SEV4, 5-Whys postmortems, RootCauseTracer (DFS), 10 root-cause categories, multi-team ownership, and the cost-vs-reliability calculator that prices each additional 9.
3h12 lessonsPRO TIER
Unlock with PRO →
4 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

Data Observability & Quality

10 modules·8.3 hours·SLOs + error budgets·alert routing·incident response·dependency-aware alerting·blameless postmortems
Open curriculum

SLO + error-budget fundamentals from this curriculum are the foundation for the project — PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Five stages. One reliability platform.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~5h
Map the platform & set the targets

Dependency graph (BFS · DFS · topo-sort) over 6 pipelines. Failure-surface map. Three-tier SLO ladder with error-budget burn-rate. Dep-aware violation detection wired to the graph.

  • 6-pipeline dependency graph + failure surfaces
  • tier-aware SLA/SLO ladder + error budgets
  • violation detection that respects the graph
02~3h
Detect without paging

RED metrics + AlertingEngine. The DependencyBasedAlerter that suppresses 47 symptom alerts to 1 root-cause page using Part 0's graph. Severity-routed dispatch with dedup.

  • MetricsCollector + alert_rules + alert_history tables
  • DependencyBasedAlerter (root-cause-only paging)
  • SEV1/2/3 routing — page · Slack · email
03~6h
Break it, recover it, own it

FailureSimulator (5 chaos types). Retry/backoff + IdempotentLoader + CheckpointManager + BackfillSystem + CircuitBreaker. Incident lifecycle with SEV1–SEV4, 5-Whys, RootCauseTracer, 10 categories, multi-team ownership.

  • FailureSimulator + RecoveryPlaybook + CircuitBreaker
  • incidents · postmortems · root_causes (30+ tables)
  • teams + pipeline_ownership + cost-vs-reliability calculator
Project setup · 5 minutes

One command. Pre-incident platform + the full reliability stack.

You get a real platform on day one — Postgres seeded with 10K pipeline runs, 50 incidents (15 still open), 800 SLA violations, plus 33 Python modules covering every layer (graph, SLO, alerting, chaos, incidents).

What lives in the repo

Everything you need to map, measure, monitor, recover, and own a data platform — and the seed data that's pre-broken with open incidents and burning error budgets so you have something real to fix.

  • docker-compose.yml — Postgres + Prometheus + Grafana
  • migrations/ — 11 schema migrations · 30+ reliability tables
  • seed/02_sample_data.sql — 10K runs · 50 incidents · 800 SLA violations · 20 postmortems
  • dataguard/ — 33 Python modules: dep_graph, slo, alerting, chaos, incidents
  • alerting/ + failure_simulator.py — DependencyBasedAlerter + FailureSimulator (5 chaos types)
Download · Starter Kit

DataGuard Reliability Starter Kit

Pre-configured Docker stack, the 11-migration seed schema, 10K pipeline runs and 50 incidents pre-loaded, and the 33 Python modules for every part. Skip the boilerplate, start on Part 0.

286 KB · Docker · 11 migrations · 33 Python modules · Prometheus + Grafana · PRO required
~/projects/dataguard-reliability — zsh
1. Clone and start the stack
$ git clone github.com/ai-de/p25-dataguard-reliability
$ cd p25-dataguard-reliability && docker-compose up -d
2. Apply 11 migrations + seed 10K runs / 50 incidents
$ psql -h localhost -U dataguard -f migrations/run_all.sql
$ psql -h localhost -U dataguard -f seed/02_sample_data.sql
3. Build the dependency graph + verify
$ python3 -m dataguard.dep_graph.build && python3 -m dataguard.dep_graph.verify
4. Open Prometheus · Grafana
$ open http://localhost:3000 # Grafana
$ open http://localhost:9090 # Prometheus
30+
Reliability tables
10K
Pipeline runs
50
Incidents (15 open)
800
SLA violations
Production hardening

The same stack — but built for the 10x case.

Most reliability tutorials show you the retry_with_backoff. This one shows what changes when the dependency graph has 200+ pipelines, three on-call rotations are watching the same alerts, and the alerter that worked at 6 pipelines starts paging at 47.

What you ship in modules 01–05Tutorial pattern
×
Anomaly detection
Static threshold row_count < 80000
×
Alert dedup
In-process 5-min window; lost on restart
×
Postmortems
5-Whys YAML stored as why_1..why_5 columns
×
Circuit breaker
Single-process state machine; per-service
×
Coverage
6 pipelines · 30+ tables · per-run scoring
×
Incident store
Postgres incidents table; single region
×
Root cause
10 hand-curated categories in a root_causes enum
×
Ownership model
Single-team pipeline_ownership; manual joins
What changes at scaleProduction pattern
Anomaly detection
Rolling EWMA / z-score on row_count + freshness series
Alert dedup
Alertmanager group_by + repeat-interval, durable across restarts
Postmortems
Causal-tree analysis surfaced as a networkx DiGraph visualization
Circuit breaker
Distributed state in Redis with PubSub; cluster-wide trip
Coverage
200+ pipelines with sampling tiers + incremental graph rebuild
Incident store
Durable write-ahead log + read replicas for global on-call
Root cause
ML-suggested classification on incident-text embeddings + 5-Whys
Ownership model
Multi-tenant policy-as-code with row-level security per team
PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or an architecture proposal. Reviewer is matched to your domain — SRE / platform / incident response for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer. Dependency-graph design, mock a system-design interview on dep-aware alerting, whiteboard the cost-vs-reliability calculator. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you’ve been paged 47 times for one root cause, not learning to.

PE

Platform engineers running data infra

You operate Postgres, Airflow, dbt, and the warehouse for 4+ teams. The dependency graph + dep-aware alerter + cost-vs-reliability calculator are the patterns you needed last quarter.

OC

On-call engineers tired of cascading pages

Every alert is P1, nothing is. The DependencyBasedAlerter that suppresses 47 symptoms to 1 root-cause page is the artifact that gets you off the rotation that's burning the team out.

SR

Senior data engineers leveling up to staff

You ship pipelines today; the staff-level rubric checks for system topology, chaos engineering, blameless 5-Whys, and multi-team ownership. This is that rubric, built end-to-end.

ST

Tech leads + EMs setting the reliability bar

You're approving the budget for reliability work. The cost-vs-reliability calculator turns 'can we get to 99.99%?' from a vibe into a number that has a 50–100x compute coefficient — the conversation that gets approved.

FAQ

Quick answers.

Part 0 (free) gives you a 6-pipeline platform pre-loaded in Postgres with real upstream/downstream relationships. You ship BFS, DFS, and topological sort over the actual graph and identify failure surfaces in real depends_on data. Most free tutorials show the algorithm in isolation; this one builds the artifact every other layer in the project depends on.
P10 is the detect / trace / prevent layer — dbt + Great Expectations validation + OpenLineage column lineage + Prometheus + Grafana on a 5-table pre-broken warehouse. P25 (this) is the SRE-for-data layer — dependency graphs + dep-aware SLOs + chaos engineering + 5-Whys postmortems + multi-team ownership. They are different layers, not different depths. Build observability first if you don't yet have quality metrics on your data; jump here if you already operate at scale and need the platform-engineering patterns.
ML-based anomaly detection (all thresholds are static or tier-based; rolling EWMA/z-score is covered in the hardening section as an upgrade path). Real Datadog or PagerDuty integration (the AlertRouter abstracts the sink — wiring an actual integration is a 10-line change). Cloud deployment (everything runs locally in Docker; the patterns transfer cleanly to managed services).
No. Everything runs locally — Postgres + Prometheus + Grafana + the 33 Python modules, all in docker-compose. The patterns transfer cleanly to managed services (RDS + Datadog + PagerDuty + on-call.com) with config changes only.
All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, a verifiable certificate, and full community access. Cancel anytime.
Yes. Staff DE rounds increasingly assume you can reason about dependency graphs, dep-aware SLOs, root-cause-only paging, chaos engineering, and multi-team ownership — the same way SRE rounds at API teams do. After this you can whiteboard the 5-stage spine — MAP / MEASURE / MONITOR / RECOVER / OWN — without hand-waving any layer.

Ready to ship a real reliability platform?

Start with Part 0 — free, no card. About 2 hours. By the end you'll have the platform mapped, the 6-pipeline dependency graph built, and the failure-surface analysis that drives every other layer in the project.

P25 · DataGuard Reliability · PRO · Part 0 freeUpgrade to PRO →
Press Cmd+K to open