Operate the data platform like
Google SRE
operates production — across 5 stages
Build the dependency graph that maps the platform, the dep-aware SLO ladder, and the alerter that turns 47 symptom alerts into one root-cause page. Inject 5 chaos types, recover with retry · backfill · circuit-breaker · validation, then own the incident through SEV1–SEV4 with blameless 5-Whys postmortems and a cost-vs-reliability calculator that knows the cost of each additional 9.
This is the system-design answer for “how would you operate a data platform across 4 teams without paging on every symptom?” — asked at staff DE rounds at Uber, Airbnb, Meta, Stripe.
- A 6-pipeline dependency graph with BFS, DFS, and topological sort plus a failure-surface analysis (the spine for every other layer)
- A three-tier SLO ladder with error-budget burn-rate tracking and dependency-aware violation detection — upstream failures don't fire downstream pages
- A DependencyBasedAlerter that suppresses inherited alerts and routes by severity (SEV1 → on-call page, SEV2 → Slack, SEV3 → email)
- A FailureSimulator with 5 chaos types (data_loss · schema_drift · upstream_down · latency_spike · duplicate_data) plus retry/backoff, IdempotentLoader, CheckpointManager, BackfillSystem, and a CircuitBreaker state machine
- An incident lifecycle (Detect → Respond → Resolve → Learn) with 30+ Postgres tables, SEV1–SEV4 state machine, 5-Whys postmortems with why_1..why_5 columns, RootCauseTracer (DFS), and 10 root-cause categories
- A multi-team platform model — teams + pipeline_ownership + access-pattern matrix — with a cost-vs-reliability calculator (99% = 1x · 99.9% = 5–10x · 99.99% = 50–100x)
Data platforms are first-class production systems — not back-office plumbing.
When the revenue dashboard shows $0, the data team is on the hook the same way the API team is when /checkout returns 500. The patterns you wire here — dependency-aware SLOs, root-cause-only paging, chaos engineering, blameless postmortems — are the staff-level rubric every data-platform interview now grades against.
47 symptom alerts is the on-call exit interview
When one upstream pipeline drops 15% of rows, the naive setup pages on every downstream symptom — schema check, row count, freshness, 12 dependent dashboards. The DependencyBasedAlerter turns that into one root-cause page. The difference between staying on-call and quitting.
Chaos engineering moved into data
Netflix's chaos monkey is now data-team table-stakes. The FailureSimulator's 5 chaos types let you prove the platform recovers — before the production fire proves it doesn't. Recovery patterns (retry/backoff, backfill, circuit-breaker, validation) ship as the recovery_playbook.
Multi-team ownership is the staff-level question
When 4 domain teams share a warehouse, 'whose dashboard is wrong?' becomes the first 30 minutes of every incident. The teams + pipeline_ownership + access-pattern matrix + RootCauseTracer answer it in three seconds, not three Slack threads.
Reliability has a price tag
Going from 99% to 99.9% is roughly 5–10x the compute. From 99.9% to 99.99% is another 5–10x. The cost-vs-reliability calculator surfaces that conversation as a number, not a vibe — the artifact that gets a budget approved.
Part 0 is free. The other four unlock with PRO.
Try the first 2 hours — map the platform, build the dependency graph, identify the failure surfaces. If the loop clicks, upgrade to unlock SLOs, dep-aware alerting, chaos engineering, and the multi-team incident management layer.
Data Observability & Quality
SLO + error-budget fundamentals from this curriculum are the foundation for the project — PRO subscribers get full access to every module.
Three sprints. Five stages. One reliability platform.
Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.
Dependency graph (BFS · DFS · topo-sort) over 6 pipelines. Failure-surface map. Three-tier SLO ladder with error-budget burn-rate. Dep-aware violation detection wired to the graph.
- ✓6-pipeline dependency graph + failure surfaces
- ✓tier-aware SLA/SLO ladder + error budgets
- ✓violation detection that respects the graph
RED metrics + AlertingEngine. The DependencyBasedAlerter that suppresses 47 symptom alerts to 1 root-cause page using Part 0's graph. Severity-routed dispatch with dedup.
- ✓MetricsCollector + alert_rules + alert_history tables
- ✓DependencyBasedAlerter (root-cause-only paging)
- ✓SEV1/2/3 routing — page · Slack · email
FailureSimulator (5 chaos types). Retry/backoff + IdempotentLoader + CheckpointManager + BackfillSystem + CircuitBreaker. Incident lifecycle with SEV1–SEV4, 5-Whys, RootCauseTracer, 10 categories, multi-team ownership.
- ✓FailureSimulator + RecoveryPlaybook + CircuitBreaker
- ✓incidents · postmortems · root_causes (30+ tables)
- ✓teams + pipeline_ownership + cost-vs-reliability calculator
One command. Pre-incident platform + the full reliability stack.
You get a real platform on day one — Postgres seeded with 10K pipeline runs, 50 incidents (15 still open), 800 SLA violations, plus 33 Python modules covering every layer (graph, SLO, alerting, chaos, incidents).
What lives in the repo
Everything you need to map, measure, monitor, recover, and own a data platform — and the seed data that's pre-broken with open incidents and burning error budgets so you have something real to fix.
- docker-compose.yml — Postgres + Prometheus + Grafana
- migrations/ — 11 schema migrations · 30+ reliability tables
- seed/02_sample_data.sql — 10K runs · 50 incidents · 800 SLA violations · 20 postmortems
- dataguard/ — 33 Python modules: dep_graph, slo, alerting, chaos, incidents
- alerting/ + failure_simulator.py — DependencyBasedAlerter + FailureSimulator (5 chaos types)
DataGuard Reliability Starter Kit
Pre-configured Docker stack, the 11-migration seed schema, 10K pipeline runs and 50 incidents pre-loaded, and the 33 Python modules for every part. Skip the boilerplate, start on Part 0.
The same stack — but built for the 10x case.
Most reliability tutorials show you the retry_with_backoff. This one shows what changes when the dependency graph has 200+ pipelines, three on-call rotations are watching the same alerts, and the alerter that worked at 6 pipelines starts paging at 47.
row_count < 80000why_1..why_5 columnsroot_causes enumrow_count + freshness seriesgroup_by + repeat-interval, durable across restartsnetworkx DiGraph visualizationReal review from senior engineers who shipped this stack.
Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.
4 reviews / month
Submit a repo, a PR, or an architecture proposal. Reviewer is matched to your domain — SRE / platform / incident response for this project. Async, comments inline, average turnaround 31 hours.
2 office hours / month
Live 30-min sessions with a senior data engineer. Dependency-graph design, mock a system-design interview on dep-aware alerting, whiteboard the cost-vs-reliability calculator. Group sessions also available.
One subscription. 15+ projects, all curriculum, code review.
PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.
Pick this if you’ve been paged 47 times for one root cause, not learning to.
Platform engineers running data infra
You operate Postgres, Airflow, dbt, and the warehouse for 4+ teams. The dependency graph + dep-aware alerter + cost-vs-reliability calculator are the patterns you needed last quarter.
On-call engineers tired of cascading pages
Every alert is P1, nothing is. The DependencyBasedAlerter that suppresses 47 symptoms to 1 root-cause page is the artifact that gets you off the rotation that's burning the team out.
Senior data engineers leveling up to staff
You ship pipelines today; the staff-level rubric checks for system topology, chaos engineering, blameless 5-Whys, and multi-team ownership. This is that rubric, built end-to-end.
Tech leads + EMs setting the reliability bar
You're approving the budget for reliability work. The cost-vs-reliability calculator turns 'can we get to 99.99%?' from a vibe into a number that has a 50–100x compute coefficient — the conversation that gets approved.
Going deeper? Three tracks back this project.
SRE-for-data is the spine. These three curriculums let you go deeper on the layers that matter most — system design, automation, and the cost story behind each additional 9.
Quick answers.
Ready to ship a real reliability platform?
Start with Part 0 — free, no card. About 2 hours. By the end you'll have the platform mapped, the 6-pipeline dependency graph built, and the failure-surface analysis that drives every other layer in the project.