ai-de.net/Projects/P25 · DataGuard Reliability

Last updated 2026-05-22By AI-DE Engineering Team

PRO · Part 0 free previewPlatform trackP25

Operate the data platform like
Google SRE
operates production — across 5 stages

Build the dependency graph that maps the platform, the dep-aware SLO ladder, and the alerter that turns 47 symptom alerts into one root-cause page. Inject 5 chaos types, recover with retry · backfill · circuit-breaker · validation, then own the incident through SEV1–SEV4 with blameless 5-Whys postmortems and a cost-vs-reliability calculator that knows the cost of each additional 9.

Timeline

13-15 hours

Difficulty

Senior+

Stack

Python · Postgres · Prometheus · graph algos

See PRO benefits

This is the system-design answer for “how would you operate a data platform across 4 teams without paging on every symptom?” — asked at staff DE rounds at Uber, Airbnb, Meta, Stripe.

By the end you will have shipped

A 6-pipeline dependency graph with BFS, DFS, and topological sort plus a failure-surface analysis (the spine for every other layer)
A three-tier SLO ladder with error-budget burn-rate tracking and dependency-aware violation detection — upstream failures don't fire downstream pages
A DependencyBasedAlerter that suppresses inherited alerts and routes by severity (SEV1 → on-call page, SEV2 → Slack, SEV3 → email)
A FailureSimulator with 5 chaos types (data_loss · schema_drift · upstream_down · latency_spike · duplicate_data) plus retry/backoff, IdempotentLoader, CheckpointManager, BackfillSystem, and a CircuitBreaker state machine
An incident lifecycle (Detect → Respond → Resolve → Learn) with 30+ Postgres tables, SEV1–SEV4 state machine, 5-Whys postmortems with why_1..why_5 columns, RootCauseTracer (DFS), and 10 root-cause categories
A multi-team platform model — teams + pipeline_ownership + access-pattern matrix — with a cost-vs-reliability calculator (99% = 1x · 99.9% = 5–10x · 99.99% = 50–100x)

PREREQBuilt for senior+ data engineers. Comfortable with Python (classes, dataclasses, enums), SQL DDL + joins, Docker. Prior exposure to data observability or system design helps but isn’t required. SRE familiarity is a bonus.

dataguard.reliability.* · 6 pipelines · 0 cascading pages

47 symptoms → 1 page

Map · Measure

Monitor

Recover

Own

dep_graph6 pipelines · BFS / DFS / topo

tier_config99.9 / 99.5 / 95

error_budgetburn-rate alerts · dep-aware

800 violationsseeded · burning

failure surfaces + tier ladder

DependencyBasedAlerter47 sym → 1 page · uses MAP graph

severity routingpage · Slack · email

alert_history5-min dedup window

root-cause-only paging

FailureSim5 chaos types · drop · drift · down

CircuitBreakerCLOSED → OPEN → HALF_OPEN

backfill + retrydry-run + validate

chaos · recover · validate

SEV1incident lifecycle

5-Whyspostmortem template

RCA × 10root-cause categories

teams30+ tables owned

incidents → playbook → action items

# Seeded reliability platform

10K pipeline runs · 50 incidents (15 open)

800 SLA violations · 20 postmortems

real burning budgets to fix on Part 0

→ DependencyBasedAlerter cuts 47 → 1

● Staff-level depth

5 chaos types · 4 recovery patterns

10 root-cause categories · 5-Whys

cost-vs-reliability calculator built in

→ platform-grade reliability spine

10K

runs · 50 incidents seeded

chaos types · 4 recovery patterns

root-cause categories

Why this matters in 2026

Data platforms are first-class production systems — not back-office plumbing.

When the revenue dashboard shows $0, the data team is on the hook the same way the API team is when /checkout returns 500. The patterns you wire here — dependency-aware SLOs, root-cause-only paging, chaos engineering, blameless postmortems — are the staff-level rubric every data-platform interview now grades against.

47 symptom alerts is the on-call exit interview

When one upstream pipeline drops 15% of rows, the naive setup pages on every downstream symptom — schema check, row count, freshness, 12 dependent dashboards. The DependencyBasedAlerter turns that into one root-cause page. The difference between staying on-call and quitting.

Chaos engineering moved into data

Netflix's chaos monkey is now data-team table-stakes. The FailureSimulator's 5 chaos types let you prove the platform recovers — before the production fire proves it doesn't. Recovery patterns (retry/backoff, backfill, circuit-breaker, validation) ship as the recovery_playbook.

Multi-team ownership is the staff-level question

When 4 domain teams share a warehouse, 'whose dashboard is wrong?' becomes the first 30 minutes of every incident. The teams + pipeline_ownership + access-pattern matrix + RootCauseTracer answer it in three seconds, not three Slack threads.

Reliability has a price tag

Going from 99% to 99.9% is roughly 5–10x the compute. From 99.9% to 99.99% is another 5–10x. The cost-vs-reliability calculator surfaces that conversation as a number, not a vibe — the artifact that gets a budget approved.

Curriculum · 5 modules · 13-15 hours

Part 0 is free. The other four unlock with PRO.

Try the first 2 hours — map the platform, build the dependency graph, identify the failure surfaces. If the loop clicks, upgrade to unlock SLOs, dep-aware alerting, chaos engineering, and the multi-team incident management layer.

P25 · 13-15 hours · 5 modules

Free preview PRO required

Part 0 is free — no card required. Map the platform before paying.

M01

✓Map the platform — dependency graph + failure surfaces

Build the system topology over 6 pipelines with BFS, DFS, and topological sort. Trace critical paths. Identify failure surfaces before measuring anything. The artifact every other layer depends on.

2h8 lessonsFREE PREVIEW

Start →

M02

⊘SLA, SLO & error budgets — dep-aware violation detection

Define a three-tier SLO ladder. Implement error-budget burn-rate tracking. Build dependency-aware violation detection that knows when an upstream failure cascades — and when it doesn't.

3h10 lessonsPRO TIER

Unlock with PRO →

M03

⊘Monitoring & dep-based alerting — 47 symptoms → 1 page

Ship MetricsCollector, AlertingEngine, and the DependencyBasedAlerter that uses Part 0's graph to suppress inherited alerts. Route by severity (SEV1 → page, SEV2 → Slack, SEV3 → email) with deduplication.

3h11 lessonsPRO TIER

Unlock with PRO →

M04

⊘Failure simulation & recovery — chaos · retry · circuit-break

Build a FailureSimulator with 5 chaos types. Implement retry/backoff with jitter, IdempotentLoader, CheckpointManager, BackfillSystem with dry-run, and a CircuitBreaker state machine (CLOSED/OPEN/HALF_OPEN). Codify everything as a RecoveryPlaybook.

3h12 lessonsPRO TIER

Unlock with PRO →

M05

⊘Incidents & multi-team — RootCauseTracer + 5-Whys + cost

Stand up the incident lifecycle (Detect → Respond → Resolve → Learn) with SEV1–SEV4, 5-Whys postmortems, RootCauseTracer (DFS), 10 root-cause categories, multi-team ownership, and the cost-vs-reliability calculator that prices each additional 9.

3h12 lessonsPRO TIER

Unlock with PRO →

4 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

Backed by curriculum

Data Observability & Quality

10 modules·8.3 hours·SLOs + error budgets·alert routing·incident response·dependency-aware alerting·blameless postmortems

Open curriculum→

SLO + error-budget fundamentals from this curriculum are the foundation for the project — PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Five stages. One reliability platform.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~5h

Map the platform & set the targets

Dependency graph (BFS · DFS · topo-sort) over 6 pipelines. Failure-surface map. Three-tier SLO ladder with error-budget burn-rate. Dep-aware violation detection wired to the graph.

✓6-pipeline dependency graph + failure surfaces
✓tier-aware SLA/SLO ladder + error budgets
✓violation detection that respects the graph

02~3h

Detect without paging

RED metrics + AlertingEngine. The DependencyBasedAlerter that suppresses 47 symptom alerts to 1 root-cause page using Part 0's graph. Severity-routed dispatch with dedup.

✓MetricsCollector + alert_rules + alert_history tables
✓DependencyBasedAlerter (root-cause-only paging)
✓SEV1/2/3 routing — page · Slack · email

03~6h

Break it, recover it, own it

FailureSimulator (5 chaos types). Retry/backoff + IdempotentLoader + CheckpointManager + BackfillSystem + CircuitBreaker. Incident lifecycle with SEV1–SEV4, 5-Whys, RootCauseTracer, 10 categories, multi-team ownership.

✓FailureSimulator + RecoveryPlaybook + CircuitBreaker
✓incidents · postmortems · root_causes (30+ tables)
✓teams + pipeline_ownership + cost-vs-reliability calculator

Project setup · 5 minutes

One command. Pre-incident platform + the full reliability stack.

You get a real platform on day one — Postgres seeded with 10K pipeline runs, 50 incidents (15 still open), 800 SLA violations, plus 33 Python modules covering every layer (graph, SLO, alerting, chaos, incidents).

What lives in the repo

Everything you need to map, measure, monitor, recover, and own a data platform — and the seed data that's pre-broken with open incidents and burning error budgets so you have something real to fix.

docker-compose.yml — Postgres + Prometheus + Grafana
migrations/ — 11 schema migrations · 30+ reliability tables
seed/02_sample_data.sql — 10K runs · 50 incidents · 800 SLA violations · 20 postmortems
dataguard/ — 33 Python modules: dep_graph, slo, alerting, chaos, incidents
alerting/ + failure_simulator.py — DependencyBasedAlerter + FailureSimulator (5 chaos types)

Download · Starter Kit

DataGuard Reliability Starter Kit

Pre-configured Docker stack, the 11-migration seed schema, 10K pipeline runs and 50 incidents pre-loaded, and the 33 Python modules for every part. Skip the boilerplate, start on Part 0.

286 KB · Docker · 11 migrations · 33 Python modules · Prometheus + Grafana · PRO required

~/projects/dataguard-reliability — zsh

1. Clone and start the stack

$ git clone github.com/ai-de/p25-dataguard-reliability

$ cd p25-dataguard-reliability && docker-compose up -d

2. Apply 11 migrations + seed 10K runs / 50 incidents

$ psql -h localhost -U dataguard -f migrations/run_all.sql

$ psql -h localhost -U dataguard -f seed/02_sample_data.sql

3. Build the dependency graph + verify

$ python3 -m dataguard.dep_graph.build && python3 -m dataguard.dep_graph.verify

4. Open Prometheus · Grafana

$ open http://localhost:3000 # Grafana

$ open http://localhost:9090 # Prometheus

30+

Reliability tables

10K

Pipeline runs

Incidents (15 open)

800

SLA violations

Production hardening

The same stack — but built for the 10x case.

Most reliability tutorials show you the retry_with_backoff. This one shows what changes when the dependency graph has 200+ pipelines, three on-call rotations are watching the same alerts, and the alerter that worked at 6 pipelines starts paging at 47.

What you ship in modules 01–05Tutorial pattern

Anomaly detection

Static threshold row_count < 80000

Alert dedup

In-process 5-min window; lost on restart

Postmortems

5-Whys YAML stored as why_1..why_5 columns

Circuit breaker

Single-process state machine; per-service

Coverage

6 pipelines · 30+ tables · per-run scoring

Incident store

Postgres incidents table; single region

Root cause

10 hand-curated categories in a root_causes enum

Ownership model

Single-team pipeline_ownership; manual joins

What changes at scaleProduction pattern

✓

Anomaly detection

Rolling EWMA / z-score on row_count + freshness series

✓

Alert dedup

Alertmanager group_by + repeat-interval, durable across restarts

✓

Postmortems

Causal-tree analysis surfaced as a networkx DiGraph visualization

✓

Circuit breaker

Distributed state in Redis with PubSub; cluster-wide trip

✓

Coverage

200+ pipelines with sampling tiers + incremental graph rebuild

✓

Incident store

Durable write-ahead log + read replicas for global on-call

✓

Root cause

ML-suggested classification on incident-text embeddings + 5-Whys

✓

Ownership model

Multi-tenant policy-as-code with row-level security per team

PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

4 reviews / month

Submit a repo, a PR, or an architecture proposal. Reviewer is matched to your domain — SRE / platform / incident response for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer. Dependency-graph design, mock a system-design interview on dep-aware alerting, whiteboard the cost-vs-reliability calculator. Group sessions also available.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Upgrade to PRO →

Who this is for

Pick this if you’ve been paged 47 times for one root cause, not learning to.

Platform engineers running data infra

You operate Postgres, Airflow, dbt, and the warehouse for 4+ teams. The dependency graph + dep-aware alerter + cost-vs-reliability calculator are the patterns you needed last quarter.

On-call engineers tired of cascading pages

Every alert is P1, nothing is. The DependencyBasedAlerter that suppresses 47 symptoms to 1 root-cause page is the artifact that gets you off the rotation that's burning the team out.

Senior data engineers leveling up to staff

You ship pipelines today; the staff-level rubric checks for system topology, chaos engineering, blameless 5-Whys, and multi-team ownership. This is that rubric, built end-to-end.

Tech leads + EMs setting the reliability bar

You're approving the budget for reliability work. The cost-vs-reliability calculator turns 'can we get to 99.99%?' from a vibe into a number that has a 50–100x compute coefficient — the conversation that gets approved.

Related curriculum

Going deeper? Three tracks back this project.

SRE-for-data is the spine. These three curriculums let you go deeper on the layers that matter most — system design, automation, and the cost story behind each additional 9.

FAQ

Quick answers.

How is Part 0 different from a free dependency-graph tutorial?+

Part 0 (free) gives you a 6-pipeline platform pre-loaded in Postgres with real upstream/downstream relationships. You ship BFS, DFS, and topological sort over the actual graph and identify failure surfaces in real depends_on data. Most free tutorials show the algorithm in isolation; this one builds the artifact every other layer in the project depends on.

How is this different from the DataGuard Observability project (P10)?+

P10 is the detect / trace / prevent layer — dbt + Great Expectations validation + OpenLineage column lineage + Prometheus + Grafana on a 5-table pre-broken warehouse. P25 (this) is the SRE-for-data layer — dependency graphs + dep-aware SLOs + chaos engineering + 5-Whys postmortems + multi-team ownership. They are different layers, not different depths. Build observability first if you don't yet have quality metrics on your data; jump here if you already operate at scale and need the platform-engineering patterns.

What's NOT in this project?+

ML-based anomaly detection (all thresholds are static or tier-based; rolling EWMA/z-score is covered in the hardening section as an upgrade path). Real Datadog or PagerDuty integration (the AlertRouter abstracts the sink — wiring an actual integration is a 10-line change). Cloud deployment (everything runs locally in Docker; the patterns transfer cleanly to managed services).

Do I need cloud credentials?+

No. Everything runs locally — Postgres + Prometheus + Grafana + the 33 Python modules, all in docker-compose. The patterns transfer cleanly to managed services (RDS + Datadog + PagerDuty + on-call.com) with config changes only.

What does PRO actually unlock for $29/mo?+

All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, a verifiable certificate, and full community access. Cancel anytime.

Will this help with senior+ / staff data engineering interviews?+

Yes. Staff DE rounds increasingly assume you can reason about dependency graphs, dep-aware SLOs, root-cause-only paging, chaos engineering, and multi-team ownership — the same way SRE rounds at API teams do. After this you can whiteboard the 5-stage spine — MAP / MEASURE / MONITOR / RECOVER / OWN — without hand-waving any layer.

Related projects

Paired with this project

P10·PAID·quality

Data observability stack

Detect, trace, prevent: dbt + OpenLineage + Grafana on a pre-broken warehouse.

Explore project →

Ready to ship a real reliability platform?

Start with Part 0 — free, no card. About 2 hours. By the end you'll have the platform mapped, the 6-pipeline dependency graph built, and the failure-surface analysis that drives every other layer in the project.

See PRO benefits

P25 · DataGuard Reliability · PRO · Part 0 freeUpgrade to PRO →

Operate the data platform likeGoogle SREoperates production — across 5 stages

Data platforms are first-class production systems — not back-office plumbing.

47 symptom alerts is the on-call exit interview

Chaos engineering moved into data

Multi-team ownership is the staff-level question

Reliability has a price tag

Part 0 is free. The other four unlock with PRO.

Data Observability & Quality

Three sprints. Five stages. One reliability platform.

One command. Pre-incident platform + the full reliability stack.

What lives in the repo

DataGuard Reliability Starter Kit

The same stack — but built for the 10x case.

Real review from senior engineers who shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you’ve been paged 47 times for one root cause, not learning to.

Platform engineers running data infra

On-call engineers tired of cascading pages

Senior data engineers leveling up to staff

Tech leads + EMs setting the reliability bar

Going deeper? Three tracks back this project.

System Design for Data Engineers

DataOps: CI/CD & Infrastructure as Code

Cost Optimization for Data Engineers

Quick answers.

Paired with this project

Ready to ship a real reliability platform?

Operate the data platform like
Google SRE
operates production — across 5 stages