Skip to content
ai-de.net/Projects/P10 · DataGuard Observability
PRO · module 01 free previewQuality trackP10

Detect, trace, and prevent
broken data
before the CEO sees a wrong dashboard

Build the four-layer data-observability stack on a 5-table e-commerce warehouse pre-broken with 4 quality bugs. Ship dbt + Great Expectations validation, OpenLineage column lineage via Marquez and sqlglot, three-tier SLOs with error budgets and P1–P4 alerting, Prometheus + Grafana — all running locally in a 7-container Docker stack.

Timeline
12-14 hours
Difficulty
Senior+
Stack
dbt · GE · OpenLineage · Prometheus · Grafana

This is the system-design answer for “how would you stop the dashboard from being wrong?” — asked at Stripe, Airbnb, Wayfair, DoorDash and any team where a broken number reaches a VP.

By the end you will have shipped
  • A composite quality_score dbt model across 6 dimensions (completeness · uniqueness · timeliness · validity · consistency · accuracy) with 12 tests and freshness SLAs
  • An OpenLineage emitter wired into Marquez plus a sqlglot AST parser that extracts column-to-column dependencies and exposes upstream/downstream BFS traversal
  • A Great Expectations checkpoint bundling 3 expectation suites, run as the gate between Bronze and Silver
  • A tier_config.yaml with 99.9 / 99.5 / 95% SLOs, error-budget burn-rate alerts, and a P1–P4 AlertRouter with 30-min dedup
  • ContractValidator + RunbookTemplate YAMLs that turn an alert into a five-minute resolution
  • A 7-container Docker stack (Postgres + Prometheus + pushgateway + Grafana + Marquez + Marquez-DB + quality-exporter) with a JSON dashboard you can demo in interviews
PREREQComfortable with SQL (CTEs, window functions, joins), Python (classes, functions), Docker, and a working dbt project. Prior exposure to data observability or dbt fundamentals helps but isn’t required.
dataguard.quality.* · 12 tests green · 0 P1 incidents
scrape live
Postgres seed
Validate
Trace
Operate
orders3% dup ids
customers4% null id
products
payments10% stale
web_events5% bad type
20k rows · 4 bugs
quality_score6 dims · weighted
dbt test12 · 4 schema · 8 GE
GE checkpoint3 suites · bronze→silver
dbt + Great Expectations
OpenLineageRunEvent → Marquez
sqlglot ASTcolumn-level lineage
LineageGraphBFS upstream + downstream
OpenLineage · Marquez · sqlglot
prometheus
grafana
marquez
pushgateway
exporter
7-container stack
# Pre-planted bugs
20k seed rows · 4 quality bugs
3% dup · 4% null · 10% stale · 5% bad
weighted quality_score by dimension
→ caught by quality_score in module 01
● SLO ladder · P1–P4 routing
Tier 1: 99.9% · Tier 2: 99.5% · Tier 3: 95%
AlertRouter dedup window: 30 min
error budgets · RunbookTemplate fire-drill
→ alerts route by tier, never page on stale
20k
seed rows · 4 bugs
12
dbt + GE tests
99.9 / 99.5 / 95
SLO tier ladder
Why this matters in 2026

A wrong dashboard erodes trust faster than any other data failure.

The patterns you wire here — quality scoring, column lineage, SLO + error budgets, severity-routed alerting — are the four layers every senior data-engineering rubric now checks for. Monte Carlo, Bigeye, and Soda built companies around them; you build the playbook.

Data SLOs are application SLOs now

At Uber, Airbnb, and Stripe, data freshness and completeness are first-class SLOs with error budgets — same tier as API uptime. Without them, the data team can't say 'no, that dashboard is wrong' before someone acts on it.

Column lineage is the new stack-trace

When the revenue number is off, you don't grep for it — you trace it. OpenLineage + sqlglot AST parsing gives you column-to-column dependencies in minutes, not the hours of detective work that broke the on-call rotation last quarter.

Tier-based alerting beats more alerting

Three SLO tiers (99.9 / 99.5 / 95) and a P1–P4 routing dict prevent the alert fatigue that makes engineers mute the channel. Every page is intentional; every dedup is by design.

The vendors built this — but you should know what's behind them

Monte Carlo and Bigeye are great products. Senior interviews still ask you to describe the system underneath: what gets measured, how it's tested, where lineage is stored, what triggers a page. This project is that system.

Curriculum · 4 modules · 12-14 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3 hours — stand up the local warehouse, find the 4 pre-planted bugs, score quality across 6 dimensions. If the loop clicks, upgrade to unlock lineage, SLOs, and the full Prometheus + Grafana monitoring stack.

P10 · 12-14 hours · 4 modules
Free preview PRO required
Module 01 is free — no card required. Find the broken data before paying.
M01
Find the bug — quality scoring + dbt tests
Score the warehouse across 6 dimensions (completeness, uniqueness, timeliness, validity, consistency, accuracy). Build the composite quality_score dbt model. Ship 12 tests (4 schema + 8 dbt_expectations) and tiered freshness SLAs that catch the 4 pre-planted bugs.
3h9 lessonsFREE PREVIEW
Start →
M02
Trace the failure — OpenLineage + Marquez + sqlglot
Wire an OpenLineage emitter into Marquez. Parse SQL with sqlglot to extract column-to-column dependencies. Build a LineageGraph with BFS for impact analysis. Layer in a Great Expectations checkpoint that bundles 3 expectation suites.
3h11 lessonsPRO TIER
Unlock with PRO →
M03
Prevent incidents — SLOs, contracts, P1–P4 routing
Define a three-tier SLO ladder (99.9 / 99.5 / 95) in tier_config.yaml. Implement an ErrorBudget dataclass with burn-rate calculation. Wire a DataContract + ContractValidator. Ship the AlertRouter with P1–P4 severity, 30-min dedup, and a RunbookTemplate that turns alerts into 5-minute resolutions.
4h13 lessonsPRO TIER
Unlock with PRO →
M04
Ship the stack — Prometheus + Grafana + 7 containers
Export quality_score, dimension scores, freshness delay, test pass/fail, and SLO budget as Prometheus Gauge + Counter metrics. Build a Grafana JSON dashboard with thresholded panels. Stand up the full 7-container docker-compose stack and verify with the end-to-end checkpoint script.
3h12 lessonsPRO TIER
Unlock with PRO →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

Data Observability & Quality

10 modules·8.3 hours·quality dimensions·dbt testing·Great Expectations·lineage platforms·SLOs + budgets
Open curriculum

This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production observability stack.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~3h
Find the bug, score the warehouse

Six dimensions scored 0–100 across the 5 seed tables. Composite quality_score dbt model materialized. 12 tests green; the 4 pre-planted bugs caught and explained.

  • stg_quality_dimensions + quality_score dbt models
  • 12 tests (4 schema + 8 dbt_expectations)
  • Source freshness SLAs per tier
02~3h
Trace the failure, prove the blast radius

OpenLineage events flowing to Marquez. sqlglot extracting column lineage from SQL ASTs. LineageGraph BFS exposing upstream + downstream impact. GE checkpoint gating Bronze → Silver.

  • OpenLineageClient.emit() wired into runs
  • sqlglot column-lineage parser + LineageGraph
  • GE checkpoint bundling 3 expectation suites
03~7h
Operate it — SLOs, alerts, dashboards, runbooks

Three-tier SLOs with error budgets. P1–P4 routing with 30-min dedup. ContractValidator + RunbookTemplate. Prometheus metrics exported, Grafana dashboard live, 7-container stack verified.

  • tier_config.yaml + ErrorBudget burn-rate
  • AlertRouter (P1–P4) + RunbookTemplate YAML
  • Prometheus exporter + Grafana JSON + 7-container stack
Project setup · 5 minutes

One command. Pre-broken warehouse + the full observability stack.

You get a real stack on day one — Postgres seeded with 20k rows and 4 intentional bugs, Marquez + Prometheus + Grafana running locally, and the dbt + Python scaffolds for every module.

What lives in the repo

Everything you need to score, test, trace, and monitor a small warehouse — and the seed data that's pre-broken so you have something real to find on module 01.

  • docker-compose.yml — Postgres + Marquez + Prometheus + pushgateway + Grafana
  • seed_data.sql — 5 tables · 20k rows · 4 pre-planted quality bugs
  • dbt/ — 6-dimension scoring + composite quality_score + 12 tests
  • dataguard/ — OpenLineage emitter, sqlglot lineage, SLOs, AlertRouter, runbooks
  • prometheus.yml + dashboards/ — scrape config + Grafana JSON dashboard with thresholds
Download · Starter Kit

DataGuard Observability Starter Kit

Pre-configured Docker stack, the pre-broken seed warehouse, and the dbt + Python scaffolds for every module. Skip the boilerplate, start on module 01.

483 KB · Docker · seed SQL · dbt · Python · Prometheus + Grafana · PRO required
~/projects/dataguard-observability — zsh
1. Clone and start the stack
$ git clone github.com/ai-de/p10-dataguard-observability
$ cd p10-dataguard-observability && docker-compose up -d
2. Seed the pre-broken warehouse
$ psql -h localhost -U dataguard -f seed_data.sql
3. Run the 12 tests + composite quality score
$ dbt run --select quality && dbt test --select tag:quality
4. Open Marquez · Grafana · Prometheus
$ open http://localhost:3000 # Grafana
$ open http://localhost:5000 # Marquez
$ open http://localhost:9090 # Prometheus
5
Seed tables
20k
Rows seeded
4
Pre-planted bugs
12
Tests + 3 GE suites
Production hardening

The same stack — but built for the 10x case.

Most observability tutorials show you the dbt test. This one shows what changes when there are 200+ tables, the on-call rotation is 4 engineers, and the threshold that worked at 5k rows pages at 3am when the table is 50M.

What you ship in modules 01–04Tutorial pattern
×
Anomaly thresholds
Static expect_column_values_to_be_between
×
Alert dedup
30-min in-process dict; lost on restart
×
Data contract
YAML file + ContractValidator on demand
×
Root cause
Manual runbook lookup; engineer pulls the lineage graph
×
Coverage
5 tables · 12 tests · per-run scoring
×
Lineage store
LineageGraph in-memory, rebuilt per process
×
Metrics export
Single prometheus_client HTTP server
×
Dashboards
Grafana JSON, manually imported
What changes at scaleProduction pattern
Anomaly thresholds
Rolling z-score / EWMA on the quality_score series
Alert dedup
Alertmanager group_by + PagerDuty grouping window
Data contract
CI gate — buf breaking / dbt-checkpoint blocks the PR
Root cause
Alert auto-triggers RootCauseAnalyzer.investigate() with ranked upstream suspects
Coverage
Tier-based sampling — 100% T1, 25% T2, 5% T3 with retention ladder
Lineage store
Marquez on Postgres — durable across restarts, queryable as a graph
Metrics export
pushgateway with retry + scrape jitter for short-lived jobs
Dashboards
Dashboards-as-code — Grafana provisioning + folder permissions in git
PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or an architecture proposal. Reviewer is matched to your domain — observability + SLOs for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer. SLO design questions, whiteboard a tricky lineage trace, mock a system-design interview. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you’ve been paged for a wrong number, not learning to.

SR

Senior data engineers

You've shipped warehouses and dashboards, lived through a bad number reaching a VP, and want the four-layer system that prevents it next time — built end-to-end, not a vendor demo.

AE

Analytics engineers leveling up

You write the dbt tests today; you want the lineage + SLO + alerting layer on top so 'is the data right' has a measurable answer instead of a Slack thread.

PE

Platform engineers running data infra

You operate Postgres, dbt, and Airflow for 10+ teams. You need the patterns for tiered SLOs, severity routing, and runbooks before signing off on a Monte Carlo / Bigeye contract.

OC

On-call engineers tired of noise

Every alert is P1, nothing is. This project is the playbook for tier-based dedup, error-budget-driven escalation, and the runbook YAML that turns a page into a 5-minute fix.

FAQ

Quick answers.

Module 01 (free) gives you the warehouse pre-broken with 4 specific quality bugs and walks you through scoring them across 6 dimensions with a composite quality_score model. Most free tutorials hand you a single dbt test on a clean dataset; this one builds the mental model for what a quality dimension actually is.
Observability (this project, P10) is the *detect / trace / prevent* layer — dbt + Great Expectations + OpenLineage + SLOs + Grafana on a pre-broken warehouse. Reliability (P25) is the *operate-it* SRE layer — failure simulation, on-call rotation, incident reviews, and platform thinking. Build observability first; reliability assumes the metrics it pages on already exist.
ML-based anomaly detection (all thresholds are static or tier-based; rolling z-score / EWMA is covered in the hardening section as an upgrade path). Cloud deployment (everything runs locally in Docker; the patterns transfer cleanly). PagerDuty / Slack webhook integration (the AlertRouter abstracts the sink — wiring an actual integration is a 10-line change, but not part of the tutorial).
No. Everything runs locally — Postgres + dbt + Marquez + Prometheus + pushgateway + Grafana + the quality-exporter, all in a 7-container docker-compose stack. The patterns transfer cleanly to managed services (RDS + Snowflake + Datadog + Grafana Cloud) with config changes only.
All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, a verifiable certificate, and full community access. Cancel anytime.
Yes. System-design rounds for senior+ DE roles increasingly assume you can reason about data SLOs, lineage stores, and alert routing the same way you reason about API uptime. After this you can whiteboard the four-layer stack — measure / test / trace / operate — without hand-waving any layer.

Ready to ship a real observability stack?

Start with module 01 — free, no card. About 3 hours. By the end you'll have the warehouse running locally, the 4 pre-planted bugs found, and the composite quality_score model materialized across all 5 tables.

P10 · DataGuard Observability · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open