ai-de.net/Projects/P10 · DataGuard Observability

PRO · module 01 free previewQuality trackP10

Detect, trace, and prevent
broken data
before the CEO sees a wrong dashboard

Build the four-layer data-observability stack on a 5-table e-commerce warehouse pre-broken with 4 quality bugs. Ship dbt + Great Expectations validation, OpenLineage column lineage via Marquez and sqlglot, three-tier SLOs with error budgets and P1–P4 alerting, Prometheus + Grafana — all running locally in a 7-container Docker stack.

Timeline

12-14 hours

Difficulty

Senior+

Stack

dbt · GE · OpenLineage · Prometheus · Grafana

See PRO benefits

This is the system-design answer for “how would you stop the dashboard from being wrong?” — asked at Stripe, Airbnb, Wayfair, DoorDash and any team where a broken number reaches a VP.

By the end you will have shipped

A composite quality_score dbt model across 6 dimensions (completeness · uniqueness · timeliness · validity · consistency · accuracy) with 12 tests and freshness SLAs
An OpenLineage emitter wired into Marquez plus a sqlglot AST parser that extracts column-to-column dependencies and exposes upstream/downstream BFS traversal
A Great Expectations checkpoint bundling 3 expectation suites, run as the gate between Bronze and Silver
A tier_config.yaml with 99.9 / 99.5 / 95% SLOs, error-budget burn-rate alerts, and a P1–P4 AlertRouter with 30-min dedup
ContractValidator + RunbookTemplate YAMLs that turn an alert into a five-minute resolution
A 7-container Docker stack (Postgres + Prometheus + pushgateway + Grafana + Marquez + Marquez-DB + quality-exporter) with a JSON dashboard you can demo in interviews

PREREQComfortable with SQL (CTEs, window functions, joins), Python (classes, functions), Docker, and a working dbt project. Prior exposure to data observability or dbt fundamentals helps but isn’t required.

dataguard.quality.* · 12 tests green · 0 P1 incidents

scrape live

Postgres seed

Validate

Trace

Operate

orders3% dup ids

customers4% null id

products

payments10% stale

web_events5% bad type

20k rows · 4 bugs

quality_score6 dims · weighted

dbt test12 · 4 schema · 8 GE

GE checkpoint3 suites · bronze→silver

dbt + Great Expectations

OpenLineageRunEvent → Marquez

sqlglot ASTcolumn-level lineage

LineageGraphBFS upstream + downstream

OpenLineage · Marquez · sqlglot

prometheus

grafana

marquez

pushgateway

exporter

7-container stack

# Pre-planted bugs

20k seed rows · 4 quality bugs

3% dup · 4% null · 10% stale · 5% bad

weighted quality_score by dimension

→ caught by quality_score in module 01

● SLO ladder · P1–P4 routing

Tier 1: 99.9% · Tier 2: 99.5% · Tier 3: 95%

AlertRouter dedup window: 30 min

error budgets · RunbookTemplate fire-drill

→ alerts route by tier, never page on stale

20k

seed rows · 4 bugs

dbt + GE tests

99.9 / 99.5 / 95

SLO tier ladder

Why this matters in 2026

A wrong dashboard erodes trust faster than any other data failure.

The patterns you wire here — quality scoring, column lineage, SLO + error budgets, severity-routed alerting — are the four layers every senior data-engineering rubric now checks for. Monte Carlo, Bigeye, and Soda built companies around them; you build the playbook.

Data SLOs are application SLOs now

At Uber, Airbnb, and Stripe, data freshness and completeness are first-class SLOs with error budgets — same tier as API uptime. Without them, the data team can't say 'no, that dashboard is wrong' before someone acts on it.

Column lineage is the new stack-trace

When the revenue number is off, you don't grep for it — you trace it. OpenLineage + sqlglot AST parsing gives you column-to-column dependencies in minutes, not the hours of detective work that broke the on-call rotation last quarter.

Tier-based alerting beats more alerting

Three SLO tiers (99.9 / 99.5 / 95) and a P1–P4 routing dict prevent the alert fatigue that makes engineers mute the channel. Every page is intentional; every dedup is by design.

The vendors built this — but you should know what's behind them

Monte Carlo and Bigeye are great products. Senior interviews still ask you to describe the system underneath: what gets measured, how it's tested, where lineage is stored, what triggers a page. This project is that system.

Curriculum · 4 modules · 12-14 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3 hours — stand up the local warehouse, find the 4 pre-planted bugs, score quality across 6 dimensions. If the loop clicks, upgrade to unlock lineage, SLOs, and the full Prometheus + Grafana monitoring stack.

P10 · 12-14 hours · 4 modules

Free preview PRO required

Module 01 is free — no card required. Find the broken data before paying.

M01

✓Find the bug — quality scoring + dbt tests

Score the warehouse across 6 dimensions (completeness, uniqueness, timeliness, validity, consistency, accuracy). Build the composite quality_score dbt model. Ship 12 tests (4 schema + 8 dbt_expectations) and tiered freshness SLAs that catch the 4 pre-planted bugs.

3h9 lessonsFREE PREVIEW

Start →

M02

⊘Trace the failure — OpenLineage + Marquez + sqlglot

Wire an OpenLineage emitter into Marquez. Parse SQL with sqlglot to extract column-to-column dependencies. Build a LineageGraph with BFS for impact analysis. Layer in a Great Expectations checkpoint that bundles 3 expectation suites.

3h11 lessonsPRO TIER

Unlock with PRO →

M03

⊘Prevent incidents — SLOs, contracts, P1–P4 routing

Define a three-tier SLO ladder (99.9 / 99.5 / 95) in tier_config.yaml. Implement an ErrorBudget dataclass with burn-rate calculation. Wire a DataContract + ContractValidator. Ship the AlertRouter with P1–P4 severity, 30-min dedup, and a RunbookTemplate that turns alerts into 5-minute resolutions.

4h13 lessonsPRO TIER

Unlock with PRO →

M04

⊘Ship the stack — Prometheus + Grafana + 7 containers

Export quality_score, dimension scores, freshness delay, test pass/fail, and SLO budget as Prometheus Gauge + Counter metrics. Build a Grafana JSON dashboard with thresholded panels. Stand up the full 7-container docker-compose stack and verify with the end-to-end checkpoint script.

3h12 lessonsPRO TIER

Unlock with PRO →

3 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

Backed by curriculum

Data Observability & Quality

10 modules·8.3 hours·quality dimensions·dbt testing·Great Expectations·lineage platforms·SLOs + budgets

Open curriculum→

This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production observability stack.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~3h

Find the bug, score the warehouse

Six dimensions scored 0–100 across the 5 seed tables. Composite quality_score dbt model materialized. 12 tests green; the 4 pre-planted bugs caught and explained.

✓stg_quality_dimensions + quality_score dbt models
✓12 tests (4 schema + 8 dbt_expectations)
✓Source freshness SLAs per tier

02~3h

Trace the failure, prove the blast radius

OpenLineage events flowing to Marquez. sqlglot extracting column lineage from SQL ASTs. LineageGraph BFS exposing upstream + downstream impact. GE checkpoint gating Bronze → Silver.

✓OpenLineageClient.emit() wired into runs
✓sqlglot column-lineage parser + LineageGraph
✓GE checkpoint bundling 3 expectation suites

03~7h

Operate it — SLOs, alerts, dashboards, runbooks

Three-tier SLOs with error budgets. P1–P4 routing with 30-min dedup. ContractValidator + RunbookTemplate. Prometheus metrics exported, Grafana dashboard live, 7-container stack verified.

✓tier_config.yaml + ErrorBudget burn-rate
✓AlertRouter (P1–P4) + RunbookTemplate YAML
✓Prometheus exporter + Grafana JSON + 7-container stack

Project setup · 5 minutes

One command. Pre-broken warehouse + the full observability stack.

You get a real stack on day one — Postgres seeded with 20k rows and 4 intentional bugs, Marquez + Prometheus + Grafana running locally, and the dbt + Python scaffolds for every module.

What lives in the repo

Everything you need to score, test, trace, and monitor a small warehouse — and the seed data that's pre-broken so you have something real to find on module 01.

docker-compose.yml — Postgres + Marquez + Prometheus + pushgateway + Grafana
seed_data.sql — 5 tables · 20k rows · 4 pre-planted quality bugs
dbt/ — 6-dimension scoring + composite quality_score + 12 tests
dataguard/ — OpenLineage emitter, sqlglot lineage, SLOs, AlertRouter, runbooks
prometheus.yml + dashboards/ — scrape config + Grafana JSON dashboard with thresholds

Download · Starter Kit

DataGuard Observability Starter Kit

Pre-configured Docker stack, the pre-broken seed warehouse, and the dbt + Python scaffolds for every module. Skip the boilerplate, start on module 01.

483 KB · Docker · seed SQL · dbt · Python · Prometheus + Grafana · PRO required

~/projects/dataguard-observability — zsh

1. Clone and start the stack

$ git clone github.com/ai-de/p10-dataguard-observability

$ cd p10-dataguard-observability && docker-compose up -d

2. Seed the pre-broken warehouse

$ psql -h localhost -U dataguard -f seed_data.sql

3. Run the 12 tests + composite quality score

$ dbt run --select quality && dbt test --select tag:quality

4. Open Marquez · Grafana · Prometheus

$ open http://localhost:3000 # Grafana

$ open http://localhost:5000 # Marquez

$ open http://localhost:9090 # Prometheus

Seed tables

20k

Rows seeded

Pre-planted bugs

Tests + 3 GE suites

Production hardening

The same stack — but built for the 10x case.

Most observability tutorials show you the dbt test. This one shows what changes when there are 200+ tables, the on-call rotation is 4 engineers, and the threshold that worked at 5k rows pages at 3am when the table is 50M.

What you ship in modules 01–04Tutorial pattern

Anomaly thresholds

Static expect_column_values_to_be_between

Alert dedup

30-min in-process dict; lost on restart

Data contract

YAML file + ContractValidator on demand

Root cause

Manual runbook lookup; engineer pulls the lineage graph

Coverage

5 tables · 12 tests · per-run scoring

Lineage store

LineageGraph in-memory, rebuilt per process

Metrics export

Single prometheus_client HTTP server

Dashboards

Grafana JSON, manually imported

What changes at scaleProduction pattern

✓

Anomaly thresholds

Rolling z-score / EWMA on the quality_score series

✓

Alert dedup

Alertmanager group_by + PagerDuty grouping window

✓

Data contract

CI gate — buf breaking / dbt-checkpoint blocks the PR

✓

Root cause

Alert auto-triggers RootCauseAnalyzer.investigate() with ranked upstream suspects

✓

Coverage

Tier-based sampling — 100% T1, 25% T2, 5% T3 with retention ladder

✓

Lineage store

Marquez on Postgres — durable across restarts, queryable as a graph

✓

Metrics export

pushgateway with retry + scrape jitter for short-lived jobs

✓

Dashboards

Dashboards-as-code — Grafana provisioning + folder permissions in git

PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

4 reviews / month

Submit a repo, a PR, or an architecture proposal. Reviewer is matched to your domain — observability + SLOs for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer. SLO design questions, whiteboard a tricky lineage trace, mock a system-design interview. Group sessions also available.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Upgrade to PRO →

Who this is for

Pick this if you’ve been paged for a wrong number, not learning to.

Senior data engineers

You've shipped warehouses and dashboards, lived through a bad number reaching a VP, and want the four-layer system that prevents it next time — built end-to-end, not a vendor demo.

Analytics engineers leveling up

You write the dbt tests today; you want the lineage + SLO + alerting layer on top so 'is the data right' has a measurable answer instead of a Slack thread.

Platform engineers running data infra

You operate Postgres, dbt, and Airflow for 10+ teams. You need the patterns for tiered SLOs, severity routing, and runbooks before signing off on a Monte Carlo / Bigeye contract.

On-call engineers tired of noise

Every alert is P1, nothing is. This project is the playbook for tier-based dedup, error-budget-driven escalation, and the runbook YAML that turns a page into a 5-minute fix.

Related curriculum

Going deeper? Three tracks back this project.

Data observability is the spine. These three curriculums let you go deeper on the layers that matter most — testing, contracts, and the cost of running it at scale.

FAQ

Quick answers.

How is module 01 different from a free dbt-test tutorial?+

Module 01 (free) gives you the warehouse pre-broken with 4 specific quality bugs and walks you through scoring them across 6 dimensions with a composite quality_score model. Most free tutorials hand you a single dbt test on a clean dataset; this one builds the mental model for what a quality dimension actually is.

How is this different from the DataGuard Reliability project (P25)?+

Observability (this project, P10) is the *detect / trace / prevent* layer — dbt + Great Expectations + OpenLineage + SLOs + Grafana on a pre-broken warehouse. Reliability (P25) is the *operate-it* SRE layer — failure simulation, on-call rotation, incident reviews, and platform thinking. Build observability first; reliability assumes the metrics it pages on already exist.

What's NOT in this project?+

ML-based anomaly detection (all thresholds are static or tier-based; rolling z-score / EWMA is covered in the hardening section as an upgrade path). Cloud deployment (everything runs locally in Docker; the patterns transfer cleanly). PagerDuty / Slack webhook integration (the AlertRouter abstracts the sink — wiring an actual integration is a 10-line change, but not part of the tutorial).

Do I need cloud credentials?+

No. Everything runs locally — Postgres + dbt + Marquez + Prometheus + pushgateway + Grafana + the quality-exporter, all in a 7-container docker-compose stack. The patterns transfer cleanly to managed services (RDS + Snowflake + Datadog + Grafana Cloud) with config changes only.

What does PRO actually unlock for $29/mo?+

All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, a verifiable certificate, and full community access. Cancel anytime.

Will this help with senior+ data engineering interviews?+

Yes. System-design rounds for senior+ DE roles increasingly assume you can reason about data SLOs, lineage stores, and alert routing the same way you reason about API uptime. After this you can whiteboard the four-layer stack — measure / test / trace / operate — without hand-waving any layer.

Ready to ship a real observability stack?

Start with module 01 — free, no card. About 3 hours. By the end you'll have the warehouse running locally, the 4 pre-planted bugs found, and the composite quality_score model materialized across all 5 tables.

See PRO benefits

P10 · DataGuard Observability · PRO · module 01 freeUpgrade to PRO →

Detect, trace, and preventbroken databefore the CEO sees a wrong dashboard

A wrong dashboard erodes trust faster than any other data failure.

Data SLOs are application SLOs now

Column lineage is the new stack-trace

Tier-based alerting beats more alerting

The vendors built this — but you should know what's behind them

Module 01 is free. The rest unlocks with PRO.

Data Observability & Quality

Three sprints. Three checkpoints. One production observability stack.

One command. Pre-broken warehouse + the full observability stack.

What lives in the repo

DataGuard Observability Starter Kit

The same stack — but built for the 10x case.

Real review from senior engineers who shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you’ve been paged for a wrong number, not learning to.

Senior data engineers

Analytics engineers leveling up

Platform engineers running data infra

On-call engineers tired of noise

Going deeper? Three tracks back this project.

dbt & Analytics Engineering

Governance & Data Contracts

Cost Optimization for Data Engineers

Quick answers.

Ready to ship a real observability stack?

Detect, trace, and prevent
broken data
before the CEO sees a wrong dashboard