Detect, trace, and prevent
broken data
before the CEO sees a wrong dashboard
Build the four-layer data-observability stack on a 5-table e-commerce warehouse pre-broken with 4 quality bugs. Ship dbt + Great Expectations validation, OpenLineage column lineage via Marquez and sqlglot, three-tier SLOs with error budgets and P1–P4 alerting, Prometheus + Grafana — all running locally in a 7-container Docker stack.
This is the system-design answer for “how would you stop the dashboard from being wrong?” — asked at Stripe, Airbnb, Wayfair, DoorDash and any team where a broken number reaches a VP.
- A composite quality_score dbt model across 6 dimensions (completeness · uniqueness · timeliness · validity · consistency · accuracy) with 12 tests and freshness SLAs
- An OpenLineage emitter wired into Marquez plus a sqlglot AST parser that extracts column-to-column dependencies and exposes upstream/downstream BFS traversal
- A Great Expectations checkpoint bundling 3 expectation suites, run as the gate between Bronze and Silver
- A tier_config.yaml with 99.9 / 99.5 / 95% SLOs, error-budget burn-rate alerts, and a P1–P4 AlertRouter with 30-min dedup
- ContractValidator + RunbookTemplate YAMLs that turn an alert into a five-minute resolution
- A 7-container Docker stack (Postgres + Prometheus + pushgateway + Grafana + Marquez + Marquez-DB + quality-exporter) with a JSON dashboard you can demo in interviews
A wrong dashboard erodes trust faster than any other data failure.
The patterns you wire here — quality scoring, column lineage, SLO + error budgets, severity-routed alerting — are the four layers every senior data-engineering rubric now checks for. Monte Carlo, Bigeye, and Soda built companies around them; you build the playbook.
Data SLOs are application SLOs now
At Uber, Airbnb, and Stripe, data freshness and completeness are first-class SLOs with error budgets — same tier as API uptime. Without them, the data team can't say 'no, that dashboard is wrong' before someone acts on it.
Column lineage is the new stack-trace
When the revenue number is off, you don't grep for it — you trace it. OpenLineage + sqlglot AST parsing gives you column-to-column dependencies in minutes, not the hours of detective work that broke the on-call rotation last quarter.
Tier-based alerting beats more alerting
Three SLO tiers (99.9 / 99.5 / 95) and a P1–P4 routing dict prevent the alert fatigue that makes engineers mute the channel. Every page is intentional; every dedup is by design.
The vendors built this — but you should know what's behind them
Monte Carlo and Bigeye are great products. Senior interviews still ask you to describe the system underneath: what gets measured, how it's tested, where lineage is stored, what triggers a page. This project is that system.
Module 01 is free. The rest unlocks with PRO.
Try the first 3 hours — stand up the local warehouse, find the 4 pre-planted bugs, score quality across 6 dimensions. If the loop clicks, upgrade to unlock lineage, SLOs, and the full Prometheus + Grafana monitoring stack.
Data Observability & Quality
This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.
Three sprints. Three checkpoints. One production observability stack.
Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.
Six dimensions scored 0–100 across the 5 seed tables. Composite quality_score dbt model materialized. 12 tests green; the 4 pre-planted bugs caught and explained.
- ✓stg_quality_dimensions + quality_score dbt models
- ✓12 tests (4 schema + 8 dbt_expectations)
- ✓Source freshness SLAs per tier
OpenLineage events flowing to Marquez. sqlglot extracting column lineage from SQL ASTs. LineageGraph BFS exposing upstream + downstream impact. GE checkpoint gating Bronze → Silver.
- ✓OpenLineageClient.emit() wired into runs
- ✓sqlglot column-lineage parser + LineageGraph
- ✓GE checkpoint bundling 3 expectation suites
Three-tier SLOs with error budgets. P1–P4 routing with 30-min dedup. ContractValidator + RunbookTemplate. Prometheus metrics exported, Grafana dashboard live, 7-container stack verified.
- ✓tier_config.yaml + ErrorBudget burn-rate
- ✓AlertRouter (P1–P4) + RunbookTemplate YAML
- ✓Prometheus exporter + Grafana JSON + 7-container stack
One command. Pre-broken warehouse + the full observability stack.
You get a real stack on day one — Postgres seeded with 20k rows and 4 intentional bugs, Marquez + Prometheus + Grafana running locally, and the dbt + Python scaffolds for every module.
What lives in the repo
Everything you need to score, test, trace, and monitor a small warehouse — and the seed data that's pre-broken so you have something real to find on module 01.
- docker-compose.yml — Postgres + Marquez + Prometheus + pushgateway + Grafana
- seed_data.sql — 5 tables · 20k rows · 4 pre-planted quality bugs
- dbt/ — 6-dimension scoring + composite quality_score + 12 tests
- dataguard/ — OpenLineage emitter, sqlglot lineage, SLOs, AlertRouter, runbooks
- prometheus.yml + dashboards/ — scrape config + Grafana JSON dashboard with thresholds
DataGuard Observability Starter Kit
Pre-configured Docker stack, the pre-broken seed warehouse, and the dbt + Python scaffolds for every module. Skip the boilerplate, start on module 01.
The same stack — but built for the 10x case.
Most observability tutorials show you the dbt test. This one shows what changes when there are 200+ tables, the on-call rotation is 4 engineers, and the threshold that worked at 5k rows pages at 3am when the table is 50M.
expect_column_values_to_be_betweenLineageGraph in-memory, rebuilt per processprometheus_client HTTP serverquality_score seriesgroup_by + PagerDuty grouping windowbuf breaking / dbt-checkpoint blocks the PRRootCauseAnalyzer.investigate() with ranked upstream suspectspushgateway with retry + scrape jitter for short-lived jobsReal review from senior engineers who shipped this stack.
Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.
4 reviews / month
Submit a repo, a PR, or an architecture proposal. Reviewer is matched to your domain — observability + SLOs for this project. Async, comments inline, average turnaround 31 hours.
2 office hours / month
Live 30-min sessions with a senior data engineer. SLO design questions, whiteboard a tricky lineage trace, mock a system-design interview. Group sessions also available.
One subscription. 15+ projects, all curriculum, code review.
PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.
Pick this if you’ve been paged for a wrong number, not learning to.
Senior data engineers
You've shipped warehouses and dashboards, lived through a bad number reaching a VP, and want the four-layer system that prevents it next time — built end-to-end, not a vendor demo.
Analytics engineers leveling up
You write the dbt tests today; you want the lineage + SLO + alerting layer on top so 'is the data right' has a measurable answer instead of a Slack thread.
Platform engineers running data infra
You operate Postgres, dbt, and Airflow for 10+ teams. You need the patterns for tiered SLOs, severity routing, and runbooks before signing off on a Monte Carlo / Bigeye contract.
On-call engineers tired of noise
Every alert is P1, nothing is. This project is the playbook for tier-based dedup, error-budget-driven escalation, and the runbook YAML that turns a page into a 5-minute fix.
Going deeper? Three tracks back this project.
Data observability is the spine. These three curriculums let you go deeper on the layers that matter most — testing, contracts, and the cost of running it at scale.
Quick answers.
Ready to ship a real observability stack?
Start with module 01 — free, no card. About 3 hours. By the end you'll have the warehouse running locally, the 4 pre-planted bugs found, and the composite quality_score model materialized across all 5 tables.