Data Observability Explained: What It Is and How It Works
Data observability is the ability to continuously monitor, detect, and resolve data quality issues before they reach consumers. It applies the operational rigor of software observability — metrics, alerting, incident workflows — to data pipelines, organized around 5 pillars: freshness, volume, schema, distribution, and lineage.
Data SLO contract (YAML)
# data_contract.yml — defines what "healthy" means
dataset: orders
owner: data-platform@company.com
slo:
freshness_hours: 1 # alert if not updated within 1h
min_rows: 1000 # alert if below 1000 rows
error_budget: 99.5 # 99.5% uptime target
schema:
- column: order_id type: integer # alert on type change
The 5 Pillars of Data Observability
Freshness
— Is data arriving on schedule?Freshness monitoring tracks when each table was last updated and alerts when it exceeds the defined SLO window. A stale orders table that feeds a real-time dashboard is the most common and painful data incident.
dbt source freshness · Prometheus scrapers
Volume
— Are row counts within expected bounds?Volume anomaly detection compares current row counts to historical baselines using statistical models. A sudden 80% drop in order rows almost always means an upstream pipeline broke or a source connection failed.
Great Expectations · custom SQL checks
Schema
— Did column definitions change unexpectedly?Schema monitoring detects when columns are added, dropped, renamed, or change type between pipeline runs. Schema drift from upstream source changes is one of the top causes of pipeline breakage.
dbt schema tests · OpenLineage schema events
Distribution
— Are column values statistically normal?Distribution monitoring tracks the statistical properties of column values — mean, stddev, null rate, unique count — and alerts when they drift beyond expected ranges. Catches data corruption that passes all schema tests.
Great Expectations · Monte Carlo · Bigeye
Lineage
— Which upstream tables feed this one?Column-level lineage tracks exactly which sources, transformations, and models feed each downstream table. When something breaks, lineage reduces root cause analysis from hours to minutes by showing exactly which upstream node failed.
OpenLineage · Marquez · dbt lineage
How Data SLOs Work
A data SLO defines a reliability target for a dataset. The gap between the target and 100% is your error budget. Rather than paging on every single freshness miss, you track error budget burn rate — alerting proportionally as you approach the budget limit.
SLI
Service Level Indicator — the metric you measure (e.g. freshness age in hours)
SLO
Service Level Objective — the target (e.g. freshness < 1 hour, 99.5% of days)
Error Budget
The allowed failure rate (0.5% of days = 1.5 days/year you can miss the SLO)
Common Mistakes
Monitoring only at pipeline time
dbt tests run when the pipeline runs. A table can go stale for 6+ hours after a successful run. Add continuous freshness monitoring between runs.
No lineage alongside alerts
An alert tells you something broke. Without lineage, finding the root cause is manual. Deploy OpenLineage at the same time as your first quality checks.
Alerting on every metric deviation
Set anomaly detection thresholds based on historical baselines (p5/p95), not arbitrary numbers. Alert fatigue kills on-call morale faster than bad data.
FAQ
- What is data observability?
- Data observability is the ability to continuously monitor, detect, diagnose, and resolve data quality issues across your pipeline before they reach consumers. It applies software observability practices to data: metrics, alerting, lineage, SLOs, and incident workflows.
- What are the 5 pillars of data observability?
- Freshness (is data arriving on schedule?), Volume (are row counts within bounds?), Schema (did column definitions change?), Distribution (are values statistically normal?), and Lineage (what upstream tables feed this one?). Monitoring all five gives complete pipeline visibility.
- How does a data SLO work?
- A data SLO defines a reliability target (e.g. 99.5% uptime). The gap to 100% is your error budget. When failures burn into the budget, you alert proportionally to the burn rate — not on every individual miss.
- What is the difference between data observability and data monitoring?
- Monitoring checks metrics on a schedule. Observability is the full operational system: automated anomaly detection, lineage for root cause, SLO-based error budgets, and incident workflows. Monitoring is a component of observability.