Skip to content

Data Observability Explained: What It Is and How It Works

Data observability is the ability to continuously monitor, detect, and resolve data quality issues before they reach consumers. It applies the operational rigor of software observability — metrics, alerting, incident workflows — to data pipelines, organized around 5 pillars: freshness, volume, schema, distribution, and lineage.

Data SLO contract (YAML)

# data_contract.yml — defines what "healthy" means
dataset: orders
owner: data-platform@company.com
slo:
  freshness_hours: 1       # alert if not updated within 1h
  min_rows: 1000        # alert if below 1000 rows
  error_budget: 99.5    # 99.5% uptime target
schema:
  - column: order_id    type: integer   # alert on type change

The 5 Pillars of Data Observability

Freshness

Is data arriving on schedule?

Freshness monitoring tracks when each table was last updated and alerts when it exceeds the defined SLO window. A stale orders table that feeds a real-time dashboard is the most common and painful data incident.

dbt source freshness · Prometheus scrapers

📊

Volume

Are row counts within expected bounds?

Volume anomaly detection compares current row counts to historical baselines using statistical models. A sudden 80% drop in order rows almost always means an upstream pipeline broke or a source connection failed.

Great Expectations · custom SQL checks

🔧

Schema

Did column definitions change unexpectedly?

Schema monitoring detects when columns are added, dropped, renamed, or change type between pipeline runs. Schema drift from upstream source changes is one of the top causes of pipeline breakage.

dbt schema tests · OpenLineage schema events

📈

Distribution

Are column values statistically normal?

Distribution monitoring tracks the statistical properties of column values — mean, stddev, null rate, unique count — and alerts when they drift beyond expected ranges. Catches data corruption that passes all schema tests.

Great Expectations · Monte Carlo · Bigeye

🔗

Lineage

Which upstream tables feed this one?

Column-level lineage tracks exactly which sources, transformations, and models feed each downstream table. When something breaks, lineage reduces root cause analysis from hours to minutes by showing exactly which upstream node failed.

OpenLineage · Marquez · dbt lineage

How Data SLOs Work

A data SLO defines a reliability target for a dataset. The gap between the target and 100% is your error budget. Rather than paging on every single freshness miss, you track error budget burn rate — alerting proportionally as you approach the budget limit.

SLI

Service Level Indicator — the metric you measure (e.g. freshness age in hours)

SLO

Service Level Objective — the target (e.g. freshness < 1 hour, 99.5% of days)

Error Budget

The allowed failure rate (0.5% of days = 1.5 days/year you can miss the SLO)

Common Mistakes

Monitoring only at pipeline time

dbt tests run when the pipeline runs. A table can go stale for 6+ hours after a successful run. Add continuous freshness monitoring between runs.

No lineage alongside alerts

An alert tells you something broke. Without lineage, finding the root cause is manual. Deploy OpenLineage at the same time as your first quality checks.

Alerting on every metric deviation

Set anomaly detection thresholds based on historical baselines (p5/p95), not arbitrary numbers. Alert fatigue kills on-call morale faster than bad data.

FAQ

What is data observability?
Data observability is the ability to continuously monitor, detect, diagnose, and resolve data quality issues across your pipeline before they reach consumers. It applies software observability practices to data: metrics, alerting, lineage, SLOs, and incident workflows.
What are the 5 pillars of data observability?
Freshness (is data arriving on schedule?), Volume (are row counts within bounds?), Schema (did column definitions change?), Distribution (are values statistically normal?), and Lineage (what upstream tables feed this one?). Monitoring all five gives complete pipeline visibility.
How does a data SLO work?
A data SLO defines a reliability target (e.g. 99.5% uptime). The gap to 100% is your error budget. When failures burn into the budget, you alert proportionally to the burn rate — not on every individual miss.
What is the difference between data observability and data monitoring?
Monitoring checks metrics on a schedule. Observability is the full operational system: automated anomaly detection, lineage for root cause, SLO-based error budgets, and incident workflows. Monitoring is a component of observability.

Related

Press Cmd+K to open