How to Implement Data Observability
Implement data observability in 5 layers: dbt freshness + schema tests → Great Expectations suites → OpenLineage lineage tracking → SLO definitions per dataset → Prometheus + Grafana dashboards. The open-source stack covers all 5 observability pillars at zero license cost.
Add dbt Freshness & Schema Tests
Configure freshness thresholds and not_null/unique tests in schema.yml for your most critical tables. This is the lowest-effort, highest-value first step.
models/schema.yml
# dbt schema.yml — freshness + schema tests
models:
- name: orders
freshness:
warn_after: {count: 1, period: hour}
error_after: {count: 3, period: hour}
columns:
- name: order_id
tests: [not_null, unique]
Add Great Expectations Suites
Create expectation suites for volume bounds, value distributions, and business-rule checks. Run them as part of your Airflow DAG or as a post-dbt step.
expectations/orders_suite.py
import great_expectations as gx
context = gx.get_context()
validator = context.get_validator(...)
validator.expect_table_row_count_to_be_between(
min_value=1000, max_value=500000
)
validator.expect_column_values_to_not_be_null("order_id")
validator.save_expectation_suite()
Deploy OpenLineage for Lineage Tracking
Run Marquez (OpenLineage backend) in Docker, configure your dbt and Airflow integrations to emit lineage events, and validate the lineage graph appears after each run.
docker-compose.yml (Marquez)
# Start Marquez lineage backend
docker compose up marquez
# profiles.yml — add OpenLineage transport
openlineage:
transport:
type: http
url: http://localhost:5000
Define SLOs Per Dataset
Write YAML data contracts specifying freshness windows, row count bounds, owner, and SLA for each critical table. Enforce contracts in CI to block merges that violate them.
contracts/orders.yml
dataset: orders
owner: data-platform@company.com
slo:
freshness_hours: 1
min_rows: 1000
error_budget_percent: 99.5
alert_channel: "#data-on-call"
Build a Prometheus + Grafana Dashboard
Expose quality metrics (freshness age, row counts, SLO burn rate) via a Prometheus exporter, and visualize them in Grafana. This gives your team a single pane of glass for data health.
monitoring/exporter.py
from prometheus_client import Gauge
freshness_gauge = Gauge(
"data_table_freshness_hours",
"Hours since table last updated",
["table_name"]
)
def update_metrics():
for table, age in get_freshness_ages().items():
freshness_gauge.labels(table).set(age)
When to Implement Observability
- → You manage 20+ tables and manual checks aren't scaling
- → Stakeholders discover broken data before your team does
- → Root cause analysis after a pipeline failure takes more than 30 minutes
- → You're deploying ML models that depend on fresh, reliable feature data
Common Implementation Issues
Alert fatigue from day one
If you add monitoring before establishing baselines, every anomaly fires an alert. Run your pipeline for 2 weeks first, collect p95 volume and freshness stats, then set thresholds at p5/p95.
Lineage not emitting events
The most common cause: OpenLineage transport URL is wrong in profiles.yml, or dbt-openlineage package version doesn't match the Marquez API version. Check the Marquez logs first.
SLOs too strict to start
Start with error budgets set at 95% uptime, not 99.9%. Tighten as you understand real pipeline failure rates. A too-strict SLO burns all your error budget in the first week.
Skipping the lineage step
Freshness and volume monitoring tells you something is wrong. Without lineage, finding the root cause takes hours. Deploy OpenLineage at the same time as your first quality checks.
FAQ
- How do I implement data observability without expensive tooling?
- Use the open-source stack: dbt for pipeline-time quality checks, Great Expectations for expectation suites, OpenLineage + Marquez for lineage, and Prometheus + Grafana for dashboards. This covers all 5 pillars at zero license cost.
- What should I monitor first?
- Start with freshness on your highest-impact downstream tables — the ones feeding dashboards or ML models. Stale data is the most common and painful incident. Add volume monitoring next, then schema change alerts.
- How long does implementation take?
- A basic layer (dbt tests + Great Expectations + Prometheus metrics) can run in 1–2 days. A full production platform with OpenLineage, tiered SLOs, Grafana dashboards, and incident routing takes 2–4 weeks.