Your dashboards went dark at 9 AM on a Monday. Not because a pipeline failed — Airflow shows all green. Because a source system changed its API response format over the weekend, and your pipeline dutifully ingested the new format, transformed it according to the old schema, and wrote garbage to the warehouse. Every test passed because the data wasn't null, wasn't duplicate, and fell within expected ranges. It was just wrong. This is the problem data observability solves.
Data Quality Testing vs Data Observability: The Distinction That Matters
Before comparing tools, clarify the distinction. These terms are often conflated, but they solve different problems.
Data quality testing runs at pipeline execution time. You define expectations ("this column is never null," "row count is between 10K and 100K"), and the test passes or fails when the pipeline runs. This is proactive — you predict what can go wrong and test for it.
Data observability monitors data continuously, independent of pipeline execution. It watches patterns over time — freshness, volume, distribution, schema changes — and alerts when something deviates from the norm. This catches anomalies you didn't predict.
| Dimension | Data Quality Testing | Data Observability |
|---|---|---|
| When it runs | During pipeline execution | Continuously (scheduled or event-driven) |
| What it checks | Predefined expectations | Pattern deviations from historical baselines |
| What it catches | Known failure modes | Unknown failure modes (anomalies) |
| Who defines rules | Engineers write expectations | System learns baselines automatically |
| Example | "revenue_usd is never null" | "Volume dropped 40% vs same day last week" |
You need both. Testing catches known issues at pipeline time. Observability catches unknown issues across the entire data platform. The question is how to implement each — and which tools to use for each layer.
The Two-Layer Observability Stack
Most teams end up with this hybrid: open-source validation embedded in pipelines (GX or Soda in CI/CD) plus a managed platform for continuous monitoring. The validation layer catches known issues at deploy time; the monitoring layer catches unknown issues in production.
Monte Carlo — Managed, Comprehensive, Expensive
A fully managed data observability platform that connects to your data warehouse, monitors every table automatically, and alerts on anomalies — freshness, volume, schema changes, distribution shifts — without writing any rules. Monte Carlo connects directly to your warehouse (Snowflake, BigQuery, Databricks, Redshift) via read-only credentials and scans metadata and data patterns on a schedule.
from monte_carlo.client import MonteCarloClient
mc = MonteCarloClient(api_key="your-api-key")
# Check table health before downstream processing
health = mc.get_table_health("analytics.fct_revenue")
if health.has_active_incidents:
print(f"Skipping pipeline: {health.incident_summary}")
# Route to dead-letter queue or skip downstream processing
else:
run_downstream_models()- What it catches — automatic ML-based anomaly detection; no rules to write, the system learns your patterns.
- Schema monitoring — detects changes across your entire warehouse, including tables you haven't explicitly instrumented.
- Full lineage — traces issues from root cause to every affected downstream asset, dashboard, and report.
- Where it falls short — cost starts at $50K/year and scales to $100K–$200K+ for enterprise; black-box anomaly detection can over-alert until tuned.
- Best for — teams with 20+ data engineers, multiple warehouses, or compliance requirements where automated monitoring is non-negotiable.
Great Expectations — Open-Source, Pipeline-Embedded, Engineer-First
An open-source Python framework for defining, running, and documenting data quality expectations. Expectations are code — they run inside your pipeline, at the point where data flows through. GX is strongest when embedded in CI/CD and treated as a quality gate before code ships to production.
import great_expectations as gx
from airflow.decorators import task
@task
def validate_revenue_data():
context = gx.get_context()
batch_request = context.get_datasource("snowflake_prod") \
.get_asset("fct_revenue") \
.build_batch_request()
results = context.run_checkpoint(
checkpoint_name="revenue_quality",
batch_request=batch_request,
)
if not results.success:
failed = [
r.expectation_config.expectation_type
for r in results.results if not r.success
]
raise ValueError(f"Data quality check failed: {failed}")
return results.statisticsimport great_expectations as gx
context = gx.get_context()
suite = context.add_expectation_suite("fct_revenue")
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="revenue_usd",
min_value=0,
max_value=1_000_000,
mostly=0.999 # 99.9% of values
)
)
suite.add_expectation(
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=10_000,
max_value=500_000
)
)
# Distribution expectation — catches shifts dbt tests miss
suite.add_expectation(
gx.expectations.ExpectColumnMeanToBeBetween(
column="revenue_usd",
min_value=50,
max_value=200
)
)- Expectations-as-code — version controlled, reviewable, testable alongside the pipeline logic.
- Rich documentation — auto-generates data docs from your expectations that double as data contracts.
- CI/CD integration — expectations run as quality gates blocking bad code from merging.
- Where it falls short — no automatic anomaly detection; you define every expectation manually. No native freshness or cross-table monitoring.
- Best for — teams that want pipeline-embedded validation, CI/CD quality gates, and full control over their data quality logic.
Soda — SQL-Native, Developer-Friendly, Lightweight
A data quality tool with a unique SQL-like configuration language (SodaCL) that makes it easy for both engineers and analysts to write data checks. The key differentiator: non-engineers can read and write Soda checks, which matters when data quality ownership is shared across the team.
checks for fct_revenue:
# Freshness
- freshness(event_timestamp) < 2h
# Volume
- row_count between 10000 and 500000
- change for row_count < 25%
# Validity
- missing_count(order_id) = 0
- duplicate_count(order_id) = 0
- invalid_count(currency) = 0:
valid values: ['USD', 'EUR', 'GBP', 'JPY', 'CAD']
# Distribution
- avg(revenue_usd) between 50 and 200
- max(revenue_usd) < 1000000
# Anomaly detection (Soda Cloud only)
- anomaly detection for row_count
- anomaly detection for avg(revenue_usd)
# Schema
- schema:
fail:
when required column missing: [order_id, revenue_usd, currency]
when wrong type:
order_id: varchar
revenue_usd: number- SodaCL readability — analysts can write and understand checks; lowers the barrier to shared data quality ownership.
- Built-in primitives — freshness, volume change, and schema monitoring are first-class in the config language, not add-ons.
- Anomaly detection — available in Soda Cloud (paid tier); not in the open-source core.
- Where it falls short — less flexible than GX for complex, custom expectations; anomaly detection requires the paid cloud tier.
- Best for — teams where data quality ownership is shared between engineers and analysts, or where the SQL-like syntax matters for adoption.
Elementary — dbt-Native, Zero Config, Lightweight
An open-source data observability tool built specifically for dbt. It runs as a dbt package — no separate infrastructure needed. If your team already uses dbt, setup takes 15 minutes and you get volume, freshness, and distribution anomaly detection immediately.
# packages.yml
packages:
- package: elementary-data/elementary
version: "0.15.0"
# models/marts/fct_revenue.yml
models:
- name: fct_revenue
config:
elementary:
timestamp_column: event_timestamp
tests:
- not_null:
column_name: order_id
- unique:
column_name: order_id
# Elementary anomaly detection
- elementary.volume_anomalies:
timestamp_column: event_timestamp
sensitivity: 3
- elementary.freshness_anomalies:
timestamp_column: event_timestamp
- elementary.column_anomalies:
column_name: revenue_usd
timestamp_column: event_timestamp- Zero infrastructure — it's a dbt package; runs inside your existing dbt project with no new services to deploy.
- Anomaly detection — volume, freshness, and column distribution anomalies out of the box, completely free.
- Auto-generated dashboard — run
edr report --openfor an instant monitoring UI from your dbt runs. - Where it falls short — only covers dbt models; raw source tables, non-dbt pipelines, and downstream BI tools are invisible to it.
- Best for — dbt-first teams that want basic observability without deploying new infrastructure. Best starting point before graduating to GX or Monte Carlo.
Side-by-Side Comparison
| Criteria | Monte Carlo | Great Expectations | Soda | Elementary |
|---|---|---|---|---|
| Type | Managed platform | Open-source framework | OSS + paid cloud | OSS dbt package |
| Setup time | Hours | Days | Hours | 15 minutes |
| Anomaly detection | Automatic ML | Manual (write it) | Paid tier | Basic statistical |
| Freshness monitoring | Automatic | Build it yourself | Built-in (SodaCL) | Built-in (dbt) |
| Schema monitoring | Automatic | Build it yourself | Built-in (SodaCL) | Basic |
| Cross-table monitoring | Automatic | Manual config | Manual config | dbt models only |
| Who writes checks | System + engineer tunes | Engineers (Python) | Engineers + analysts | Engineers (dbt YAML) |
| dbt integration | Reads manifest | Checkpoint integration | Native support | IS a dbt package |
| Cost (team of 10) | $50K–$100K/yr | Free (eng time) | Free / $20K+ cloud | Free (eng time) |
| Cost (team of 30) | $100K–$200K/yr | Free (significant eng time) | Free / $40K+ cloud | Free (significant eng time) |
The Build vs Buy Decision
This is the real question. Not "which tool is best" but "should I build with open-source or buy a managed platform?" The answer depends on team size, not philosophy.
- Build with open-source when — fewer than 15 data engineers; primarily one warehouse + dbt; engineering capacity to maintain monitoring infrastructure; budget is constrained. Stack: Great Expectations (CI/CD quality gates) + Elementary (dbt monitoring). Total cost: ~2 weeks initial setup, ~4 hours/month maintenance.
- Buy a managed platform when — 20+ data engineers; multiple warehouses; data quality incidents have caused business-impacting outages; compliance or audit requirements demand comprehensive monitoring. Stack: Monte Carlo or Anomalo. Total cost: $50K–$200K/year — but saves 1–2 FTE of engineering time on monitoring infrastructure.
- The hybrid approach (where most teams land) — GX or Soda in CI/CD pipelines for validation, plus Monte Carlo or Elementary for continuous monitoring. Validation catches known issues at deploy time; monitoring catches unknown issues in production.
Building Observability as a Platform Service
Platform engineers don't just pick a tool — they build observability as a shared service the entire data team uses. Here's the core of what that looks like in practice: freshness, volume, and schema checks wrapped into a unified service with config-driven table registration.
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
class AlertSeverity(Enum):
INFO = "info"
WARNING = "warning"
CRITICAL = "critical"
@dataclass
class ObservabilityCheck:
table: str
check_type: str # freshness | volume | schema | distribution
status: str # passed | failed | warning
severity: AlertSeverity
message: str
timestamp: datetime = field(default_factory=datetime.utcnow)
metadata: dict = field(default_factory=dict)
class DataObservabilityService:
def __init__(self, warehouse_conn, alert_client, history_store):
self.conn = warehouse_conn
self.alerter = alert_client
self.history = history_store
def check_freshness(self, table: str, timestamp_col: str,
max_delay_hours: int = 2) -> ObservabilityCheck:
result = self.conn.execute(f"""
SELECT MAX({timestamp_col}) as latest,
DATEDIFF(hour, MAX({timestamp_col}), CURRENT_TIMESTAMP())
as hours_delay
FROM {table}
""").fetchone()
hours_delay = result["hours_delay"] or 999
if hours_delay > max_delay_hours * 2:
severity, status = AlertSeverity.CRITICAL, "failed"
elif hours_delay > max_delay_hours:
severity, status = AlertSeverity.WARNING, "warning"
else:
severity, status = AlertSeverity.INFO, "passed"
return ObservabilityCheck(
table=table, check_type="freshness",
status=status, severity=severity,
message=f"Latest data: {result['latest']} ({hours_delay}h ago)",
metadata={"hours_delay": hours_delay, "threshold": max_delay_hours}
)
def check_volume(self, table: str,
max_change_pct: float = 25.0) -> ObservabilityCheck:
current = self.conn.execute(
f"SELECT COUNT(*) as cnt FROM {table}"
).fetchone()["cnt"]
historical = self.history.get_average_count(table, lookback_days=7)
pct_change = (current - historical) / historical * 100 if historical else 0
if abs(pct_change) > max_change_pct * 2:
severity, status = AlertSeverity.CRITICAL, "failed"
elif abs(pct_change) > max_change_pct:
severity, status = AlertSeverity.WARNING, "warning"
else:
severity, status = AlertSeverity.INFO, "passed"
self.history.record_count(table, current)
return ObservabilityCheck(
table=table, check_type="volume",
status=status, severity=severity,
message=f"Row count: {current:,} ({pct_change:+.1f}% vs 7-day avg)",
metadata={"current": current, "historical_avg": historical,
"pct_change": pct_change}
)
def run_all_checks(self, table: str, config: dict) -> list[ObservabilityCheck]:
results = []
if "freshness" in config:
results.append(self.check_freshness(
table,
timestamp_col=config["freshness"]["timestamp_column"],
max_delay_hours=config["freshness"].get("max_delay_hours", 2)
))
if "volume" in config:
results.append(self.check_volume(
table, max_change_pct=config["volume"].get("max_change_pct", 25)
))
failures = [r for r in results if r.status in ("failed", "warning")]
if failures:
self.alerter.send(failures)
return resultstables:
analytics.fct_revenue:
owner: revenue-squad
freshness:
timestamp_column: event_timestamp
max_delay_hours: 2
volume:
max_change_pct: 25
schema:
enabled: true
analytics.fct_orders:
owner: orders-squad
freshness:
timestamp_column: created_at
max_delay_hours: 4
volume:
max_change_pct: 30
schema:
enabled: trueCommon Mistakes
- Treating observability as a replacement for testing — observability monitors patterns; testing validates expectations. A table that passes all observability checks (normal volume, fresh data, stable schema) can still have incorrect data if a transformation bug produces wrong values within normal ranges.
- Over-alerting until teams ignore alerts — the biggest operational risk. Start with critical checks only (freshness, schema breaks, extreme volume changes) and add granularity once the baseline is stable. 50 daily alerts, most of them false positives, trains teams to ignore alerts entirely.
- Monitoring only tables you know about — open-source tools only cover tables you've explicitly instrumented. The staging table someone created last month, the ad-hoc pipeline that writes to a shared schema — none of these are covered. This is Monte Carlo's strongest advantage.
- Not measuring the cost of data incidents — "we don't need observability tools" usually means "we don't know how much data incidents cost us." Track time-to-detect, time-to-resolve, and blast radius. This data justifies the investment.
- Deploying an enterprise tool without a rollout plan — Monte Carlo surfaces 200 anomalies on day one and nobody knows which are real. Start with 3–5 critical tables, tune thresholds for two weeks, expand only when false positive rates are below 10%.
Build the DataGuard Observability Platform
Reading about tools is one thing. Hiring managers want to see you build observability as a system — not just a tool choice. The AI-DE Data Observability module walks you through building the full stack: pipeline-embedded validation with Great Expectations, dbt monitoring with Elementary, centralized alerting, and a config-driven table registry.
By the end you'll have a portfolio project that demonstrates the monitoring architecture senior data engineers are expected to design — not just "I ran Great Expectations once."