Data Observability in 2026: Monte Carlo vs Great Expectations vs Soda — A Data Engineer's Honest Comparison

Your dashboards went dark at 9 AM on a Monday. Not because a pipeline failed — Airflow shows all green. Because a source system changed its API response format over the weekend, and your pipeline dutifully ingested the new format, transformed it according to the old schema, and wrote garbage to the warehouse. Every test passed because the data wasn't null, wasn't duplicate, and fell within expected ranges. It was just wrong. This is the problem data observability solves.

Data Quality Testing vs Data Observability: The Distinction That Matters

Before comparing tools, clarify the distinction. These terms are often conflated, but they solve different problems.

Data quality testing runs at pipeline execution time. You define expectations ("this column is never null," "row count is between 10K and 100K"), and the test passes or fails when the pipeline runs. This is proactive — you predict what can go wrong and test for it.

Data observability monitors data continuously, independent of pipeline execution. It watches patterns over time — freshness, volume, distribution, schema changes — and alerts when something deviates from the norm. This catches anomalies you didn't predict.

Dimension	Data Quality Testing	Data Observability
When it runs	During pipeline execution	Continuously (scheduled or event-driven)
What it checks	Predefined expectations	Pattern deviations from historical baselines
What it catches	Known failure modes	Unknown failure modes (anomalies)
Who defines rules	Engineers write expectations	System learns baselines automatically
Example	"revenue_usd is never null"	"Volume dropped 40% vs same day last week"

You need both. Testing catches known issues at pipeline time. Observability catches unknown issues across the entire data platform. The question is how to implement each — and which tools to use for each layer.

The Two-Layer Observability Stack

Most teams end up with this hybrid: open-source validation embedded in pipelines (GX or Soda in CI/CD) plus a managed platform for continuous monitoring. The validation layer catches known issues at deploy time; the monitoring layer catches unknown issues in production.

Monte Carlo — Managed, Comprehensive, Expensive

A fully managed data observability platform that connects to your data warehouse, monitors every table automatically, and alerts on anomalies — freshness, volume, schema changes, distribution shifts — without writing any rules. Monte Carlo connects directly to your warehouse (Snowflake, BigQuery, Databricks, Redshift) via read-only credentials and scans metadata and data patterns on a schedule.

Pythonmonte_carlo_circuit_breaker.py// Pipeline integration via circuit breaker pattern

from monte_carlo.client import MonteCarloClient

mc = MonteCarloClient(api_key="your-api-key")

# Check table health before downstream processing
health = mc.get_table_health("analytics.fct_revenue")

if health.has_active_incidents:
    print(f"Skipping pipeline: {health.incident_summary}")
    # Route to dead-letter queue or skip downstream processing
else:
    run_downstream_models()

What it catches — automatic ML-based anomaly detection; no rules to write, the system learns your patterns.
Schema monitoring — detects changes across your entire warehouse, including tables you haven't explicitly instrumented.
Full lineage — traces issues from root cause to every affected downstream asset, dashboard, and report.
Where it falls short — cost starts at $50K/year and scales to $100K–$200K+ for enterprise; black-box anomaly detection can over-alert until tuned.
Best for — teams with 20+ data engineers, multiple warehouses, or compliance requirements where automated monitoring is non-negotiable.

Great Expectations — Open-Source, Pipeline-Embedded, Engineer-First

An open-source Python framework for defining, running, and documenting data quality expectations. Expectations are code — they run inside your pipeline, at the point where data flows through. GX is strongest when embedded in CI/CD and treated as a quality gate before code ships to production.

Pythonvalidate_revenue_airflow.py// GX quality gate inside an Airflow task

import great_expectations as gx
from airflow.decorators import task

@task
def validate_revenue_data():
    context = gx.get_context()

    batch_request = context.get_datasource("snowflake_prod") \
        .get_asset("fct_revenue") \
        .build_batch_request()

    results = context.run_checkpoint(
        checkpoint_name="revenue_quality",
        batch_request=batch_request,
    )

    if not results.success:
        failed = [
            r.expectation_config.expectation_type
            for r in results.results if not r.success
        ]
        raise ValueError(f"Data quality check failed: {failed}")

    return results.statistics

Pythonfct_revenue_suite.py// Expectation suite — reads like documentation

import great_expectations as gx

context = gx.get_context()
suite = context.add_expectation_suite("fct_revenue")

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="revenue_usd",
        min_value=0,
        max_value=1_000_000,
        mostly=0.999  # 99.9% of values
    )
)
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=10_000,
        max_value=500_000
    )
)
# Distribution expectation — catches shifts dbt tests miss
suite.add_expectation(
    gx.expectations.ExpectColumnMeanToBeBetween(
        column="revenue_usd",
        min_value=50,
        max_value=200
    )
)

Expectations-as-code — version controlled, reviewable, testable alongside the pipeline logic.
Rich documentation — auto-generates data docs from your expectations that double as data contracts.
CI/CD integration — expectations run as quality gates blocking bad code from merging.
Where it falls short — no automatic anomaly detection; you define every expectation manually. No native freshness or cross-table monitoring.
Best for — teams that want pipeline-embedded validation, CI/CD quality gates, and full control over their data quality logic.

Soda — SQL-Native, Developer-Friendly, Lightweight

A data quality tool with a unique SQL-like configuration language (SodaCL) that makes it easy for both engineers and analysts to write data checks. The key differentiator: non-engineers can read and write Soda checks, which matters when data quality ownership is shared across the team.

YAMLchecks/fct_revenue.yml// SodaCL — readable by engineers and analysts alike

checks for fct_revenue:
  # Freshness
  - freshness(event_timestamp) < 2h

  # Volume
  - row_count between 10000 and 500000
  - change for row_count < 25%

  # Validity
  - missing_count(order_id) = 0
  - duplicate_count(order_id) = 0
  - invalid_count(currency) = 0:
      valid values: ['USD', 'EUR', 'GBP', 'JPY', 'CAD']

  # Distribution
  - avg(revenue_usd) between 50 and 200
  - max(revenue_usd) < 1000000

  # Anomaly detection (Soda Cloud only)
  - anomaly detection for row_count
  - anomaly detection for avg(revenue_usd)

  # Schema
  - schema:
      fail:
        when required column missing: [order_id, revenue_usd, currency]
        when wrong type:
          order_id: varchar
          revenue_usd: number

SodaCL readability — analysts can write and understand checks; lowers the barrier to shared data quality ownership.
Built-in primitives — freshness, volume change, and schema monitoring are first-class in the config language, not add-ons.
Anomaly detection — available in Soda Cloud (paid tier); not in the open-source core.
Where it falls short — less flexible than GX for complex, custom expectations; anomaly detection requires the paid cloud tier.
Best for — teams where data quality ownership is shared between engineers and analysts, or where the SQL-like syntax matters for adoption.

Elementary — dbt-Native, Zero Config, Lightweight

An open-source data observability tool built specifically for dbt. It runs as a dbt package — no separate infrastructure needed. If your team already uses dbt, setup takes 15 minutes and you get volume, freshness, and distribution anomaly detection immediately.

YAMLmodels/marts/fct_revenue.yml// Elementary anomaly detection in dbt model config

# packages.yml
packages:
  - package: elementary-data/elementary
    version: "0.15.0"

# models/marts/fct_revenue.yml
models:
  - name: fct_revenue
    config:
      elementary:
        timestamp_column: event_timestamp
    tests:
      - not_null:
          column_name: order_id
      - unique:
          column_name: order_id

      # Elementary anomaly detection
      - elementary.volume_anomalies:
          timestamp_column: event_timestamp
          sensitivity: 3
      - elementary.freshness_anomalies:
          timestamp_column: event_timestamp
      - elementary.column_anomalies:
          column_name: revenue_usd
          timestamp_column: event_timestamp

Zero infrastructure — it's a dbt package; runs inside your existing dbt project with no new services to deploy.
Anomaly detection — volume, freshness, and column distribution anomalies out of the box, completely free.
Auto-generated dashboard — run edr report --open for an instant monitoring UI from your dbt runs.
Where it falls short — only covers dbt models; raw source tables, non-dbt pipelines, and downstream BI tools are invisible to it.
Best for — dbt-first teams that want basic observability without deploying new infrastructure. Best starting point before graduating to GX or Monte Carlo.

Side-by-Side Comparison

Criteria	Monte Carlo	Great Expectations	Soda	Elementary
Type	Managed platform	Open-source framework	OSS + paid cloud	OSS dbt package
Setup time	Hours	Days	Hours	15 minutes
Anomaly detection	Automatic ML	Manual (write it)	Paid tier	Basic statistical
Freshness monitoring	Automatic	Build it yourself	Built-in (SodaCL)	Built-in (dbt)
Schema monitoring	Automatic	Build it yourself	Built-in (SodaCL)	Basic
Cross-table monitoring	Automatic	Manual config	Manual config	dbt models only
Who writes checks	System + engineer tunes	Engineers (Python)	Engineers + analysts	Engineers (dbt YAML)
dbt integration	Reads manifest	Checkpoint integration	Native support	IS a dbt package
Cost (team of 10)	$50K–$100K/yr	Free (eng time)	Free / $20K+ cloud	Free (eng time)
Cost (team of 30)	$100K–$200K/yr	Free (significant eng time)	Free / $40K+ cloud	Free (significant eng time)

The Build vs Buy Decision

This is the real question. Not "which tool is best" but "should I build with open-source or buy a managed platform?" The answer depends on team size, not philosophy.

Build with open-source when — fewer than 15 data engineers; primarily one warehouse + dbt; engineering capacity to maintain monitoring infrastructure; budget is constrained. Stack: Great Expectations (CI/CD quality gates) + Elementary (dbt monitoring). Total cost: ~2 weeks initial setup, ~4 hours/month maintenance.
Buy a managed platform when — 20+ data engineers; multiple warehouses; data quality incidents have caused business-impacting outages; compliance or audit requirements demand comprehensive monitoring. Stack: Monte Carlo or Anomalo. Total cost: $50K–$200K/year — but saves 1–2 FTE of engineering time on monitoring infrastructure.
The hybrid approach (where most teams land) — GX or Soda in CI/CD pipelines for validation, plus Monte Carlo or Elementary for continuous monitoring. Validation catches known issues at deploy time; monitoring catches unknown issues in production.

Building Observability as a Platform Service

Platform engineers don't just pick a tool — they build observability as a shared service the entire data team uses. Here's the core of what that looks like in practice: freshness, volume, and schema checks wrapped into a unified service with config-driven table registration.

Pythonplatform_tools/observability/service.py// Observability as a platform service — config-driven table registration

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum


class AlertSeverity(Enum):
    INFO = "info"
    WARNING = "warning"
    CRITICAL = "critical"


@dataclass
class ObservabilityCheck:
    table: str
    check_type: str   # freshness | volume | schema | distribution
    status: str       # passed | failed | warning
    severity: AlertSeverity
    message: str
    timestamp: datetime = field(default_factory=datetime.utcnow)
    metadata: dict = field(default_factory=dict)


class DataObservabilityService:
    def __init__(self, warehouse_conn, alert_client, history_store):
        self.conn = warehouse_conn
        self.alerter = alert_client
        self.history = history_store

    def check_freshness(self, table: str, timestamp_col: str,
                        max_delay_hours: int = 2) -> ObservabilityCheck:
        result = self.conn.execute(f"""
            SELECT MAX({timestamp_col}) as latest,
                   DATEDIFF(hour, MAX({timestamp_col}), CURRENT_TIMESTAMP())
                     as hours_delay
            FROM {table}
        """).fetchone()

        hours_delay = result["hours_delay"] or 999
        if hours_delay > max_delay_hours * 2:
            severity, status = AlertSeverity.CRITICAL, "failed"
        elif hours_delay > max_delay_hours:
            severity, status = AlertSeverity.WARNING, "warning"
        else:
            severity, status = AlertSeverity.INFO, "passed"

        return ObservabilityCheck(
            table=table, check_type="freshness",
            status=status, severity=severity,
            message=f"Latest data: {result['latest']} ({hours_delay}h ago)",
            metadata={"hours_delay": hours_delay, "threshold": max_delay_hours}
        )

    def check_volume(self, table: str,
                     max_change_pct: float = 25.0) -> ObservabilityCheck:
        current = self.conn.execute(
            f"SELECT COUNT(*) as cnt FROM {table}"
        ).fetchone()["cnt"]

        historical = self.history.get_average_count(table, lookback_days=7)
        pct_change = (current - historical) / historical * 100 if historical else 0

        if abs(pct_change) > max_change_pct * 2:
            severity, status = AlertSeverity.CRITICAL, "failed"
        elif abs(pct_change) > max_change_pct:
            severity, status = AlertSeverity.WARNING, "warning"
        else:
            severity, status = AlertSeverity.INFO, "passed"

        self.history.record_count(table, current)
        return ObservabilityCheck(
            table=table, check_type="volume",
            status=status, severity=severity,
            message=f"Row count: {current:,} ({pct_change:+.1f}% vs 7-day avg)",
            metadata={"current": current, "historical_avg": historical,
                      "pct_change": pct_change}
        )

    def run_all_checks(self, table: str, config: dict) -> list[ObservabilityCheck]:
        results = []
        if "freshness" in config:
            results.append(self.check_freshness(
                table,
                timestamp_col=config["freshness"]["timestamp_column"],
                max_delay_hours=config["freshness"].get("max_delay_hours", 2)
            ))
        if "volume" in config:
            results.append(self.check_volume(
                table, max_change_pct=config["volume"].get("max_change_pct", 25)
            ))
        failures = [r for r in results if r.status in ("failed", "warning")]
        if failures:
            self.alerter.send(failures)
        return results

YAMLobservability_config.yml// Platform config — teams register their tables here

tables:
  analytics.fct_revenue:
    owner: revenue-squad
    freshness:
      timestamp_column: event_timestamp
      max_delay_hours: 2
    volume:
      max_change_pct: 25
    schema:
      enabled: true

  analytics.fct_orders:
    owner: orders-squad
    freshness:
      timestamp_column: created_at
      max_delay_hours: 4
    volume:
      max_change_pct: 30
    schema:
      enabled: true

Common Mistakes

Treating observability as a replacement for testing — observability monitors patterns; testing validates expectations. A table that passes all observability checks (normal volume, fresh data, stable schema) can still have incorrect data if a transformation bug produces wrong values within normal ranges.
Over-alerting until teams ignore alerts — the biggest operational risk. Start with critical checks only (freshness, schema breaks, extreme volume changes) and add granularity once the baseline is stable. 50 daily alerts, most of them false positives, trains teams to ignore alerts entirely.
Monitoring only tables you know about — open-source tools only cover tables you've explicitly instrumented. The staging table someone created last month, the ad-hoc pipeline that writes to a shared schema — none of these are covered. This is Monte Carlo's strongest advantage.
Not measuring the cost of data incidents — "we don't need observability tools" usually means "we don't know how much data incidents cost us." Track time-to-detect, time-to-resolve, and blast radius. This data justifies the investment.
Deploying an enterprise tool without a rollout plan — Monte Carlo surfaces 200 anomalies on day one and nobody knows which are real. Start with 3–5 critical tables, tune thresholds for two weeks, expand only when false positive rates are below 10%.

Hands-on project

Build the DataGuard Observability Platform

Reading about tools is one thing. Hiring managers want to see you build observability as a system — not just a tool choice. The AI-DE Data Observability module walks you through building the full stack: pipeline-embedded validation with Great Expectations, dbt monitoring with Elementary, centralized alerting, and a config-driven table registry.

By the end you'll have a portfolio project that demonstrates the monitoring architecture senior data engineers are expected to design — not just "I ran Great Expectations once."

Start the Data Observability module View the DataGuard Observability project