Skip to content

Data Contracts vs Data Quality: What's the Difference?

Data contracts define what a dataset promises — before it is written. Data quality measures whether the data delivered on those promises — after it is written. Contracts are enforced in CI/CD to prevent breaking changes; quality checks run at pipeline time to catch runtime violations. You need both.

Data Contract

  • Defines schema, SLA, owner, and change policy
  • Enforced in CI/CD — blocks breaking PRs before merge
  • Versioned YAML stored in source control
  • Covers PII classification and access control
  • Requires CI/CD integration to be effective
  • Most valuable at team boundaries — overkill for internal tables

Stack: ODCS YAML · GitHub Actions · Schema Registry

Data Quality

  • Measures null rates, row counts, value distributions
  • Runs at pipeline time — catches runtime data violations
  • Works without team buy-in — add to any existing pipeline
  • Tools: dbt tests, Great Expectations, Soda
  • Can't block breaking schema changes before they ship
  • No ownership or SLA metadata — just assertions

Stack: dbt · Great Expectations · Soda · Prometheus

Mental Model

Think of a data contract as a building code — it defines what must be true before construction begins, enforced by an inspector (CI/CD) before anyone moves in. Data quality is a home inspection — it checks the delivered building against specs after it's built. Building codes prevent the wrong thing from being built; inspections catch what slipped through. You need both to ship reliable data.

Use Data Contracts When

  • Breaking schema changes are a recurring problem
  • Producer and consumer are different teams
  • You need PII tagging and access control enforced
  • SOC2, GDPR, or HIPAA compliance is required

Use Data Quality When

  • Catching runtime data violations (nulls, bad values)
  • Adding checks to existing pipelines quickly
  • You need a foundation before rolling out contracts
  • Internal tables consumed only by your own team

How They Work Together

The production pattern is layers: contracts define what the dataset promises and gate CI/CD. Quality tools (Great Expectations, dbt tests) validate the runtime output against those promises. The contract drives the test generation — quality checks should be derived from contract SLAs, not defined independently.

# Contract SLA drives quality check thresholds
# contracts/orders.yml
sla:
  freshness_hours: 1
  min_rows: 1000

# Great Expectations suite auto-generated from contract
validator.expect_table_row_count_to_be_between(
    min_value=1000  # from contract.sla.min_rows
)

Common Mistakes

Treating contracts and quality as alternatives

They operate at different layers. Contracts prevent the wrong thing from shipping; quality catches runtime violations. Replace neither with the other.

Quality checks without contracts at team boundaries

If a producer renames a column, your quality check passes (the new column exists) but all downstream consumers reading the old name break. A contract would have blocked the rename in CI.

Contracts without quality enforcement

A YAML file with SLA definitions that nobody validates is just documentation. Wire contracts into both CI/CD (breaking changes) and pipeline time (quality thresholds) from day one.

FAQ

What is the difference between data contracts and data quality?
Contracts define what a dataset promises and are enforced in CI/CD before code ships. Quality measures whether the data delivered at runtime. Contracts prevent problems; quality catches them.
Can data quality replace data contracts?
No. Quality tools run after data is written and catch known violations. Contracts block breaking schema changes before they are ever merged. You need both.
Should I implement contracts or quality first?
Start with quality — dbt tests and Great Expectations on critical tables. Once baselines exist, layer contracts on top for your most important team boundaries.

Related

Press Cmd+K to open