Data Contracts vs Data Quality: What's the Difference?
Data contracts define what a dataset promises — before it is written. Data quality measures whether the data delivered on those promises — after it is written. Contracts are enforced in CI/CD to prevent breaking changes; quality checks run at pipeline time to catch runtime violations. You need both.
Data Contract
- ✓ Defines schema, SLA, owner, and change policy
- ✓ Enforced in CI/CD — blocks breaking PRs before merge
- ✓ Versioned YAML stored in source control
- ✓ Covers PII classification and access control
- – Requires CI/CD integration to be effective
- – Most valuable at team boundaries — overkill for internal tables
Stack: ODCS YAML · GitHub Actions · Schema Registry
Data Quality
- ✓ Measures null rates, row counts, value distributions
- ✓ Runs at pipeline time — catches runtime data violations
- ✓ Works without team buy-in — add to any existing pipeline
- ✓ Tools: dbt tests, Great Expectations, Soda
- – Can't block breaking schema changes before they ship
- – No ownership or SLA metadata — just assertions
Stack: dbt · Great Expectations · Soda · Prometheus
Mental Model
Think of a data contract as a building code — it defines what must be true before construction begins, enforced by an inspector (CI/CD) before anyone moves in. Data quality is a home inspection — it checks the delivered building against specs after it's built. Building codes prevent the wrong thing from being built; inspections catch what slipped through. You need both to ship reliable data.
Use Data Contracts When
- → Breaking schema changes are a recurring problem
- → Producer and consumer are different teams
- → You need PII tagging and access control enforced
- → SOC2, GDPR, or HIPAA compliance is required
Use Data Quality When
- → Catching runtime data violations (nulls, bad values)
- → Adding checks to existing pipelines quickly
- → You need a foundation before rolling out contracts
- → Internal tables consumed only by your own team
How They Work Together
The production pattern is layers: contracts define what the dataset promises and gate CI/CD. Quality tools (Great Expectations, dbt tests) validate the runtime output against those promises. The contract drives the test generation — quality checks should be derived from contract SLAs, not defined independently.
# Contract SLA drives quality check thresholds
# contracts/orders.yml
sla:
freshness_hours: 1
min_rows: 1000
# Great Expectations suite auto-generated from contract
validator.expect_table_row_count_to_be_between(
min_value=1000 # from contract.sla.min_rows
)
Common Mistakes
Treating contracts and quality as alternatives
They operate at different layers. Contracts prevent the wrong thing from shipping; quality catches runtime violations. Replace neither with the other.
Quality checks without contracts at team boundaries
If a producer renames a column, your quality check passes (the new column exists) but all downstream consumers reading the old name break. A contract would have blocked the rename in CI.
Contracts without quality enforcement
A YAML file with SLA definitions that nobody validates is just documentation. Wire contracts into both CI/CD (breaking changes) and pipeline time (quality thresholds) from day one.
FAQ
- What is the difference between data contracts and data quality?
- Contracts define what a dataset promises and are enforced in CI/CD before code ships. Quality measures whether the data delivered at runtime. Contracts prevent problems; quality catches them.
- Can data quality replace data contracts?
- No. Quality tools run after data is written and catch known violations. Contracts block breaking schema changes before they are ever merged. You need both.
- Should I implement contracts or quality first?
- Start with quality — dbt tests and Great Expectations on critical tables. Once baselines exist, layer contracts on top for your most important team boundaries.