What is a Data Contract?
The complete guide for data engineers — what data contracts are, how they work, and how to enforce them in production.
Quick Answer
A data contract is a versioned YAML file that defines what a dataset promises — its schema, quality rules, freshness SLA, and owner. Stored in source control and enforced in CI/CD, contracts prevent breaking changes from reaching downstream consumers and create accountability at team boundaries.
What is a Data Contract?
A data contract is a formal, versioned agreement between a data producer (the team that owns and publishes a dataset) and its consumers (the teams or systems that read it). Contracts define not just the structure of the data, but what it promises — quality levels, SLAs, ownership, and breaking-change policy. They transform implicit assumptions into explicit, enforceable agreements.
Schema (without a contract)
Defines column names and types. No SLA, no quality rules, no owner. A column rename in the source breaks five downstream dashboards with no warning.
Data Contract (ODCS format)
Schema + quality rules + freshness SLA + owner + compatibility policy. A column rename is flagged as a breaking change in CI/CD, blocking the PR until consumers are notified and updated.
Before vs. After Data Contracts
Before
- ✗ Column renamed upstream — 4 dashboards break silently
- ✗ No owner listed — on-call spends 2 hours finding who to page
- ✗ PII column added without sensitivity tag — compliance audit fails
- ✗ No SLA — nobody knows if missing data is a bug or expected
After
- ✓ Breaking change blocked at PR — consumer team notified automatically
- ✓ Owner field in contract — PagerDuty fires to the right rotation
- ✓ PII scanner enforces sensitivity tags — compliance check passes in CI
- ✓ Freshness SLA: 1 hour — alert fires if table goes stale
What Data Contracts Cover
🔒
Schema Change Protection
Block breaking changes at CI/CD before they reach production and break downstream consumers.
🤝
Cross-Team SLAs
Codify freshness windows, row count bounds, and quality thresholds between producer and consumer teams.
🏷
PII Classification
Tag columns with sensitivity tiers and enforce role-based access control via policy-as-code.
📋
Compliance Documentation
Auto-generate data lineage reports for SOC2, GDPR, and HIPAA audits from contract metadata.
🔄
Backward Compatibility
Enforce Avro schema evolution rules and Confluent Schema Registry compatibility checks in CI.
📚
Contract Registry
Version and publish contracts to a central registry so consumers can discover and subscribe to datasets.
How Data Contracts Work
A production data contract system has four layers — define, validate, enforce, and monitor — applied across the full data lifecycle.
DEFINE
VALIDATE
ENFORCE
MONITOR
ODCS data contract (YAML)
# contracts/orders.yml — ODCS format
apiVersion: v2.2.2
kind: DataContract
id: orders-v1
dataset: orders
version: 1.2.0
owner:
team: data-platform
contact: data-platform@company.com
sla:
freshness_hours: 1
uptime_percent: 99.5
schema:
- name: order_id
type: integer
nullable: false
- name: customer_email
type: string
pii: true
sensitivity: HIGH
CI/CD contract validation (GitHub Actions)
# .github/workflows/contract-check.yml
on: [pull_request]
jobs:
validate-contracts:
steps:
- name: Check breaking changes
run: |
python scripts/contract_diff.py \
--base origin/main \
--head HEAD \
--fail-on-breaking
Data Contracts vs Schemas vs dbt Tests
vs Schemas
A schema is a structural description. A data contract is a promise — it includes the schema plus SLAs, quality rules, owner, and change compatibility policy. Schemas tell you what data looks like; contracts tell you what it guarantees.
vs dbt Tests
dbt tests run assertions at pipeline time inside your dbt project. Data contracts are external documents that define what the pipeline must guarantee — they can drive dbt test generation, but also enforce pre-merge breaking change detection and PII classification that dbt tests don't cover.
| Dimension | Data Contract | Schema Only | dbt Tests |
|---|---|---|---|
| Scope | Schema + quality + SLA + ownership | Column names and types only | Pipeline-time test assertions |
| When enforced | CI/CD + pipeline time + registry | At read/write time | When pipeline runs |
| Ownership | Explicit — named team and on-call | Implicit | In dbt project |
| Breaking change protection | Blocking CI check on PR | None | Post-hoc test failure |
| PII / compliance | Column-level sensitivity tags | None | Via custom meta tags |
Common Mistakes
- ✗Starting with too many contracts — begin with 3–5 critical datasets at team boundaries, not every table in the warehouse
- ✗Contracts without enforcement — a YAML file nobody checks is just documentation. Wire it into CI/CD from day one.
- ✗No versioning policy — define what counts as breaking vs non-breaking before you write your first contract, not after the first incident
- ✗Treating contracts as a governance team's job — contracts work when the producing team owns them like production code
Who Should Learn Data Contracts?
Junior Data Engineers
Learn to write ODCS contracts, run validation scripts, and add quality checks. Understanding contracts is increasingly required at mid-level interviews.
Senior Data Engineers
Design contract frameworks, build CI/CD enforcement pipelines, implement Schema Registry compatibility rules, and own producer-consumer SLAs end-to-end.
Staff / Platform Engineers
Define org-wide contract standards, build contract registries, set PII classification policy, and ensure contracts satisfy SOC2, GDPR, and HIPAA audit requirements.
Related Guides
Frequently Asked Questions
What is a data contract?
A data contract is a formal, versioned agreement between a data producer and its consumers that defines: the schema (columns, types, nullability), quality rules (freshness SLAs, row count bounds, value constraints), ownership (team, PagerDuty rotation), and compatibility policy (breaking vs non-breaking change rules). Contracts are stored as YAML files in source control and enforced in CI/CD pipelines.
What is the Open Data Contract Standard (ODCS)?
ODCS (Open Data Contract Standard) is a vendor-neutral YAML schema for defining data contracts. It standardizes fields for dataset identity, schema definitions, quality expectations, SLAs, and ownership. ODCS contracts can be validated by tools like Soda, Great Expectations, and custom CI scripts, making them portable across platforms and teams.
What is the difference between a data contract and a data schema?
A schema defines the structure of data (column names and types). A data contract includes the schema plus quality rules, SLAs, ownership, and versioning policy. A schema tells you what the data looks like; a contract tells you what the data promises — and creates accountability when those promises are broken.
How are data contracts enforced?
Data contracts are enforced at three layers: (1) CI/CD validation — a GitHub Actions workflow runs schema compatibility checks on every PR, blocking merges that introduce breaking changes; (2) pipeline-time validation — Soda or Great Expectations run contract checks after each pipeline run; (3) Schema Registry — for Kafka-based producers, Avro schema evolution rules are enforced by the Confluent Schema Registry.
When should I use data contracts?
Use data contracts when: multiple teams consume the same dataset, schema changes break downstream pipelines regularly, you need PII classification and access control enforced at the column level, or you are subject to compliance requirements (SOC2, GDPR, HIPAA) that require data lineage and ownership documentation. Contracts are most valuable at team boundaries — where the producer and consumer are different engineering teams.
What You Will Build
In the Data Governance learning path, you will build a production data governance platform — YAML contracts, CI/CD enforcement, PII classification, and a full audit trail.
- →YAML data contracts in ODCS format with schema versioning
- →CI/CD breaking change detection (GitHub Actions)
- →Great Expectations + Soda validation suites
- →Avro schema evolution + Confluent Schema Registry
- →PII detection pipeline with sensitivity classification
- →Role-based access control via policy-as-code
- →SOC2 / GDPR audit trail generation