Skip to content

What is a Data Contract?

The complete guide for data engineers — what data contracts are, how they work, and how to enforce them in production.

13 min read · Updated March 2026

Quick Answer

A data contract is a versioned YAML file that defines what a dataset promises — its schema, quality rules, freshness SLA, and owner. Stored in source control and enforced in CI/CD, contracts prevent breaking changes from reaching downstream consumers and create accountability at team boundaries.

What is a Data Contract?

A data contract is a formal, versioned agreement between a data producer (the team that owns and publishes a dataset) and its consumers (the teams or systems that read it). Contracts define not just the structure of the data, but what it promises — quality levels, SLAs, ownership, and breaking-change policy. They transform implicit assumptions into explicit, enforceable agreements.

Schema (without a contract)

Defines column names and types. No SLA, no quality rules, no owner. A column rename in the source breaks five downstream dashboards with no warning.

Data Contract (ODCS format)

Schema + quality rules + freshness SLA + owner + compatibility policy. A column rename is flagged as a breaking change in CI/CD, blocking the PR until consumers are notified and updated.

Before vs. After Data Contracts

Before

  • Column renamed upstream — 4 dashboards break silently
  • No owner listed — on-call spends 2 hours finding who to page
  • PII column added without sensitivity tag — compliance audit fails
  • No SLA — nobody knows if missing data is a bug or expected

After

  • Breaking change blocked at PR — consumer team notified automatically
  • Owner field in contract — PagerDuty fires to the right rotation
  • PII scanner enforces sensitivity tags — compliance check passes in CI
  • Freshness SLA: 1 hour — alert fires if table goes stale

What Data Contracts Cover

🔒

Schema Change Protection

Block breaking changes at CI/CD before they reach production and break downstream consumers.

🤝

Cross-Team SLAs

Codify freshness windows, row count bounds, and quality thresholds between producer and consumer teams.

🏷

PII Classification

Tag columns with sensitivity tiers and enforce role-based access control via policy-as-code.

📋

Compliance Documentation

Auto-generate data lineage reports for SOC2, GDPR, and HIPAA audits from contract metadata.

🔄

Backward Compatibility

Enforce Avro schema evolution rules and Confluent Schema Registry compatibility checks in CI.

📚

Contract Registry

Version and publish contracts to a central registry so consumers can discover and subscribe to datasets.

How Data Contracts Work

A production data contract system has four layers — define, validate, enforce, and monitor — applied across the full data lifecycle.

DEFINE

VALIDATE

ENFORCE

MONITOR

ODCS data contract (YAML)

# contracts/orders.yml — ODCS format
apiVersion: v2.2.2
kind: DataContract
id: orders-v1
dataset: orders
version: 1.2.0
owner:
  team: data-platform
  contact: data-platform@company.com
sla:
  freshness_hours: 1
  uptime_percent: 99.5
schema:
  - name: order_id
    type: integer
    nullable: false
  - name: customer_email
    type: string
    pii: true
    sensitivity: HIGH

CI/CD contract validation (GitHub Actions)

# .github/workflows/contract-check.yml
on: [pull_request]
jobs:
  validate-contracts:
    steps:
      - name: Check breaking changes
        run: |
          python scripts/contract_diff.py \
            --base origin/main \
            --head HEAD \
            --fail-on-breaking

Data Contracts vs Schemas vs dbt Tests

vs Schemas

A schema is a structural description. A data contract is a promise — it includes the schema plus SLAs, quality rules, owner, and change compatibility policy. Schemas tell you what data looks like; contracts tell you what it guarantees.

vs dbt Tests

dbt tests run assertions at pipeline time inside your dbt project. Data contracts are external documents that define what the pipeline must guarantee — they can drive dbt test generation, but also enforce pre-merge breaking change detection and PII classification that dbt tests don't cover.

DimensionData ContractSchema Onlydbt Tests
ScopeSchema + quality + SLA + ownershipColumn names and types onlyPipeline-time test assertions
When enforcedCI/CD + pipeline time + registryAt read/write timeWhen pipeline runs
OwnershipExplicit — named team and on-callImplicitIn dbt project
Breaking change protectionBlocking CI check on PRNonePost-hoc test failure
PII / complianceColumn-level sensitivity tagsNoneVia custom meta tags

Common Mistakes

  • Starting with too many contracts — begin with 3–5 critical datasets at team boundaries, not every table in the warehouse
  • Contracts without enforcement — a YAML file nobody checks is just documentation. Wire it into CI/CD from day one.
  • No versioning policy — define what counts as breaking vs non-breaking before you write your first contract, not after the first incident
  • Treating contracts as a governance team's job — contracts work when the producing team owns them like production code

Who Should Learn Data Contracts?

Junior Data Engineers

Learn to write ODCS contracts, run validation scripts, and add quality checks. Understanding contracts is increasingly required at mid-level interviews.

Senior Data Engineers

Design contract frameworks, build CI/CD enforcement pipelines, implement Schema Registry compatibility rules, and own producer-consumer SLAs end-to-end.

Staff / Platform Engineers

Define org-wide contract standards, build contract registries, set PII classification policy, and ensure contracts satisfy SOC2, GDPR, and HIPAA audit requirements.

Related Guides

Frequently Asked Questions

What is a data contract?

A data contract is a formal, versioned agreement between a data producer and its consumers that defines: the schema (columns, types, nullability), quality rules (freshness SLAs, row count bounds, value constraints), ownership (team, PagerDuty rotation), and compatibility policy (breaking vs non-breaking change rules). Contracts are stored as YAML files in source control and enforced in CI/CD pipelines.

What is the Open Data Contract Standard (ODCS)?

ODCS (Open Data Contract Standard) is a vendor-neutral YAML schema for defining data contracts. It standardizes fields for dataset identity, schema definitions, quality expectations, SLAs, and ownership. ODCS contracts can be validated by tools like Soda, Great Expectations, and custom CI scripts, making them portable across platforms and teams.

What is the difference between a data contract and a data schema?

A schema defines the structure of data (column names and types). A data contract includes the schema plus quality rules, SLAs, ownership, and versioning policy. A schema tells you what the data looks like; a contract tells you what the data promises — and creates accountability when those promises are broken.

How are data contracts enforced?

Data contracts are enforced at three layers: (1) CI/CD validation — a GitHub Actions workflow runs schema compatibility checks on every PR, blocking merges that introduce breaking changes; (2) pipeline-time validation — Soda or Great Expectations run contract checks after each pipeline run; (3) Schema Registry — for Kafka-based producers, Avro schema evolution rules are enforced by the Confluent Schema Registry.

When should I use data contracts?

Use data contracts when: multiple teams consume the same dataset, schema changes break downstream pipelines regularly, you need PII classification and access control enforced at the column level, or you are subject to compliance requirements (SOC2, GDPR, HIPAA) that require data lineage and ownership documentation. Contracts are most valuable at team boundaries — where the producer and consumer are different engineering teams.

What You Will Build

In the Data Governance learning path, you will build a production data governance platform — YAML contracts, CI/CD enforcement, PII classification, and a full audit trail.

  • YAML data contracts in ODCS format with schema versioning
  • CI/CD breaking change detection (GitHub Actions)
  • Great Expectations + Soda validation suites
  • Avro schema evolution + Confluent Schema Registry
  • PII detection pipeline with sensitivity classification
  • Role-based access control via policy-as-code
  • SOC2 / GDPR audit trail generation
Press Cmd+K to open