Skip to content
ai-de.net/Projects/P11 · Data Governance & Contracts
PRO · module 01 free previewQuality trackP11

Build a
production-grade
data-contract platform on ODCS + Schema Registry

ODCS v2.2 contracts with semantic versioning, dual validation (Great Expectations + Soda), Avro + Confluent Schema Registry with BACKWARD/FORWARD/FULL compatibility enforced in GitHub Actions, PII detection feeding a 4-tier classification model, RBAC policy-as-code with row-level security, an append-only audit log with integrity hashing, and SOC2 + GDPR check engines wired to a governance bot — all on a payments + risk-assessments domain across 3 teams.

Timeline
20-26 hours
Difficulty
Senior+
Stack
ODCS · Avro · GE · Soda · Schema Registry

This is the platform-design question asked at Stripe, Airbnb, Spotify, GoCardless and any company running shared data across producer/consumer teams under SOC2 or GDPR.

By the end you will have
  • Two ODCS v2.2 YAML contracts (payments_events, risk_assessments) with semantic versioning and a contract registry
  • Dual validation pipeline: Great Expectations expectations + Soda checks producing a single PASS/WARN/FAIL gate decision
  • Avro + Confluent Schema Registry with BACKWARD/FORWARD/FULL compatibility, plus a CDC schema-evolution handler with a dead-letter queue
  • GitHub Actions PR gate that blocks breaking changes before merge — same pattern Confluent and Spotify run
  • RBAC policy-as-code (4 roles × 4 sensitivity tiers) with row-level filtering, plus an append-only audit log with cryptographic integrity hashing
  • SOC2 (CC6.1/CC7.2/CC7.3/CC6.5) + GDPR (Art 5/7/15/17) check engines, a governance bot that reviews PRs, and a written governance charter with a domain-ownership map
PREREQBuilt for senior+ data engineers. Comfortable with Python (classes, type hints), SQL DDL, Docker, and Git. Prior exposure to governance or data observability helps but isn’t required. SOC2/GDPR familiarity is a bonus.
governance.contracts.* · contract enforced · audit hashed
PR gate
Contracts
Validate
Enforce
Govern
payments_eventsODCS v2.2 · v1.3.0
risk_assessmentsdepends_on: payments
ContractRegistrysemver · deprecation
3 teams · 1 spec
Great Expectations8+ expectations
Soda · SodaCLfreshness · row counts
DriftDetectorADD · TRANSFORM · BREAK
PASS · WARN · FAIL
Avro v1.avsclogical types · defaults
Schema RegistryBACKWARD · FORWARD
GH Actions PR gatecontract-check.yml
CDC handler · DLQ
PII · 4-tier classifyPUB · INT · CONF · RST
RBAC · row secpolicy-as-code YAML
AuditLogger · #hashtamper-evident
SOC2 + GDPR8 controls · GovBot
PR review · auto-remediate
# PR-gated contract enforcement
$ gh pr open
→ registry.check_compat(BACKWARD) → FAIL
drop user_id breaks risk + analytics
→ blocked at git push, not at 3am
● Audit + compliance evidence
audit.append(event)
sha256(prev_hash + event) → tamper-evident
CC6.1 · CC7.2 · CC7.3 · Art 5 · 17 · 15
→ emitted as JSON, every run
8
SOC2 + GDPR controls
4-tier
PII classification
20-26h
end to end
Why this matters in 2026

Schema breaks and PII leaks are the top-two incidents platform teams ship for in 2026.

The patterns you wire here — contracts as code, dual validation, compatibility-gated CI, classification-driven RBAC, hashed audit logs, compliance bots — are what every senior data-engineering rubric now checks for.

ODCS is the emerging contract standard

Bitol's Open Data Contract Standard (v2.2 in this project) is the format Spotify, Airbnb, and GoCardless converged on. Building one teaches you what the spec actually solves.

Compatibility-gated CI is table stakes

Confluent Schema Registry with BACKWARD compatibility blocks breaking changes at the registry, not at 3am. The CI gate is the difference between a 5-min PR review and a 5-hour incident.

Classification drives access, not docs

A column tagged RESTRICTED is automatically denied to the analyst role — without a Jira ticket, without a wiki page. Policy-as-code is what shifts governance left.

SOC2 + GDPR are no longer optional

Auditors now ask for evidence, not promises. A hashed audit log + automated CC6.1 / Art 17 checks is the evidence they want to see in your Type II report.

Curriculum · 4 modules · 20-26 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3-4 hours — author your first ODCS contract, wire the Great Expectations + Soda dual gate, and watch the drift detector catch a renamed column. If it clicks, upgrade to unlock enforcement, classification, and compliance modules.

P11 · 20-26 hours · 4 modules
Free preview PRO required
Module 01 is free — no card required. Get a feel for the stack before paying.
M01
Schema contracts + dual validation framework
Author ODCS v2.2 contracts for payments_events and risk_assessments. Build a ContractRegistry with semantic-version bump recommendations and deprecation windows. Wire dual validation (Great Expectations expectations + Soda checks) producing a single PASS/WARN/FAIL gate. Build the SchemaDriftDetector with severity classification (ADDITIVE / TRANSFORMATIVE / BREAKING) and Slack alerts.
3-4h6 lessonsFREE PREVIEW
Start →
M02
Breaking changes + Avro + Schema Registry + CI/CD
Stand up the Confluent Platform docker stack. Build the BreakingChangeSimulator and ConsumerImpactAnalyzer that calculate blast radius across the Risk + Analytics consumers. Wire AvroSchemaEvolution with BACKWARD/FORWARD/FULL compatibility checks against the registry. Ship a GitHub Actions PR gate. Build a CDC schema-evolution handler with a dead-letter queue.
4-5h8 lessonsPRO TIER
Unlock with PRO →
M03
PII classification + RBAC + audit log
Build the PIIDetector (regex patterns + column-name heuristics, no NLP). Define the 4-tier sensitivity model (PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED) with TierPolicy mappings. Build the column-level lineage graph with a BlastRadiusCalculator. Ship the RBAC PolicyEngine + RowSecurityEngine + YAMLPolicyLoader. Build the append-only AuditLogger with cryptographic integrity hashing for tamper detection.
3-4h7 lessonsPRO TIER
Unlock with PRO →
M04
SOC2 + GDPR compliance + governance bot
Build the SOC2ComplianceChecker (CC6.1 encryption, CC7.2 access logging, CC7.3 audit trail, CC6.5 retention) and GDPRComplianceChecker (Art 5 minimization, Art 7 consent, Art 15 DSAR, Art 17 right-to-deletion). Wire the GovernanceMetricsCollector + Grafana KPI dashboard config. Ship the GovernanceBot (PR contract-syntax + compatibility + PII checks), the AutoRemediationEngine, the Slack Block Kit notifier, and the written governance charter + domain-ownership map.
3-4h8 lessonsPRO TIER
Unlock with PRO →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

Governance & Data Contracts

10 modules·14 hours·ODCS contracts·schema evolution·compatibility modes·lineage·compliance
Open curriculum

This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One governance platform.

Each phase ships runnable artifacts, not slides. Tagged commits at every checkpoint.

01~4h
Contracts + dual validation

Two ODCS contracts with semantic versioning. ContractRegistry, VersionManager, MigrationManager. GE + Soda dual gate emitting a single PASS/WARN/FAIL decision. SchemaDriftDetector with severity classification.

  • contracts/payments_events.yaml + risk_assessments.yaml (ODCS v2.2)
  • ContractRegistry + SemanticVersion + ChangeType bump recommender
  • PipelineValidator (GE + Soda) → GateDecision + DriftAlerter to Slack
02~5h
Enforcement: Avro + Schema Registry + CI

Confluent Schema Registry stack via docker-compose. Avro evolution with BACKWARD/FORWARD/FULL compatibility. GitHub Actions PR gate. Blast-radius analysis across Risk + Analytics consumers. CDC handler with DLQ for unsafe transformations.

  • docker-compose.yml + RegistryClient + AvroSchemaEvolution
  • .github/workflows/contract-check.yml (PR gate)
  • BreakingChangeSimulator + ConsumerImpactAnalyzer + CDCSchemaEvolutionHandler
03~6h
Classification + compliance + bot

4-tier PII classification with RBAC + row-level security. Hashed audit log. SOC2 + GDPR check engines (8 controls). Governance bot reviewing PRs. Auto-remediation router. Slack Block Kit notifier. Grafana KPI dashboard. Written governance charter + domain-ownership map.

  • PIIDetector + DataClassifier + ColumnLineageGraph + BlastRadiusCalculator
  • PolicyEngine + RowSecurityEngine + AuditLogger (with integrity hash)
  • UnifiedComplianceEngine + GovernanceBot + AutoRemediationEngine + governance_charter.md
Project setup · 10 minutes

One starter kit. 67 pre-built files. Sample data with intentional quality issues.

The starter kit ships every module wired and importable — ODCS contracts, Python modules for registry / validation / drift / classification / compliance, an Avro schema, GitHub Actions workflow, docker-compose stack, and synthetic CSVs with planted quality bugs so the validators have something to fail.

What lives in the repo

Everything you need to run the four modules locally — including the Confluent Platform docker stack, the GitHub Actions workflow, and the sample datasets that exercise drift, PII, and compatibility paths.

  • contracts/ — 2 ODCS v2.2 YAML contracts (payments_events, risk_assessments)
  • src/ — ContractRegistry, VersionManager, drift_detector, break_simulator, impact_analyzer, compliance_engine, governance_bot
  • governance/ — PIIDetector, DataClassifier, LineageGraph, RBAC PolicyEngine, AuditLogger, policies as YAML
  • validation/ — Great Expectations suite + SodaCL checks + PipelineValidator
  • schemas/ + docker/ — Avro schema + docker-compose for Confluent Platform 7.5
  • .github/workflows/ — contract-check.yml: schema discovery, compatibility check, GE validation on every PR
Download · Starter Kit

Data Governance & Contracts Starter Kit

Pre-built repo with all 4 modules wired — 2 ODCS contracts, the Python registry + validators + drift detector + PII detector + RBAC engine + compliance checkers, the Avro schema, the GitHub Actions workflow, the docker-compose stack, and 4 synthetic CSVs with intentional quality bugs.

Pro project · 67 files · ~3 MB · sample data with planted bugs
~/projects/data-governance-contracts — zsh
1. Unzip and set up the Python environment
$ unzip data-governance-contracts-starter.zip
$ cd data-governance-contracts
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install -r requirements.txt
2. Verify Module 01 — run the drift detector
$ python3 scripts/simulate_breaking_change.py
$ # ✓ Drift detected: amount_cents renamed → amount_usd (BREAKING)
3. Start the Confluent Platform stack (Module 02)
$ docker-compose -f docker/docker-compose.yml up -d
$ python3 scripts/test_registry_enforcement.py
4. Run the PII scanner across all sample CSVs (Module 03)
$ python3 -c 'from governance.pii_detector import PIIScanner; PIIScanner().scan_directory("data/")'
5. Run SOC2 + GDPR check engines (Module 04)
$ python3 -m src.unified_compliance_engine
$ # Generates compliance_report_<ts>.json with CC6.1, CC7.2, CC7.3, CC6.5 + Art 5/7/15/17
67
starter files
40+
Python modules
8
SOC2+GDPR controls
3
team boundaries
Production hardening

The same governance — but built for the cross-team case.

Most governance tutorials show you a great_expectations.yml in isolation. This one shows what changes when three teams share the same contract and an auditor wants evidence — not promises.

Single-team / wiki-driven versionWhat you have today
×
Schema source of truth
A README in the producer's repo
×
Breaking change detection
Found at 3am when a dashboard breaks
×
PII handling
A spreadsheet listing sensitive columns
×
Access control
Database GRANTs reviewed quarterly
×
Audit log
Application logs (mutable, no integrity)
×
Compliance evidence
Manual screenshots before the audit
Your governance versionModules 02-04
Schema source of truth
contracts/*.yaml ODCS v2.2 + ContractRegistry — the registry is the spec, not the README
Breaking change detection
BACKWARD compat enforced at Schema Registry + GH Actions PR gate — caught at git push, not at 3am
PII handling
PIIScanner + 4-tier DataClassifier + tier-aware policy — evaluated, not documented
Access control
PolicyEngine + RowSecurityEngine from YAML — policy-as-code, evaluated per query
Audit log
AuditLogger append-only with cryptographic integrity_hash — tamper-evident
Compliance evidence
UnifiedComplianceEngine emits timestamped JSON for CC6.1/CC7.2/CC7.3/CC6.5 + Art 5/7/15/17 on every run
PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or a contract design proposal. Reviewer is matched to your domain — governance / contracts / compliance for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior platform engineer. Walk a tricky contract design, mock a SOC2 readiness review, whiteboard an incident-runbook with policy-as-code. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you’re the engineer the auditor and the producer’s PR both end up asking.

PE

Platform / data-platform engineers

You own the contract layer between domain teams. This gives you the registry, PR gate, lineage, and compliance bot — the four levers a platform team actually pulls.

SR

Senior data engineers prepping interviews

Cross-team contracts and compliance show up in every senior+ system-design round. After this you can defend a contract architecture without hand-waving evidence.

CO

Compliance / governance engineers

You're trying to get out of spreadsheets. Module 04 wires SOC2 (CC6.1/CC7.2/CC7.3/CC6.5) and GDPR (Art 5/7/15/17) into automated checks — the evidence Type II reviewers actually want.

TL

Tech leads scaling data teams

Three teams, one shared dataset, no contract → outage. The governance charter + domain-ownership YAML in Module 04 is the operating model you can adopt verbatim.

FAQ

Quick answers.

P23 is the schema-evolution deep-dive — FastAPI registry from scratch, NetworkX lineage, chaos-injecting incident simulator, 5-stage runbook. P11 (this) is the enterprise-governance project — cross-team RBAC + 4-tier PII classification + hashed audit log + SOC2 + GDPR check engines + governance bot. Pick P23 if you want depth on the schema layer; pick P11 if you want breadth across the governance + compliance stack. Most learners do both.
Real OpenLineage SDK integration (we ship a custom column-lineage graph instead), encryption-at-rest implementation (compliance check only), runtime DSAR workflow automation (handler signature only), LDAP/Okta/Auth0 integration, multi-tenancy, and HIPAA/CCPA controls. These are deliberate cuts — they're full projects in their own right. The earlier landing claimed some of them; we cut the misleading copy as part of this redesign.
Automated. Module 04 ships SOC2ComplianceChecker covering CC6.1 (encryption), CC7.2 (access logging), CC7.3 (audit trail), CC6.5 (retention) and GDPRComplianceChecker covering Art 5 (data minimization), Art 7 (consent), Art 15 (DSAR handler), Art 17 (right to deletion). The UnifiedComplianceEngine emits timestamped JSON reports — the artifact you'd attach to a Type II review. We did not implement encryption-at-rest itself or a live DSAR workflow; those would be follow-on projects.
No cloud credentials — everything runs locally via docker-compose (Confluent Platform 7.5: Zookeeper + Kafka + Schema Registry). No real Slack workspace — the GovernanceNotifier formats Block Kit messages and prints them; swap to a webhook to go live. No GitHub Apps token — the GovernanceBot logic is wired but the actual `gh` API call is shown as the integration point. The patterns transfer to production with config changes only.
All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum access across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime. The free Module 01 here is the same first 3-4 hours full PRO subscribers get — same code, same dataset, same drift detector.
Especially the cross-team and compliance rounds. After this you can whiteboard a contract architecture end-to-end, defend a 4-tier classification model, explain when BACKWARD vs FORWARD compatibility is the right call, and answer the inevitable 'how does this hold up under SOC2 audit' follow-up with a real artifact rather than a hand-wave.

Ready to build a real governance platform?

Start with module 01 — free, no card. About 3-4 hours. By the end you'll have two ODCS contracts authored, the GE + Soda dual gate emitting decisions, and the drift detector catching a renamed column with a Slack alert.

P11 · Data Governance & Contracts · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open