ai-de.net/Projects/P21 · Airflow + dbt: Production Pipeline Foundations

PRO · module 01 free previewFoundations trackP21

Production pipelines on
Airflow + dbt
— ingest, orchestrate, transform, monitor

End-to-end batch pipeline foundations on the open-source Airflow + dbt + Postgres + Docker stack. Build 3 production DAGs: an idempotent REST-API ingestion with pagination and retry, a multi-source orchestrator with TaskGroups + dynamic mapping + branching + callbacks, and a dbt medallion transformer (bronze → silver → gold) with dbt tests as quality gates. Wrap it with SLA monitoring, freshness checks, and a production-readiness runbook. Local Docker only — no cloud credentials.

Timeline

11-13 hours

Difficulty

Intermediate

Stack

Airflow · dbt · Postgres · Docker · Python

See PRO benefits

The end-to-end batch pipeline question every junior+ data-engineering interview asks — pagination, idempotency, orchestration, transformation, monitoring. This project gives you the working code on the most-deployed open-source stack.

By the end you will have shipped

ingest_orders_api.py — REST API DAG with pagination, retry/backoff, HttpSensor, idempotent ON CONFLICT UPSERT into staging.orders
customer_orders_pipeline.py + multi_source_etl.py — TaskGroups, BranchPythonOperator, dynamic mapping with expand(), watermark/CDC, on_failure_callback
streamcart_transforms/ — dbt project with bronze/silver/gold medallion, is_incremental() + unique_key, dbt tests + custom assertions, dbt docs
dbt_medallion_pipeline.py — Airflow orchestrating dbt bronze→test→silver→test→gold→test→docs as quality-gated stages
Monitoring layer — SLA monitoring, freshness checks, on_failure_callback, production-readiness checklist + runbook
Seeded warehouse — 10K orders, 1K customers, 600 products, 100 audit records on local Postgres + Docker (Airflow + Redis + Celery executor)

PREREQComfortable with Python (functions, decorators, basic OOP), SQL basics (SELECT / JOIN / WHERE), Docker, and Git. Pairs well with the Apache Airflow deep dive and dbt curriculum — those teach the primitives, this project stitches them into one production pipeline.

streamcart.warehouse.* · 3 DAGs · medallion

all gates green

Ingest

Orchestrate

Transform

Monitor

REST APIorders · paginated

ingest_orders_apiTaskFlow + HttpSensor

staging.ordersON CONFLICT UPSERT

retries · backoff · idempotent

TaskGroupextract · transform · load

expand()dynamic mapping

multi_source_etl3 sources · branching

watermark · CDC · pytest

bronze_ordersraw + dbt_loaded_at

silver_ordersincremental + dedup

gold + dbt testsunique · not_null · custom

medallion · gated by tests

SLA monitorsla_miss_callback

freshnessmax-age check

runbookcallbacks · diagrams

on-call ready

production-grade

⚡ 3 production DAGs (you ship)

ingest_orders_api.py

customer_orders_pipeline.py

dbt_medallion_pipeline.py

→ all docker-compose · all idempotent

▣ Seeded warehouse (boot in 5 min)

staging.orders 10,000 rows

staging.customers 1,000 rows

staging.products 600 rows

→ audit.pipeline_runs 100 rows

Production DAGs

10K

Orders seeded

11-13h

End to end

Why this matters

Airflow + dbt is the open-source stack every paid platform copies.

Whether your company runs on Astronomer / MWAA / Composer or self-hosts Airflow, the primitives are the same. dbt + a warehouse is the same. This project builds the substrate every 'modern data stack' platform abstracts over — so you can reason about what's underneath the SaaS bill.

Airflow primitives > vendor magic

TaskFlow API, TaskGroups, dynamic mapping, branching, sensors, callbacks. The patterns transfer to MWAA / Composer / Astronomer with config-only changes.

Idempotency > rerun panic

ON CONFLICT UPSERT + watermark/CDC + execution-date keys mean you can re-run any DAG without fear of duplicates. The pattern that turns a tutorial into production.

dbt tests > runtime asserts

Quality gates as Airflow tasks — bronze→test→silver→test→gold→test→docs. Failures block downstream, not silently corrupt the warehouse.

Runbook > tribal knowledge

SLA + freshness + on_failure_callback + production-readiness checklist. Hand the project to a teammate and they can operate it.

Curriculum · 4 modules · 11-13 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 2-3 hours — write the REST-API ingestion DAG with pagination, retry, and idempotent UPSERT into staging.orders. If the rhythm clicks, upgrade to unlock the orchestration patterns, dbt medallion, and monitoring modules.

P21 · 11-13 hours · 4 modules

Free preview PRO required

Module 01 is free — no card required. Get the ingestion DAG and idempotent UPSERT pattern under your fingers before paying.

M01

✓Ingest API data reliably: pagination + idempotent UPSERT

Write `ingest_orders_api.py` — Airflow DAG using TaskFlow API and HttpSensor. Page-based pagination loop on a REST API. Retry/backoff with `retries=3, retry_delay=5min`. Idempotent `ON CONFLICT (order_id) DO UPDATE` UPSERT into staging.orders. Connection management for HTTP + Postgres. The canonical ingestion pattern.

2-3h9 lessonsFREE PREVIEW

Start →

M02

⊘Orchestrate multi-step workflows: TaskGroups + dynamic mapping

Build `customer_orders_pipeline.py` and `multi_source_etl.py`. TaskGroups for DAG visualization. BranchPythonOperator for data-quality gates. Dynamic mapping with `expand()` for runtime task creation. Watermark/CDC pattern via `MAX(updated_at)`. ExternalTaskSensor (poke vs reschedule). pytest DagBag tests. on_failure_callback for alerting.

3-4h9 lessonsPRO TIER

Unlock with PRO →

M03

⊘Transform with dbt: bronze → silver → gold + tests

Stand up `streamcart_transforms/` dbt project with profiles.yml (dev/prod schemas). Build bronze (raw copy + dbt_loaded_at), silver (cleaned, deduped, incremental with unique_key), gold (business aggregates). dbt tests: built-in (unique, not_null, relationships) + custom assertions (assert_revenue_positive). Orchestrate from `dbt_medallion_pipeline.py`: bronze→test→silver→test→gold→test→docs.

4-5h9 lessonsPRO TIER

Unlock with PRO →

M04

⊘Monitor, alert & ship: SLA + freshness + runbook

SLA monitoring with `sla_miss_callback`. Data freshness checks (max-age validation against business cutoffs). on_failure_callback wiring (Slack / email). Production-readiness checklist (RBAC, secrets, retries, deployment paths to MWAA / Composer). Portfolio runbook + architecture diagram + DAG screenshots.

2-3h9 lessonsPRO TIER

Unlock with PRO →

3 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

Backed by curriculum

Apache Airflow Production Patterns

13 modules·205 hours·TaskFlow API·TaskGroups·Dynamic mapping·Sensors & SLAs·Datasets & CDC

Open curriculum→

The deep-dive curriculum on every Airflow primitive used in this project’s 3 DAGs. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production-ready pipeline.

Each phase ends with a runnable DAG and a tagged commit. No theory decks.

01~2.5h

Ingestion + idempotent UPSERT

ingest_orders_api.py running locally. Idempotent re-runs land cleanly. audit.pipeline_runs tracks every execution.

✓ingest_orders_api.py (TaskFlow + HttpSensor)
✓staging.orders + audit.pipeline_runs
✓ON CONFLICT UPSERT + retries + backoff

02~3.5h

Orchestration patterns

Multi-source ETL across API + Postgres + S3 mock. TaskGroups + dynamic mapping + branching + callbacks all wired and tested with pytest.

✓customer_orders_pipeline.py (watermark/CDC)
✓multi_source_etl.py (3-source parallel)
✓TaskGroups + expand() + on_failure_callback

03~5.5h

dbt medallion + monitoring

dbt project with bronze/silver/gold + tests + docs, orchestrated from dbt_medallion_pipeline.py. SLA + freshness + runbook complete.

✓streamcart_transforms/ (bronze→silver→gold + tests)
✓dbt_medallion_pipeline.py (quality-gated)
✓SLA + freshness + runbook + architecture diagram

Project setup · 5 minutes

One stack. Airflow + Postgres + dbt — running in Docker.

Pre-configured docker-compose with Airflow (scheduler + webserver + Celery executor + Redis), Postgres, and the dbt project scaffolded. Seed data with 10K orders / 1K customers / 600 products / 100 audit records loads on first boot.

What lives in the repo

Everything you need to run all 3 production DAGs locally. Tutorial walks you through writing each DAG; the starter has working scaffolds, seed data, and the dbt project structure so you can boot fast and focus on the patterns.

docker-compose.yml — Airflow (scheduler + webserver + Celery + Redis) + Postgres
dags/streamcart/ — ingest_orders_api · customer_orders_pipeline · dbt_medallion_pipeline
streamcart_transforms/ — dbt project with bronze/silver/gold models + tests + docs
seed/ — 10K orders · 1K customers · 600 products · 100 audit records
tests/ — pytest DagBag tests (cycles, defaults, load errors)
requirements.txt + README.md — pinned versions + production-readiness checklist

Download · Starter Kit

Airflow + dbt Pipeline Starter Kit

Pre-configured docker-compose stack, the dbt project scaffolded, all 3 production DAG skeletons, seed data with 11.7K rows across 3 entities + audit log, and the pytest DagBag tests. Boot Airflow + Postgres + dbt in under 5 minutes.

363 KB · docker-compose · 3 DAG scaffolds · 11.7K seeded rows · PRO required

~/projects/modern-data-stack — zsh

1. Clone and start the stack

$ git clone github.com/ai-de/p21-modern-data-stack

$ cd p21-modern-data-stack && docker-compose up -d

2. Open Airflow UI + verify seed data

$ open http://localhost:8080 # admin / admin

$ docker exec -it postgres psql -d streamcart -c 'SELECT count(*) FROM staging.orders'

3. Trigger the ingestion DAG (Module 01)

$ docker exec airflow-scheduler airflow dags trigger ingest_orders_api

$ # → pagination loop runs, ON CONFLICT UPSERTs into staging.orders

4. Run the dbt medallion pipeline (Module 03)

$ docker exec airflow-scheduler airflow dags trigger dbt_medallion_pipeline

$ # → bronze → test → silver → test → gold → test → docs

5. Run the pytest DagBag suite

$ docker exec airflow-scheduler pytest tests/test_dags.py -v

$ # → DAG load errors, cycles, defaults all pass

10K

orders

customers

600

products

100

audit records

Production hardening

The same pipeline — but built for the production case.

Tutorials show you the happy path. Production breaks at the edges. Each row pairs the tutorial pattern with the upgrade you make when the DAG is actually on-call for revenue dashboards at 9am.

Tutorial patternWhat you ship in this project

Pagination

Page-based loop with manual <code>next</code> param

Idempotency

<code>ON CONFLICT</code> UPSERT in Postgres

Retries

<code>retries=3</code> with exponential backoff

Quality gate

BranchPythonOperator on test result

Incremental

<code>WHERE updated_at > MAX(updated_at)</code>

Failure alert

<code>on_failure_callback</code> prints to logs

Production upgradeModule 02–04 + hardening

✓

Pagination

Cursor-based with idempotency keys + dead-letter queue for poisoned pages

✓

Idempotency

Schema evolution + soft-delete patterns + audit chain (every change traceable)

✓

Retries

Circuit breaker + token-bucket rate limiting + budget guard against quota burn

✓

Quality gate

Quality gate as Airflow Dataset + downstream blocked-on-fail policy

✓

Incremental

dbt source freshness + late-arriving-data window + reconciliation job

✓

Failure alert

Alertmanager-style routing with dedup, severity, ack tracking, and escalation

PRO benefit · code review

Real review from senior engineers who’ve shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-DE.

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Airflow + dbt for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer. Walk through your DAGs, debug an idempotency edge case, mock a system-design interview. Group sessions also available.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Upgrade to PRO →

Who this is for

Pick this if you’re shipping pipelines, not learning to.

Junior data engineers

You know SQL and Python; you want to ship the canonical Airflow + dbt pipeline so you can talk about it in interviews and on a portfolio.

Analysts moving into DE

You live in dbt + a warehouse but you've never owned the Airflow side. This bridges that gap end-to-end with real DAG patterns.

Backend devs adding DE

You can ship services; you want to see how data folks structure batch — DAGs, idempotency, quality gates, runbooks. The on-call rhythm.

Career-switchers / bootcampers

You've done a few notebooks. This is the smallest realistic production pipeline that fits in a portfolio and survives a system-design conversation.

Related curriculum

Going deeper? Three tracks back this pipeline.

This project stitches Airflow + dbt + Python into one pipeline. These three curriculums let you go deeper on each layer — the orchestration primitives, the transformation engine, and the Python ETL patterns the DAGs are built on.

FAQ

Quick answers.

Is this actually a 'modern data stack' project?+

No, and that name was misleading. There's no Fivetran/Airbyte, no Snowflake/BigQuery, no Looker/Hightouch, no metrics layer. This is the open-source Airflow + dbt + Postgres + Docker stack — the substrate every paid 'modern data stack' platform is built on. We renamed the page to be honest about that.

Why no Spark or Kubernetes?+

They were claimed in the layout metadata before; that was wrong, we cut it. The stack is intentionally local-Docker so you can ship the project without cloud credentials. Patterns transfer to MWAA / Composer / Astronomer / cloud-Postgres with config-only changes.

Why no BI tool / Metabase?+

The catalog used to mention Metabase but it was never wired in tutorials. We cut it. The gold layer is described as the BI handoff but no specific tool is wired — once gold lands, plug Metabase / Superset / Looker / Tableau on top with a JDBC connection.

Do I need cloud credentials?+

No. Everything runs in local Docker — Airflow (scheduler + webserver + Celery executor + Redis) + Postgres + dbt. The pytest DagBag tests run locally too.

How is this different from the api-data-ingestion project (P16)?+

P16 is API-side specifics — pagination, rate limits, retries, schema drift — on a single source. P21 (this) stitches Airflow + dbt + monitoring into an end-to-end pipeline across 4 parts. P16 is a slice; P21 is the whole.

What does PRO actually unlock for $29/mo?+

All 4 modules of this project, all 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.

Ready to ship a real production pipeline?

Start with module 01 — free, no card. About 2-3 hours. By the end you'll have a production-shaped REST-API ingestion DAG running locally in Docker, with pagination, retry, HttpSensor, and idempotent ON CONFLICT UPSERT into staging.orders.

See PRO benefits

P21 · Airflow + dbt: Production Pipeline Foundations · PRO · module 01 freeUpgrade to PRO →

Production pipelines onAirflow + dbt— ingest, orchestrate, transform, monitor

Airflow + dbt is the open-source stack every paid platform copies.

Airflow primitives > vendor magic

Idempotency > rerun panic

dbt tests > runtime asserts

Runbook > tribal knowledge

Module 01 is free. The rest unlocks with PRO.

Apache Airflow Production Patterns

Three sprints. Three checkpoints. One production-ready pipeline.

One stack. Airflow + Postgres + dbt — running in Docker.

What lives in the repo

Airflow + dbt Pipeline Starter Kit

The same pipeline — but built for the production case.

Real review from senior engineers who’ve shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you’re shipping pipelines, not learning to.

Junior data engineers

Analysts moving into DE

Backend devs adding DE

Career-switchers / bootcampers

Going deeper? Three tracks back this pipeline.

dbt & Analytics Engineering

Python for Data Engineers

Advanced Data Modeling & Architecture

Quick answers.

Ready to ship a real production pipeline?

Production pipelines on
Airflow + dbt
— ingest, orchestrate, transform, monitor