Skip to content
ai-de.net/Projects/P21 · Airflow + dbt: Production Pipeline Foundations
PRO · module 01 free previewFoundations trackP21

Production pipelines on
Airflow + dbt
ingest, orchestrate, transform, monitor

End-to-end batch pipeline foundations on the open-source Airflow + dbt + Postgres + Docker stack. Build 3 production DAGs: an idempotent REST-API ingestion with pagination and retry, a multi-source orchestrator with TaskGroups + dynamic mapping + branching + callbacks, and a dbt medallion transformer (bronze → silver → gold) with dbt tests as quality gates. Wrap it with SLA monitoring, freshness checks, and a production-readiness runbook. Local Docker only — no cloud credentials.

Timeline
11-13 hours
Difficulty
Intermediate
Stack
Airflow · dbt · Postgres · Docker · Python

The end-to-end batch pipeline question every junior+ data-engineering interview asks — pagination, idempotency, orchestration, transformation, monitoring. This project gives you the working code on the most-deployed open-source stack.

By the end you will have shipped
  • ingest_orders_api.py — REST API DAG with pagination, retry/backoff, HttpSensor, idempotent ON CONFLICT UPSERT into staging.orders
  • customer_orders_pipeline.py + multi_source_etl.py — TaskGroups, BranchPythonOperator, dynamic mapping with expand(), watermark/CDC, on_failure_callback
  • streamcart_transforms/ — dbt project with bronze/silver/gold medallion, is_incremental() + unique_key, dbt tests + custom assertions, dbt docs
  • dbt_medallion_pipeline.py — Airflow orchestrating dbt bronze→test→silver→test→gold→test→docs as quality-gated stages
  • Monitoring layer — SLA monitoring, freshness checks, on_failure_callback, production-readiness checklist + runbook
  • Seeded warehouse — 10K orders, 1K customers, 600 products, 100 audit records on local Postgres + Docker (Airflow + Redis + Celery executor)
PREREQComfortable with Python (functions, decorators, basic OOP), SQL basics (SELECT / JOIN / WHERE), Docker, and Git. Pairs well with the Apache Airflow deep dive and dbt curriculum — those teach the primitives, this project stitches them into one production pipeline.
streamcart.warehouse.* · 3 DAGs · medallion
all gates green
Ingest
Orchestrate
Transform
Monitor
REST APIorders · paginated
ingest_orders_apiTaskFlow + HttpSensor
staging.ordersON CONFLICT UPSERT
retries · backoff · idempotent
TaskGroupextract · transform · load
expand()dynamic mapping
multi_source_etl3 sources · branching
watermark · CDC · pytest
bronze_ordersraw + dbt_loaded_at
silver_ordersincremental + dedup
gold + dbt testsunique · not_null · custom
medallion · gated by tests
SLA monitorsla_miss_callback
freshnessmax-age check
runbookcallbacks · diagrams
on-call ready
production-grade
⚡ 3 production DAGs (you ship)
ingest_orders_api.py
customer_orders_pipeline.py
dbt_medallion_pipeline.py
→ all docker-compose · all idempotent
▣ Seeded warehouse (boot in 5 min)
staging.orders 10,000 rows
staging.customers 1,000 rows
staging.products 600 rows
→ audit.pipeline_runs 100 rows
3
Production DAGs
10K
Orders seeded
11-13h
End to end
Why this matters

Airflow + dbt is the open-source stack every paid platform copies.

Whether your company runs on Astronomer / MWAA / Composer or self-hosts Airflow, the primitives are the same. dbt + a warehouse is the same. This project builds the substrate every 'modern data stack' platform abstracts over — so you can reason about what's underneath the SaaS bill.

Airflow primitives > vendor magic

TaskFlow API, TaskGroups, dynamic mapping, branching, sensors, callbacks. The patterns transfer to MWAA / Composer / Astronomer with config-only changes.

Idempotency > rerun panic

ON CONFLICT UPSERT + watermark/CDC + execution-date keys mean you can re-run any DAG without fear of duplicates. The pattern that turns a tutorial into production.

dbt tests > runtime asserts

Quality gates as Airflow tasks — bronze→test→silver→test→gold→test→docs. Failures block downstream, not silently corrupt the warehouse.

Runbook > tribal knowledge

SLA + freshness + on_failure_callback + production-readiness checklist. Hand the project to a teammate and they can operate it.

Curriculum · 4 modules · 11-13 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 2-3 hours — write the REST-API ingestion DAG with pagination, retry, and idempotent UPSERT into staging.orders. If the rhythm clicks, upgrade to unlock the orchestration patterns, dbt medallion, and monitoring modules.

P21 · 11-13 hours · 4 modules
Free preview PRO required
Module 01 is free — no card required. Get the ingestion DAG and idempotent UPSERT pattern under your fingers before paying.
M01
Ingest API data reliably: pagination + idempotent UPSERT
Write `ingest_orders_api.py` — Airflow DAG using TaskFlow API and HttpSensor. Page-based pagination loop on a REST API. Retry/backoff with `retries=3, retry_delay=5min`. Idempotent `ON CONFLICT (order_id) DO UPDATE` UPSERT into staging.orders. Connection management for HTTP + Postgres. The canonical ingestion pattern.
2-3h9 lessonsFREE PREVIEW
Start →
M02
Orchestrate multi-step workflows: TaskGroups + dynamic mapping
Build `customer_orders_pipeline.py` and `multi_source_etl.py`. TaskGroups for DAG visualization. BranchPythonOperator for data-quality gates. Dynamic mapping with `expand()` for runtime task creation. Watermark/CDC pattern via `MAX(updated_at)`. ExternalTaskSensor (poke vs reschedule). pytest DagBag tests. on_failure_callback for alerting.
3-4h9 lessonsPRO TIER
Unlock with PRO →
M03
Transform with dbt: bronze → silver → gold + tests
Stand up `streamcart_transforms/` dbt project with profiles.yml (dev/prod schemas). Build bronze (raw copy + dbt_loaded_at), silver (cleaned, deduped, incremental with unique_key), gold (business aggregates). dbt tests: built-in (unique, not_null, relationships) + custom assertions (assert_revenue_positive). Orchestrate from `dbt_medallion_pipeline.py`: bronze→test→silver→test→gold→test→docs.
4-5h9 lessonsPRO TIER
Unlock with PRO →
M04
Monitor, alert & ship: SLA + freshness + runbook
SLA monitoring with `sla_miss_callback`. Data freshness checks (max-age validation against business cutoffs). on_failure_callback wiring (Slack / email). Production-readiness checklist (RBAC, secrets, retries, deployment paths to MWAA / Composer). Portfolio runbook + architecture diagram + DAG screenshots.
2-3h9 lessonsPRO TIER
Unlock with PRO →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

Apache Airflow Production Patterns

13 modules·205 hours·TaskFlow API·TaskGroups·Dynamic mapping·Sensors & SLAs·Datasets & CDC
Open curriculum

The deep-dive curriculum on every Airflow primitive used in this project’s 3 DAGs. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production-ready pipeline.

Each phase ends with a runnable DAG and a tagged commit. No theory decks.

01~2.5h
Ingestion + idempotent UPSERT

ingest_orders_api.py running locally. Idempotent re-runs land cleanly. audit.pipeline_runs tracks every execution.

  • ingest_orders_api.py (TaskFlow + HttpSensor)
  • staging.orders + audit.pipeline_runs
  • ON CONFLICT UPSERT + retries + backoff
02~3.5h
Orchestration patterns

Multi-source ETL across API + Postgres + S3 mock. TaskGroups + dynamic mapping + branching + callbacks all wired and tested with pytest.

  • customer_orders_pipeline.py (watermark/CDC)
  • multi_source_etl.py (3-source parallel)
  • TaskGroups + expand() + on_failure_callback
03~5.5h
dbt medallion + monitoring

dbt project with bronze/silver/gold + tests + docs, orchestrated from dbt_medallion_pipeline.py. SLA + freshness + runbook complete.

  • streamcart_transforms/ (bronze→silver→gold + tests)
  • dbt_medallion_pipeline.py (quality-gated)
  • SLA + freshness + runbook + architecture diagram
Project setup · 5 minutes

One stack. Airflow + Postgres + dbt — running in Docker.

Pre-configured docker-compose with Airflow (scheduler + webserver + Celery executor + Redis), Postgres, and the dbt project scaffolded. Seed data with 10K orders / 1K customers / 600 products / 100 audit records loads on first boot.

What lives in the repo

Everything you need to run all 3 production DAGs locally. Tutorial walks you through writing each DAG; the starter has working scaffolds, seed data, and the dbt project structure so you can boot fast and focus on the patterns.

  • docker-compose.yml — Airflow (scheduler + webserver + Celery + Redis) + Postgres
  • dags/streamcart/ — ingest_orders_api · customer_orders_pipeline · dbt_medallion_pipeline
  • streamcart_transforms/ — dbt project with bronze/silver/gold models + tests + docs
  • seed/ — 10K orders · 1K customers · 600 products · 100 audit records
  • tests/ — pytest DagBag tests (cycles, defaults, load errors)
  • requirements.txt + README.md — pinned versions + production-readiness checklist
Download · Starter Kit

Airflow + dbt Pipeline Starter Kit

Pre-configured docker-compose stack, the dbt project scaffolded, all 3 production DAG skeletons, seed data with 11.7K rows across 3 entities + audit log, and the pytest DagBag tests. Boot Airflow + Postgres + dbt in under 5 minutes.

363 KB · docker-compose · 3 DAG scaffolds · 11.7K seeded rows · PRO required
~/projects/modern-data-stack — zsh
1. Clone and start the stack
$ git clone github.com/ai-de/p21-modern-data-stack
$ cd p21-modern-data-stack && docker-compose up -d
2. Open Airflow UI + verify seed data
$ open http://localhost:8080 # admin / admin
$ docker exec -it postgres psql -d streamcart -c 'SELECT count(*) FROM staging.orders'
3. Trigger the ingestion DAG (Module 01)
$ docker exec airflow-scheduler airflow dags trigger ingest_orders_api
$ # → pagination loop runs, ON CONFLICT UPSERTs into staging.orders
4. Run the dbt medallion pipeline (Module 03)
$ docker exec airflow-scheduler airflow dags trigger dbt_medallion_pipeline
$ # → bronze → test → silver → test → gold → test → docs
5. Run the pytest DagBag suite
$ docker exec airflow-scheduler pytest tests/test_dags.py -v
$ # → DAG load errors, cycles, defaults all pass
10K
orders
1K
customers
600
products
100
audit records
Production hardening

The same pipeline — but built for the production case.

Tutorials show you the happy path. Production breaks at the edges. Each row pairs the tutorial pattern with the upgrade you make when the DAG is actually on-call for revenue dashboards at 9am.

Tutorial patternWhat you ship in this project
×
Pagination
Page-based loop with manual <code>next</code> param
×
Idempotency
<code>ON CONFLICT</code> UPSERT in Postgres
×
Retries
<code>retries=3</code> with exponential backoff
×
Quality gate
BranchPythonOperator on test result
×
Incremental
<code>WHERE updated_at &gt; MAX(updated_at)</code>
×
Failure alert
<code>on_failure_callback</code> prints to logs
Production upgradeModule 02–04 + hardening
Pagination
Cursor-based with idempotency keys + dead-letter queue for poisoned pages
Idempotency
Schema evolution + soft-delete patterns + audit chain (every change traceable)
Retries
Circuit breaker + token-bucket rate limiting + budget guard against quota burn
Quality gate
Quality gate as Airflow Dataset + downstream blocked-on-fail policy
Incremental
dbt source freshness + late-arriving-data window + reconciliation job
Failure alert
Alertmanager-style routing with dedup, severity, ack tracking, and escalation
PRO benefit · code review

Real review from senior engineers who’ve shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-DE.

CR

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Airflow + dbt for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer. Walk through your DAGs, debug an idempotency edge case, mock a system-design interview. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you’re shipping pipelines, not learning to.

JD

Junior data engineers

You know SQL and Python; you want to ship the canonical Airflow + dbt pipeline so you can talk about it in interviews and on a portfolio.

AN

Analysts moving into DE

You live in dbt + a warehouse but you've never owned the Airflow side. This bridges that gap end-to-end with real DAG patterns.

BE

Backend devs adding DE

You can ship services; you want to see how data folks structure batch — DAGs, idempotency, quality gates, runbooks. The on-call rhythm.

CS

Career-switchers / bootcampers

You've done a few notebooks. This is the smallest realistic production pipeline that fits in a portfolio and survives a system-design conversation.

FAQ

Quick answers.

No, and that name was misleading. There's no Fivetran/Airbyte, no Snowflake/BigQuery, no Looker/Hightouch, no metrics layer. This is the open-source Airflow + dbt + Postgres + Docker stack — the substrate every paid 'modern data stack' platform is built on. We renamed the page to be honest about that.
They were claimed in the layout metadata before; that was wrong, we cut it. The stack is intentionally local-Docker so you can ship the project without cloud credentials. Patterns transfer to MWAA / Composer / Astronomer / cloud-Postgres with config-only changes.
The catalog used to mention Metabase but it was never wired in tutorials. We cut it. The gold layer is described as the BI handoff but no specific tool is wired — once gold lands, plug Metabase / Superset / Looker / Tableau on top with a JDBC connection.
No. Everything runs in local Docker — Airflow (scheduler + webserver + Celery executor + Redis) + Postgres + dbt. The pytest DagBag tests run locally too.
P16 is API-side specifics — pagination, rate limits, retries, schema drift — on a single source. P21 (this) stitches Airflow + dbt + monitoring into an end-to-end pipeline across 4 parts. P16 is a slice; P21 is the whole.
All 4 modules of this project, all 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.

Ready to ship a real production pipeline?

Start with module 01 — free, no card. About 2-3 hours. By the end you'll have a production-shaped REST-API ingestion DAG running locally in Docker, with pagination, retry, HttpSensor, and idempotent ON CONFLICT UPSERT into staging.orders.

P21 · Airflow + dbt: Production Pipeline Foundations · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open