The whole
lakehouse stack
in 13 hours, on Iceberg
A breadth-first tour through a complete Iceberg lakehouse: foundations, schema + partition evolution, multi-engine reads (Trino + DuckDB + PyIceberg), a real Debezium → Kafka → Flink CDC pipeline, compaction, and an intro to Feast + MLflow on Iceberg.
The panoramic system-design tour — when an interviewer asks “walk me through a lakehouse,” this is the project that lets you do it without hand-waving any layer.
- Iceberg + Spark + REST catalog + MinIO running locally via Docker
- Schema + partition evolution patterns on the same e-commerce orders table
- Trino, DuckDB, and PyIceberg all reading the same Iceberg tables
- PostgreSQL → Debezium → Kafka → Flink CDC pipeline into Bronze + Silver
- Compaction (binpack + sort + Z-order) and Prometheus monitoring
- Feast feature view + MLflow snapshot versioning on top of the lakehouse
System-design rounds ask about the whole lakehouse, not one layer.
Most projects go deep in one tool. Senior+ interviews ask you to reason across the stack. This is the project that gives you firsthand context on each layer so you can compare trade-offs rather than guess at them.
Iceberg vs Hive
ACID, time travel, and hidden partitioning on object storage — without rewriting your warehouse.
Streaming vs batch
Flink for low-latency, Spark Structured Streaming for micro-batch. Same Iceberg sink, different latency budgets.
CDC + medallion
Debezium WAL → Kafka → Flink → Bronze (audit log) → Silver (dedup MERGE). The pattern Shopify and Instacart actually run.
Lakehouse + ML
Feast + MLflow on Iceberg snapshots means your training data is reproducible by snapshot ID, not a frozen CSV.
Module 01 is free. The rest unlocks with PRO.
Try the first 3 hours — set up the local stack, write your first ACID Iceberg table, explore the four-layer metadata model, and try a time-travel query. If it clicks, upgrade to unlock the streaming, multi-engine, and ML modules.
Three sprints. Three checkpoints. One end-to-end stack.
Each phase is a runnable artifact, not a theory deck. Tagged commits at every checkpoint.
ACID Iceberg with WAP branching (Module 01). Schema + partition evolution and multi-engine reads from Trino, DuckDB, PyIceberg (Module 02).
- ✓Iceberg + Nessie/REST + MinIO stack
- ✓WAP branch + time-travel query
- ✓Trino + DuckDB + PyIceberg reading the same orders table
Debezium captures PostgreSQL WAL, Kafka buffers, Flink writes to Iceberg. Bronze audit log + Silver MERGE dedup. Compaction running on a schedule.
- ✓PG → Debezium → Kafka → Flink → Iceberg
- ✓Bronze + Silver medallion via MERGE
- ✓Compaction (binpack / sort / Z-order)
Feast feature view + MLflow snapshot versioning + monitoring. System-design write-up for 500K/sec scale. Deploy the full stack via docker-compose.
- ✓Feast feature view on Iceberg
- ✓MLflow runs with snapshot_id
- ✓docker-compose.lakehouse.yml deploy
One stack. Iceberg + five engines + CDC + Feast.
The starter kit ships a complete docker-compose stack — MinIO, REST catalog, Spark, Trino, Kafka, Debezium, PostgreSQL with logical WAL, Prometheus, and Grafana — wired and seeded.
What lives in the repo
Everything you need to run the breadth tour locally — including the seed PostgreSQL database with WAL pre-configured for Debezium and a Java Flink CDC reference job.
- docker-compose.lakehouse.yml — Spark, Trino, Kafka, Debezium, PostgreSQL, Prometheus, Grafana
- icelake/ — Python module with table creators, MERGE helpers, snapshot utils
- postgres-cdc/ — seed schema + Debezium connector config (pgoutput)
- flink-cdc-job/ — Java reference job for streaming Kafka → Iceberg
- trino/ + grafana/ — catalog configs + dashboards for table health
- notebooks/ — Jupyter notebooks for PyIceberg + DuckDB + Feast walkthroughs
IceLake Commerce Starter Kit
Pre-built lakehouse stack with seeded PostgreSQL, Trino + Grafana configs, the Java Flink CDC job, and Jupyter notebooks for the multi-engine + Feast walkthroughs.
The same data — readable by five engines, hardened for production patterns.
Most warehouses lock you to one engine. Hive locks you to one writer. The patterns shown here — WAP, MERGE, hidden partitioning, snapshot isolation — are what unlocks the multi-engine + streaming story without breaking readers.
ALTER TABLE with column-ID preservation — readers don’t noticeADD PARTITION FIELD — evolve days→hours without rewriteMERGE INTO ... WHEN MATCHED with ROW_NUMBER windowrewrite_data_files with zorder sortReal review from senior engineers who shipped this stack.
Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.
4 reviews / month
Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Iceberg + streaming for this project. Async, comments inline, average turnaround 31 hours.
2 office hours / month
Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.
One subscription. 15+ projects, all curriculum, code review.
PRO is built for engineers who want production-grade builds and feedback loops — not more tutorials.
Pick this if you want the whole map, not one corner of it.
Data engineers prepping interviews
You can talk Spark or you can talk Kafka but the whole-stack system-design question still trips you up. This gives you firsthand reps across every layer.
Analytics engineers going deeper
You live in dbt + Snowflake. You want to understand what the lakehouse actually is and why your platform team keeps mentioning Iceberg in roadmap reviews.
Platform engineers comparing options
You're evaluating Iceberg vs Delta vs warehouse-only. This is the fastest way to wire the alternatives and form an opinion based on real usage.
ML engineers needing data context
You write models but feature pipelines are someone else's headache. Module 04 introduces Feast + MLflow on top of a lakehouse you actually built.
Quick answers.
Ready to wire the whole lakehouse?
Start with module 01 — free, no card. About 3 hours. By the end you'll have Iceberg + Spark + REST catalog running locally with your first ACID table written and the four-layer metadata model walked through.