Skip to content
ai-de.net/Projects/P22 · IceLake Commerce — end-to-end Iceberg tour
PRO · module 01 free previewBatch trackP22

The whole
lakehouse stack
in 13 hours, on Iceberg

A breadth-first tour through a complete Iceberg lakehouse: foundations, schema + partition evolution, multi-engine reads (Trino + DuckDB + PyIceberg), a real Debezium → Kafka → Flink CDC pipeline, compaction, and an intro to Feast + MLflow on Iceberg.

Timeline
13-16 hours
Difficulty
Intermediate
Stack
Iceberg · Spark · Flink · Trino · CDC

The panoramic system-design tour — when an interviewer asks “walk me through a lakehouse,” this is the project that lets you do it without hand-waving any layer.

By the end you will have wired
  • Iceberg + Spark + REST catalog + MinIO running locally via Docker
  • Schema + partition evolution patterns on the same e-commerce orders table
  • Trino, DuckDB, and PyIceberg all reading the same Iceberg tables
  • PostgreSQL → Debezium → Kafka → Flink CDC pipeline into Bronze + Silver
  • Compaction (binpack + sort + Z-order) and Prometheus monitoring
  • Feast feature view + MLflow snapshot versioning on top of the lakehouse
PREREQComfortable with SQL (CTEs, window functions), Python, and Docker. We recommend P04 Iceberg Lakehouse Foundations first if you want depth on internals before this breadth tour.
icelake.commerce.* · 5 engines wired
CDC live
Source
Stream
Iceberg
Engines
PostgreSQLorders · WAL logical
Debezium · pgoutput
+ customers · products
transactional system
Kafkadbserver1.public.*
Flink CDCexactly-once
change data capture
bronze_ordersappend · audit log
silver_ordersMERGE · ROW_NUMBER
features (Feast)point-in-time · MLflow
WAP · time travel · compaction
bronze · silver · features
Sparkbatch + stream
Flinkwrites + reads
Trinoad-hoc SQL
DuckDBlocal notebooks
PyIcebergPython scripts
5 engines · same tables
# Streaming MERGE (Module 03)
MERGE INTO silver_orders t USING (
SELECT * FROM cdc_orders QUALIFY
ROW_NUMBER() OVER ... = 1) s …
→ exactly-once dedup from CDC stream
● Multi-engine reads (Module 02)
trino > SELECT count(*) FROM silver_orders
duckdb > iceberg_scan('s3://lake/silver_orders')
pyiceberg > tbl.scan().to_pandas()
→ 5 engines, same tables, predicate pushdown
5
Engines wired
4
Layers (CDC → ML)
13h
End to end
Why a breadth tour matters

System-design rounds ask about the whole lakehouse, not one layer.

Most projects go deep in one tool. Senior+ interviews ask you to reason across the stack. This is the project that gives you firsthand context on each layer so you can compare trade-offs rather than guess at them.

Iceberg vs Hive

ACID, time travel, and hidden partitioning on object storage — without rewriting your warehouse.

Streaming vs batch

Flink for low-latency, Spark Structured Streaming for micro-batch. Same Iceberg sink, different latency budgets.

CDC + medallion

Debezium WAL → Kafka → Flink → Bronze (audit log) → Silver (dedup MERGE). The pattern Shopify and Instacart actually run.

Lakehouse + ML

Feast + MLflow on Iceberg snapshots means your training data is reproducible by snapshot ID, not a frozen CSV.

Curriculum · 4 modules · 13-16 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3 hours — set up the local stack, write your first ACID Iceberg table, explore the four-layer metadata model, and try a time-travel query. If it clicks, upgrade to unlock the streaming, multi-engine, and ML modules.

P22 · 13-16 hours · 4 modules
Free preview PRO required
Module 01 is free — no card required. Get a feel for the stack before paying.
M01
Iceberg foundations: 4-layer metadata + WAP branching
Stand up Spark + REST catalog + MinIO via Docker. Create your first ACID Iceberg table. Explore catalog → metadata.json → manifest list → manifest files. CRUD with MERGE INTO. Configure Copy-on-Write vs Merge-on-Read. Time travel via WAP branching.
3h8 lessonsFREE PREVIEW
Start →
M02
Evolution + multi-engine reads (Trino + DuckDB + PyIceberg)
Schema evolution with column-ID preservation. Partition evolution (days → hours) without rewrite. Hidden partitioning transforms (hours, bucket, truncate). Connect Trino, DuckDB, and PyIceberg to the same tables and compare predicate pushdown.
3h9 lessonsPRO TIER
Unlock with PRO →
M03
Streaming + CDC: Debezium → Kafka → Flink → Bronze/Silver
Capture PostgreSQL changes via Debezium (pgoutput), buffer in Kafka, write to Iceberg with Flink (exactly-once). Spark Structured Streaming as the alternative. MERGE de-dup with ROW_NUMBER. Compaction (binpack / sort / Z-order). Prometheus + Iceberg exporter.
4h12 lessonsPRO TIER
Unlock with PRO →
M04
ML edges: Feast feature views + MLflow snapshot versioning
Define a Feast feature view backed by Iceberg with point-in-time correct retrieval. Log MLflow training runs with snapshot_id so the dataset is reproducible. Sketch a 500K events/sec system design and a 4-tier storage cost model. Deploy the full stack with docker-compose.
3h10 lessonsPRO TIER
Unlock with PRO →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
The build, in 3 phases

Three sprints. Three checkpoints. One end-to-end stack.

Each phase is a runnable artifact, not a theory deck. Tagged commits at every checkpoint.

01~6h
Iceberg + multi-engine on the same tables

ACID Iceberg with WAP branching (Module 01). Schema + partition evolution and multi-engine reads from Trino, DuckDB, PyIceberg (Module 02).

  • Iceberg + Nessie/REST + MinIO stack
  • WAP branch + time-travel query
  • Trino + DuckDB + PyIceberg reading the same orders table
02~4h
Streaming CDC into Bronze/Silver

Debezium captures PostgreSQL WAL, Kafka buffers, Flink writes to Iceberg. Bronze audit log + Silver MERGE dedup. Compaction running on a schedule.

  • PG → Debezium → Kafka → Flink → Iceberg
  • Bronze + Silver medallion via MERGE
  • Compaction (binpack / sort / Z-order)
03~3h
ML edges + full deployment

Feast feature view + MLflow snapshot versioning + monitoring. System-design write-up for 500K/sec scale. Deploy the full stack via docker-compose.

  • Feast feature view on Iceberg
  • MLflow runs with snapshot_id
  • docker-compose.lakehouse.yml deploy
Project setup · 10 minutes

One stack. Iceberg + five engines + CDC + Feast.

The starter kit ships a complete docker-compose stack — MinIO, REST catalog, Spark, Trino, Kafka, Debezium, PostgreSQL with logical WAL, Prometheus, and Grafana — wired and seeded.

What lives in the repo

Everything you need to run the breadth tour locally — including the seed PostgreSQL database with WAL pre-configured for Debezium and a Java Flink CDC reference job.

  • docker-compose.lakehouse.yml — Spark, Trino, Kafka, Debezium, PostgreSQL, Prometheus, Grafana
  • icelake/ — Python module with table creators, MERGE helpers, snapshot utils
  • postgres-cdc/ — seed schema + Debezium connector config (pgoutput)
  • flink-cdc-job/ — Java reference job for streaming Kafka → Iceberg
  • trino/ + grafana/ — catalog configs + dashboards for table health
  • notebooks/ — Jupyter notebooks for PyIceberg + DuckDB + Feast walkthroughs
Download · Starter Kit

IceLake Commerce Starter Kit

Pre-built lakehouse stack with seeded PostgreSQL, Trino + Grafana configs, the Java Flink CDC job, and Jupyter notebooks for the multi-engine + Feast walkthroughs.

Pro project · pre-built lakehouse stack · sample data included
~/projects/icelake-commerce — zsh
1. Clone and start the lakehouse
$ git clone github.com/ai-de/p22-icelake-commerce
$ cd p22-icelake-commerce && docker-compose -f docker-compose.lakehouse.yml up -d
2. Seed Iceberg tables (orders, customers, products, events)
$ docker exec icelake-spark python /home/iceberg/scripts/seed_iceberg.py
3. Start the Debezium CDC connector (PostgreSQL → Kafka)
$ curl -X POST http://localhost:8083/connectors -d @postgres-cdc/connector.json -H 'Content-Type: application/json'
4. Submit the Flink CDC job (Kafka → Iceberg silver)
$ docker exec flink-jobmanager flink run /opt/flink/jobs/flink-cdc-silver.jar
5. Query the same table from 3 engines
$ trino --catalog iceberg --schema icelake -e 'SELECT count(*) FROM silver_orders'
$ duckdb -c "SELECT * FROM iceberg_scan('s3://lake/silver_orders')"
$ python -c 'from pyiceberg.catalog import load_catalog; print(load_catalog("rest").load_table("icelake.silver_orders").scan().to_pandas())'
1k
customers
500
products
10k
orders
15k
events
What changes vs a single-engine warehouse

The same data — readable by five engines, hardened for production patterns.

Most warehouses lock you to one engine. Hive locks you to one writer. The patterns shown here — WAP, MERGE, hidden partitioning, snapshot isolation — are what unlocks the multi-engine + streaming story without breaking readers.

Hive / single-engine versionWhat you have today
×
Engines
Spark only — readers blocked during writes
×
Schema changes
Drop & recreate, downstream breaks
×
Partition strategy
Hardcoded; never changes
×
CDC ingestion
Hourly batch from a snapshot copy
×
De-duplication
DISTINCT after the fact, brittle
×
Small files
Accumulate; query gets slower over time
Your IceLake versionModule 01–03
Engines
Trino, DuckDB, PyIceberg, Spark, Flink — all on the same Iceberg tables
Schema changes
ALTER TABLE with column-ID preservation — readers don’t notice
Partition strategy
ADD PARTITION FIELD — evolve days→hours without rewrite
CDC ingestion
Debezium WAL → Kafka → Flink, exactly-once into bronze + silver
De-duplication
MERGE INTO ... WHEN MATCHED with ROW_NUMBER window
Small files
Scheduled rewrite_data_files with zorder sort
PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Iceberg + streaming for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you want the whole map, not one corner of it.

DE

Data engineers prepping interviews

You can talk Spark or you can talk Kafka but the whole-stack system-design question still trips you up. This gives you firsthand reps across every layer.

AE

Analytics engineers going deeper

You live in dbt + Snowflake. You want to understand what the lakehouse actually is and why your platform team keeps mentioning Iceberg in roadmap reviews.

PE

Platform engineers comparing options

You're evaluating Iceberg vs Delta vs warehouse-only. This is the fastest way to wire the alternatives and form an opinion based on real usage.

ML

ML engineers needing data context

You write models but feature pipelines are someone else's headache. Module 04 introduces Feast + MLflow on top of a lakehouse you actually built.

FAQ

Quick answers.

P04 goes deep on the foundations + ops (20-26h on internals, MERGE, compaction, snapshot expiry). P22 (this) goes wide across the stack (13h covering streaming + multi-engine + ML edges). Pick P04 if you want depth on operating Iceberg; pick P22 if you want the panoramic system-design tour. Most learners do both.
Built. Module 03 wires PostgreSQL → Debezium → Kafka → Flink → Iceberg with real Java + SQL. Module 04 has actual Feast feature views and MLflow runs logging snapshot IDs. Depth is intentionally tour-level, not specialist — but it's executable code, not slides.
No — Module 04 has it as a system-design write-up (the kind of thing you'd whiteboard in an interview). The deployed scale is the seeded dataset (1k customers / 500 products / 10k orders / 15k events). The 500K/sec piece is design + costing, not load tested.
No. MinIO stands in for S3 and the whole stack runs locally via docker-compose. The patterns transfer to S3 + Glue (or S3 Tables) with config changes only.
All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.
Especially the system-design rounds. After this you can whiteboard a CDC pipeline end-to-end, defend a multi-engine architecture, explain when to reach for Flink vs Spark Streaming, and answer the inevitable 'what about ML on top?' follow-up without hand-waving.

Ready to wire the whole lakehouse?

Start with module 01 — free, no card. About 3 hours. By the end you'll have Iceberg + Spark + REST catalog running locally with your first ACID table written and the four-layer metadata model walked through.

P22 · IceLake Commerce · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open