ai-de.net/Projects/P22 · IceLake Commerce — end-to-end Iceberg tour

PRO · module 01 free previewBatch trackP22

The whole
lakehouse stack
in 13 hours, on Iceberg

A breadth-first tour through a complete Iceberg lakehouse: foundations, schema + partition evolution, multi-engine reads (Trino + DuckDB + PyIceberg), a real Debezium → Kafka → Flink CDC pipeline, compaction, and an intro to Feast + MLflow on Iceberg.

Timeline

13-16 hours

Difficulty

Intermediate

Stack

Iceberg · Spark · Flink · Trino · CDC

See PRO benefits

The panoramic system-design tour — when an interviewer asks “walk me through a lakehouse,” this is the project that lets you do it without hand-waving any layer.

By the end you will have wired

Iceberg + Spark + REST catalog + MinIO running locally via Docker
Schema + partition evolution patterns on the same e-commerce orders table
Trino, DuckDB, and PyIceberg all reading the same Iceberg tables
PostgreSQL → Debezium → Kafka → Flink CDC pipeline into Bronze + Silver
Compaction (binpack + sort + Z-order) and Prometheus monitoring
Feast feature view + MLflow snapshot versioning on top of the lakehouse

PREREQComfortable with SQL (CTEs, window functions), Python, and Docker. We recommend P04 Iceberg Lakehouse Foundations first if you want depth on internals before this breadth tour.

icelake.commerce.* · 5 engines wired

CDC live

Source

Stream

Iceberg

Engines

PostgreSQLorders · WAL logical

Debezium · pgoutput

+ customers · products

transactional system

Kafkadbserver1.public.*

Flink CDCexactly-once

change data capture

bronze_ordersappend · audit log

silver_ordersMERGE · ROW_NUMBER

features (Feast)point-in-time · MLflow

WAP · time travel · compaction

bronze · silver · features

Sparkbatch + stream

Flinkwrites + reads

Trinoad-hoc SQL

DuckDBlocal notebooks

PyIcebergPython scripts

5 engines · same tables

# Streaming MERGE (Module 03)

MERGE INTO silver_orders t USING (

SELECT * FROM cdc_orders QUALIFY

ROW_NUMBER() OVER ... = 1) s …

→ exactly-once dedup from CDC stream

● Multi-engine reads (Module 02)

trino > SELECT count(*) FROM silver_orders

duckdb > iceberg_scan('s3://lake/silver_orders')

pyiceberg > tbl.scan().to_pandas()

→ 5 engines, same tables, predicate pushdown

Engines wired

Layers (CDC → ML)

13h

End to end

Why a breadth tour matters

System-design rounds ask about the whole lakehouse, not one layer.

Most projects go deep in one tool. Senior+ interviews ask you to reason across the stack. This is the project that gives you firsthand context on each layer so you can compare trade-offs rather than guess at them.

Iceberg vs Hive

ACID, time travel, and hidden partitioning on object storage — without rewriting your warehouse.

Streaming vs batch

Flink for low-latency, Spark Structured Streaming for micro-batch. Same Iceberg sink, different latency budgets.

CDC + medallion

Debezium WAL → Kafka → Flink → Bronze (audit log) → Silver (dedup MERGE). The pattern Shopify and Instacart actually run.

Lakehouse + ML

Feast + MLflow on Iceberg snapshots means your training data is reproducible by snapshot ID, not a frozen CSV.

Curriculum · 4 modules · 13-16 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3 hours — set up the local stack, write your first ACID Iceberg table, explore the four-layer metadata model, and try a time-travel query. If it clicks, upgrade to unlock the streaming, multi-engine, and ML modules.

P22 · 13-16 hours · 4 modules

Free preview PRO required

Module 01 is free — no card required. Get a feel for the stack before paying.

M01

✓Iceberg foundations: 4-layer metadata + WAP branching

Stand up Spark + REST catalog + MinIO via Docker. Create your first ACID Iceberg table. Explore catalog → metadata.json → manifest list → manifest files. CRUD with MERGE INTO. Configure Copy-on-Write vs Merge-on-Read. Time travel via WAP branching.

3h8 lessonsFREE PREVIEW

Start →

M02

⊘Evolution + multi-engine reads (Trino + DuckDB + PyIceberg)

Schema evolution with column-ID preservation. Partition evolution (days → hours) without rewrite. Hidden partitioning transforms (hours, bucket, truncate). Connect Trino, DuckDB, and PyIceberg to the same tables and compare predicate pushdown.

3h9 lessonsPRO TIER

Unlock with PRO →

M03

⊘Streaming + CDC: Debezium → Kafka → Flink → Bronze/Silver

Capture PostgreSQL changes via Debezium (pgoutput), buffer in Kafka, write to Iceberg with Flink (exactly-once). Spark Structured Streaming as the alternative. MERGE de-dup with ROW_NUMBER. Compaction (binpack / sort / Z-order). Prometheus + Iceberg exporter.

4h12 lessonsPRO TIER

Unlock with PRO →

M04

⊘ML edges: Feast feature views + MLflow snapshot versioning

Define a Feast feature view backed by Iceberg with point-in-time correct retrieval. Log MLflow training runs with snapshot_id so the dataset is reproducible. Sketch a 500K events/sec system design and a 4-tier storage cost model. Deploy the full stack with docker-compose.

3h10 lessonsPRO TIER

Unlock with PRO →

3 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

The build, in 3 phases

Three sprints. Three checkpoints. One end-to-end stack.

Each phase is a runnable artifact, not a theory deck. Tagged commits at every checkpoint.

01~6h

Iceberg + multi-engine on the same tables

ACID Iceberg with WAP branching (Module 01). Schema + partition evolution and multi-engine reads from Trino, DuckDB, PyIceberg (Module 02).

✓Iceberg + Nessie/REST + MinIO stack
✓WAP branch + time-travel query
✓Trino + DuckDB + PyIceberg reading the same orders table

02~4h

Streaming CDC into Bronze/Silver

Debezium captures PostgreSQL WAL, Kafka buffers, Flink writes to Iceberg. Bronze audit log + Silver MERGE dedup. Compaction running on a schedule.

✓PG → Debezium → Kafka → Flink → Iceberg
✓Bronze + Silver medallion via MERGE
✓Compaction (binpack / sort / Z-order)

03~3h

ML edges + full deployment

Feast feature view + MLflow snapshot versioning + monitoring. System-design write-up for 500K/sec scale. Deploy the full stack via docker-compose.

✓Feast feature view on Iceberg
✓MLflow runs with snapshot_id
✓docker-compose.lakehouse.yml deploy

Project setup · 10 minutes

One stack. Iceberg + five engines + CDC + Feast.

The starter kit ships a complete docker-compose stack — MinIO, REST catalog, Spark, Trino, Kafka, Debezium, PostgreSQL with logical WAL, Prometheus, and Grafana — wired and seeded.

What lives in the repo

Everything you need to run the breadth tour locally — including the seed PostgreSQL database with WAL pre-configured for Debezium and a Java Flink CDC reference job.

docker-compose.lakehouse.yml — Spark, Trino, Kafka, Debezium, PostgreSQL, Prometheus, Grafana
icelake/ — Python module with table creators, MERGE helpers, snapshot utils
postgres-cdc/ — seed schema + Debezium connector config (pgoutput)
flink-cdc-job/ — Java reference job for streaming Kafka → Iceberg
trino/ + grafana/ — catalog configs + dashboards for table health
notebooks/ — Jupyter notebooks for PyIceberg + DuckDB + Feast walkthroughs

Download · Starter Kit

IceLake Commerce Starter Kit

Pre-built lakehouse stack with seeded PostgreSQL, Trino + Grafana configs, the Java Flink CDC job, and Jupyter notebooks for the multi-engine + Feast walkthroughs.

Pro project · pre-built lakehouse stack · sample data included

~/projects/icelake-commerce — zsh

1. Clone and start the lakehouse

$ git clone github.com/ai-de/p22-icelake-commerce

$ cd p22-icelake-commerce && docker-compose -f docker-compose.lakehouse.yml up -d

2. Seed Iceberg tables (orders, customers, products, events)

$ docker exec icelake-spark python /home/iceberg/scripts/seed_iceberg.py

3. Start the Debezium CDC connector (PostgreSQL → Kafka)

$ curl -X POST http://localhost:8083/connectors -d @postgres-cdc/connector.json -H 'Content-Type: application/json'

4. Submit the Flink CDC job (Kafka → Iceberg silver)

$ docker exec flink-jobmanager flink run /opt/flink/jobs/flink-cdc-silver.jar

5. Query the same table from 3 engines

$ trino --catalog iceberg --schema icelake -e 'SELECT count(*) FROM silver_orders'

$ duckdb -c "SELECT * FROM iceberg_scan('s3://lake/silver_orders')"

$ python -c 'from pyiceberg.catalog import load_catalog; print(load_catalog("rest").load_table("icelake.silver_orders").scan().to_pandas())'

customers

500

products

10k

orders

15k

events

What changes vs a single-engine warehouse

The same data — readable by five engines, hardened for production patterns.

Most warehouses lock you to one engine. Hive locks you to one writer. The patterns shown here — WAP, MERGE, hidden partitioning, snapshot isolation — are what unlocks the multi-engine + streaming story without breaking readers.

Hive / single-engine versionWhat you have today

Engines

Spark only — readers blocked during writes

Schema changes

Drop & recreate, downstream breaks

Partition strategy

Hardcoded; never changes

CDC ingestion

Hourly batch from a snapshot copy

De-duplication

DISTINCT after the fact, brittle

Small files

Accumulate; query gets slower over time

Your IceLake versionModule 01–03

✓

Engines

Trino, DuckDB, PyIceberg, Spark, Flink — all on the same Iceberg tables

✓

Schema changes

ALTER TABLE with column-ID preservation — readers don’t notice

✓

Partition strategy

ADD PARTITION FIELD — evolve days→hours without rewrite

✓

CDC ingestion

Debezium WAL → Kafka → Flink, exactly-once into bronze + silver

✓

De-duplication

MERGE INTO ... WHEN MATCHED with ROW_NUMBER window

✓

Small files

Scheduled rewrite_data_files with zorder sort

PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Iceberg + streaming for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Upgrade to PRO →

Who this is for

Pick this if you want the whole map, not one corner of it.

Data engineers prepping interviews

You can talk Spark or you can talk Kafka but the whole-stack system-design question still trips you up. This gives you firsthand reps across every layer.

Analytics engineers going deeper

You live in dbt + Snowflake. You want to understand what the lakehouse actually is and why your platform team keeps mentioning Iceberg in roadmap reviews.

Platform engineers comparing options

You're evaluating Iceberg vs Delta vs warehouse-only. This is the fastest way to wire the alternatives and form an opinion based on real usage.

ML engineers needing data context

You write models but feature pipelines are someone else's headache. Module 04 introduces Feast + MLflow on top of a lakehouse you actually built.

FAQ

Quick answers.

How is this different from P04 Iceberg Lakehouse Foundations?+

P04 goes deep on the foundations + ops (20-26h on internals, MERGE, compaction, snapshot expiry). P22 (this) goes wide across the stack (13h covering streaming + multi-engine + ML edges). Pick P04 if you want depth on operating Iceberg; pick P22 if you want the panoramic system-design tour. Most learners do both.

Are Flink, Kafka, Debezium, and Feast actually built — or just discussed?+

Built. Module 03 wires PostgreSQL → Debezium → Kafka → Flink → Iceberg with real Java + SQL. Module 04 has actual Feast feature views and MLflow runs logging snapshot IDs. Depth is intentionally tour-level, not specialist — but it's executable code, not slides.

Is the '500K events/sec' system actually deployed?+

No — Module 04 has it as a system-design write-up (the kind of thing you'd whiteboard in an interview). The deployed scale is the seeded dataset (1k customers / 500 products / 10k orders / 15k events). The 500K/sec piece is design + costing, not load tested.

Do I need AWS credentials?+

No. MinIO stands in for S3 and the whole stack runs locally via docker-compose. The patterns transfer to S3 + Glue (or S3 Tables) with config changes only.

What does PRO actually unlock for $29/mo?+

All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.

Will this help with senior+ data engineering interviews?+

Especially the system-design rounds. After this you can whiteboard a CDC pipeline end-to-end, defend a multi-engine architecture, explain when to reach for Flink vs Spark Streaming, and answer the inevitable 'what about ML on top?' follow-up without hand-waving.

Ready to wire the whole lakehouse?

Start with module 01 — free, no card. About 3 hours. By the end you'll have Iceberg + Spark + REST catalog running locally with your first ACID table written and the four-layer metadata model walked through.

See PRO benefits

P22 · IceLake Commerce · PRO · module 01 freeUpgrade to PRO →

The wholelakehouse stackin 13 hours, on Iceberg

System-design rounds ask about the whole lakehouse, not one layer.

Iceberg vs Hive

Streaming vs batch

CDC + medallion

Lakehouse + ML

Module 01 is free. The rest unlocks with PRO.

Three sprints. Three checkpoints. One end-to-end stack.

One stack. Iceberg + five engines + CDC + Feast.

What lives in the repo

IceLake Commerce Starter Kit

The same data — readable by five engines, hardened for production patterns.

Real review from senior engineers who shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you want the whole map, not one corner of it.

Data engineers prepping interviews

Analytics engineers going deeper

Platform engineers comparing options

ML engineers needing data context

Quick answers.

Ready to wire the whole lakehouse?

The whole
lakehouse stack
in 13 hours, on Iceberg