ai-de.net/Projects/P04 · Iceberg Lakehouse Foundations

PRO · module 01 free previewBatch trackP04

Build a
production-grade
lakehouse on Apache Iceberg

Run Spark + object storage + a Nessie catalog locally. Ship ACID tables, Bronze → Silver → Gold pipelines on a 1M-row dataset, time travel, and compaction — no warehouse required.

Timeline

20-26 hours

Difficulty

Senior+

Stack

Iceberg · Spark · Nessie · MinIO

See PRO benefits

This is the system-design question asked at Snowflake, Databricks, Netflix and any company running on Iceberg.

By the end you will have

A working Iceberg lakehouse running locally (Spark + Nessie + MinIO via Docker)
Bronze → Silver → Gold medallion tables on a 1M-row synthetic e-commerce dataset
MERGE-based upserts with row-level de-duplication
Time-travel queries (FOR SYSTEM_VERSION AS OF, FOR SYSTEM_TIME AS OF)
A maintenance runbook: compaction, snapshot expiry, orphan file cleanup

PREREQBuilt for senior+ engineers. Comfortable with dbt fundamentals, Spark, and one columnar format. Not a tutorial — assumes you’ve shipped pipelines before.

iceberg.icelake.* · snapshot 8c4f2e

time travel

Docker stack

Nessie catalog

Iceberg tables

MinIO storage

spark-master

spark-worker

nessie-rest

minio

docker compose up

icelake (HEAD)main @ snap-8c4f2e

snap-3a91d2INSERT bronze

snap-1f7c0aMERGE silver

snap-0e2b88OVERWRITE gold

branching · time travel

bronze_transactions1M rows · 12 mo parts

silver_ordersMERGE · ROW_NUMBER dedup

gold_sales_summaryINSERT OVERWRITE · idempotent

bronze · silver · gold

2024-01.parquet

2024-02.parquet

2024-03.parquet

…04 / 05 / 06

.parquet × 1,200

hidden partitioning

# Time travel

SELECT * FROM silver_orders

FOR SYSTEM_VERSION AS OF 8c4f2e

instant rollback · zero copy

→ snapshot isolation as a first-class API

● Compaction

CALL system.rewrite_data_files(

strategy => 'sort')

1,200 small files → 10 optimized

→ keeps query planning fast as data grows

1M+

row dataset

medallion layers

ACID

snapshot isolation

Why this matters in 2026

Iceberg is the table format everyone is moving to.

The patterns you ship in this project — ACID + time travel + hidden partitioning + compaction — are the ones in production at the companies setting the bar for data infra.

Iceberg vs Snowflake cost

Object storage + open table format breaks the warehouse pricing curve. The same patterns that work locally scale to S3 without a vendor lock-in tax.

Lakehouse vs warehouse

ACID, time travel, and schema evolution on object storage means you stop choosing between cheap-and-dumb (Hive) and expensive-and-managed (Snowflake).

GDPR-grade deletes

Iceberg's MERGE + snapshot expiry is how you actually delete a user's row across 14k Parquet files — and prove you did it.

Multi-team data reuse

Hidden partitioning + snapshot isolation means analytics, ML, and ad-hoc queries can read the same tables without breaking each other.

Curriculum · 4 modules · 20-26 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3-4 hours — stand up the local lakehouse, write your first ACID table, explore the snapshot tree. If it clicks, upgrade to unlock the medallion build and production maintenance modules.

P04 · 20-26 hours · 4 modules

Free preview PRO required

Module 01 is free — no card required. Get a feel for the stack before paying.

M01

✓Lakehouse foundation: Iceberg + Nessie + MinIO + Spark

Stand up a local ACID lakehouse with Docker. Configure the Nessie REST catalog, MinIO object store, and Spark with Iceberg extensions. Create your first Iceberg tables with hidden partitioning.

3-4h9 lessonsFREE PREVIEW

Start →

M02

⊘Bronze layer: 1M-row synthetic ingestion

Generate 1M synthetic retail transactions with Faker (~2% intentional nulls). Land them as a partitioned Iceberg bronze table with explicit schema and zstd compression. Verify with snapshot history queries.

5-6h11 lessonsPRO TIER

Unlock with PRO →

M03

⊘Silver + Gold: MERGE upserts and aggregations

Build silver_orders with MERGE INTO + ROW_NUMBER de-duplication. Build gold_sales_summary as an idempotent INSERT OVERWRITE aggregation. Evolve partitioning in place. Sketch the Airflow DAG that drives it.

6-8h13 lessonsPRO TIER

Unlock with PRO →

M04

⊘Operate at scale: compaction, snapshots, time travel

Run compaction with sort and bin-pack strategies. Expire snapshots and remove orphan files on a schedule. Query historic table state with FOR SYSTEM_TIME AS OF. Build a maintenance runbook with health checks.

6-8h14 lessonsPRO TIER

Unlock with PRO →

3 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

Backed by curriculum

Apache Iceberg & Modern Lakehouse Architecture

9 modules·~11.5 hours·ACID tables·MERGE·schema evolution·compaction·time travel

Open curriculum→

This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production lakehouse.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~4h

Stand up the ACID lakehouse

Iceberg + Nessie + MinIO + Spark running locally via Docker. First ACID table created with hidden partitioning. Snapshot history visible.

✓Docker Compose stack (Spark, Nessie, MinIO)
✓3 Iceberg tables (orders, customers, products)
✓Spark + Nessie catalog wired

02~12h

Build the medallion (Bronze → Gold)

1M-row synthetic dataset landed as bronze. MERGE-based upsert into silver. Idempotent gold aggregation by date / region / category.

✓1M-row Faker bronze table
✓Silver table with MERGE de-dup
✓Gold sales-summary aggregations

03~7h

Operate at scale

Compaction, snapshot expiry, and orphan-file cleanup running as a maintenance runbook. Time-travel queries against historic snapshots.

✓Compaction job (sort + bin-pack)
✓Snapshot expiry + orphan cleanup
✓Maintenance runbook with health checks

Project setup · 10 minutes

One command. Local Iceberg + Nessie + Spark + MinIO.

You get a real stack on day one — local S3 (MinIO), the Nessie REST catalog, Spark 3.5, and 1M+ synthetic transactions ready to load.

What lives in the repo

Everything you need to stand up a production-shaped lakehouse on your laptop, plus the seed scripts and verification queries used in modules 02–04.

docker-compose.yml — Spark, Nessie REST, MinIO
seeds/transactions/ — 1M synthetic retail rows (Faker)
spark/jobs/ — bronze, silver MERGE, gold aggregation scripts
nessie/ — catalog config + branching examples
maintenance/ — compaction, snapshot expiry, orphan cleanup runbook

Download · Starter Kit

Iceberg Lakehouse Starter Kit

Pre-configured Docker stack, sample CSVs, and the bronze/silver/gold scaffolds. Skip the boilerplate, start on module 01.

376 KB · 50 files · 4 sample CSVs · PRO required

~/projects/iceberg-lakehouse — zsh

1. Clone and start the stack

$ git clone github.com/ai-de/p04-iceberg-lakehouse

$ cd p04-iceberg-lakehouse && make up

2. Seed the bronze table (1M rows)

$ make seed && make write-bronze

3. Verify rowcounts via spark-sql

$ docker exec icelake-spark spark-sql -e 'SELECT COUNT(*) FROM nessie.icelake.bronze_transactions'

4. Open MinIO console / Spark UI

$ open http://localhost:9001 # MinIO

$ open http://localhost:8080 # Spark Master

1M+

Bronze rows

Medallion tables

Monthly partitions

~2%

Intentional nulls

Production hardening

The same lakehouse — but built for the 10x case.

Most Iceberg tutorials show you the CREATE TABLE. This one shows what changes when the table is 100GB+, the schema is mid-evolution, and one team is upserting while another is reading.

Hive-table versionWhat you have today

Schema changes

Drop & recreate the table

Bad write recovery

Restore from S3 versioning, hope for the best

Small files

Accumulate forever; query gets slow

De-duplication

DISTINCT after the fact, brittle

Storage cost

Old snapshots and orphans pile up

Historical queries

Snapshot the table to a copy

Your Iceberg versionModule 03–04

✓

Schema changes

Add/rename/drop with ALTER TABLE — column IDs preserved

✓

Bad write recovery

ROLLBACK TO SNAPSHOT — instant, atomic

✓

Small files

Scheduled rewrite_data_files sort + bin-pack

✓

De-duplication

MERGE INTO ... WHEN MATCHED with ROW_NUMBER window

✓

Storage cost

expire_snapshots + remove_orphan_files on a schedule

✓

Historical queries

FOR SYSTEM_VERSION AS OF 8c4f2e — zero copy

PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Iceberg/Spark for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Upgrade to PRO →

Who this is for

Pick this if you’re shipping at scale, not learning to.

Senior data engineers

You've shipped Hive/Parquet pipelines, dealt with the small-files problem, and want to know why everyone is moving to Iceberg.

Staff / tech leads

You're driving the lakehouse migration. You need to understand the failure modes, the migration path, and the cost story before signing off.

Platform engineers

You run the warehouse for 10+ teams. You need to know how Iceberg behaves under MERGE upserts, what to put behind a service, and what to leave open.

Software engineers crossing over

You know systems but the warehouse is opaque. This makes Iceberg feel like the database internals you already understand.

Related curriculum

Going deeper? Three tracks back this project.

Iceberg fundamentals are the spine. These three curriculums let you go deeper on the layers that matter most — modeling, object storage, and Spark internals.

FAQ

Quick answers.

How is module 01 different from a free Iceberg tutorial?+

Module 01 (free) gives you a working local stack — Iceberg + Nessie + MinIO + Spark via Docker — and walks you through hidden partitioning and your first ACID table. Most free tutorials hand you a one-liner; this one builds the mental model.

Does this include Kafka, Flink, or CDC?+

No — and we used to advertise that, which was misleading. This project is the foundations track: ACID lakehouse, medallion pipelines, and operations. Streaming CDC into Iceberg is its own project (P03). Multi-engine reads (Trino) are a planned appendix.

Does this include an ML feature store?+

No. Feature stores get a dedicated project (P07 PredictFlow). What you build here is the foundation a feature store would sit on top of — partitioned, time-traveled, MERGE-able tables with snapshot isolation.

Do I need AWS credentials to do this?+

No. Everything runs locally with MinIO standing in for S3 and Nessie as the catalog. The patterns transfer cleanly to S3 + Glue (or S3 Tables) with config changes only.

What does PRO actually unlock for $29/mo?+

All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.

Will this help with senior+ data engineering interviews?+

Yes. System-design rounds for senior+ DE roles increasingly assume Iceberg or Delta. After this you can whiteboard schema evolution, talk through small-files compaction, explain MERGE upserts, and reason about copy-on-write vs merge-on-read trade-offs.

Ready to ship a real lakehouse?

Start with module 01 — free, no card. About 3-4 hours. By the end you'll have Iceberg + Nessie + MinIO running locally with your first ACID table written and the snapshot tree explored.

See PRO benefits

P04 · Iceberg Lakehouse Foundations · PRO · module 01 freeUpgrade to PRO →

Build aproduction-gradelakehouse on Apache Iceberg

Iceberg is the table format everyone is moving to.

Iceberg vs Snowflake cost

Lakehouse vs warehouse

GDPR-grade deletes

Multi-team data reuse

Module 01 is free. The rest unlocks with PRO.

Apache Iceberg & Modern Lakehouse Architecture

Three sprints. Three checkpoints. One production lakehouse.

One command. Local Iceberg + Nessie + Spark + MinIO.

What lives in the repo

Iceberg Lakehouse Starter Kit

The same lakehouse — but built for the 10x case.

Real review from senior engineers who shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you’re shipping at scale, not learning to.

Senior data engineers

Staff / tech leads

Platform engineers

Software engineers crossing over

Going deeper? Three tracks back this project.

Advanced Data Modeling & Architecture

Cloud Data Infrastructure & FinOps

Apache Spark Distributed Processing

Quick answers.

Ready to ship a real lakehouse?

Build a
production-grade
lakehouse on Apache Iceberg