Skip to content
ai-de.net/Projects/P04 · Iceberg Lakehouse Foundations
PRO · module 01 free previewBatch trackP04

Build a
production-grade
lakehouse on Apache Iceberg

Run Spark + object storage + a Nessie catalog locally. Ship ACID tables, Bronze → Silver → Gold pipelines on a 1M-row dataset, time travel, and compaction — no warehouse required.

Timeline
20-26 hours
Difficulty
Senior+
Stack
Iceberg · Spark · Nessie · MinIO

This is the system-design question asked at Snowflake, Databricks, Netflix and any company running on Iceberg.

By the end you will have
  • A working Iceberg lakehouse running locally (Spark + Nessie + MinIO via Docker)
  • Bronze → Silver → Gold medallion tables on a 1M-row synthetic e-commerce dataset
  • MERGE-based upserts with row-level de-duplication
  • Time-travel queries (FOR SYSTEM_VERSION AS OF, FOR SYSTEM_TIME AS OF)
  • A maintenance runbook: compaction, snapshot expiry, orphan file cleanup
PREREQBuilt for senior+ engineers. Comfortable with dbt fundamentals, Spark, and one columnar format. Not a tutorial — assumes you’ve shipped pipelines before.
iceberg.icelake.* · snapshot 8c4f2e
time travel
Docker stack
Nessie catalog
Iceberg tables
MinIO storage
spark-master
spark-worker
nessie-rest
minio
docker compose up
icelake (HEAD)main @ snap-8c4f2e
snap-3a91d2INSERT bronze
snap-1f7c0aMERGE silver
snap-0e2b88OVERWRITE gold
branching · time travel
bronze_transactions1M rows · 12 mo parts
silver_ordersMERGE · ROW_NUMBER dedup
gold_sales_summaryINSERT OVERWRITE · idempotent
bronze · silver · gold
2024-01.parquet
2024-02.parquet
2024-03.parquet
…04 / 05 / 06
.parquet × 1,200
hidden partitioning
# Time travel
SELECT * FROM silver_orders
FOR SYSTEM_VERSION AS OF 8c4f2e
instant rollback · zero copy
→ snapshot isolation as a first-class API
● Compaction
CALL system.rewrite_data_files(
strategy => 'sort')
1,200 small files → 10 optimized
→ keeps query planning fast as data grows
1M+
row dataset
3
medallion layers
ACID
snapshot isolation
Why this matters in 2026

Iceberg is the table format everyone is moving to.

The patterns you ship in this project — ACID + time travel + hidden partitioning + compaction — are the ones in production at the companies setting the bar for data infra.

Iceberg vs Snowflake cost

Object storage + open table format breaks the warehouse pricing curve. The same patterns that work locally scale to S3 without a vendor lock-in tax.

Lakehouse vs warehouse

ACID, time travel, and schema evolution on object storage means you stop choosing between cheap-and-dumb (Hive) and expensive-and-managed (Snowflake).

GDPR-grade deletes

Iceberg's MERGE + snapshot expiry is how you actually delete a user's row across 14k Parquet files — and prove you did it.

Multi-team data reuse

Hidden partitioning + snapshot isolation means analytics, ML, and ad-hoc queries can read the same tables without breaking each other.

Curriculum · 4 modules · 20-26 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3-4 hours — stand up the local lakehouse, write your first ACID table, explore the snapshot tree. If it clicks, upgrade to unlock the medallion build and production maintenance modules.

P04 · 20-26 hours · 4 modules
Free preview PRO required
Module 01 is free — no card required. Get a feel for the stack before paying.
M01
Lakehouse foundation: Iceberg + Nessie + MinIO + Spark
Stand up a local ACID lakehouse with Docker. Configure the Nessie REST catalog, MinIO object store, and Spark with Iceberg extensions. Create your first Iceberg tables with hidden partitioning.
3-4h9 lessonsFREE PREVIEW
Start →
M02
Bronze layer: 1M-row synthetic ingestion
Generate 1M synthetic retail transactions with Faker (~2% intentional nulls). Land them as a partitioned Iceberg bronze table with explicit schema and zstd compression. Verify with snapshot history queries.
5-6h11 lessonsPRO TIER
Unlock with PRO →
M03
Silver + Gold: MERGE upserts and aggregations
Build silver_orders with MERGE INTO + ROW_NUMBER de-duplication. Build gold_sales_summary as an idempotent INSERT OVERWRITE aggregation. Evolve partitioning in place. Sketch the Airflow DAG that drives it.
6-8h13 lessonsPRO TIER
Unlock with PRO →
M04
Operate at scale: compaction, snapshots, time travel
Run compaction with sort and bin-pack strategies. Expire snapshots and remove orphan files on a schedule. Query historic table state with FOR SYSTEM_TIME AS OF. Build a maintenance runbook with health checks.
6-8h14 lessonsPRO TIER
Unlock with PRO →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

Apache Iceberg & Modern Lakehouse Architecture

9 modules·~11.5 hours·ACID tables·MERGE·schema evolution·compaction·time travel
Open curriculum

This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One production lakehouse.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~4h
Stand up the ACID lakehouse

Iceberg + Nessie + MinIO + Spark running locally via Docker. First ACID table created with hidden partitioning. Snapshot history visible.

  • Docker Compose stack (Spark, Nessie, MinIO)
  • 3 Iceberg tables (orders, customers, products)
  • Spark + Nessie catalog wired
02~12h
Build the medallion (Bronze → Gold)

1M-row synthetic dataset landed as bronze. MERGE-based upsert into silver. Idempotent gold aggregation by date / region / category.

  • 1M-row Faker bronze table
  • Silver table with MERGE de-dup
  • Gold sales-summary aggregations
03~7h
Operate at scale

Compaction, snapshot expiry, and orphan-file cleanup running as a maintenance runbook. Time-travel queries against historic snapshots.

  • Compaction job (sort + bin-pack)
  • Snapshot expiry + orphan cleanup
  • Maintenance runbook with health checks
Project setup · 10 minutes

One command. Local Iceberg + Nessie + Spark + MinIO.

You get a real stack on day one — local S3 (MinIO), the Nessie REST catalog, Spark 3.5, and 1M+ synthetic transactions ready to load.

What lives in the repo

Everything you need to stand up a production-shaped lakehouse on your laptop, plus the seed scripts and verification queries used in modules 02–04.

  • docker-compose.yml — Spark, Nessie REST, MinIO
  • seeds/transactions/ — 1M synthetic retail rows (Faker)
  • spark/jobs/ — bronze, silver MERGE, gold aggregation scripts
  • nessie/ — catalog config + branching examples
  • maintenance/ — compaction, snapshot expiry, orphan cleanup runbook
Download · Starter Kit

Iceberg Lakehouse Starter Kit

Pre-configured Docker stack, sample CSVs, and the bronze/silver/gold scaffolds. Skip the boilerplate, start on module 01.

376 KB · 50 files · 4 sample CSVs · PRO required
~/projects/iceberg-lakehouse — zsh
1. Clone and start the stack
$ git clone github.com/ai-de/p04-iceberg-lakehouse
$ cd p04-iceberg-lakehouse && make up
2. Seed the bronze table (1M rows)
$ make seed && make write-bronze
3. Verify rowcounts via spark-sql
$ docker exec icelake-spark spark-sql -e 'SELECT COUNT(*) FROM nessie.icelake.bronze_transactions'
4. Open MinIO console / Spark UI
$ open http://localhost:9001 # MinIO
$ open http://localhost:8080 # Spark Master
1M+
Bronze rows
3
Medallion tables
12
Monthly partitions
~2%
Intentional nulls
Production hardening

The same lakehouse — but built for the 10x case.

Most Iceberg tutorials show you the CREATE TABLE. This one shows what changes when the table is 100GB+, the schema is mid-evolution, and one team is upserting while another is reading.

Hive-table versionWhat you have today
×
Schema changes
Drop & recreate the table
×
Bad write recovery
Restore from S3 versioning, hope for the best
×
Small files
Accumulate forever; query gets slow
×
De-duplication
DISTINCT after the fact, brittle
×
Storage cost
Old snapshots and orphans pile up
×
Historical queries
Snapshot the table to a copy
Your Iceberg versionModule 03–04
Schema changes
Add/rename/drop with ALTER TABLE — column IDs preserved
Bad write recovery
ROLLBACK TO SNAPSHOT — instant, atomic
Small files
Scheduled rewrite_data_files sort + bin-pack
De-duplication
MERGE INTO ... WHEN MATCHED with ROW_NUMBER window
Storage cost
expire_snapshots + remove_orphan_files on a schedule
Historical queries
FOR SYSTEM_VERSION AS OF 8c4f2e — zero copy
PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Iceberg/Spark for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you’re shipping at scale, not learning to.

SR

Senior data engineers

You've shipped Hive/Parquet pipelines, dealt with the small-files problem, and want to know why everyone is moving to Iceberg.

ST

Staff / tech leads

You're driving the lakehouse migration. You need to understand the failure modes, the migration path, and the cost story before signing off.

PE

Platform engineers

You run the warehouse for 10+ teams. You need to know how Iceberg behaves under MERGE upserts, what to put behind a service, and what to leave open.

SE

Software engineers crossing over

You know systems but the warehouse is opaque. This makes Iceberg feel like the database internals you already understand.

FAQ

Quick answers.

Module 01 (free) gives you a working local stack — Iceberg + Nessie + MinIO + Spark via Docker — and walks you through hidden partitioning and your first ACID table. Most free tutorials hand you a one-liner; this one builds the mental model.
No — and we used to advertise that, which was misleading. This project is the foundations track: ACID lakehouse, medallion pipelines, and operations. Streaming CDC into Iceberg is its own project (P03). Multi-engine reads (Trino) are a planned appendix.
No. Feature stores get a dedicated project (P07 PredictFlow). What you build here is the foundation a feature store would sit on top of — partitioned, time-traveled, MERGE-able tables with snapshot isolation.
No. Everything runs locally with MinIO standing in for S3 and Nessie as the catalog. The patterns transfer cleanly to S3 + Glue (or S3 Tables) with config changes only.
All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.
Yes. System-design rounds for senior+ DE roles increasingly assume Iceberg or Delta. After this you can whiteboard schema evolution, talk through small-files compaction, explain MERGE upserts, and reason about copy-on-write vs merge-on-read trade-offs.

Ready to ship a real lakehouse?

Start with module 01 — free, no card. About 3-4 hours. By the end you'll have Iceberg + Nessie + MinIO running locally with your first ACID table written and the snapshot tree explored.

P04 · Iceberg Lakehouse Foundations · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open