Build a
production-grade
lakehouse on Apache Iceberg
Run Spark + object storage + a Nessie catalog locally. Ship ACID tables, Bronze → Silver → Gold pipelines on a 1M-row dataset, time travel, and compaction — no warehouse required.
This is the system-design question asked at Snowflake, Databricks, Netflix and any company running on Iceberg.
- A working Iceberg lakehouse running locally (Spark + Nessie + MinIO via Docker)
- Bronze → Silver → Gold medallion tables on a 1M-row synthetic e-commerce dataset
- MERGE-based upserts with row-level de-duplication
- Time-travel queries (FOR SYSTEM_VERSION AS OF, FOR SYSTEM_TIME AS OF)
- A maintenance runbook: compaction, snapshot expiry, orphan file cleanup
Iceberg is the table format everyone is moving to.
The patterns you ship in this project — ACID + time travel + hidden partitioning + compaction — are the ones in production at the companies setting the bar for data infra.
Iceberg vs Snowflake cost
Object storage + open table format breaks the warehouse pricing curve. The same patterns that work locally scale to S3 without a vendor lock-in tax.
Lakehouse vs warehouse
ACID, time travel, and schema evolution on object storage means you stop choosing between cheap-and-dumb (Hive) and expensive-and-managed (Snowflake).
GDPR-grade deletes
Iceberg's MERGE + snapshot expiry is how you actually delete a user's row across 14k Parquet files — and prove you did it.
Multi-team data reuse
Hidden partitioning + snapshot isolation means analytics, ML, and ad-hoc queries can read the same tables without breaking each other.
Module 01 is free. The rest unlocks with PRO.
Try the first 3-4 hours — stand up the local lakehouse, write your first ACID table, explore the snapshot tree. If it clicks, upgrade to unlock the medallion build and production maintenance modules.
Apache Iceberg & Modern Lakehouse Architecture
This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.
Three sprints. Three checkpoints. One production lakehouse.
Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.
Iceberg + Nessie + MinIO + Spark running locally via Docker. First ACID table created with hidden partitioning. Snapshot history visible.
- ✓Docker Compose stack (Spark, Nessie, MinIO)
- ✓3 Iceberg tables (orders, customers, products)
- ✓Spark + Nessie catalog wired
1M-row synthetic dataset landed as bronze. MERGE-based upsert into silver. Idempotent gold aggregation by date / region / category.
- ✓1M-row Faker bronze table
- ✓Silver table with MERGE de-dup
- ✓Gold sales-summary aggregations
Compaction, snapshot expiry, and orphan-file cleanup running as a maintenance runbook. Time-travel queries against historic snapshots.
- ✓Compaction job (sort + bin-pack)
- ✓Snapshot expiry + orphan cleanup
- ✓Maintenance runbook with health checks
One command. Local Iceberg + Nessie + Spark + MinIO.
You get a real stack on day one — local S3 (MinIO), the Nessie REST catalog, Spark 3.5, and 1M+ synthetic transactions ready to load.
What lives in the repo
Everything you need to stand up a production-shaped lakehouse on your laptop, plus the seed scripts and verification queries used in modules 02–04.
- docker-compose.yml — Spark, Nessie REST, MinIO
- seeds/transactions/ — 1M synthetic retail rows (Faker)
- spark/jobs/ — bronze, silver MERGE, gold aggregation scripts
- nessie/ — catalog config + branching examples
- maintenance/ — compaction, snapshot expiry, orphan cleanup runbook
Iceberg Lakehouse Starter Kit
Pre-configured Docker stack, sample CSVs, and the bronze/silver/gold scaffolds. Skip the boilerplate, start on module 01.
The same lakehouse — but built for the 10x case.
Most Iceberg tutorials show you the CREATE TABLE. This one shows what changes when the table is 100GB+, the schema is mid-evolution, and one team is upserting while another is reading.
ALTER TABLE — column IDs preservedROLLBACK TO SNAPSHOT — instant, atomicrewrite_data_files sort + bin-packMERGE INTO ... WHEN MATCHED with ROW_NUMBER windowexpire_snapshots + remove_orphan_files on a scheduleFOR SYSTEM_VERSION AS OF 8c4f2e — zero copyReal review from senior engineers who shipped this stack.
Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.
4 reviews / month
Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Iceberg/Spark for this project. Async, comments inline, average turnaround 31 hours.
2 office hours / month
Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.
One subscription. 15+ projects, all curriculum, code review.
PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.
Pick this if you’re shipping at scale, not learning to.
Senior data engineers
You've shipped Hive/Parquet pipelines, dealt with the small-files problem, and want to know why everyone is moving to Iceberg.
Staff / tech leads
You're driving the lakehouse migration. You need to understand the failure modes, the migration path, and the cost story before signing off.
Platform engineers
You run the warehouse for 10+ teams. You need to know how Iceberg behaves under MERGE upserts, what to put behind a service, and what to leave open.
Software engineers crossing over
You know systems but the warehouse is opaque. This makes Iceberg feel like the database internals you already understand.
Going deeper? Three tracks back this project.
Iceberg fundamentals are the spine. These three curriculums let you go deeper on the layers that matter most — modeling, object storage, and Spark internals.
Quick answers.
Ready to ship a real lakehouse?
Start with module 01 — free, no card. About 3-4 hours. By the end you'll have Iceberg + Nessie + MinIO running locally with your first ACID table written and the snapshot tree explored.