IceLakeCommerce
Migrate a Hive-based data lake to a production-grade Apache Iceberg lakehouse. Build ACID tables, CDC pipelines, streaming ingestion, and multi-engine queries for a fast-growing e-commerce marketplace processing 2B+ events monthly.
fig 1 — lakehouse pipeline
TABLES
500+
Migrated from Hive
ENGINES
5
Spark, Flink, Trino, DuckDB, PyIceberg
CDC
2B+/mo
Events Processed
LATENCY
Sub-min
Query Response
What You'll Build
A production-grade Iceberg lakehouse with ACID transactions, time travel, CDC streaming, automated maintenance, and multi-engine analytics.
ACID Lakehouse Tables
Iceberg tables with snapshot isolation, time travel queries, WAP branching for safe writes, and row-level deletes across 500+ migrated tables
CDC Streaming Pipeline
Debezium captures PostgreSQL changes, Kafka buffers events, Flink writes to Iceberg with exactly-once semantics in a Bronze/Silver pattern
Multi-Engine Analytics
Trino for ad-hoc queries with predicate pushdown, DuckDB for local development, PyIceberg for notebooks — all reading the same tables
ML Feature Platform
Feast feature store with point-in-time correctness, MLflow training data versioning via snapshot IDs, and Grafana monitoring dashboards
Curriculum
4 parts, each with a clear checkpoint. Build incrementally, test as you go.
Technical Standards
Production patterns you'll implement across the Iceberg lakehouse.
Every table has snapshot isolation, optimistic concurrency control, and full ACID transactions — no more silent data corruption from concurrent writers
CDC streaming pipeline processes billions of change events with exactly-once semantics through Debezium, Kafka, and Flink
Metadata pruning, hidden partitioning, and automated compaction reduce query latency from 30 minutes to under 3 minutes
Environment Setup
Spin up the lakehouse stack and create your first Iceberg table.
# Clone the project & launch the lakehouse stack$ git clone https://github.com/aide-hub/icelake-commerce.git$ cd icelake-commerce# Start Spark + REST Catalog + MinIO + Kafka$ docker-compose -f docker-compose.lakehouse.yml up -d# Create your first Iceberg table$ python -m icelake.pipeline create_table \$ --catalog rest --warehouse s3://lakehouse \$ --table orders --format iceberg
Tech Stack
Prerequisites
- SQL proficiency (CTEs, window functions, aggregations)
- Basic Python (PySpark, pandas, pip/virtualenv)
- Understanding of data lake concepts (Parquet, partitioning, catalogs)
- Docker basics (containers, compose files)
Related Learning Path
Master Iceberg architecture, metadata layers, table operations, and production patterns before diving into this project.
Apache Iceberg Deep DiveReady to build your Iceberg lakehouse?
Start with Part 1: Iceberg Foundations