Skip to content
Featured Project~13 hrs

IceLakeCommerce

Migrate a Hive-based data lake to a production-grade Apache Iceberg lakehouse. Build ACID tables, CDC pipelines, streaming ingestion, and multi-engine queries for a fast-growing e-commerce marketplace processing 2B+ events monthly.

4 Parts/500+ Tables/5 Engines
icelake / lakehouse-pipeline
STORE
Table Creation
ACID Writes
Metadata Inspection
Time Travel
EVOLVE
Schema Evolution
Partition Evolution
Catalog Management
Multi-Engine
STREAM
Flink Streaming
CDC Pipeline
Exactly-Once
Bronze/Silver
OPERATE
Compaction
Monitoring
ML Features
Deployment

fig 1 — lakehouse pipeline

TABLES

500+

Migrated from Hive

ENGINES

5

Spark, Flink, Trino, DuckDB, PyIceberg

CDC

2B+/mo

Events Processed

LATENCY

Sub-min

Query Response

What You'll Build

A production-grade Iceberg lakehouse with ACID transactions, time travel, CDC streaming, automated maintenance, and multi-engine analytics.

ACID Lakehouse Tables

Iceberg tables with snapshot isolation, time travel queries, WAP branching for safe writes, and row-level deletes across 500+ migrated tables

CDC Streaming Pipeline

Debezium captures PostgreSQL changes, Kafka buffers events, Flink writes to Iceberg with exactly-once semantics in a Bronze/Silver pattern

Multi-Engine Analytics

Trino for ad-hoc queries with predicate pushdown, DuckDB for local development, PyIceberg for notebooks — all reading the same tables

ML Feature Platform

Feast feature store with point-in-time correctness, MLflow training data versioning via snapshot IDs, and Grafana monitoring dashboards

Curriculum

4 parts, each with a clear checkpoint. Build incrementally, test as you go.

Technical Standards

Production patterns you'll implement across the Iceberg lakehouse.

ACID
500+tables

Every table has snapshot isolation, optimistic concurrency control, and full ACID transactions — no more silent data corruption from concurrent writers

THROUGHPUT
2B+events/mo

CDC streaming pipeline processes billions of change events with exactly-once semantics through Debezium, Kafka, and Flink

QUERY
10xfaster

Metadata pruning, hidden partitioning, and automated compaction reduce query latency from 30 minutes to under 3 minutes

Environment Setup

Spin up the lakehouse stack and create your first Iceberg table.

icelake-commerce
# Clone the project & launch the lakehouse stack
$ git clone https://github.com/aide-hub/icelake-commerce.git
$ cd icelake-commerce

# Start Spark + REST Catalog + MinIO + Kafka
$ docker-compose -f docker-compose.lakehouse.yml up -d

# Create your first Iceberg table
$ python -m icelake.pipeline create_table \
$ --catalog rest --warehouse s3://lakehouse \
$ --table orders --format iceberg

Tech Stack

Apache IcebergPySparkFlinkTrinoDuckDBPyIcebergKafkaDebeziumDockerParquet

Prerequisites

  • SQL proficiency (CTEs, window functions, aggregations)
  • Basic Python (PySpark, pandas, pip/virtualenv)
  • Understanding of data lake concepts (Parquet, partitioning, catalogs)
  • Docker basics (containers, compose files)

Related Learning Path

Master Iceberg architecture, metadata layers, table operations, and production patterns before diving into this project.

Apache Iceberg Deep Dive

Ready to build your Iceberg lakehouse?

Start with Part 1: Iceberg Foundations

Press Cmd+K to open