Skip to content
Back to Projects

Petabyte-Scale Iceberg Lakehouse

Modernize a traditional data lake by implementing an ACID-compliant table format, allowing time-travel queries and schema evolution at massive scale.

20-25 hrsAdvanced4 Parts
Apache IcebergApache SparkApache FlinkTrinoDebeziumApache KafkaDockerMinIO
  PostgreSQL     App Events    External APIs
       |               |              |
       v               v              v
   Debezium       Kafka Producer  REST/Batch
       |               |              |
       +-------+-------+-------+------+
               |
               v
        Apache Kafka
       (Message Broker)
               |
       +-------+-------+
       |       |       |
       v       v       v
    Spark    Flink   Kafka
   Stream    SQL    Connect
       |       |       |
       +-------+-------+
               |
               v
        Apache Iceberg
        (Table Format)
        [REST Catalog]
               |
               v
         MinIO / S3
       (Object Storage)
               ^
       +-------+-------+
       |       |       |
    Spark    Trino   Flink
   (Batch)  (OLAP)  (Stream)

Fig 1.1: End-to-end Iceberg lakehouse architecture

What You'll Build

Lakehouse Architecture

Complete multi-service setup with Iceberg, Spark, Flink, Trino, and Kafka

Real-time Sync

CDC pipeline from PostgreSQL to Iceberg via Debezium with exactly-once semantics

Multi-Engine Queries

Same tables queryable from Spark, Flink, and Trino simultaneously

Auto Maintenance

Automated compaction, snapshot expiration, and orphan file cleanup

Business Scenario

IceLake Commerce

IceLake Commerce is a fast-growing e-commerce platform processing millions of transactions daily. Their data infrastructure team needs to modernize the analytics platform from a legacy Hive warehouse to a modern Iceberg lakehouse.

Current Challenges

  • -Legacy Hive warehouse with performance issues
  • -Batch-only analytics with 24-hour data latency
  • -No support for updates/deletes (GDPR compliance)
  • -Separate systems for different query engines

Your Mission

  • -Real-time inventory and sales analytics
  • -Multi-engine support for different teams
  • -ML features for personalization engine
  • -Cost-effective storage with S3-compatible backend

Progressive Learning Path

Each part builds on the previous. Master Iceberg from foundation to production.

Total: 20-25 hours across 4 parts

Prerequisites

  • Docker Desktop (8GB+ RAM allocated)
  • Basic SQL knowledge (SELECT, JOIN, GROUP BY)
  • Python familiarity for PySpark scripts
  • Conceptual understanding of data lakes (helpful)

Related Learning Path

This capstone project is the culmination of the Iceberg Deep Dive skill toolkit. Complete the prerequisite modules first, or dive straight in if you have prior experience.

View Iceberg Skill Toolkit

Ready to build?

Press Cmd+K to open