Apache Iceberg Table Format

Name: Apache Iceberg Table Format
Price: 29 USD
Availability: InStock
Author: AI-DE Engineering Team

Open table format mastery — time travel, schema evolution, and multi-engine analytics.

Hive is dying. Iceberg is the standard the entire open-data ecosystem (Spark, Trino, Flink, Snowflake, BigQuery, Dremio) is converging on. Engineers who own table format decisions own the lakehouse — and the lakehouse is where the data lives.

What you’ll be able to do

Create and manage Iceberg tables with time travel, snapshots, and rollback
Implement schema and partition evolution safely without breaking downstream consumers
Configure REST / Hive / Glue / Nessie catalogs for multi-engine access from Spark, Trino, and Flink
Run production Iceberg with compaction, streaming CDC, and zero-downtime Hive migrations

Curriculum

Phase 1: Your First Iceberg Table

Hands-on table creation and exploration

Your First Iceberg Table

Spin up Iceberg locally with Spark + a REST catalog, create your first table, INSERT a few rows, query with time travel, and see the metadata + manifest files on disk — before any production decisions.

Phase 2: Iceberg Foundations

Core concepts and essential operations

Iceberg Foundations

The three-layer architecture (catalog → metadata files → manifest + data files), how snapshots and isolation actually work, and the concurrent-write failure mode that kills naive Parquet-on-S3.

Core Operations

INSERT / MERGE / DELETE / UPDATE semantics, file sizing for object storage (target 128–512 MB), compaction (rewrite_data_files / rewrite_manifests), and the 4 M-file-handles incident pattern to design around.

Phase 3: Schema & Multi-Engine

Schema evolution, catalogs, and cross-engine queries

Schema Evolution

Add / rename / drop columns by ID, type promotion rules, partition evolution without rewrites, and the consumer-side patterns that survive an upstream fraud_score column addition without breaking 12 pipelines.

Catalogs & Multi-Engine Access

REST catalog vs Hive Metastore vs Glue vs Nessie vs Polaris, single-source-of-truth setup, and how to get Spark, Trino, and Flink reading the same table without diverging row counts.

Phase 4: Production Operations

Streaming CDC, production ops, and migration

Streaming CDC

Streaming writes from Flink / Spark Structured Streaming / Kafka Connect into Iceberg, exactly-once via two-phase commit, equality + position deletes for CDC, and how to keep streaming + batch consumers consistent.

Production Operations

Compaction policy, snapshot expiration, orphan-file cleanup, table maintenance jobs, monitoring (file count, snapshot age, write throughput), and the runbook every production Iceberg table needs from day one.

Migration & Interop

Zero-downtime Hive → Iceberg migration with add_files / migrate / shadow tables, dual-write strategies, consumer cutover ordering, and the 8 TB / 3 day rollout pattern that doesn't require a maintenance window.

What you’ll build

A working Iceberg lakehouse with a REST catalog, partitioned + clustered tables, time travel + rollback, and Spark + Trino reading the same source of truth
A schema-evolution playbook covering the 5 safe changes (add nullable, rename, type-promote, add partition, drop column) and how to gate breaking changes in CI
A streaming-CDC pipeline that lands Kafka events into Iceberg with exactly-once semantics, equality deletes, and < 1-minute end-to-end latency
A production-ops runbook with compaction + snapshot-expiration cron, orphan-file cleanup, file-count + snapshot-age dashboards, and a Hive → Iceberg migration plan

Without Iceberg internals, your data lake silently turns into a data swamp.

WHAT GOES WRONG

The 4 M file-handle incident — ingestion writes 10 K small files/hour (50 KB each); six months later a single SUM() query opens 4 million file handles and times out
The 12 broken consumers — a column added to the events table; downstream pipelines with strict schemas fail immediately; on-call eats 4 hours hunting consumers
The two-engine divergence — Spark says 10 M rows, Trino says 9.8 M; both pointing at 'the same' table via different catalog endpoints; the gap had been growing for 3 weeks
The 2.7 M small-file table — built for a demo, no compaction policy, no retention, no monitoring; same root cause as every previous incident; production from day one is the only fix

What is Apache Iceberg Table Format?

Apache Iceberg is an open table format for large-scale analytics datasets. It provides time travel, schema evolution, and partition evolution on top of data lake storage like S3, enabling multi-engine access from Spark, Trino, Flink, and more. Adopted by Apple, Netflix, and LinkedIn, Iceberg is becoming the standard table format for modern data lakehouses.

Why this matters in production

Production data teams migrate from Hive to Iceberg for reliable schema evolution and ACID transactions on data lakes. Apple manages exabytes of data with Iceberg. Without proper table format management, data lakes become unreliable swamps with broken schemas and inconsistent reads.

Common use cases

Building data lakehouses with ACID transactions and time travel
Implementing schema evolution without breaking downstream consumers
Enabling multi-engine analytics with Spark, Trino, and Flink on the same tables
Running streaming CDC pipelines that merge into Iceberg tables
Managing partition evolution for changing query patterns without data rewrites
Migrating from Hive tables to Iceberg for improved reliability

Iceberg vs alternatives

Iceberg vs Delta Lake

Iceberg offers better multi-engine support and partition evolution. Delta Lake has deeper Databricks integration and simpler time travel. Iceberg is the more open standard; Delta Lake is strongest in the Databricks ecosystem.

Iceberg vs Hudi

Iceberg provides cleaner schema evolution and broader engine support. Hudi excels at incremental upserts and CDC use cases. Iceberg has gained more community momentum and enterprise adoption.

Iceberg vs Hive

Iceberg replaces Hive as the table format for data lakes. It adds ACID transactions, schema evolution, and time travel that Hive lacks. Most teams are actively migrating from Hive to Iceberg.

Related skills

Iceberg tables are most commonly read and written using Apache Spark.
Streaming CDC into Iceberg tables builds on concepts from Streaming Fundamentals.
Understanding storage optimization connects to Data Warehouse Internals.

Why this skill matters

Iceberg is the table format the next decade of data engineering is built on. Mid-to-senior data engineers at Apple, Netflix, LinkedIn, and Stripe are paid for exactly this skill — turning unreliable data lakes into ACID lakehouses that streaming, batch, and ML all read from the same source of truth.

Common questions about Iceberg

What is Apache Iceberg?

Iceberg is an open table format that brings database-like reliability to data lakes. It provides ACID transactions, time travel, schema evolution, and multi-engine access on cloud object storage.

Is Iceberg replacing Delta Lake?

Iceberg and Delta Lake compete but coexist. Iceberg leads in multi-engine and open ecosystem adoption. Delta Lake dominates in Databricks environments. The industry is converging on open formats.

How long does it take to learn Iceberg?

Basic table operations take 1-2 weeks. Production patterns like partition evolution, catalog management, and streaming CDC typically take 4-6 weeks of hands-on practice.

Do data engineers need to know Iceberg?

Yes. Iceberg is the leading open table format and appears in most modern data lakehouse architectures. Understanding table formats is essential for mid-to-senior data engineers.

Iceberg vs Delta Lake vs Hudi?

Iceberg leads in open multi-engine support. Delta Lake is strongest on Databricks. Hudi is best for incremental upserts. Most new projects choose Iceberg or Delta Lake based on their platform.

What engines work with Iceberg?

Spark, Trino, Flink, Presto, Dremio, Snowflake, and BigQuery all support Iceberg. This multi-engine access is one of Iceberg primary advantages over proprietary formats.

ai-de.net/Learn/Apache Iceberg Table Format

BatchPhase 1 freeFull access in Professional

Apache Iceberg Table Format

Open table format mastery — time travel, schema evolution, and multi-engine analytics.

Last updated 2026-05-22By AI-DE Engineering Team

Phases

Modules

Time

~24h video + labs

Continue Learning View phases

Jump to:P1Your First Iceberg Table P2Iceberg Foundations P3Schema & Multi-Engine P4Production Operations

What you'll do

What you'll be able to do.

Create and manage Iceberg tables with time travel, snapshots, and rollback
Implement schema and partition evolution safely without breaking downstream consumers
Configure REST / Hive / Glue / Nessie catalogs for multi-engine access from Spark, Trino, and Flink
Run production Iceberg with compaction, streaming CDC, and zero-downtime Hive migrations

Phase roadmap.

Phase 1PRO REQUIRED

Your First Iceberg Table

Hands-on table creation and exploration

1.1

✓Your First Iceberg Table

Open →

Used in:P04 — Iceberg lakehouse foundations

Start Phase 1 →

Phase 2PRO REQUIRED

Iceberg Foundations

Core concepts and essential operations

2.1

⊘Iceberg Foundations

The three-layer architecture (catalog → metadata files → manifest + data files), how snapshots and isolation actually work, and the concurrent-write failure mode that kills naive Parquet-on-S3.

Used in:P04 — Iceberg lakehouse foundations P05 — ShopStream Spark batch pipeline

Unlock Phase 2 →

Phase 3PRO REQUIRED

Schema & Multi-Engine

Schema evolution, catalogs, and cross-engine queries

⊘Catalogs & Multi-Engine Access

REST catalog vs Hive Metastore vs Glue vs Nessie vs Polaris, single-source-of-truth setup, and how to get Spark, Trino, and Flink reading the same table without diverging row counts.

Locked

Used in:P04 — Iceberg lakehouse foundations P03 — Commerce data warehouse

Unlock Phase 3 →

Phase 4PRO REQUIRED

Production Operations

Streaming CDC, production ops, and migration

⊘Production Operations

Used in:P04 — Iceberg lakehouse foundations P02 — Uber event platform P05 — ShopStream Spark batch pipeline

Unlock Phase 4 →

Without Iceberg internals, your data lake silently turns into a data swamp.

WHAT GOES WRONG

The 4 M file-handle incident — ingestion writes 10 K small files/hour (50 KB each); six months later a single SUM() query opens 4 million file handles and times out
The 12 broken consumers — a column added to the events table; downstream pipelines with strict schemas fail immediately; on-call eats 4 hours hunting consumers
The two-engine divergence — Spark says 10 M rows, Trino says 9.8 M; both pointing at 'the same' table via different catalog endpoints; the gap had been growing for 3 weeks
The 2.7 M small-file table — built for a demo, no compaction policy, no retention, no monitoring; same root cause as every previous incident; production from day one is the only fix

See how to fix it

What you'll ship

What you'll build.

A working Iceberg lakehouse with a REST catalog, partitioned + clustered tables, time travel + rollback, and Spark + Trino reading the same source of truth
A schema-evolution playbook covering the 5 safe changes (add nullable, rename, type-promote, add partition, drop column) and how to gate breaking changes in CI
A streaming-CDC pipeline that lands Kafka events into Iceberg with exactly-once semantics, equality deletes, and < 1-minute end-to-end latency
A production-ops runbook with compaction + snapshot-expiration cron, orphan-file cleanup, file-count + snapshot-age dashboards, and a Hive → Iceberg migration plan

Definition

What is Apache Iceberg Table Format?

Production context

Why this matters in production.

Use cases

Common use cases.

Building data lakehouses with ACID transactions and time travel
Implementing schema evolution without breaking downstream consumers
Enabling multi-engine analytics with Spark, Trino, and Flink on the same tables
Running streaming CDC pipelines that merge into Iceberg tables
Managing partition evolution for changing query patterns without data rewrites
Migrating from Hive tables to Iceberg for improved reliability

Compare

Iceberg vs alternatives.

IcebergvsDelta Lake

IcebergvsHudi

Iceberg provides cleaner schema evolution and broader engine support. Hudi excels at incremental upserts and CDC use cases. Iceberg has gained more community momentum and enterprise adoption.

IcebergvsHive

Iceberg replaces Hive as the table format for data lakes. It adds ACID transactions, schema evolution, and time travel that Hive lacks. Most teams are actively migrating from Hive to Iceberg.

Related curriculum

Related skills.

Why this matters

Why this skill matters.

FAQ

Common questions about Apache.

Iceberg is an open table format that brings database-like reliability to data lakes. It provides ACID transactions, time travel, schema evolution, and multi-engine access on cloud object storage.

Apache Iceberg Table FormatStart Phase 1