Your First Iceberg Table
Spin up Iceberg locally with Spark + a REST catalog, create your first table, INSERT a few rows, query with time travel, and see the metadata + manifest files on disk — before any production decisions.
Open table format mastery — time travel, schema evolution, and multi-engine analytics.
Hive is dying. Iceberg is the standard the entire open-data ecosystem (Spark, Trino, Flink, Snowflake, BigQuery, Dremio) is converging on. Engineers who own table format decisions own the lakehouse — and the lakehouse is where the data lives.
Hands-on table creation and exploration
Spin up Iceberg locally with Spark + a REST catalog, create your first table, INSERT a few rows, query with time travel, and see the metadata + manifest files on disk — before any production decisions.
Core concepts and essential operations
The three-layer architecture (catalog → metadata files → manifest + data files), how snapshots and isolation actually work, and the concurrent-write failure mode that kills naive Parquet-on-S3.
INSERT / MERGE / DELETE / UPDATE semantics, file sizing for object storage (target 128–512 MB), compaction (rewrite_data_files / rewrite_manifests), and the 4 M-file-handles incident pattern to design around.
Schema evolution, catalogs, and cross-engine queries
Add / rename / drop columns by ID, type promotion rules, partition evolution without rewrites, and the consumer-side patterns that survive an upstream fraud_score column addition without breaking 12 pipelines.
REST catalog vs Hive Metastore vs Glue vs Nessie vs Polaris, single-source-of-truth setup, and how to get Spark, Trino, and Flink reading the same table without diverging row counts.
Streaming CDC, production ops, and migration
Streaming writes from Flink / Spark Structured Streaming / Kafka Connect into Iceberg, exactly-once via two-phase commit, equality + position deletes for CDC, and how to keep streaming + batch consumers consistent.
Compaction policy, snapshot expiration, orphan-file cleanup, table maintenance jobs, monitoring (file count, snapshot age, write throughput), and the runbook every production Iceberg table needs from day one.
Zero-downtime Hive → Iceberg migration with add_files / migrate / shadow tables, dual-write strategies, consumer cutover ordering, and the 8 TB / 3 day rollout pattern that doesn't require a maintenance window.
WHAT GOES WRONG
Apache Iceberg is an open table format for large-scale analytics datasets. It provides time travel, schema evolution, and partition evolution on top of data lake storage like S3, enabling multi-engine access from Spark, Trino, Flink, and more. Adopted by Apple, Netflix, and LinkedIn, Iceberg is becoming the standard table format for modern data lakehouses.
Production data teams migrate from Hive to Iceberg for reliable schema evolution and ACID transactions on data lakes. Apple manages exabytes of data with Iceberg. Without proper table format management, data lakes become unreliable swamps with broken schemas and inconsistent reads.
Iceberg offers better multi-engine support and partition evolution. Delta Lake has deeper Databricks integration and simpler time travel. Iceberg is the more open standard; Delta Lake is strongest in the Databricks ecosystem.
Iceberg provides cleaner schema evolution and broader engine support. Hudi excels at incremental upserts and CDC use cases. Iceberg has gained more community momentum and enterprise adoption.
Iceberg replaces Hive as the table format for data lakes. It adds ACID transactions, schema evolution, and time travel that Hive lacks. Most teams are actively migrating from Hive to Iceberg.
Iceberg is the table format the next decade of data engineering is built on. Mid-to-senior data engineers at Apple, Netflix, LinkedIn, and Stripe are paid for exactly this skill — turning unreliable data lakes into ACID lakehouses that streaming, batch, and ML all read from the same source of truth.
Iceberg is an open table format that brings database-like reliability to data lakes. It provides ACID transactions, time travel, schema evolution, and multi-engine access on cloud object storage.
Iceberg and Delta Lake compete but coexist. Iceberg leads in multi-engine and open ecosystem adoption. Delta Lake dominates in Databricks environments. The industry is converging on open formats.
Basic table operations take 1-2 weeks. Production patterns like partition evolution, catalog management, and streaming CDC typically take 4-6 weeks of hands-on practice.
Yes. Iceberg is the leading open table format and appears in most modern data lakehouse architectures. Understanding table formats is essential for mid-to-senior data engineers.
Iceberg leads in open multi-engine support. Delta Lake is strongest on Databricks. Hudi is best for incremental upserts. Most new projects choose Iceberg or Delta Lake based on their platform.
Spark, Trino, Flink, Presto, Dremio, Snowflake, and BigQuery all support Iceberg. This multi-engine access is one of Iceberg primary advantages over proprietary formats.