Skip to content

How to Create an Iceberg Table

Creating an Iceberg table takes 5 steps: configure a catalog → CREATE TABLE with hidden partitioning → insert data → query with time travel → evolve the schema. The entire flow works in Spark SQL with no additional tooling beyond the iceberg Spark package.

Steps

1

Configure a Catalog

The catalog stores table metadata pointers. Use the Hadoop catalog for local dev, AWS Glue for production on AWS, or a REST catalog (Nessie) for multi-engine environments.

# PySpark: Hadoop catalog (local dev)
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config('spark.sql.extensions',
             'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
    .config('spark.sql.catalog.local',
             'org.apache.iceberg.spark.SparkCatalog')\
    .config('spark.sql.catalog.local.type', 'hadoop')\
    .config('spark.sql.catalog.local.warehouse', '/tmp/iceberg-warehouse')\
    .getOrCreate()
2

Create the Table

Use USING iceberg and define a hidden partition with a transform. days(event_ts) partitions by day without exposing the partition column.

CREATE TABLE local.db.events (
    event_id    BIGINT      NOT NULL,
    user_id     BIGINT,
    event_type  STRING,
    event_ts    TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(event_ts))    -- hidden partition
TBLPROPERTIES (
    'write.format.default' = 'parquet',
    'write.target-file-size-bytes' = '536870912'  -- 512 MB
)
3

Insert and Query Data

Use standard SQL or the DataFrame API. Iceberg applies partition pruning automatically — no need to add WHERE event_date = X to your queries.

-- Insert with SQL
INSERT INTO local.db.events
VALUES(1, 101, 'page_view', '2026-03-23 12:00:00');

-- Insert with PySpark DataFrame
df.writeTo('local.db.events').append()

-- Query (partition pruning is automatic)
SELECT event_type, COUNT(*)
FROM local.db.events
WHERE event_ts >= '2026-03-01'
GROUP BY event_type
4

Query with Time Travel

Every write creates a snapshot. Query any past point in time using a timestamp or snapshot ID.

-- Time travel by timestamp (Spark SQL)
SELECT * FROM local.db.events
TIMESTAMP AS OF '2026-03-22 00:00:00';

-- By snapshot ID
SELECT * FROM local.db.events
VERSION AS OF 5432109876543210987;

-- List all snapshots
SELECT * FROM local.db.events.snapshots
5

Evolve the Schema

Add, rename, or drop columns without rewriting data files. Iceberg uses column IDs internally, so renames never break existing files.

-- Add a nullable column (metadata-only, instant)
ALTER TABLE local.db.events
ADD COLUMN session_id STRING;

-- Rename a column safely
ALTER TABLE local.db.events
RENAME COLUMN event_type TO event_name;

-- Run compaction to merge small files
CALL local.system.rewrite_data_files(\n    table => 'db.events'
)

Common Issues

Table not found after creation

Check that your Spark session catalog name matches the table prefix. If you configured catalog name `local`, the table must be referenced as `local.db.table`, not just `db.table`.

Slow queries despite partitioning

Hidden partitioning only prunes when your filter column matches the partition transform. If you partition by days(event_ts) but filter on user_id, no pruning happens. Run EXPLAIN to verify partition pruning is applied.

Storage growing unbounded

Each write creates new data files and a new snapshot. Old snapshot files are not deleted automatically. Run expire_snapshots() weekly and rewrite_data_files() to compact, then run remove_orphan_files() for cleanup.

FAQ

What catalog should I use with Iceberg?
Hadoop catalog for local dev. AWS Glue for production on AWS (integrates with Athena, EMR). REST catalog (Nessie, Polaris) for multi-engine production environments.
How do I convert a Parquet table to Iceberg?
CALL catalog.system.snapshot('db.old_table', 'db.new_iceberg_table') creates an Iceberg table over existing Parquet files without copying data. Then migrate consumers and optionally run catalog.system.migrate() to convert in-place.
How do I run compaction on an Iceberg table?
CALL catalog.system.rewrite_data_files(table => 'db.table'). Target 128–512 MB files. Schedule daily for streaming tables. Follow with rewrite_manifests() for metadata optimization.

Related

Press Cmd+K to open