How to Create an Iceberg Table
Creating an Iceberg table takes 5 steps: configure a catalog → CREATE TABLE with hidden partitioning → insert data → query with time travel → evolve the schema. The entire flow works in Spark SQL with no additional tooling beyond the iceberg Spark package.
Steps
Configure a Catalog
The catalog stores table metadata pointers. Use the Hadoop catalog for local dev, AWS Glue for production on AWS, or a REST catalog (Nessie) for multi-engine environments.
# PySpark: Hadoop catalog (local dev)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config('spark.sql.extensions',
'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
.config('spark.sql.catalog.local',
'org.apache.iceberg.spark.SparkCatalog')\
.config('spark.sql.catalog.local.type', 'hadoop')\
.config('spark.sql.catalog.local.warehouse', '/tmp/iceberg-warehouse')\
.getOrCreate()Create the Table
Use USING iceberg and define a hidden partition with a transform. days(event_ts) partitions by day without exposing the partition column.
CREATE TABLE local.db.events (
event_id BIGINT NOT NULL,
user_id BIGINT,
event_type STRING,
event_ts TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(event_ts)) -- hidden partition
TBLPROPERTIES (
'write.format.default' = 'parquet',
'write.target-file-size-bytes' = '536870912' -- 512 MB
)Insert and Query Data
Use standard SQL or the DataFrame API. Iceberg applies partition pruning automatically — no need to add WHERE event_date = X to your queries.
-- Insert with SQL
INSERT INTO local.db.events
VALUES(1, 101, 'page_view', '2026-03-23 12:00:00');
-- Insert with PySpark DataFrame
df.writeTo('local.db.events').append()
-- Query (partition pruning is automatic)
SELECT event_type, COUNT(*)
FROM local.db.events
WHERE event_ts >= '2026-03-01'
GROUP BY event_typeQuery with Time Travel
Every write creates a snapshot. Query any past point in time using a timestamp or snapshot ID.
-- Time travel by timestamp (Spark SQL)
SELECT * FROM local.db.events
TIMESTAMP AS OF '2026-03-22 00:00:00';
-- By snapshot ID
SELECT * FROM local.db.events
VERSION AS OF 5432109876543210987;
-- List all snapshots
SELECT * FROM local.db.events.snapshotsEvolve the Schema
Add, rename, or drop columns without rewriting data files. Iceberg uses column IDs internally, so renames never break existing files.
-- Add a nullable column (metadata-only, instant)
ALTER TABLE local.db.events
ADD COLUMN session_id STRING;
-- Rename a column safely
ALTER TABLE local.db.events
RENAME COLUMN event_type TO event_name;
-- Run compaction to merge small files
CALL local.system.rewrite_data_files(\n table => 'db.events'
)Common Issues
Table not found after creation
Check that your Spark session catalog name matches the table prefix. If you configured catalog name `local`, the table must be referenced as `local.db.table`, not just `db.table`.
Slow queries despite partitioning
Hidden partitioning only prunes when your filter column matches the partition transform. If you partition by days(event_ts) but filter on user_id, no pruning happens. Run EXPLAIN to verify partition pruning is applied.
Storage growing unbounded
Each write creates new data files and a new snapshot. Old snapshot files are not deleted automatically. Run expire_snapshots() weekly and rewrite_data_files() to compact, then run remove_orphan_files() for cleanup.
FAQ
- What catalog should I use with Iceberg?
- Hadoop catalog for local dev. AWS Glue for production on AWS (integrates with Athena, EMR). REST catalog (Nessie, Polaris) for multi-engine production environments.
- How do I convert a Parquet table to Iceberg?
- CALL catalog.system.snapshot('db.old_table', 'db.new_iceberg_table') creates an Iceberg table over existing Parquet files without copying data. Then migrate consumers and optionally run catalog.system.migrate() to convert in-place.
- How do I run compaction on an Iceberg table?
- CALL catalog.system.rewrite_data_files(table => 'db.table'). Target 128–512 MB files. Schedule daily for streaming tables. Follow with rewrite_manifests() for metadata optimization.