What catalog should I use with Iceberg?

For local development, use the Hadoop catalog (filesystem-based). For production on AWS, use AWS Glue catalog — it integrates with Athena, EMR, and Glue jobs. For multi-engine production environments, use a REST catalog like Nessie or Apache Polaris, which provides a standard HTTP API that any Iceberg-compatible engine can use. Avoid the Hive Metastore catalog for new projects.

How do I convert a Parquet table to Iceberg?

Use the Iceberg snapshot procedure: CALL catalog.system.snapshot('db.old_parquet_table', 'db.new_iceberg_table'). This creates an Iceberg table that references the existing Parquet files without copying data. The original table remains unchanged. After validating the new table, migrate downstream consumers and optionally migrate the table in-place with CALL catalog.system.migrate().

How do I run compaction on an Iceberg table?

Use the rewrite_data_files stored procedure: CALL catalog.system.rewrite_data_files(table => 'db.table'). This rewrites small files into larger, query-optimized files. For tables with frequent streaming writes, run compaction daily. Set a target file size of 128–512 MB. You can also rewrite manifests separately with rewrite_manifests for metadata optimization.

How to Create an Iceberg Table

Creating an Iceberg table takes 5 steps: configure a catalog → CREATE TABLE with hidden partitioning → insert data → query with time travel → evolve the schema. The entire flow works in Spark SQL with no additional tooling beyond the iceberg Spark package.

Steps

Configure a Catalog

The catalog stores table metadata pointers. Use the Hadoop catalog for local dev, AWS Glue for production on AWS, or a REST catalog (Nessie) for multi-engine environments.

# PySpark: Hadoop catalog (local dev)
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config('spark.sql.extensions',
             'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
    .config('spark.sql.catalog.local',
             'org.apache.iceberg.spark.SparkCatalog')\
    .config('spark.sql.catalog.local.type', 'hadoop')\
    .config('spark.sql.catalog.local.warehouse', '/tmp/iceberg-warehouse')\
    .getOrCreate()

Create the Table

Use USING iceberg and define a hidden partition with a transform. days(event_ts) partitions by day without exposing the partition column.

CREATE TABLE local.db.events (
    event_id    BIGINT      NOT NULL,
    user_id     BIGINT,
    event_type  STRING,
    event_ts    TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(event_ts))    -- hidden partition
TBLPROPERTIES (
    'write.format.default' = 'parquet',
    'write.target-file-size-bytes' = '536870912'  -- 512 MB
)

Insert and Query Data

Use standard SQL or the DataFrame API. Iceberg applies partition pruning automatically — no need to add WHERE event_date = X to your queries.

-- Insert with SQL
INSERT INTO local.db.events
VALUES(1, 101, 'page_view', '2026-03-23 12:00:00');

-- Insert with PySpark DataFrame
df.writeTo('local.db.events').append()

-- Query (partition pruning is automatic)
SELECT event_type, COUNT(*)
FROM local.db.events
WHERE event_ts >= '2026-03-01'
GROUP BY event_type

Query with Time Travel

Every write creates a snapshot. Query any past point in time using a timestamp or snapshot ID.

-- Time travel by timestamp (Spark SQL)
SELECT * FROM local.db.events
TIMESTAMP AS OF '2026-03-22 00:00:00';

-- By snapshot ID
SELECT * FROM local.db.events
VERSION AS OF 5432109876543210987;

-- List all snapshots
SELECT * FROM local.db.events.snapshots

Evolve the Schema

Add, rename, or drop columns without rewriting data files. Iceberg uses column IDs internally, so renames never break existing files.

-- Add a nullable column (metadata-only, instant)
ALTER TABLE local.db.events
ADD COLUMN session_id STRING;

-- Rename a column safely
ALTER TABLE local.db.events
RENAME COLUMN event_type TO event_name;

-- Run compaction to merge small files
CALL local.system.rewrite_data_files(\n    table => 'db.events'
)

Common Issues

✗

Table not found after creation

Check that your Spark session catalog name matches the table prefix. If you configured catalog name `local`, the table must be referenced as `local.db.table`, not just `db.table`.

✗

Slow queries despite partitioning

Hidden partitioning only prunes when your filter column matches the partition transform. If you partition by days(event_ts) but filter on user_id, no pruning happens. Run EXPLAIN to verify partition pruning is applied.

✗

Storage growing unbounded

Each write creates new data files and a new snapshot. Old snapshot files are not deleted automatically. Run expire_snapshots() weekly and rewrite_data_files() to compact, then run remove_orphan_files() for cleanup.

FAQ

What catalog should I use with Iceberg?: Hadoop catalog for local dev. AWS Glue for production on AWS (integrates with Athena, EMR). REST catalog (Nessie, Polaris) for multi-engine production environments.
How do I convert a Parquet table to Iceberg?: CALL catalog.system.snapshot('db.old_table', 'db.new_iceberg_table') creates an Iceberg table over existing Parquet files without copying data. Then migrate consumers and optionally run catalog.system.migrate() to convert in-place.
How do I run compaction on an Iceberg table?: CALL catalog.system.rewrite_data_files(table => 'db.table'). Target 128–512 MB files. Schedule daily for streaming tables. Follow with rewrite_manifests() for metadata optimization.

→

What is Apache Iceberg?

/guide/what-is-apache-iceberg

→

Iceberg Learning Path

/learn/iceberg

→

Build an Iceberg Lakehouse

/projects/iceberg-lakehouse