Skip to main content

Overview

Iceberg provides stored procedures for table maintenance and management. Procedures are available when using Iceberg SQL extensions.
In Spark 4.0+, procedures are supported natively but are case-sensitive.

Using Procedures

Call procedures from any configured catalog using the CALL statement:
CALL catalog_name.system.procedure_name(arguments);

Argument Passing

CALL catalog_name.system.rollback_to_snapshot(
    table => 'db.sample',
    snapshot_id => 1234
);
Mixing positional and named arguments is not supported.

Snapshot Management

rollback_to_snapshot

Roll back a table to a specific snapshot:
CALL catalog.system.rollback_to_snapshot('db.sample', 1);
Arguments:
  • table (required) - Table name
  • snapshot_id (required) - Target snapshot ID

rollback_to_timestamp

Roll back to a snapshot at a specific time:
CALL catalog.system.rollback_to_timestamp(
    'db.sample',
    TIMESTAMP '2021-06-30 00:00:00.000'
);
Arguments:
  • table (required) - Table name
  • timestamp (required) - Target timestamp

set_current_snapshot

Set the current snapshot (not limited to ancestors):
CALL catalog.system.set_current_snapshot('db.sample', 1);

cherrypick_snapshot

Apply changes from a snapshot without removing the original:
CALL catalog.system.cherrypick_snapshot('my_table', 1);
Only append and dynamic overwrite snapshots can be cherry-picked.

fast_forward

Fast-forward a branch to another branch’s head:
CALL catalog.system.fast_forward(
    'my_table',
    'main',
    'audit-branch'
);

Metadata Management

expire_snapshots

Remove old snapshots and unreferenced data files:
CALL catalog.system.expire_snapshots('db.sample');
Arguments:
  • table (required) - Table name
  • older_than - Expiration timestamp (default: 5 days ago)
  • retain_last - Minimum snapshots to keep (default: 1)
  • max_concurrent_deletes - Thread pool size for deletions
  • stream_results - Stream results to prevent driver OOM
  • snapshot_ids - Specific snapshot IDs to expire
Output:
  • deleted_data_files_count
  • deleted_position_delete_files_count
  • deleted_equality_delete_files_count
  • deleted_manifest_files_count
  • deleted_manifest_lists_count

remove_orphan_files

Remove files not referenced in table metadata:
CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);
Arguments:
  • table (required) - Table name
  • older_than - Remove files older than this (default: 3 days ago)
  • location - Specific directory to scan
  • dry_run - Preview without deleting (default: false)
  • max_concurrent_deletes - Thread pool size
  • stream_results - Stream results to prevent OOM
Orphan file removal is irreversible. Always run with dry_run => true first.

rewrite_data_files

Compact small files and optimize data layout:
CALL catalog.system.rewrite_data_files('db.sample');
Common Options:
  • target-file-size-bytes - Target output file size (default: 512 MB)
  • min-file-size-bytes - Files below this are rewritten (default: 75% of target)
  • max-file-size-bytes - Files above this are rewritten (default: 180% of target)
  • min-input-files - Minimum files to trigger rewrite (default: 5)
  • rewrite-all - Force rewrite all files (default: false)
  • remove-dangling-deletes - Remove orphaned delete files (default: false)

rewrite_manifests

Optimize manifest files for better scan planning:
CALL catalog.system.rewrite_manifests('db.sample');

rewrite_position_delete_files

Compact position delete files and remove dangling deletes:
CALL catalog.system.rewrite_position_delete_files('db.sample');

Table Migration

snapshot

Create a lightweight copy for testing:
CALL catalog.system.snapshot(
    'db.sample',
    'db.snap',
    '/tmp/temptable/'
);
Snapshot tables share data files with the source table. Use DROP TABLE to clean up when done testing.

migrate

Replace a Hive/Spark table with an Iceberg table:
CALL catalog.system.migrate(
    'spark_catalog.db.sample',
    map('foo', 'bar')
);
Arguments:
  • table (required) - Table to migrate
  • properties - Properties for the new Iceberg table
  • drop_backup - Don’t retain original table (default: false)
  • backup_table_name - Custom backup name (default: table_BACKUP_)

add_files

Add files from external sources:
CALL spark_catalog.system.add_files(
    table => 'db.tbl',
    source_table => 'db.src_tbl',
    partition_filter => map('part_col_1', 'A')
);
Schema is not validated. Adding incompatible files will cause query failures.

register_table

Register an existing metadata file in a catalog:
CALL spark_catalog.system.register_table(
    table => 'db.tbl',
    metadata_file => 'path/to/metadata/file.json'
);
Registering the same metadata in multiple catalogs can cause data loss and corruption.

Change Data Capture

create_changelog_view

Create a view showing table changes:
CALL catalog.system.create_changelog_view(
    table => 'db.tbl',
    options => map(
        'start-snapshot-id', '1',
        'end-snapshot-id', '2'
    )
);
Query the changelog:
SELECT * FROM tbl_changes WHERE _change_type = 'INSERT';
CDC Metadata Columns:
  • _change_type - INSERT, DELETE, UPDATE_BEFORE, UPDATE_AFTER
  • _change_ordinal - Order of changes
  • _commit_snapshot_id - Snapshot where change occurred

Table Statistics

compute_table_stats

Calculate NDV statistics for columns:
CALL catalog.system.compute_table_stats('my_table');

compute_partition_stats

Compute partition statistics incrementally:
CALL catalog.system.compute_partition_stats('my_table');

Metadata Information

ancestors_of

Report snapshot ancestry:
CALL catalog.system.ancestors_of('db.tbl');

Best Practices

Run maintenance procedures on a schedule:
  • Daily: expire_snapshots for active tables
  • Weekly: rewrite_data_files for frequently updated tables
  • Monthly: remove_orphan_files for all tables
For tables with streaming writes:
  • Use longer trigger intervals (1+ minutes)
  • Regularly run rewrite_data_files to compact small files
  • Run rewrite_manifests to optimize metadata
Always use dry run first:
CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);

Next Steps

Writes

Learn about write operations and distribution

Configuration

Configure Spark for optimal performance

Queries

Query tables and inspect metadata

Structured Streaming

Maintain streaming tables

Build docs developers (and LLMs) love