Spark Procedures - Apache Iceberg Documentation

Overview

Iceberg provides stored procedures for table maintenance and management. Procedures are available when using Iceberg SQL extensions.

In Spark 4.0+, procedures are supported natively but are case-sensitive.

Using Procedures

Call procedures from any configured catalog using the CALL statement:

CALL catalog_name.system.procedure_name(arguments);

Argument Passing

CALL catalog_name.system.rollback_to_snapshot(
    table => 'db.sample',
    snapshot_id => 1234
);

Mixing positional and named arguments is not supported.

Snapshot Management

rollback_to_snapshot

Roll back a table to a specific snapshot:

CALL catalog.system.rollback_to_snapshot('db.sample', 1);

Arguments:

table (required) - Table name
snapshot_id (required) - Target snapshot ID

rollback_to_timestamp

Roll back to a snapshot at a specific time:

CALL catalog.system.rollback_to_timestamp(
    'db.sample',
    TIMESTAMP '2021-06-30 00:00:00.000'
);

Arguments:

table (required) - Table name
timestamp (required) - Target timestamp

set_current_snapshot

Set the current snapshot (not limited to ancestors):

CALL catalog.system.set_current_snapshot('db.sample', 1);

cherrypick_snapshot

Apply changes from a snapshot without removing the original:

CALL catalog.system.cherrypick_snapshot('my_table', 1);

Only append and dynamic overwrite snapshots can be cherry-picked.

fast_forward

Fast-forward a branch to another branch’s head:

CALL catalog.system.fast_forward(
    'my_table',
    'main',
    'audit-branch'
);

Metadata Management

expire_snapshots

Remove old snapshots and unreferenced data files:

CALL catalog.system.expire_snapshots('db.sample');

Arguments:

table (required) - Table name
older_than - Expiration timestamp (default: 5 days ago)
retain_last - Minimum snapshots to keep (default: 1)
max_concurrent_deletes - Thread pool size for deletions
stream_results - Stream results to prevent driver OOM
snapshot_ids - Specific snapshot IDs to expire

Output:

deleted_data_files_count
deleted_position_delete_files_count
deleted_equality_delete_files_count
deleted_manifest_files_count
deleted_manifest_lists_count

remove_orphan_files

Remove files not referenced in table metadata:

CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);

Arguments:

table (required) - Table name
older_than - Remove files older than this (default: 3 days ago)
location - Specific directory to scan
dry_run - Preview without deleting (default: false)
max_concurrent_deletes - Thread pool size
stream_results - Stream results to prevent OOM

Orphan file removal is irreversible. Always run with dry_run => true first.

rewrite_data_files

Compact small files and optimize data layout:

CALL catalog.system.rewrite_data_files('db.sample');

Common Options:

target-file-size-bytes - Target output file size (default: 512 MB)
min-file-size-bytes - Files below this are rewritten (default: 75% of target)
max-file-size-bytes - Files above this are rewritten (default: 180% of target)
min-input-files - Minimum files to trigger rewrite (default: 5)
rewrite-all - Force rewrite all files (default: false)
remove-dangling-deletes - Remove orphaned delete files (default: false)

rewrite_manifests

Optimize manifest files for better scan planning:

CALL catalog.system.rewrite_manifests('db.sample');

rewrite_position_delete_files

Compact position delete files and remove dangling deletes:

CALL catalog.system.rewrite_position_delete_files('db.sample');

Table Migration

snapshot

Create a lightweight copy for testing:

CALL catalog.system.snapshot(
    'db.sample',
    'db.snap',
    '/tmp/temptable/'
);

Snapshot tables share data files with the source table. Use DROP TABLE to clean up when done testing.

migrate

Replace a Hive/Spark table with an Iceberg table:

CALL catalog.system.migrate(
    'spark_catalog.db.sample',
    map('foo', 'bar')
);

Arguments:

table (required) - Table to migrate
properties - Properties for the new Iceberg table
drop_backup - Don’t retain original table (default: false)
backup_table_name - Custom backup name (default: table_BACKUP_)

add_files

Add files from external sources:

CALL spark_catalog.system.add_files(
    table => 'db.tbl',
    source_table => 'db.src_tbl',
    partition_filter => map('part_col_1', 'A')
);

Schema is not validated. Adding incompatible files will cause query failures.

register_table

CALL spark_catalog.system.register_table(
    table => 'db.tbl',
    metadata_file => 'path/to/metadata/file.json'
);

Registering the same metadata in multiple catalogs can cause data loss and corruption.

Change Data Capture

create_changelog_view

Create a view showing table changes:

CALL catalog.system.create_changelog_view(
    table => 'db.tbl',
    options => map(
        'start-snapshot-id', '1',
        'end-snapshot-id', '2'
    )
);

Query the changelog:

SELECT * FROM tbl_changes WHERE _change_type = 'INSERT';

CDC Metadata Columns:

_change_type - INSERT, DELETE, UPDATE_BEFORE, UPDATE_AFTER
_change_ordinal - Order of changes
_commit_snapshot_id - Snapshot where change occurred

Table Statistics

compute_table_stats

Calculate NDV statistics for columns:

CALL catalog.system.compute_table_stats('my_table');

compute_partition_stats

Compute partition statistics incrementally:

CALL catalog.system.compute_partition_stats('my_table');

Metadata Information

ancestors_of

Report snapshot ancestry:

CALL catalog.system.ancestors_of('db.tbl');

Best Practices

Regular Maintenance

Run maintenance procedures on a schedule:

Daily: expire_snapshots for active tables
Weekly: rewrite_data_files for frequently updated tables
Monthly: remove_orphan_files for all tables

Streaming Tables

For tables with streaming writes:

Use longer trigger intervals (1+ minutes)
Regularly run rewrite_data_files to compact small files
Run rewrite_manifests to optimize metadata

Safe Orphan Removal

Always use dry run first:

CALL catalog.system.remove_orphan_files(
    table => 'db.sample',
    dry_run => true
);

Next Steps

Writes

Learn about write operations and distribution

Configuration

Configure Spark for optimal performance

Queries

Query tables and inspect metadata

Structured Streaming

Maintain streaming tables

Getting Started

Core Concepts

Table Operations

Query Engines

Catalogs & Storage

Advanced Features

Migration

Integrations

​Overview

​Using Procedures

​Argument Passing

​Snapshot Management

​rollback_to_snapshot

​rollback_to_timestamp

​set_current_snapshot

​cherrypick_snapshot

​fast_forward

​Metadata Management

​expire_snapshots

​remove_orphan_files

​rewrite_data_files

​rewrite_manifests

​rewrite_position_delete_files

​Table Migration

​snapshot

​migrate

​add_files

​register_table

​Change Data Capture

​create_changelog_view

​Table Statistics

​compute_table_stats

​compute_partition_stats

​Metadata Information

​ancestors_of

​Best Practices

​Next Steps

Writes

Configuration

Queries

Structured Streaming

Build docs developers (and LLMs) love

Overview

Using Procedures

Argument Passing

Snapshot Management

rollback_to_snapshot

rollback_to_timestamp

set_current_snapshot

cherrypick_snapshot

fast_forward

Metadata Management

expire_snapshots

remove_orphan_files

rewrite_data_files

rewrite_manifests

rewrite_position_delete_files

Table Migration

snapshot

migrate

add_files

register_table

Change Data Capture

create_changelog_view

Table Statistics

compute_table_stats

compute_partition_stats

Metadata Information

ancestors_of

Best Practices

Next Steps