Learn how to maintain Apache Iceberg tables through snapshot expiration, compaction, and orphan file cleanup
Regular maintenance is essential for optimal Iceberg table performance. This guide covers recommended and optional maintenance operations to keep your tables healthy.
Maintenance operations require a loaded Table instance. See the Java API quickstart for details on loading tables.
Each write to an Iceberg table creates a new snapshot (version) of the table. Snapshots enable time-travel queries and rollback, but they accumulate over time and must be expired to delete unused data files and keep metadata compact.
import org.apache.iceberg.ExpireSnapshots.CleanupLevel;// Only remove snapshot metadata, keep data and manifest filestable.expireSnapshots() .expireOlderThan(tsToExpire) .cleanupLevel(CleanupLevel.METADATA_ONLY) .commit();// Remove both metadata and data files (default)table.expireSnapshots() .expireOlderThan(tsToExpire) .cleanupLevel(CleanupLevel.ALL) .commit();// Skip all cleanup, only remove snapshot referencestable.expireSnapshots() .expireOlderThan(tsToExpire) .cleanupLevel(CleanupLevel.NONE) .commit();
Use METADATA_ONLY when data files are shared across tables or when using procedures like add-files that may reference the same data files.
Data files are not deleted until they are no longer referenced by any snapshot that may be used for time travel or rollback. Regularly expiring snapshots is essential to reclaim storage.
Iceberg tracks table metadata using JSON files. Each table change produces a new metadata file for atomicity. Over time, these accumulate and need cleanup.
import java.time.Instant;import java.time.temporal.ChronoUnit;// Only delete files older than 5 daysInstant olderThan = Instant.now().minus(5, ChronoUnit.DAYS);SparkActions .get() .deleteOrphanFiles(table) .olderThan(olderThan.toEpochMilli()) .execute();
It is dangerous to remove orphan files with a retention interval shorter than the time expected for any write to complete. The default interval is 3 days. Setting it too short might corrupt the table by deleting in-progress files.
// Clean specific locationsSparkActions .get() .deleteOrphanFiles(table) .location("s3://bucket/warehouse/db/table/data/") .execute();
Iceberg uses string representations of paths when determining which files to remove. On some file systems, paths can change over time while representing the same file (e.g., HDFS authority changes). This will lead to data loss when orphan file deletion is run. Ensure entries in metadata tables match current file listings.