Consistency, isolation, and reliability guarantees in Apache Iceberg
Iceberg was designed to solve correctness and reliability problems that affect traditional table formats, especially when running on cloud object stores like Amazon S3. It provides strong consistency guarantees without requiring distributed locks or consistent listing.
// Reader loads a snapshotTable table = catalog.loadTable(tableId);long snapshotId = table.currentSnapshot().snapshotId();// Read remains consistent even if writers commit changesTableScan scan = table.newScan() .useSnapshot(snapshotId); // Always sees the same snapshot// Writers can commit while reader is runningtable.newAppend().appendFile(newFile).commit();// Reader is unaffected - still sees original snapshotIterable<FileScanTask> files = scan.planFiles();
Readers never need locks:
Snapshots are immutable
Readers pin a snapshot at table load time
Concurrent writes create new snapshots
Readers remain on their snapshot until explicitly refreshed
Writers structure operations to minimize retry cost:Good: Appends create new manifests that can be reused
// Manifest file created before commit attempt// On retry, same manifest is added to new snapshottable.newAppend() .appendFile(file1) // Writes new manifest: manifest-1.avro .commit(); // Adds manifest-1.avro to snapshot// Retry reuses manifest-1.avro (no rewrite needed)
Expensive: Rewrites require recreating manifests
// Must verify rewritten files still existtable.newRewrite() .rewriteFiles( ImmutableSet.of(file1, file2), // Files to remove ImmutableSet.of(merged) // Files to add ) .commit();// On retry, must check file1 and file2 weren't deleted
Operations specify assumptions that must hold for commit:
// Example: Compaction assumes source files still existtable.newRewrite() .rewriteFiles( oldFiles, // Assumption: These files are still in the table newFiles // Action: Replace with these files ) .commit();// On conflict:// 1. Refresh table to see new state// 2. Check if oldFiles are still present// 3. If yes, retry commit// 4. If no, operation fails (files were already deleted)
// Find small files to compactList<DataFile> smallFiles = findSmallFiles(table);// Compact into larger files DataFile compacted = mergeFiles(smallFiles);// Atomically replace small files with compacted filetable.newRewrite() .rewriteFiles( ImmutableSet.copyOf(smallFiles), // Remove these ImmutableSet.of(compacted) // Add this ) .commit();// If commit fails (files deleted by another writer), retry logic handles it
Without atomicity, compaction could:
Leave duplicate data (compacted + originals)
Lose data (delete originals before adding compacted)
// Write late data for yesterdayDataFile lateFile = writeLateData(yesterday);// Safely add to historical partitiontable.newAppend() .appendFile(lateFile) .commit();// Readers see a consistent snapshot (before or after, never partial)
// Delete files matching a filtertable.newDelete() .deleteFromRowFilter(Expressions.equal("status", "deleted")) .commit();// All matching files deleted atomically// Readers never see partial deletes
-- Hive: Limited to ~10K partitions (listing overhead)PARTITIONED BY (year, month, day)-- Iceberg: Can handle millions of partitionsPARTITIONED BY (hours(event_time), region, user_bucket)-- No listing = no partition count limit