Architecture Overview
Iceberg uses a tree-based metadata structure to track all files in a table:- Atomic commits - Table changes are atomic swaps of metadata files
- Snapshot isolation - Readers see a consistent view without locks
- Fast planning - O(1) remote calls instead of O(n) file listings
- Time travel - Access any historical snapshot
Key Components
Table Metadata File
The table metadata file is the entry point to a table. It contains:- Current schema and all historical schemas
- Current partition spec and all historical partition specs
- Sort orders
- Table properties and configuration
- Snapshot log
- Reference to the current snapshot
Snapshots
A snapshot represents the state of a table at a specific point in time. Each snapshot includes:- Unique snapshot ID
- Timestamp when the snapshot was created
- Operation that created it (append, replace, delete, etc.)
- Manifest list containing all data files
- Summary metrics (files added, deleted, records, etc.)
- Reader isolation - queries always use a consistent snapshot
- Time travel - query data as it existed at any snapshot
- Rollback - revert to a previous table state
- Incremental processing - detect changes between snapshots
Manifest Lists
A manifest list stores metadata about manifest files for a snapshot:- Path to each manifest file
- Partition value ranges (min/max for each partition field)
- Counts of added, deleted, and existing files
- Sequence numbers for tracking relative age
Manifest Files
Manifest files are Avro files that list data files along with their metadata:- File path and format (Parquet, ORC, Avro)
- Partition values for the file
- Column-level statistics (value counts, null counts, lower/upper bounds)
- File size and record count
- File-level metrics for query planning
Data Files
Data files contain the actual table data in columnar formats:- Parquet - recommended for most use cases
- ORC - optimized for Hive compatibility
- Avro - row-oriented, good for streaming writes
Optimistic Concurrency
Iceberg supports multiple concurrent writers using optimistic concurrency control:- Each writer reads the current table metadata
- Writer creates new data files and manifest files
- Writer prepares a new metadata file based on the current version
- Writer attempts to commit by atomically swapping the metadata file pointer
- If another writer committed first, retry with the new current state
- Serializable isolation - all changes occur in a linear history
- No distributed locks - writers don’t block readers or each other
- Efficient retries - work is structured to minimize wasted effort
Sequence Numbers
Every successful commit is assigned a sequence number that establishes the relative age of files:- Sequence numbers are assigned when a snapshot is committed
- All files added in a snapshot inherit the snapshot’s sequence number
- Sequence numbers are used to determine which delete files apply to which data files
- This enables correct handling of row-level deletes and updates
Format Versions
Iceberg has evolved through multiple format versions:Version 1
Analytic data tables with immutable files (Parquet, Avro, ORC)
Version 2
Adds row-level deletes (position and equality deletes)
Version 3
Extended types (nanosecond timestamps, variant, geometry), default values, row lineage
Design Goals
Iceberg was designed with specific goals:Serializable Isolation
Serializable Isolation
Reads use committed snapshots and are never affected by concurrent writes. All writes appear in a linear history.
Speed
Speed
Operations use O(1) remote calls for planning, not O(n) with table size. Planning scales to petabyte tables on a single node.
Scale
Scale
Job planning happens on clients, not a central metadata store. Metadata includes statistics for cost-based optimization.
Evolution
Evolution
Full schema and partition spec evolution without rewriting data. Safe column operations with correctness guarantees.
Storage Separation
Storage Separation
Queries use predicates on data values, not partition values. Partitioning can evolve as data volume changes.
File System Requirements
Iceberg only requires basic file system operations:- In-place write - files are never moved after being written
- Seekable reads - for columnar format support
- Deletes - remove files that are no longer used
Learn More
Schemas
Learn about Iceberg’s schema structure and type system
Partitioning
Understand hidden partitioning and partition transforms
Evolution
Explore schema and partition evolution capabilities
Reliability
Discover Iceberg’s reliability guarantees