Skip to main content
Apache Iceberg is a high-performance table format designed to manage large, slow-changing collections of files in distributed file systems or object stores. Unlike traditional table formats, Iceberg treats data as a table rather than as directories of files.

Architecture Overview

Iceberg uses a tree-based metadata structure to track all files in a table:
Table Metadata File (atomic pointer)
  |
  v
Snapshot (point-in-time state)
  |
  v
Manifest List (index of manifests)
  |
  v
Manifest Files (lists of data files)
  |
  v
Data Files (Parquet, ORC, Avro)
This hierarchical structure enables:
  • Atomic commits - Table changes are atomic swaps of metadata files
  • Snapshot isolation - Readers see a consistent view without locks
  • Fast planning - O(1) remote calls instead of O(n) file listings
  • Time travel - Access any historical snapshot

Key Components

Table Metadata File

The table metadata file is the entry point to a table. It contains:
  • Current schema and all historical schemas
  • Current partition spec and all historical partition specs
  • Sort orders
  • Table properties and configuration
  • Snapshot log
  • Reference to the current snapshot
All table updates create a new metadata file and atomically swap the pointer.

Snapshots

A snapshot represents the state of a table at a specific point in time. Each snapshot includes:
  • Unique snapshot ID
  • Timestamp when the snapshot was created
  • Operation that created it (append, replace, delete, etc.)
  • Manifest list containing all data files
  • Summary metrics (files added, deleted, records, etc.)
Snapshots form the foundation for:
  • Reader isolation - queries always use a consistent snapshot
  • Time travel - query data as it existed at any snapshot
  • Rollback - revert to a previous table state
  • Incremental processing - detect changes between snapshots

Manifest Lists

A manifest list stores metadata about manifest files for a snapshot:
  • Path to each manifest file
  • Partition value ranges (min/max for each partition field)
  • Counts of added, deleted, and existing files
  • Sequence numbers for tracking relative age
Manifest lists act as an index - they enable filtering out entire manifests during query planning without reading them.

Manifest Files

Manifest files are Avro files that list data files along with their metadata:
  • File path and format (Parquet, ORC, Avro)
  • Partition values for the file
  • Column-level statistics (value counts, null counts, lower/upper bounds)
  • File size and record count
  • File-level metrics for query planning
Manifests are organized by partition spec - when a partition spec changes, new manifests are created while old manifests remain unchanged.

Data Files

Data files contain the actual table data in columnar formats:
  • Parquet - recommended for most use cases
  • ORC - optimized for Hive compatibility
  • Avro - row-oriented, good for streaming writes
Data files are immutable once written. Updates create new files rather than modifying existing ones.

Optimistic Concurrency

Iceberg supports multiple concurrent writers using optimistic concurrency control:
  1. Each writer reads the current table metadata
  2. Writer creates new data files and manifest files
  3. Writer prepares a new metadata file based on the current version
  4. Writer attempts to commit by atomically swapping the metadata file pointer
  5. If another writer committed first, retry with the new current state
// Example: Concurrent append operation
Table table = catalog.loadTable(tableId);

// Create and commit new data
AppendFiles append = table.newAppend();
append.appendFile(dataFile1);
append.appendFile(dataFile2);
append.commit(); // Atomic swap - will retry if there's a conflict
This approach enables:
  • Serializable isolation - all changes occur in a linear history
  • No distributed locks - writers don’t block readers or each other
  • Efficient retries - work is structured to minimize wasted effort

Sequence Numbers

Every successful commit is assigned a sequence number that establishes the relative age of files:
  • Sequence numbers are assigned when a snapshot is committed
  • All files added in a snapshot inherit the snapshot’s sequence number
  • Sequence numbers are used to determine which delete files apply to which data files
  • This enables correct handling of row-level deletes and updates

Format Versions

Iceberg has evolved through multiple format versions:

Version 1

Analytic data tables with immutable files (Parquet, Avro, ORC)

Version 2

Adds row-level deletes (position and equality deletes)

Version 3

Extended types (nanosecond timestamps, variant, geometry), default values, row lineage
Tables can be upgraded between versions, and v1 metadata files remain valid in v2/v3 tables.

Design Goals

Iceberg was designed with specific goals:
Reads use committed snapshots and are never affected by concurrent writes. All writes appear in a linear history.
Operations use O(1) remote calls for planning, not O(n) with table size. Planning scales to petabyte tables on a single node.
Job planning happens on clients, not a central metadata store. Metadata includes statistics for cost-based optimization.
Full schema and partition spec evolution without rewriting data. Safe column operations with correctness guarantees.
Queries use predicates on data values, not partition values. Partitioning can evolve as data volume changes.

File System Requirements

Iceberg only requires basic file system operations:
  • In-place write - files are never moved after being written
  • Seekable reads - for columnar format support
  • Deletes - remove files that are no longer used
No rename or listing operations required - making Iceberg compatible with object stores like S3 without consistency requirements.

Learn More

Schemas

Learn about Iceberg’s schema structure and type system

Partitioning

Understand hidden partitioning and partition transforms

Evolution

Explore schema and partition evolution capabilities

Reliability

Discover Iceberg’s reliability guarantees

Build docs developers (and LLMs) love