Table Format

Apache Iceberg is a high-performance table format designed to manage large, slow-changing collections of files in distributed file systems or object stores. Unlike traditional table formats, Iceberg treats data as a table rather than as directories of files.

Architecture Overview

Iceberg uses a tree-based metadata structure to track all files in a table:

Table Metadata File (atomic pointer)
  |
  v
Snapshot (point-in-time state)
  |
  v
Manifest List (index of manifests)
  |
  v
Manifest Files (lists of data files)
  |
  v
Data Files (Parquet, ORC, Avro)

This hierarchical structure enables:

Atomic commits - Table changes are atomic swaps of metadata files
Snapshot isolation - Readers see a consistent view without locks
Fast planning - O(1) remote calls instead of O(n) file listings
Time travel - Access any historical snapshot

Key Components

Table Metadata File

The table metadata file is the entry point to a table. It contains:

Current schema and all historical schemas
Current partition spec and all historical partition specs
Sort orders
Table properties and configuration
Snapshot log
Reference to the current snapshot

All table updates create a new metadata file and atomically swap the pointer.

Snapshots

A snapshot represents the state of a table at a specific point in time. Each snapshot includes:

Unique snapshot ID
Timestamp when the snapshot was created
Operation that created it (append, replace, delete, etc.)
Manifest list containing all data files
Summary metrics (files added, deleted, records, etc.)

Snapshots form the foundation for:

Reader isolation - queries always use a consistent snapshot
Time travel - query data as it existed at any snapshot
Rollback - revert to a previous table state
Incremental processing - detect changes between snapshots

Manifest Lists

A manifest list stores metadata about manifest files for a snapshot:

Path to each manifest file
Partition value ranges (min/max for each partition field)
Counts of added, deleted, and existing files
Sequence numbers for tracking relative age

Manifest lists act as an index - they enable filtering out entire manifests during query planning without reading them.

Manifest Files

Manifest files are Avro files that list data files along with their metadata:

File path and format (Parquet, ORC, Avro)
Partition values for the file
Column-level statistics (value counts, null counts, lower/upper bounds)
File size and record count
File-level metrics for query planning

Manifests are organized by partition spec - when a partition spec changes, new manifests are created while old manifests remain unchanged.

Data Files

Data files contain the actual table data in columnar formats:

Parquet - recommended for most use cases
ORC - optimized for Hive compatibility
Avro - row-oriented, good for streaming writes

Data files are immutable once written. Updates create new files rather than modifying existing ones.

Optimistic Concurrency

Iceberg supports multiple concurrent writers using optimistic concurrency control:

Each writer reads the current table metadata
Writer creates new data files and manifest files
Writer prepares a new metadata file based on the current version
Writer attempts to commit by atomically swapping the metadata file pointer
If another writer committed first, retry with the new current state

// Example: Concurrent append operation
Table table = catalog.loadTable(tableId);

// Create and commit new data
AppendFiles append = table.newAppend();
append.appendFile(dataFile1);
append.appendFile(dataFile2);
append.commit(); // Atomic swap - will retry if there's a conflict

This approach enables:

Serializable isolation - all changes occur in a linear history
No distributed locks - writers don’t block readers or each other
Efficient retries - work is structured to minimize wasted effort

Sequence Numbers

Every successful commit is assigned a sequence number that establishes the relative age of files:

Sequence numbers are assigned when a snapshot is committed
All files added in a snapshot inherit the snapshot’s sequence number
Sequence numbers are used to determine which delete files apply to which data files
This enables correct handling of row-level deletes and updates

Format Versions

Iceberg has evolved through multiple format versions:

Version 1

Analytic data tables with immutable files (Parquet, Avro, ORC)

Version 2

Adds row-level deletes (position and equality deletes)

Version 3

Extended types (nanosecond timestamps, variant, geometry), default values, row lineage

Tables can be upgraded between versions, and v1 metadata files remain valid in v2/v3 tables.

Design Goals

Iceberg was designed with specific goals:

Serializable Isolation

Reads use committed snapshots and are never affected by concurrent writes. All writes appear in a linear history.

Speed

Operations use O(1) remote calls for planning, not O(n) with table size. Planning scales to petabyte tables on a single node.

Scale

Job planning happens on clients, not a central metadata store. Metadata includes statistics for cost-based optimization.

Evolution

Full schema and partition spec evolution without rewriting data. Safe column operations with correctness guarantees.

Storage Separation

Queries use predicates on data values, not partition values. Partitioning can evolve as data volume changes.

File System Requirements

Iceberg only requires basic file system operations:

In-place write - files are never moved after being written
Seekable reads - for columnar format support
Deletes - remove files that are no longer used

No rename or listing operations required - making Iceberg compatible with object stores like S3 without consistency requirements.

Learn More

Schemas

Learn about Iceberg’s schema structure and type system

Partitioning

Understand hidden partitioning and partition transforms

Evolution

Explore schema and partition evolution capabilities

Reliability

Discover Iceberg’s reliability guarantees

Getting Started

Core Concepts

Table Operations

Query Engines

Catalogs & Storage

Advanced Features

Migration

Integrations

Architecture Overview

Key Components

Table Metadata File

Snapshots

Manifest Lists

Manifest Files

Data Files

Optimistic Concurrency

Sequence Numbers

Format Versions

Version 1

Version 2

Version 3

Design Goals

File System Requirements

Learn More

Schemas

Partitioning

Evolution

Reliability

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Table Operations

Query Engines

Catalogs & Storage

Advanced Features

Migration

Integrations

​Architecture Overview

​Key Components

​Table Metadata File

​Snapshots

​Manifest Lists

​Manifest Files

​Data Files

​Optimistic Concurrency

​Sequence Numbers

​Format Versions

Version 1

Version 2

Version 3

​Design Goals

​File System Requirements

​Learn More

Schemas

Partitioning

Evolution

Reliability

Build docs developers (and LLMs) love

Architecture Overview

Key Components

Table Metadata File

Snapshots

Manifest Lists

Manifest Files

Data Files

Optimistic Concurrency

Sequence Numbers

Format Versions

Design Goals

File System Requirements

Learn More