Scale Without Compromise
Iceberg tables scale to production workloads with:
- Tens of petabytes in a single table
- Billions of files tracked efficiently
- Millions of partitions without performance degradation
- Single-node planning even at massive scale
Real-World Scale
Iceberg is used in production for:- Netflix: Petabyte-scale data lakehouse with thousands of tables
- Apple: Exabyte-scale analytics platform
- Adobe: Multi-petabyte event data processing
- LinkedIn: Large-scale data warehouse migration from Hive
- High-volume streaming ingestion (millions of events/second)
- Large batch processing (petabytes per job)
- Interactive queries (sub-second to minutes)
- Concurrent readers and writers (hundreds of simultaneous operations)
Scan Planning Performance
Scan planning is the process of determining which data files to read for a query. This is where Iceberg excels.The Traditional Problem
Hive tables require O(n) remote calls for planning:- Metastore becomes overwhelmed
- Planning time grows with table size
- Can’t scale to millions of partitions
Iceberg’s O(1) Planning
Iceberg uses O(1) remote calls regardless of table size:Two-Level Metadata Filtering
Iceberg uses hierarchical metadata for aggressive pruning:Level 1: Manifest List Filtering
Manifest list contains min/max values for each partition field:- Converts predicate:
event_date = 2024-02-15→days(event_time) = 18707 - Checks manifest bounds: 18707 < 18718 (lower bound)
- Skips entire manifest without reading it
Level 2: Manifest File Filtering
Manifest files contain per-file statistics:- Partition filter: Reads only manifests for 2024-03-01
- Column bounds: Checks
500000 >= lower_bound AND 500000 <= upper_bound - Scans only files where user_id range includes 500000
Data Filtering with Column Statistics
Manifest statistics enable column-level pruning:Performance Impact
Real-world query on 100TB table: Without column statistics (full scan):- Read: 100TB
- Files scanned: 100,000
- Time: Hours
- Read: 10TB (90% reduction)
- Files scanned: 10,000
- Time: 30-60 minutes
- Read: 100GB (99.9% reduction)
- Files scanned: 100
- Time: Seconds to minutes
Iceberg’s column statistics provide 10x performance improvement over partition-only filtering.
Distributed Planning
Iceberg pushes file pruning to query engines, removing metastore bottlenecks:Traditional Centralized Planning
Iceberg Distributed Planning
No Metastore Bottleneck
Catalog only provides metadata file location. No file listing or filtering.
Horizontal Scaling
Planning parallelizes across workers. More workers = faster planning.
Metadata Locality
Workers read manifests from object storage in parallel, not from a central server.
Better Statistics
Column statistics available for cost-based optimization and better execution plans.
Optimizations for Common Patterns
Time-Series Queries
Time-based filtering is extremely efficient:- Partition by time (days/hours) for coarse pruning
- Manifest list filtering eliminates old manifests
- Column bounds on event_time eliminate old files
- Sorted data (optional) enables early termination
High-Cardinality Filters
Filters on high-cardinality columns (user_id, device_id) benefit from column statistics:- Bucket partition on user_id reduces search space
- Lower/upper bounds filter files without the user_id
- Bloom filters (future enhancement) for exact membership testing
Join Optimization
Iceberg provides metadata for cost-based optimization:- Record counts from manifest metadata
- Partition statistics for partition pruning
- Column bounds for bloom filter joins
- File size statistics for task scheduling
Partitioning Best Practices
Partitioning strategy significantly impacts performance:Right-Sizing Partitions
Target: 500MB - 1GB per file
Target: 500MB - 1GB per file
Too small (< 100MB):
- Too many files = metadata overhead
- Poor read performance (many small I/Os)
- Slow query planning (many manifests)
- Can’t skip data within file
- Poor parallelism (fewer tasks)
- High memory usage per task
- Good balance of pruning and parallelism
- Efficient I/O patterns
- Manageable metadata size
Choose High-Selectivity Columns
Choose High-Selectivity Columns
Partition on columns frequently used in WHERE clauses:
Use Bucket for High Cardinality
Use Bucket for High Cardinality
Don’t create millions of partitions:
Evolve with Data Volume
Evolve with Data Volume
Start coarse, refine as data grows:
Multi-Column Partitioning
Combine dimensions for better pruning:- Put highest-selectivity column first (region in this case)
- Put time-based column last (for time-series queries)
File Layout Optimization
Sorting Within Files
Declare sort order for better min/max bounds:- Tighter min/max bounds (better pruning)
- Clustered data (better compression)
- Efficient range scans
Compaction
Regularly compact small files:- Fewer files = less metadata overhead
- Larger files = more efficient I/O
- Better compression ratios
Z-Ordering (Clustering)
Co-locate related data using space-filling curves:- Multi-dimensional queries
- Columns with similar cardinality
- When partition evolution isn’t practical
Metadata Performance
Manifest File Management
Keep manifest files optimized:- More than 100 manifests per snapshot
- Many manifests with only “existing” files
- Slow query planning despite good partitioning
Snapshot Expiration
Regularly expire old snapshots:- Reduces metadata file count
- Enables cleanup of orphaned data files
- Faster metadata reads
Read Performance Tuning
Split Size Configuration
Control task parallelism:- Smaller splits = more parallelism, more overhead
- Larger splits = less parallelism, less overhead
Vectorized Reads
Enable vectorization for Parquet:- Process data in batches (1000s of rows at a time)
- Better CPU cache utilization
- 5-10x faster for analytical queries
Column Projection
Read only necessary columns:- I/O bandwidth (read less data)
- Memory (store less data)
- CPU (deserialize less data)
Write Performance Tuning
Batch Size
Control file size:Commit Frequency
Balance freshness vs. overhead:- Frequent commits = fresher data, more snapshots, more manifests
- Infrequent commits = less overhead, larger files, stale data
Fanout Writers
Handle high-cardinality partitions:Monitoring and Metrics
Key metrics to monitor:File Count
Target: 500MB - 1GB per fileMonitor: Files per partition, total filesAction: Compact if too many small files
Manifest Count
Target: < 100 manifests per snapshotMonitor: Manifests per snapshotAction: Rewrite manifests if too many
Snapshot Count
Target: < 100 active snapshotsMonitor: Total snapshots, oldest snapshot ageAction: Expire old snapshots regularly
Planning Time
Target: < 10 seconds for large tablesMonitor: Query planning durationAction: Check manifest count, partition strategy
Performance Comparison
Iceberg vs. Hive
| Operation | Hive | Iceberg | Improvement |
|---|---|---|---|
| Query planning (10K partitions) | 5-10 minutes | 5-10 seconds | 60-120x |
| Schema evolution | Hours (rewrite data) | Seconds (metadata) | 1000x+ |
| Partition evolution | Impossible | Seconds (metadata) | ∞ |
| Concurrent writes | Locks/failures | Optimistic concurrency | 10x+ throughput |
| Time travel | Not supported | Instant (metadata) | New capability |
| Column-level pruning | No | Yes | 10x fewer files read |
Real-World Example
100TB table, 1 million files, query filtering on timestamp and user_id: Hive:- Planning: List 10,000 partitions × 100 files each = 5 minutes
- Scanning: Read all files in matching partitions = 1TB read
- Total time: 30 minutes
- Planning: Read manifest list + filter manifests = 10 seconds
- Scanning: Column stats prune to 1,000 matching files = 10GB read
- Total time: 60 seconds
Learn More
Partitioning
Learn how to partition for optimal performance
Reliability
Understand how reliability features enable performance
Table Format
Explore the metadata structure that enables fast planning