What is Partitioning?
Partitioning groups data into separate files based on column values:logs table is partitioned by date, Iceberg can skip reading files for other dates, making the query much faster.
Problems with Traditional Partitioning
Traditional table formats like Hive have significant partitioning limitations:User-Maintained Partition Values
In Hive, partitions are explicit columns that users must populate:- No validation - Wrong formats produce silent errors (“2024-03-01” vs “20240301”)
- Wrong source column - Using a different timestamp field produces incorrect results
- User burden - Every write must calculate partition values correctly
Partition-Dependent Queries
Hive queries must filter on partition columns for performance:event_time relates to event_date.
Inflexible Partitioning
Changing partition layout requires creating a new table and rewriting queries:- Daily partitions → Hourly partitions requires a new table
- All queries must be rewritten for the new partition structure
- Historical data migration is expensive
Iceberg’s Hidden Partitioning
Iceberg handles partition value derivation automatically and keeps partitioning separate from query logic.
- Automatic derivation - Partition values are calculated correctly every time
- Automatic filtering - Predicates on source columns automatically filter partitions
- Evolution - Partition layout can change without rewriting queries
Partition Transforms
Iceberg provides built-in transforms for creating partition values:Identity Transform
Uses the source value unchanged:Temporal Transforms
Extract date/time components from timestamps:date, timestamp, timestamptz, timestamp_ns, timestamptz_ns
Bucket Transform
Hash values into a fixed number of buckets:- High-cardinality columns (user IDs, device IDs)
- Evenly distributing data across partitions
- Controlling partition count
int, long, decimal, date, time, timestamp, timestamptz, string, uuid, fixed, binary
Truncate Transform
Truncate values to a specified width:int, long, decimal, string, binary
Multi-Column Partitioning
Tables can partition by multiple columns:Partition Pruning
Iceberg automatically converts column predicates to partition predicates:- Iceberg knows the transform relationship (
days(event_time)) - Predicates on
event_timeare converted to predicates on the partition value - Manifest metadata contains partition value ranges
- Non-matching manifests and files are skipped
Partition Metadata
Iceberg tracks rich partition metadata for pruning: Manifest List Level:- Min/max values for each partition field across all files in a manifest
- Enables skipping entire manifests
- Exact partition values for each data file
- Column-level statistics (value counts, null counts, min/max)
- Enables skipping individual data files
Creating Partition Specs
Using Spark SQL
Using Java API
Unpartitioned Tables
Tables can be created without partitioning:- All data in a single logical partition
- Simpler for small tables or streaming workloads
- Can still use column statistics for file pruning
- Can be partitioned later via partition evolution
Partitioning Best Practices
Choose the Right Granularity
Choose the Right Granularity
- Too few partitions (yearly) → files too large, poor pruning
- Too many partitions (per second) → too many small files, metadata overhead
- Start coarse (daily/monthly) and refine with partition evolution
- Target partition sizes of 500MB to a few GB
Partition High-Selectivity Columns
Partition High-Selectivity Columns
- Choose columns commonly used in WHERE clauses
- Time-based partitioning is most common (date, timestamp)
- Consider query patterns when selecting partition columns
Use Bucket for High Cardinality
Use Bucket for High Cardinality
- Don’t use identity partitioning on user_id (millions of partitions)
- Use bucket(N, user_id) instead to control partition count
- Typical bucket counts: 16, 32, 64, 128
Avoid Over-Partitioning
Avoid Over-Partitioning
- More partitions ≠ better performance
- Each partition adds metadata overhead
- Small files (< 100MB) hurt read performance
- Use compaction to manage file sizes
Hidden Partitioning Benefits
Correctness
Partition values are always derived correctly from source data
Simplicity
Users query on actual columns, not partition values
Evolution
Partition layout can change without breaking queries
Performance
Automatic partition pruning optimizes all compatible queries
Learn More
Partition Evolution
Learn how to change partition specs over time
Performance
Understand how partitioning affects query performance