Overview
Iceberg supports Spark Structured Streaming for both reading and writing data. Streaming is built on Spark’s DataSourceV2 API.Streaming Reads
Read incremental data starting from a historical timestamp:Skip Non-Append Snapshots
Ignore overwrite and delete snapshots:Limit Input Rate
Control micro-batch sizes:When both limits are set, the micro-batch stops at whichever limit is reached first.
Rate limiting also works with
Trigger.AvailableNow to split one-time processing into multiple batches. Limits are ignored with the deprecated Trigger.Once.Streaming Writes
Write streaming data to Iceberg tables:Output Modes
Iceberg supports two output modes:| Mode | Behavior | Use Case |
|---|---|---|
append | Appends each micro-batch | Continuous data ingestion |
complete | Replaces table contents each micro-batch | Aggregations, stateful operations |
Create Table Before Streaming
Ensure the table exists before starting the streaming query:Partitioned Tables
Iceberg requires data to be sorted by partition value per task.Fanout Writer
Avoid repartitioning overhead with the fanout writer:The fanout writer keeps one file open per partition value until the write task finishes.
Maintenance for Streaming Tables
Streaming writes create metadata quickly. Regular maintenance is essential.Tune Commit Rate
Use appropriate trigger intervals:Expire Old Snapshots
Remove old snapshots regularly:By default, snapshots older than 5 days are expired. Adjust based on your retention requirements.
Compact Data Files
Streamming writes create many small files. Compact them regularly:Rewrite Manifests
Optimize manifest files for better query performance:Complete Example
End-to-end streaming pipeline:Best Practices
Trigger Intervals
Trigger Intervals
- Minimum: 1 minute for most workloads
- Recommended: 2-5 minutes for high-throughput streams
- High Latency OK: 10+ minutes for maximum efficiency
Checkpoint Management
Checkpoint Management
Store checkpoints in reliable storage:Never delete checkpoints while a query is running.
Partitioning Strategy
Partitioning Strategy
For streaming tables:
- Use time-based partitioning (hours, days)
- Enable fanout writer for partitioned tables
- Avoid high-cardinality partitions
Monitoring
Monitoring
Track these metrics:
- Micro-batch processing time
- Number of files per micro-batch
- Table snapshot count
- File sizes in recent partitions
Maintenance Schedule
Maintenance Schedule
Recommended schedule:
| Task | Frequency | Purpose |
|---|---|---|
| Expire snapshots | Daily | Remove old metadata |
| Compact files | Weekly | Reduce small files |
| Rewrite manifests | Weekly | Optimize scan planning |
| Remove orphans | Monthly | Clean up unused files |
Troubleshooting
Too Many Small Files
Too Many Small Files
Symptoms: Slow queries, high metadata overheadSolutions:
- Increase trigger interval
- Run
rewrite_data_filesmore frequently - Increase
target-file-size-bytes
High Metadata Overhead
High Metadata Overhead
Symptoms: Slow query planning, large metadata filesSolutions:
- Expire snapshots more aggressively
- Run
rewrite_manifests - Reduce commit frequency
Overwrite Snapshot Errors
Overwrite Snapshot Errors
Symptoms: Streaming query fails on overwrite snapshotsSolutions:
Out of Memory
Out of Memory
Symptoms: Driver or executor OOMSolutions:
- Reduce micro-batch size with
streaming-max-files-per-micro-batch - Increase executor memory
- Enable
streaming-skip-delete-snapshots
Next Steps
Procedures
Learn about maintenance procedures
Writes
Understand write distribution modes
Configuration
Configure streaming options
Queries
Query streaming tables