BatchScan interface provides an API for configuring batch-oriented scans over Iceberg tables, optimized for processing large amounts of data.
Overview
BatchScan is designed for batch processing workflows and extends the base Scan interface. It provides similar snapshot selection capabilities to TableScan but is optimized for batch processing engines.Interface
Core Methods
table()
Returns the table from which this scan loads data.useSnapshot()
Creates a new batch scan that will use a snapshot with the given ID.snapshotId- A snapshot ID
IllegalArgumentException if the snapshot cannot be found
Example:
useRef()
Creates a new batch scan that will use the given reference.ref- A reference name (branch or tag)
IllegalArgumentException if the reference with the given name could not be found
Example:
asOfTime()
Creates a new batch scan that will use the most recent snapshot as of the given time.timestampMillis- A timestamp in milliseconds since the epoch
IllegalArgumentException if the snapshot cannot be found or time travel is attempted on a tag
Example:
snapshot()
Returns the snapshot that will be used by this scan.Examples
Basic Batch Scan
Batch Scan with Snapshot Selection
Time-Based Batch Processing
Branch-Based Batch Scan
Task Group Processing
Historical Batch Processing
Comparing TableScan and BatchScan
Task Planning
BatchScan provides two planning methods:planFiles()
Returns individual scan tasks.planTasks()
Returns grouped scan tasks optimized for parallel processing.See Also
- TableScan - Table scan API
- IncrementalAppendScan - Incremental scan for appends
- Table - Table API reference
- Expressions - Filter expressions