Overview
Arrow’s Substrait support provides:- Plan serialization: Convert Acero execution plans to Substrait format
- Plan deserialization: Execute Substrait plans using Acero engine
- Expression conversion: Translate between Arrow compute expressions and Substrait
- Type mapping: Bidirectional conversion between Arrow and Substrait types
- Cross-system interoperability: Exchange query plans between systems
Key Concepts
Substrait Plans
A Substrait Plan represents a complete query execution plan with:- Relations: Operations like scans, filters, projections, aggregations
- Expressions: Computations on data (literals, field references, functions)
- Types: Schema information for data flowing through the plan
- Extensions: Custom functions and types via extension URIs
Arrow Acero Integration
Arrow executes Substrait plans using the Acero query engine:- Substrait relations map to Acero execution nodes
- Substrait expressions map to Arrow compute expressions
- Plans execute using Arrow’s streaming execution model
Deserializing Substrait Plans
Basic Plan Execution
Execute a Substrait plan using Arrow:Multiple Consumer Factory
Handle plans with multiple output relations:Writing to Filesystem
Write plan output directly to files:Serializing to Substrait
Serialize Acero Plan
Convert an Acero execution plan to Substrait:Serialize Expressions
Serialize standalone expressions:Deserialize Expressions
Type Conversion
Serialize Arrow Types
Deserialize Substrait Types
Schema Conversion
Serialize Schema
Deserialize Schema
Extension Sets
Extension sets manage custom functions and types:Conversion Options
Control conversion behavior:JSON Format
Convert between binary and JSON formats:Complete Example
Full workflow from serialization to execution:Supported Operations
Arrow supports these Substrait relation types:- Read: Scan data from files or tables
- Filter: Select rows based on predicates
- Project: Compute derived columns
- Aggregate: Group and aggregate data
- Join: Combine data from multiple sources
- Sort: Order data
- Fetch: Limit and offset
- Set operations: Union, intersect, except
Best Practices
Extension Set Management
Error Handling
Version Compatibility
When to Use Substrait
Substrait is beneficial for:- Multi-system workflows: Pass queries between different engines
- Query federation: Distribute query execution across systems
- Plan optimization: Serialize, optimize externally, then execute
- Remote execution: Send query plans to remote workers
- Testing: Validate query plans across implementations
- Single-system execution: No need for serialization overhead
- Performance-critical: Avoid serialization/deserialization cost
- Dynamic plans: Plans built and executed immediately