Overview
Gandiva uses LLVM to compile expressions into native machine code at runtime, enabling:- High-performance projections: Compute derived columns from existing data
- Efficient filtering: Select rows based on complex conditions
- Vectorized execution: Process entire batches of data at once
- Expression caching: Reuse compiled code across multiple evaluations
Core Concepts
Projections
Projectors evaluate expressions to compute new columns from input data. A projector is built once for a schema and set of expressions, then reused for multiple record batches.Filters
Filters evaluate boolean conditions to select rows from record batches. The output is a selection vector containing indices of matching rows.Expression Building
Gandiva provides theTreeExprBuilder class for constructing expression trees:
Literals
Functions
Gandiva supports a rich set of built-in functions:Conditional Expressions
IN Expressions
Configuration Options
Customize Gandiva behavior withConfiguration:
Configuration Builder
Performance Considerations
Expression Caching
Gandiva automatically caches compiled expressions. Reusing the same projector or filter across multiple batches avoids recompilation:Pre-allocated Output
For maximum performance, pre-allocate output arrays:Selection Vectors
When filtering, use selection vectors to avoid copying data:When to Use Gandiva
Gandiva excels at:- Complex expressions: Multi-step calculations with many operations
- High throughput: Processing large batches where compilation overhead is amortized
- CPU-bound workloads: Compute-intensive operations that benefit from SIMD
- Reusable expressions: Same expressions evaluated on many batches
- Simple operations: Single-operation expressions may not justify compilation overhead
- One-time evaluations: Compilation cost dominates for single-batch processing
- I/O-bound workloads: Where computation is not the bottleneck
Function Registry
Query available functions at runtime:Error Handling
All Gandiva operations returnarrow::Status: