Key Concepts and Terminology
This page covers the fundamental concepts and terminology you need to understand Apache Arrow’s architecture and design.Core Terminology
Arrow uses specific terminology that differs from other data systems. Understanding these terms is essential:Array / Vector
An array (also called a vector) is a sequence of values with known length, all having the same type. These terms are used interchangeably across Arrow implementations.
Slot
A slot is a single logical value position in an array. Each slot may contain a value or be null.Buffer
A buffer (or contiguous memory region) is a sequential virtual address space with a given length. Any byte can be reached via a single pointer offset.Buffers are the raw memory backing Arrow data structures. Most arrays consist of multiple buffers (validity, offsets, values).
Physical Layout vs Data Type
- Physical Layout: The underlying memory organization without value semantics
- Data Type: An application-facing semantic type implemented using a physical layout
- A 32-bit signed integer array and a 32-bit floating point array share the same physical layout
- A Decimal128 value uses a 16-byte fixed-size binary layout
- A Timestamp is stored using a 64-bit fixed-size layout
Type Categories
Primitive Type A data type with no child types:- Fixed bit-width types (Int32, Float64, Boolean)
- Variable-size binary and string types
- Null type
- List, Struct, Map, Union
- Example:
List<Int32>is distinct fromList<Float64>
- All nested types (require child types)
- Timestamp (requires unit and timezone)
- Decimal (requires precision and scale)
Memory Layout Fundamentals
Buffer Alignment and Padding
Key alignment requirements:- Buffers should be allocated on 8 or 64-byte aligned addresses
- Buffer lengths should be padded to multiples of 8 or 64 bytes
- Alignment ensures optimal SIMD operations and cache utilization
- Padded bytes don’t need specific values
Array Structure
Every Arrow array consists of:Validity Bitmaps
Most array types use a validity bitmap (also called a null bitmap) to track which values are null:- 1 bit = not null (valid)
- 0 bit = null (invalid)
- Uses least-significant bit (LSB) numbering
- Arrays with zero null count may omit the validity bitmap
- Bitmap size: at least
⌈length / 8⌉bytes, padded to 64 bytes
Access validity using:
is_valid[j] = bitmap[j / 8] & (1 << (j % 8))Physical Memory Layouts
Arrow defines several memory layouts optimized for different use cases:Fixed-Size Primitive Layout
Used for types with constant byte width: Example: Int32 Array[1, null, 2, 4, 8]
Null slots don’t require specific values in the value buffer. The validity bitmap determines nullness.
Variable-Size Binary Layout
Used for strings and binary data with varying lengths: Components:- Validity bitmap (1 bit per element)
- Offsets buffer (
length + 1Int32 or Int64 values) - Data buffer (actual bytes)
['joe', null, null, 'mark']
Binary View Layout
New in Arrow Columnar Format 1.4
List Layout
Nested type for variable-length sequences: Example:List<Int8> with values [[12, -7, 25], null, [0, -127, 127, 50], []]
Struct Layout
Contains multiple child arrays with named fields: Example:Struct<name: VarBinary, age: Int32>
Structs have:
- Top-level validity bitmap (independent of children)
- One child array per field
- All child arrays have the same length
To determine if a child value is valid, take the logical AND of the struct’s validity bit and the child array’s validity bit.
Union Layout
Represents values that can be one of several types: Dense Union: Minimal memory overhead (5 bytes per value)- Types buffer (8-bit type IDs)
- Offsets buffer (Int32 offsets into child arrays)
- Child arrays (one per type variant)
- Types buffer only (no offsets)
- All child arrays have same length as union
- Faster for vectorized operations
Union types do NOT have a validity bitmap. Nullness is determined by the child arrays.
Dictionary-Encoded Layout
Efficient representation of data with repeated values: Example:- Reduced memory usage for categorical data
- Faster equality comparisons (compare integers instead of strings)
- Lower bandwidth for data transfer
Run-End Encoded Layout
New in Arrow Columnar Format 1.3
[1.0, 1.0, 1.0, 1.0, null, null, 2.0]
- First run: 1.0 repeated 4 times (indices 0-3)
- Second run: null repeated 2 times (indices 4-5)
- Third run: 2.0 once (index 6)
Random access is O(log n) using binary search on run_ends, vs O(1) for other layouts.
Data Type Examples
Here’s a comprehensive table of Arrow data types:| Type Category | Examples | Physical Layout |
|---|---|---|
| Integer | Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64 | Fixed-size Primitive |
| Float | Float32 (single), Float64 (double) | Fixed-size Primitive |
| Decimal | Decimal128, Decimal256 | Fixed-size Primitive |
| Boolean | Boolean | Fixed-size Primitive (bit-packed) |
| Temporal | Date32, Date64, Time32, Time64, Timestamp, Duration, Interval | Fixed-size Primitive |
| Binary | Binary, LargeBinary, BinaryView, FixedSizeBinary | Variable or Fixed-size |
| String | Utf8, LargeUtf8, Utf8View | Variable-size Binary |
| List | List, LargeList, FixedSizeList, ListView, LargeListView | Variable or Fixed-size List |
| Nested | Struct, Map | Struct Layout |
| Union | DenseUnion, SparseUnion | Union Layout |
| Special | Dictionary, RunEndEncoded, Null | Special Layouts |
Schema and Metadata
Schema Definition
A schema defines the structure of a table:Field Properties
Each field in a schema has:- Name: UTF-8 encoded field name
- Type: The Arrow data type
- Nullable: Whether the field can contain nulls
- Metadata: Optional key-value metadata
Extension Types
Extension types allow custom semantics on standard Arrow types:Serialization and IPC
Arrow IPC Format
The IPC (Interprocess Communication) format enables efficient data exchange:- Based on FlatBuffers for metadata serialization
- Supports streaming and random-access file formats
- Preserves zero-copy properties across process boundaries
- Language-agnostic wire protocol
Stream Format
Continuous stream of record batches:File Format
Random-access file with footer:Best Practices
Prefer signed integers
Use signed integers for dictionary indices and array lengths for better JVM compatibility.
Choose appropriate types
Use BinaryView for long strings with many comparisons, Dictionary encoding for categorical data, and Run-End Encoding for data with long runs.
Next Steps
Now that you understand Arrow’s core concepts:- Read the complete format specification
- Explore the Apache Arrow Cookbook
- Choose your language implementation from the documentation
- Join the community on the Arrow mailing list