Overview
The Arrow columnar format is a language-agnostic in-memory data structure specification that enables high-performance analytical operations on structured data. The format provides:- Data adjacency for sequential access (scans)
- O(1) constant-time random access (except Run-End Encoded layout)
- SIMD and vectorization-friendly operations
- Zero-copy access through relocatable memory without pointer swizzling
The Arrow columnar format is at version 1.5, with forward and backward compatibility guarantees since version 1.0.0.
Terminology
Core Concepts
Core Concepts
- Array/Vector: A sequence of values with known length, all having the same type
- Slot: A single logical value in an array
- Buffer: A sequential virtual address space with a given length
- Physical Layout: The underlying memory layout for an array without value semantics
Type Categories
Type Categories
- Data type: Application-facing semantic value type implemented using a physical layout
- Primitive type: Data type with no child types (fixed bit-width, variable-size binary, null)
- Nested type: Data type whose structure depends on one or more child types
- Parametric type: Type requiring additional parameters (e.g., timestamp with unit and timezone)
Data Types
The Arrow columnar format supports a comprehensive type system defined inSchema.fbs. All types map to well-defined physical memory layouts.
Primitive Types
| Type | Parameters | Physical Layout |
|---|---|---|
| Null | - | Null |
| Boolean | - | Fixed-size Primitive (bit-packed) |
| Int | bit width, signedness | Fixed-size Primitive |
| Floating Point | precision | Fixed-size Primitive |
| Decimal | bit width, scale, precision | Fixed-size Primitive |
| Date | unit | Fixed-size Primitive |
| Time | bit width, unit | Fixed-size Primitive |
| Timestamp | unit, timezone | Fixed-size Primitive |
| Duration | unit | Fixed-size Primitive |
| Interval | unit | Fixed-size Primitive |
Binary and String Types
| Type | Offsets | Description |
|---|---|---|
| Binary | 32-bit | Variable-size binary data |
| Utf8 | 32-bit | UTF-8 encoded strings |
| Large Binary | 64-bit | Variable-size binary for large data |
| Large Utf8 | 64-bit | UTF-8 strings for large data |
| Binary View | - | Variable-size with inline small strings |
| Utf8 View | - | UTF-8 with inline small strings |
| Fixed-Size Binary | byte width | Fixed-width binary data |
Nested Types
| Type | Description |
|---|---|
| List | Variable-size list with 32-bit offsets |
| Large List | Variable-size list with 64-bit offsets |
| List View | Variable-size with out-of-order offsets |
| Fixed-Size List | Fixed number of elements per slot |
| Struct | Collection of named fields |
| Map | Key-value pairs (List of Structs) |
| Union (Sparse/Dense) | Multiple possible types per slot |
Encoded Types
| Type | Description |
|---|---|
| Dictionary | Integer indices into a dictionary array |
| Run-End Encoded | Run-length encoding variant for repeated values |
Physical Memory Layout
Arrays are defined by:- A data type
- A sequence of buffers
- A length (64-bit signed integer)
- A null count (64-bit signed integer)
- An optional dictionary (for dictionary-encoded arrays)
Buffer Alignment and Padding
Implementations should allocate memory on aligned addresses (multiples of 8 or 64 bytes) and pad to a multiple of 8 or 64 bytes. 64-byte alignment is recommended for optimal SIMD performance on modern CPUs (Intel AVX-512).Validity Bitmaps
All array types (except unions) use a validity bitmap to encode null values. A bit value of:- 1 (set): Value is valid (non-null)
- 0 (unset): Value is null
Layout Examples
Fixed-Size Primitive Layout
Int32 array:[1, null, 2, 4, 8]
Variable-Size Binary Layout
String array:['joe', null, null, 'mark']
Null values may occupy non-empty space in the data buffer; their content is undefined.
Variable-Size Binary View Layout
New in Arrow 1.4 - Optimized for string operations with inline small strings.List Layout
List<Int8> array: [[12, -7, 25], null, [0, -127, 127, 50], []]
Struct Layout
Struct array with schemaStruct<name: VarBinary, age: Int32>:
[{'joe', 1}, {null, 2}, null, {'mark', 4}]
Union Layout
Unions support two modes: Dense Union: 5 bytes overhead per value, space-efficient- Type ids buffer (8-bit integers)
- Offsets buffer (32-bit integers)
- Child arrays contain only values of their type
- Type ids buffer (8-bit integers)
- No offsets buffer
- Child arrays all have length equal to the union
Union types do NOT have a validity bitmap. Nullness is determined by the child arrays.
Dictionary-Encoded Layout
Dictionary encoding represents repeated values efficiently using integer indices.Run-End Encoded Layout
New in Arrow 1.3 - Efficient for data with long runs of repeated values.Buffer Layouts by Type
| Layout Type | Buffer 0 | Buffer 1 | Buffer 2 | Variadic Buffers |
|---|---|---|---|---|
| Primitive | validity | data | - | - |
| Variable Binary | validity | offsets | data | - |
| Variable Binary View | validity | views | - | data |
| List | validity | offsets | - | - |
| List View | validity | offsets | sizes | - |
| Fixed-size List | validity | - | - | - |
| Struct | validity | - | - | - |
| Sparse Union | type ids | - | - | - |
| Dense Union | type ids | offsets | - | - |
| Null | - | - | - | - |
| Dictionary-encoded | validity | data (indices) | - | - |
| Run-end encoded | - | - | - | - |
Schema and Metadata
Schemas are serialized using Flatbuffers and contain:- Ordered sequence of fields
- Endianness (default: Little Endian)
- Custom metadata (key-value pairs)
- Features used in the stream/file
- Name and type
- Nullable flag
- Child fields (for nested types)
- Dictionary encoding information
- Custom metadata
Extension Types
Extension types allow custom semantics on standard Arrow types using field metadata:ARROW:extension:name- String identifier (use namespaced names)ARROW:extension:metadata- Serialized reconstruction data
uuidasFixedSizeBinary(16)latitude-longitudeasStruct<latitude: double, longitude: double>tensorasBinarywith shape/dtype metadata
Implementation Guidelines
Array Lengths
Array lengths are 64-bit signed integers. Implementations may limit to 32-bit (2³¹ - 1 elements) for cross-language compatibility. Larger datasets should use multiple array chunks.Null Count
64-bit signed integer representing the number of null slots. May be as large as the array length.Subset Implementation
- Producers: May implement any subset of the spec
- Consumers: Should convert unsupported types to supported ones (e.g., timestamp.millis → timestamp.micros)
Extensibility
Engines may use custom internal vectors but must convert to standard Arrow types before serialization.References
- Schema.fbs - Type definitions
- Message.fbs - Message format
- Flatbuffers - Serialization framework