Skip to main content

Overview

The Arrow columnar format is a language-agnostic in-memory data structure specification that enables high-performance analytical operations on structured data. The format provides:
  • Data adjacency for sequential access (scans)
  • O(1) constant-time random access (except Run-End Encoded layout)
  • SIMD and vectorization-friendly operations
  • Zero-copy access through relocatable memory without pointer swizzling
The columnar format prioritizes analytical performance and data locality over mutation operations. It uses Google’s Flatbuffers for metadata serialization.
The Arrow columnar format is at version 1.5, with forward and backward compatibility guarantees since version 1.0.0.

Terminology

  • Array/Vector: A sequence of values with known length, all having the same type
  • Slot: A single logical value in an array
  • Buffer: A sequential virtual address space with a given length
  • Physical Layout: The underlying memory layout for an array without value semantics
  • Data type: Application-facing semantic value type implemented using a physical layout
  • Primitive type: Data type with no child types (fixed bit-width, variable-size binary, null)
  • Nested type: Data type whose structure depends on one or more child types
  • Parametric type: Type requiring additional parameters (e.g., timestamp with unit and timezone)

Data Types

The Arrow columnar format supports a comprehensive type system defined in Schema.fbs. All types map to well-defined physical memory layouts.

Primitive Types

TypeParametersPhysical Layout
Null-Null
Boolean-Fixed-size Primitive (bit-packed)
Intbit width, signednessFixed-size Primitive
Floating PointprecisionFixed-size Primitive
Decimalbit width, scale, precisionFixed-size Primitive
DateunitFixed-size Primitive
Timebit width, unitFixed-size Primitive
Timestampunit, timezoneFixed-size Primitive
DurationunitFixed-size Primitive
IntervalunitFixed-size Primitive

Binary and String Types

TypeOffsetsDescription
Binary32-bitVariable-size binary data
Utf832-bitUTF-8 encoded strings
Large Binary64-bitVariable-size binary for large data
Large Utf864-bitUTF-8 strings for large data
Binary View-Variable-size with inline small strings
Utf8 View-UTF-8 with inline small strings
Fixed-Size Binarybyte widthFixed-width binary data

Nested Types

TypeDescription
ListVariable-size list with 32-bit offsets
Large ListVariable-size list with 64-bit offsets
List ViewVariable-size with out-of-order offsets
Fixed-Size ListFixed number of elements per slot
StructCollection of named fields
MapKey-value pairs (List of Structs)
Union (Sparse/Dense)Multiple possible types per slot

Encoded Types

TypeDescription
DictionaryInteger indices into a dictionary array
Run-End EncodedRun-length encoding variant for repeated values
Extension types allow custom semantics on standard Arrow types but should use namespaced names (e.g., myorg.custom_type) to avoid conflicts.

Physical Memory Layout

Arrays are defined by:
  • A data type
  • A sequence of buffers
  • A length (64-bit signed integer)
  • A null count (64-bit signed integer)
  • An optional dictionary (for dictionary-encoded arrays)

Buffer Alignment and Padding

Implementations should allocate memory on aligned addresses (multiples of 8 or 64 bytes) and pad to a multiple of 8 or 64 bytes. 64-byte alignment is recommended for optimal SIMD performance on modern CPUs (Intel AVX-512).
Alignment benefits:
- Guaranteed aligned access for numeric arrays
- Reduced cache line usage
- Optimized SIMD instruction usage

Validity Bitmaps

All array types (except unions) use a validity bitmap to encode null values. A bit value of:
  • 1 (set): Value is valid (non-null)
  • 0 (unset): Value is null
Bitmaps use least-significant bit (LSB) numbering:
values = [0, 1, null, 2, null, 3]

bitmap (reading right-to-left within each byte):
j mod 8   7  6  5  4  3  2  1  0
          0  0  1  0  1  0  1  1
Arrays with zero null count may omit the validity bitmap.

Layout Examples

Fixed-Size Primitive Layout

Int32 array: [1, null, 2, 4, 8]
Length: 5, Null count: 1

Validity bitmap buffer:
| Byte 0            | Bytes 1-63 |
|-------------------|------------|
| 00011101          | 0 (padding)|

Value Buffer:
| Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|-----------|-------------|------------|-------------|-------------|-------------|
| 1         | unspecified | 2          | 4           | 8           | unspecified |

Variable-Size Binary Layout

String array: ['joe', null, null, 'mark']
Length: 4, Null count: 2

Validity bitmap:
| Byte 0   | Bytes 1-63 |
|----------|------------|
| 00001001 | 0 (padding)|

Offsets buffer (32-bit):
| Bytes 0-19        | Bytes 20-63 |
|-------------------|-------------|
| 0, 3, 3, 3, 7     | unspecified |

Value buffer:
| Bytes 0-6 | Bytes 7-63  |
|-----------|-------------|
| joemark   | unspecified |
Null values may occupy non-empty space in the data buffer; their content is undefined.

Variable-Size Binary View Layout

New in Arrow 1.4 - Optimized for string operations with inline small strings.
View structure (16 bytes):

Short strings (length ≤ 12 bytes):
| Bytes 0-3 | Bytes 4-15              |
|-----------|-------------------------|
| length    | data (padded with 0)    |

Long strings (length > 12 bytes):
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11  | Bytes 12-15 |
|-----------|-----------|-------------|-------------|
| length    | prefix    | buffer index| offset      |
The prefix stores the first 4 bytes of the string, enabling fast string comparisons.

List Layout

List<Int8> array: [[12, -7, 25], null, [0, -127, 127, 50], []]
Length: 4, Null count: 1

Validity bitmap:
| Byte 0   | Bytes 1-63 |
|----------|------------|
| 00001101 | 0 (padding)|

Offsets buffer (int32):
| Bytes 0-19        | Bytes 20-63 |
|-------------------|-------------|
| 0, 3, 3, 7, 7     | unspecified |

Values array (Int8Array):
  Length: 7, Null count: 0
  Values: [12, -7, 25, 0, -127, 127, 50]

Struct Layout

Struct array with schema Struct<name: VarBinary, age: Int32>: [{'joe', 1}, {null, 2}, null, {'mark', 4}]
Length: 4, Null count: 1

Struct validity bitmap:
| Byte 0   | Bytes 1-63 |
|----------|------------|
| 00001011 | 0 (padding)|

Child arrays:
  - name (VarBinary): ['joe', null, 'alice', 'mark']
  - age (Int32): [1, 2, null, 4]
A struct slot is valid only if BOTH the struct validity bitmap AND the child array validity bitmap have the bit set for that position.

Union Layout

Unions support two modes: Dense Union: 5 bytes overhead per value, space-efficient
  • Type ids buffer (8-bit integers)
  • Offsets buffer (32-bit integers)
  • Child arrays contain only values of their type
Sparse Union: Equal-length child arrays, vectorization-friendly
  • Type ids buffer (8-bit integers)
  • No offsets buffer
  • Child arrays all have length equal to the union
Union types do NOT have a validity bitmap. Nullness is determined by the child arrays.

Dictionary-Encoded Layout

Dictionary encoding represents repeated values efficiently using integer indices.
Original data: ['foo', 'bar', 'foo', 'bar', null, 'baz']

Dictionary-encoded:
  Index type: Int32
  Index values: [0, 1, 0, 1, null, 2]
  
Dictionary:
  Type: VarBinary  
  Values: ['foo', 'bar', 'baz']

Run-End Encoded Layout

New in Arrow 1.3 - Efficient for data with long runs of repeated values.
Original: [1.0, 1.0, 1.0, 1.0, null, null, 2.0]

Run-End Encoded:
  Length: 7, Null count: 0
  
  run_ends (Int32): [4, 6, 7]
  values (Float32): [1.0, null, 2.0]
    Validity: [valid, null, valid]
Run ends represent cumulative lengths - the logical index where each run ends.

Buffer Layouts by Type

Layout TypeBuffer 0Buffer 1Buffer 2Variadic Buffers
Primitivevaliditydata--
Variable Binaryvalidityoffsetsdata-
Variable Binary Viewvalidityviews-data
Listvalidityoffsets--
List Viewvalidityoffsetssizes-
Fixed-size Listvalidity---
Structvalidity---
Sparse Uniontype ids---
Dense Uniontype idsoffsets--
Null----
Dictionary-encodedvaliditydata (indices)--
Run-end encoded----

Schema and Metadata

Schemas are serialized using Flatbuffers and contain:
  • Ordered sequence of fields
  • Endianness (default: Little Endian)
  • Custom metadata (key-value pairs)
  • Features used in the stream/file
table Schema {
  endianness: Endianness = Little;
  fields: [Field];
  custom_metadata: [KeyValue];
  features: [Feature];
}
Each Field includes:
  • Name and type
  • Nullable flag
  • Child fields (for nested types)
  • Dictionary encoding information
  • Custom metadata

Extension Types

Extension types allow custom semantics on standard Arrow types using field metadata:
  • ARROW:extension:name - String identifier (use namespaced names)
  • ARROW:extension:metadata - Serialized reconstruction data
Examples:
  • uuid as FixedSizeBinary(16)
  • latitude-longitude as Struct<latitude: double, longitude: double>
  • tensor as Binary with shape/dtype metadata
Extension names beginning with arrow. are reserved for canonical extensions. Use namespaced names like myorg.type_name for custom types.

Implementation Guidelines

Array Lengths

Array lengths are 64-bit signed integers. Implementations may limit to 32-bit (2³¹ - 1 elements) for cross-language compatibility. Larger datasets should use multiple array chunks.

Null Count

64-bit signed integer representing the number of null slots. May be as large as the array length.

Subset Implementation

  • Producers: May implement any subset of the spec
  • Consumers: Should convert unsupported types to supported ones (e.g., timestamp.millis → timestamp.micros)

Extensibility

Engines may use custom internal vectors but must convert to standard Arrow types before serialization.

References

Build docs developers (and LLMs) love