Arrow Columnar Memory Format

Overview

The Arrow columnar format is a language-agnostic in-memory data structure specification that enables high-performance analytical operations on structured data. The format provides:

Data adjacency for sequential access (scans)
O(1) constant-time random access (except Run-End Encoded layout)
SIMD and vectorization-friendly operations
Zero-copy access through relocatable memory without pointer swizzling

The columnar format prioritizes analytical performance and data locality over mutation operations. It uses Google’s Flatbuffers for metadata serialization.

The Arrow columnar format is at version 1.5, with forward and backward compatibility guarantees since version 1.0.0.

Terminology

Core Concepts

Array/Vector: A sequence of values with known length, all having the same type
Slot: A single logical value in an array
Buffer: A sequential virtual address space with a given length
Physical Layout: The underlying memory layout for an array without value semantics

Type Categories

Data type: Application-facing semantic value type implemented using a physical layout
Primitive type: Data type with no child types (fixed bit-width, variable-size binary, null)
Nested type: Data type whose structure depends on one or more child types
Parametric type: Type requiring additional parameters (e.g., timestamp with unit and timezone)

Data Types

The Arrow columnar format supports a comprehensive type system defined in Schema.fbs. All types map to well-defined physical memory layouts.

Primitive Types

Type	Parameters	Physical Layout
Null	-	Null
Boolean	-	Fixed-size Primitive (bit-packed)
Int	bit width, signedness	Fixed-size Primitive
Floating Point	precision	Fixed-size Primitive
Decimal	bit width, scale, precision	Fixed-size Primitive
Date	unit	Fixed-size Primitive
Time	bit width, unit	Fixed-size Primitive
Timestamp	unit, timezone	Fixed-size Primitive
Duration	unit	Fixed-size Primitive
Interval	unit	Fixed-size Primitive

Binary and String Types

Type	Offsets	Description
Binary	32-bit	Variable-size binary data
Utf8	32-bit	UTF-8 encoded strings
Large Binary	64-bit	Variable-size binary for large data
Large Utf8	64-bit	UTF-8 strings for large data
Binary View	-	Variable-size with inline small strings
Utf8 View	-	UTF-8 with inline small strings
Fixed-Size Binary	byte width	Fixed-width binary data

Nested Types

Type	Description
List	Variable-size list with 32-bit offsets
Large List	Variable-size list with 64-bit offsets
List View	Variable-size with out-of-order offsets
Fixed-Size List	Fixed number of elements per slot
Struct	Collection of named fields
Map	Key-value pairs (List of Structs)
Union (Sparse/Dense)	Multiple possible types per slot

Encoded Types

Type	Description
Dictionary	Integer indices into a dictionary array
Run-End Encoded	Run-length encoding variant for repeated values

Extension types allow custom semantics on standard Arrow types but should use namespaced names (e.g., myorg.custom_type) to avoid conflicts.

Physical Memory Layout

Arrays are defined by:

A data type
A sequence of buffers
A length (64-bit signed integer)
A null count (64-bit signed integer)
An optional dictionary (for dictionary-encoded arrays)

Buffer Alignment and Padding

Implementations should allocate memory on aligned addresses (multiples of 8 or 64 bytes) and pad to a multiple of 8 or 64 bytes. 64-byte alignment is recommended for optimal SIMD performance on modern CPUs (Intel AVX-512).

Alignment benefits:
- Guaranteed aligned access for numeric arrays
- Reduced cache line usage
- Optimized SIMD instruction usage

Validity Bitmaps

All array types (except unions) use a validity bitmap to encode null values. A bit value of:

1 (set): Value is valid (non-null)
0 (unset): Value is null

Bitmaps use least-significant bit (LSB) numbering:

values = [0, 1, null, 2, null, 3]

bitmap (reading right-to-left within each byte):
j mod 8   7  6  5  4  3  2  1  0
          0  0  1  0  1  0  1  1

Arrays with zero null count may omit the validity bitmap.

Layout Examples

Fixed-Size Primitive Layout

Int32 array: [1, null, 2, 4, 8]

Length: 5, Null count: 1

Validity bitmap buffer:
| Byte 0            | Bytes 1-63 |
|-------------------|------------|
| 00011101          | 0 (padding)|

Value Buffer:
| Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|-----------|-------------|------------|-------------|-------------|-------------|
| 1         | unspecified | 2          | 4           | 8           | unspecified |

Variable-Size Binary Layout

String array: ['joe', null, null, 'mark']

Length: 4, Null count: 2

Validity bitmap:
| Byte 0   | Bytes 1-63 |
|----------|------------|
| 00001001 | 0 (padding)|

Offsets buffer (32-bit):
| Bytes 0-19        | Bytes 20-63 |
|-------------------|-------------|
| 0, 3, 3, 3, 7     | unspecified |

Value buffer:
| Bytes 0-6 | Bytes 7-63  |
|-----------|-------------|
| joemark   | unspecified |

Null values may occupy non-empty space in the data buffer; their content is undefined.

Variable-Size Binary View Layout

New in Arrow 1.4 - Optimized for string operations with inline small strings.

View structure (16 bytes):

Short strings (length ≤ 12 bytes):
| Bytes 0-3 | Bytes 4-15              |
|-----------|-------------------------|
| length    | data (padded with 0)    |

Long strings (length > 12 bytes):
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11  | Bytes 12-15 |
|-----------|-----------|-------------|-------------|
| length    | prefix    | buffer index| offset      |

The prefix stores the first 4 bytes of the string, enabling fast string comparisons.

List Layout

List<Int8> array: [[12, -7, 25], null, [0, -127, 127, 50], []]

Length: 4, Null count: 1

Validity bitmap:
| Byte 0   | Bytes 1-63 |
|----------|------------|
| 00001101 | 0 (padding)|

Offsets buffer (int32):
| Bytes 0-19        | Bytes 20-63 |
|-------------------|-------------|
| 0, 3, 3, 7, 7     | unspecified |

Values array (Int8Array):
  Length: 7, Null count: 0
  Values: [12, -7, 25, 0, -127, 127, 50]

Struct Layout

Struct array with schema Struct<name: VarBinary, age: Int32>: [{'joe', 1}, {null, 2}, null, {'mark', 4}]

Length: 4, Null count: 1

Struct validity bitmap:
| Byte 0   | Bytes 1-63 |
|----------|------------|
| 00001011 | 0 (padding)|

Child arrays:
  - name (VarBinary): ['joe', null, 'alice', 'mark']
  - age (Int32): [1, 2, null, 4]

A struct slot is valid only if BOTH the struct validity bitmap AND the child array validity bitmap have the bit set for that position.

Union Layout

Unions support two modes: Dense Union: 5 bytes overhead per value, space-efficient

Type ids buffer (8-bit integers)
Offsets buffer (32-bit integers)
Child arrays contain only values of their type

Sparse Union: Equal-length child arrays, vectorization-friendly

Type ids buffer (8-bit integers)
No offsets buffer
Child arrays all have length equal to the union

Union types do NOT have a validity bitmap. Nullness is determined by the child arrays.

Dictionary-Encoded Layout

Dictionary encoding represents repeated values efficiently using integer indices.

Original data: ['foo', 'bar', 'foo', 'bar', null, 'baz']

Dictionary-encoded:
  Index type: Int32
  Index values: [0, 1, 0, 1, null, 2]
  
Dictionary:
  Type: VarBinary  
  Values: ['foo', 'bar', 'baz']

Run-End Encoded Layout

New in Arrow 1.3 - Efficient for data with long runs of repeated values.

Original: [1.0, 1.0, 1.0, 1.0, null, null, 2.0]

Run-End Encoded:
  Length: 7, Null count: 0
  
  run_ends (Int32): [4, 6, 7]
  values (Float32): [1.0, null, 2.0]
    Validity: [valid, null, valid]

Run ends represent cumulative lengths - the logical index where each run ends.

Buffer Layouts by Type

Layout Type	Buffer 0	Buffer 1	Buffer 2	Variadic Buffers
Primitive	validity	data	-	-
Variable Binary	validity	offsets	data	-
Variable Binary View	validity	views	-	data
List	validity	offsets	-	-
List View	validity	offsets	sizes	-
Fixed-size List	validity	-	-	-
Struct	validity	-	-	-
Sparse Union	type ids	-	-	-
Dense Union	type ids	offsets	-	-
Null	-	-	-	-
Dictionary-encoded	validity	data (indices)	-	-
Run-end encoded	-	-	-	-

Schema and Metadata

Schemas are serialized using Flatbuffers and contain:

Ordered sequence of fields
Endianness (default: Little Endian)
Custom metadata (key-value pairs)
Features used in the stream/file

table Schema {
  endianness: Endianness = Little;
  fields: [Field];
  custom_metadata: [KeyValue];
  features: [Feature];
}

Each Field includes:

Name and type
Nullable flag
Child fields (for nested types)
Dictionary encoding information
Custom metadata

Extension Types

Extension types allow custom semantics on standard Arrow types using field metadata:

ARROW:extension:name - String identifier (use namespaced names)
ARROW:extension:metadata - Serialized reconstruction data

Examples:

uuid as FixedSizeBinary(16)
latitude-longitude as Struct<latitude: double, longitude: double>
tensor as Binary with shape/dtype metadata

Extension names beginning with arrow. are reserved for canonical extensions. Use namespaced names like myorg.type_name for custom types.

Implementation Guidelines

Array Lengths

Array lengths are 64-bit signed integers. Implementations may limit to 32-bit (2³¹ - 1 elements) for cross-language compatibility. Larger datasets should use multiple array chunks.

Null Count

64-bit signed integer representing the number of null slots. May be as large as the array length.

Subset Implementation

Producers: May implement any subset of the spec
Consumers: Should convert unsupported types to supported ones (e.g., timestamp.millis → timestamp.micros)

Extensibility

Engines may use custom internal vectors but must convert to standard Arrow types before serialization.

References

Schema.fbs - Type definitions
Message.fbs - Message format
Flatbuffers - Serialization framework

Introduction

Format Specification

Architecture

Arrow Columnar Memory Format

Overview

Terminology

Data Types

Primitive Types

Binary and String Types

Nested Types

Encoded Types

Physical Memory Layout

Buffer Alignment and Padding

Validity Bitmaps

Layout Examples

Fixed-Size Primitive Layout

Variable-Size Binary Layout

Variable-Size Binary View Layout

List Layout

Struct Layout

Union Layout

Dictionary-Encoded Layout

Run-End Encoded Layout

Buffer Layouts by Type

Schema and Metadata

Extension Types

Implementation Guidelines

Array Lengths

Null Count

Subset Implementation

Extensibility

References

Build docs developers (and LLMs) love

Introduction

Format Specification

Architecture

​Overview

​Terminology

​Data Types

​Primitive Types

​Binary and String Types

​Nested Types

​Encoded Types

​Physical Memory Layout

​Buffer Alignment and Padding

​Validity Bitmaps

​Layout Examples

​Fixed-Size Primitive Layout

​Variable-Size Binary Layout

​Variable-Size Binary View Layout

​List Layout

​Struct Layout

​Union Layout

​Dictionary-Encoded Layout

​Run-End Encoded Layout

​Buffer Layouts by Type

​Schema and Metadata

​Extension Types

​Implementation Guidelines

​Array Lengths

​Null Count

​Subset Implementation

​Extensibility

​References

Build docs developers (and LLMs) love

Overview

Terminology

Data Types

Primitive Types

Binary and String Types

Nested Types

Encoded Types

Physical Memory Layout

Buffer Alignment and Padding

Validity Bitmaps

Layout Examples

Fixed-Size Primitive Layout

Variable-Size Binary Layout

Variable-Size Binary View Layout

List Layout

Struct Layout

Union Layout

Dictionary-Encoded Layout

Run-End Encoded Layout

Buffer Layouts by Type

Schema and Metadata

Extension Types

Implementation Guidelines

Array Lengths

Null Count

Subset Implementation

Extensibility

References