Overview
The Arrow IPC (Inter-Process Communication) format enables zero-copy data exchange between processes and persistence to storage. The protocol serializes Arrow record batches into binary payloads that can be reconstructed without memory copying. Key Features:- Zero-copy reconstruction of arrays
- Streaming and random-access file formats
- Dictionary encoding support
- Optional compression (LZ4, ZSTD)
- Language-agnostic using Flatbuffers
The IPC format uses Flatbuffers for metadata serialization and defines binary message structures for efficient data exchange.
Core Concepts
Record Batch
The primitive unit of serialized data. A record batch is an ordered collection of arrays (fields) with:- Same length across all fields
- Potentially different data types
- A shared schema defining field names and types
Message Types
The IPC protocol uses three message types:- Schema - Defines the structure of record batches
- RecordBatch - Contains actual data buffers
- DictionaryBatch - Contains dictionary data for encoded fields
Encapsulated Message Format
All IPC messages use a standardized encapsulation format:Message Components
Continuation Indicator
Continuation Indicator
- 4-byte value:
0xFFFFFFFF - Indicates a valid message follows
- Introduced in v0.15.0 for Flatbuffers alignment
Metadata Size
Metadata Size
- 4-byte little-endian integer
- Includes size of flatbuffer plus padding
- Allows readers to skip to message body
Metadata Flatbuffer
Metadata Flatbuffer
Contains:
- Version number
- Message type (Schema, RecordBatch, DictionaryBatch)
- Body size
- Custom metadata (key-value pairs)
Message Body
Message Body
- Flat sequence of memory buffers
- End-to-end layout with 8-byte alignment
- Optional compression per buffer
Schema Message
The Schema message contains type metadata without data buffers.Field Structure
Each Field contains:- name: Field identifier (UTF-8 string)
- nullable: Whether field can contain nulls
- type: Data type definition
- dictionary: Dictionary encoding metadata (if applicable)
- children: Child fields for nested types
- custom_metadata: Application-specific metadata
RecordBatch Message
RecordBatch messages contain the actual data buffers.Field and Buffer Flattening
Fields and buffers are flattened using pre-order depth-first traversal. Example Flattening:FieldNode Structure
Fields with
null_count == 0 may omit their physical validity bitmap by setting buffer length to 0.Buffer Structure
Variadic Buffers
New in Arrow 1.4 - Some types (BinaryView, Utf8View) use variable numbers of buffers.Compression
RecordBatch buffers support optional compression:Supported Codecs
- LZ4_FRAME: LZ4 frame format (not raw/block format)
- ZSTD: Zstandard compression
Compression Method
Compressed Buffer Format
Each compressed buffer in the message body:uncompressed_length == -1: Buffer is NOT compressed- Empty buffers: May be written as 0 bytes (omitting length header)
IPC Streaming Format
The streaming format serializes record batches as a sequence of messages:Message Sequence Rules
- Schema message comes first
- Dictionary messages define dictionaries before use in RecordBatches
- Dictionary and RecordBatch messages may be interleaved
- Dictionary must be defined before any RecordBatch uses it
- End-of-stream (EOS) is optional
Edge case: RecordBatches with completely null dictionary-encoded columns may have their dictionary appear after the first batch.
End-of-Stream Signal
Two ways to signal EOS:- Write 8 bytes:
0xFFFFFFFF 0x00000000(continuation + zero length) - Close the stream interface
.arrows
Reading Streaming Format
IPC File Format
The file format enables random access to record batches:Footer Structure
.arrow or .feather
Files with this format are sometimes called “Feather V2” - the name derives from an early Arrow proof-of-concept for Python/R data frame storage.
File Format Features
Random Access:- Footer contains offsets to all record batches
- Any batch can be read independently
- Schema is redundantly stored in footer
- Dictionary keys don’t need to be defined before use
- Only one non-delta dictionary per ID (no replacement)
- Delta dictionaries applied in footer order
Reading File Format
Dictionary Messages
Dictionaries are serialized as record batches with a single field.Dictionary Encoding Example
Original data:["A", "B", "C", "B", "D", "C", "E", "A"]
Stream with Delta Dictionary:
Dictionary Replacement
IfisDelta = false, dictionary replaces existing:
Endianness
The Arrow format is little-endian by default.Cross-Platform Exchange
Schema metadata includes an endianness field:- Primary use case: systems with same endianness
- Initial implementations may error on mismatched endianness
- Future support may include automatic byte swapping
- Reference implementation focuses on little-endian
Custom Metadata
Metadata can be attached at three levels:- Field level:
Field.custom_metadata - Schema level:
Schema.custom_metadata - Message level:
Message.custom_metadata
Metadata Format
Naming Conventions
- Use colon
:as namespace separator - Multiple colons allowed:
org:project:feature:key ARROW:prefix reserved for internal Arrow use
ARROW:extension:name- Extension type nameARROW:extension:metadata- Extension type metadatamyorg:custom:property- Application metadata
Extension Types
Extension types enable custom semantics on standard Arrow types:Required Metadata
ARROW:extension:name- Unique identifier (use namespaced names)ARROW:extension:metadata- Optional reconstruction data
Extension Type Examples
UUID
- Base type:
FixedSizeBinary(16) - Metadata: empty
- Name:
arrow.uuid
GeoPoint
- Base type:
Struct<lat: Float64, lon: Float64> - Metadata: empty
- Name:
myorg.geopoint
Tensor
- Base type:
Binary - Metadata:
{"dtype": "float32", "shape": [3, 4]} - Name:
myorg.tensor
TradingTime
- Base type:
Timestamp[microsecond] - Metadata:
{"calendar": "NYSE"} - Name:
myorg.trading_time
Extension Type Compatibility
Implementations that don’t support an extension type:- Interpret as the base Arrow type
- Pass through custom metadata
- Enable graceful degradation
Features and Compatibility
Schemas can declare required features:- Forward compatibility detection
- Client-server capability negotiation
- Graceful handling of unknown features
Implementation Considerations
Zero-Copy Reconstruction
The IPC format enables zero-copy array reconstruction:Memory Mapping
File format supports memory mapping:- Map entire file to memory
- Arrays reference mapped regions
- No explicit reading needed
- Optimal for large datasets
Subset Implementation
Producers:- May implement any subset of spec
- Should document supported types
- Should handle missing validity bitmaps
- Should convert unsupported types
- Must validate message structure
Best Practices
Buffer Alignment
Buffer Alignment
- Always align buffers to 8-byte boundaries minimum
- Prefer 64-byte alignment for SIMD optimization
- Pad buffers to alignment size
Compression
Compression
- Profile before enabling compression
- Avoid double compression with transport protocols
- Consider leaving some buffers uncompressed
- Use LZ4 for speed, ZSTD for better ratios
Dictionary Encoding
Dictionary Encoding
- Use for columns with repeated values
- Consider cardinality vs. overhead
- Define dictionaries early in streams
- Use delta dictionaries for incremental data
File Format
File Format
- Use for random access scenarios
- Include schema in footer for convenience
- Consider chunking very large batches
- Validate footer magic bytes
References
- Message.fbs - Message definitions
- File.fbs - File format
- Schema.fbs - Type system