Overview
Arrow Flight is an RPC framework for high-performance data services based on Arrow data. Flight is built on top of:- gRPC - Modern RPC framework
- Arrow IPC format - Zero-copy data serialization
- Protocol Buffers - Service definitions
Flight implementations include optimizations to avoid Protocol Buffer overhead, primarily by minimizing memory copies during data transfer.
Key Concepts
FlightDescriptor
Identifies a data stream using either: Path-based descriptor:FlightInfo
Metadata about a dataset:FlightEndpoint
Location of a data partition:Ticket
Opaque binary token identifying specific data
Location
URI specifying server address and transport
RPC Methods
Flight defines a core set of RPC methods:Metadata Methods
ListFlights
ListFlights
- Optional criteria for filtering
- Returns stream of FlightInfo messages
- Used for dataset discovery
GetFlightInfo
GetFlightInfo
- Blocks until query completes
- Returns endpoints and schema
- May generate new data streams
PollFlightInfo
PollFlightInfo
- Non-blocking first response
- Long-polling for updates
- Incremental result availability
- Progress reporting
GetSchema
GetSchema
- Returns IPC-encoded schema
- Lighter than GetFlightInfo
- May generate new streams
Data Transfer Methods
DoGet
DoGet
- Ticket from FlightEndpoint
- Returns Arrow record batches
- Server-to-client streaming
DoPut
DoPut
- Client-to-server streaming
- Bidirectional messaging
- Server can send acknowledgments
DoExchange
DoExchange
- Simultaneous upload and download
- Stateful computations
- Transform operations
Action Methods
DoAction
DoAction
- Application-specific operations
- Opaque request/response
- Example:
CancelFlightInfo,RenewFlightEndpoint
ListActions
ListActions
- Returns action names and descriptions
- Enables capability discovery
Authentication
Request Patterns
Downloading Data
Standard workflow for retrieving data: Steps:- Acquire FlightDescriptor (from discovery or known descriptor)
- Call GetFlightInfo to get data locations
- Connect to endpoints and call DoGet with ticket
- Consume record batches from each endpoint
Clients may consume endpoints in parallel or distribute them across multiple machines for horizontal scaling.
Ordered vs. Unordered Data
Ordered Data (FlightInfo.ordered = true):
FlightInfo.ordered = false):
Long-Running Queries
UsePollFlightInfo for queries with long execution times:
PollInfo Structure:
- First response should be immediate
- Subsequent calls block until new results (long polling)
- Only append to endpoints (no removal/reordering)
- Set
progressif available (not required to be monotonic)
- Can fetch partial results before query completes
- Should use descriptor from PollInfo for next poll
- Can set short timeout to avoid blocking
- Query complete when
flight_descriptoris unset
Uploading Data
Use Cases:- Data ingestion
- Resumable writes (via PutResult messages)
- Bulk uploads
Data Exchange
DoExchange enables stateful, bidirectional operations:
Location URIs
Flight supports multiple transports via URI schemes:| Transport | URI Scheme | Example |
|---|---|---|
| gRPC (plaintext) | grpc: or grpc+tcp: | grpc://hostname:8080 |
| gRPC (TLS) | grpc+tls: | grpc+tls://hostname:8443 |
| gRPC (Unix socket) | grpc+unix: | grpc+unix:///tmp/flight.sock |
| Connection reuse | arrow-flight-reuse-connection: | arrow-flight-reuse-connection://? |
| HTTP/HTTPS | http: or https: | https://hostname/data.arrow |
Connection Reuse
Special URI for servers to indicate reuse of existing connection:- Server doesn’t need to know its public address
- Useful for port forwarding scenarios
- Simplifies deployment configurations
The URI must be exactly
arrow-flight-reuse-connection://? with trailing /? for compatibility across Java, C++, Go, and Python URI parsers.Extended Location URIs
Servers can return HTTP(S) URLs for direct data access:- Presigned S3/GCS URLs
- CDN-hosted data
- HTTP file servers
- Cached Parquet files
- Perform HTTP GET request
- Assume Arrow IPC format unless
Content-Typeindicates otherwise - Support for
Acceptheader negotiation (optional)
application/octet-stream- Assume Arrow IPCapplication/vnd.apache.arrow.stream- Arrow IPC streamapplication/vnd.apache.arrow.file- Arrow IPC file- Other types (e.g., Parquet) per server capabilities
Authentication
Flight supports multiple authentication patterns:Handshake Authentication
Implementation Options:-
Stateful “login” pattern:
- Establish trust during handshake
- Don’t validate token on each call
- ⚠️ Not secure with Layer 7 load balancers
-
Stateless token validation:
- Handshake may be skipped
- Include externally acquired token (e.g., OAuth bearer)
- Validate on every call
Header-Based Authentication
Custom middleware validates headers:Mutual TLS (mTLS)
Client certificate authentication:- TLS must be enabled
- Certificate provisioning and distribution
- May not be available in all implementations
gRPC Authentication
Flight implementations may expose underlying gRPC authentication:- OAuth2
- JWT tokens
- Custom authenticators
- Channel credentials
Error Handling
Flight defines standard error codes:| Error Code | Description |
|---|---|
UNKNOWN | Unknown error (default) |
INTERNAL | Internal service error |
INVALID_ARGUMENT | Client passed invalid argument |
TIMED_OUT | Operation exceeded timeout |
NOT_FOUND | Requested resource not found |
ALREADY_EXISTS | Resource already exists |
CANCELLED | Operation cancelled (client or server) |
UNAUTHENTICATED | Client not authenticated |
UNAUTHORIZED | Client lacks permissions |
UNIMPLEMENTED | RPC method not implemented |
UNAVAILABLE | Server unavailable (connectivity issues) |
Standard Actions
Flight defines standard actions for common operations:CancelFlightInfo
RenewFlightEndpoint
Performance Considerations
Zero-Copy Transfers
Flight optimizes Protocol Buffer usage:- Metadata in Protobuf
- Data in Arrow IPC format
- Minimal copies during serialization/deserialization
Parallel Endpoint Consumption
Compression
Use Arrow IPC compression for data transfer:- LZ4 for low latency
- ZSTD for better compression ratios
- Consider network bandwidth vs. CPU trade-off
Batching
Optimize batch size:- Too small: overhead dominates
- Too large: memory pressure
- Sweet spot: typically 10K-100K rows
Protocol Buffer Definitions
Full service definition fromFlight.proto:
Implementation Examples
Server Example
Client Example
Best Practices
Service Design
Service Design
- Define clear descriptor semantics
- Document available actions
- Implement meaningful error messages
- Use appropriate authentication method
- Version your service interface
Performance
Performance
- Optimize batch sizes for your use case
- Use compression judiciously
- Exploit parallel endpoint consumption
- Consider data locality in endpoints
- Profile and measure bottlenecks
Reliability
Reliability
- Implement timeout handling
- Use PollFlightInfo for long queries
- Set appropriate expiration times
- Support cancellation
- Handle network failures gracefully
Security
Security
- Always use TLS in production
- Validate tokens on every call
- Implement proper authorization
- Audit access to sensitive data
- Use presigned URLs for object storage