Overview
TheFileIO interface is Iceberg’s abstraction for reading and writing data and metadata files. It provides a flexible way to integrate with any storage system while maintaining Iceberg’s guarantees around metadata operations and table commits.
Why FileIO?
Iceberg’s design separates concerns:- Metadata Operations: Planning and committing changes (uses FileIO)
- Data Operations: Reading and writing table data (uses processing engine + FileIO)
- Physical Layout: Absolute paths allow flexibility in file organization
Key Benefits
- No File Renaming: Iceberg never renames files, simplifying storage requirements
- Absolute Paths: Metadata tracks full file paths, enabling flexible layouts
- Minimal Requirements: Only need read, write, delete, and seek operations
- Custom Storage: Support any storage backend with a FileIO implementation
FileIO Interface
The core interface is simple:Built-in Implementations
Iceberg provides FileIO implementations for common storage systems:| Implementation | Storage Type | Module |
|---|---|---|
S3FileIO | Amazon S3 | iceberg-aws |
GCSFileIO | Google Cloud Storage | iceberg-gcp |
ADLSFileIO | Azure Data Lake Storage | iceberg-azure |
OSSFileIO | Alibaba Cloud OSS | iceberg-aliyun |
HadoopFileIO | Any Hadoop FileSystem | iceberg-core |
ResolvingFileIO | Multiple storage types | iceberg-core |
Implementing Custom FileIO
Basic Implementation
Implementing InputFile
Implementing OutputFile
Implementing Seekable Input
Configuration
Loading via Catalog Property
Loading via Java API
Advanced Features
Hadoop Configuration Access
If your FileIO needs Hadoop configuration:Bulk Delete Operations
Optimize deletes with bulk operations:Prefix Operations
Implement efficient prefix listing:Testing Your FileIO
Unit Tests
Performance Considerations
Minimize Metadata Calls
Minimize Metadata Calls
Cache file metadata (size, existence) to reduce storage API calls:
Use Connection Pooling
Use Connection Pooling
Reuse HTTP connections for better performance:
Implement Bulk Operations
Implement Bulk Operations
Batch deletes and list operations when possible to reduce API calls.
Buffer Streams
Buffer Streams
Use buffered streams for better throughput:
Best Practices
- Thread Safety: Ensure FileIO instances are thread-safe or document thread safety requirements
- Resource Cleanup: Always close streams and clients in
close()method - Error Handling: Wrap storage exceptions in Iceberg exceptions (RuntimeIOException)
- Retry Logic: Implement retries for transient failures
- Metrics: Add instrumentation for monitoring (optional)
- Documentation: Document custom properties and configuration requirements
Common Use Cases
Cloud Storage Integration
Implement FileIO for cloud storage not natively supported:- Oracle Cloud Infrastructure (OCI) Object Storage
- Cloudflare R2
- Wasabi
- MinIO
- Ceph RADOS Gateway
On-Premises Storage
Integrate with enterprise storage systems:- NetApp StorageGRID
- Pure Storage FlashBlade
- IBM Cloud Object Storage
- Scality RING
Custom Protocols
Support custom URI schemes:Debugging
Enable debug logging:Next Steps
AWS S3 FileIO
See production FileIO implementation for S3
Custom Catalog
Build custom catalog with your FileIO