IOTA Data Ingestion
The IOTA Data Ingestion service processes and stores blockchain checkpoint data in various formats and destinations.Overview
The data ingestion service provides:- Archival: Long-term checkpoint storage with manifests
- Blob Storage: Checkpoint data uploaded to object stores
- KV Store: Key-value store for checkpoint data (BigTable, DynamoDB, etc.)
- Historical Processing: Backfill and historical data processing
- Progress Tracking: Persistent progress store for reliability
Architecture
The service consists of:- IndexerExecutor: Main coordinator managing workers and progress
- WorkerPool: Parallel workers for data processing
- Reducers: Transform and reduce checkpoint data
- Progress Store: Tracks processing watermarks
- Remote Store: Source for checkpoint data
Task Types
The ingestion service supports multiple task types configured via YAML:Archival Task
Stores checkpoint data with manifests for long-term archival:remote-url: Object store URL for archived dataremote-store-options: Key-value pairs for store configurationcommit-file-size: Number of checkpoints per archive filecommit-duration-seconds: Maximum duration before committing
Blob Task
Uploads checkpoint data to blob storage:object-store-config: Object store type and credentialscheckpoint-chunk-size-mb: Size of checkpoint chunks (minimum 5 MB)node-rest-api-url: Fullnode REST API for checkpoint data
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
- Local filesystem
BigTable KV Task
Stores checkpoint data in Google Cloud BigTable:instance-id: BigTable instance identifiercolumn-family: Column family nametimeout-secs: Operation timeoutemulator-host: Optional emulator for local development
KV Store Task
Generic key-value store for checkpoint data:Historical Task
Processes historical checkpoint data:Configuration File
Create a YAML configuration file to define ingestion tasks:Running the Service
Command-line Usage
Configuration Parameters
Global Settings
path: Local directory containing checkpoint filesremote-store-url: URL of remote checkpoint source (fullnode REST API)remote-store-options: Connection options for remote storeremote-read-batch-size: Checkpoints to read per batch (default:100)progress-store-path: Directory for progress trackingmetrics-host: Prometheus metrics host (default:127.0.0.1)metrics-port: Prometheus metrics port (default:8081)
Per-Task Settings
name: Unique task identifierconcurrency: Number of parallel workerstask: Task type (archival, blob, big-table-kv, kv, historical)
Progress Tracking
The service maintains a persistent progress store:Purpose
- Tracks processing watermark for each task
- Enables resumption after restart
- Prevents duplicate processing
Location
Specified byprogress-store-path in configuration:
Behavior
- Each task maintains its own watermark
- Watermark updated after successful processing
- On restart, processing resumes from last watermark
- Epoch transitions handled automatically
Checkpoint Sources
The service can read checkpoints from multiple sources:Local Files
Read from local.chk files:
Remote Store (Fullnode REST API)
Fetch checkpoints from fullnode:Object Store
Read from S3, GCS, or Azure:Monitoring
Prometheus metrics exposed on configured port (default: 8081):Key Metrics
ingestion_checkpoint_processed: Checkpoints processed by taskingestion_checkpoint_errors: Processing errors by taskingestion_watermark: Current watermark for each taskingestion_worker_active: Active workers per taskingestion_batch_duration: Processing time per batch
Access Metrics
Blob Worker Details
Chunking Strategy
Checkpoints are grouped into chunks:- Minimum chunk size: 5 MB
- Configurable size:
checkpoint-chunk-size-mb - Multipart upload: For large chunks
- Concurrent parts: Up to 50 parts uploaded in parallel
File Naming
Checkpoints stored as:Epoch Transitions
On epoch change:- Detects transition via REST API
- Updates watermark to new epoch’s first checkpoint
- Resets remote store for previous epoch range
- Continues processing from new epoch
Archival Worker Details
Manifest Management
The archival worker maintains a MANIFEST file:- Lists all archived checkpoint files
- Tracks checkpoint ranges per file
- Updated after each commit
- Enables efficient checkpoint discovery
File Format
Two files per range:-
Checkpoint file (
.chk): Checkpoint contents- Magic bytes:
CHECKPOINT_FILE_MAGIC - Compressed checkpoint data
- Magic bytes:
-
Summary file (
.sum): Checkpoint summaries- Magic bytes:
SUMMARY_FILE_MAGIC - Compressed summary data
- Magic bytes:
Directory Structure
Historical Worker Details
Processes historical data with a reducer:Use Cases
- Backfilling missing data
- Reprocessing with new logic
- Data migration
- Analytics pipeline
Signal Handling
The service handles graceful shutdown:Supported Signals
- SIGTERM: Graceful shutdown
- SIGINT (Ctrl+C): Graceful shutdown
Shutdown Behavior
- Stop accepting new checkpoints
- Complete in-flight processing
- Update progress store
- Exit cleanly
Environment Variables
BigTable Configuration
BIGTABLE_EMULATOR_HOST: BigTable emulator address (for testing)
AWS Configuration
AWS_REGION: AWS regionAWS_ACCESS_KEY_ID: AWS access keyAWS_SECRET_ACCESS_KEY: AWS secret keyAWS_SESSION_TOKEN: AWS session token (optional)
GCP Configuration
GOOGLE_APPLICATION_CREDENTIALS: Path to service account JSON
Error Handling
Retry Logic
Built-in retry for transient failures:- Network timeouts
- Rate limiting
- Temporary unavailability
Fatal Errors
Service exits on:- Invalid configuration
- Missing credentials
- Unrecoverable storage errors
Recovery
On restart:- Reads progress store
- Resumes from last watermark
- Reprocesses failed checkpoints (if any)
Best Practices
- Concurrency: Tune based on available resources
- Progress Store: Use reliable storage (not tmpfs)
- Monitoring: Set up alerts on metrics
- Credentials: Use secure credential management
- Testing: Test with BigTable emulator for development
- Backups: Backup progress store periodically
- Chunk Size: Balance between efficiency and memory usage