Overview
Hive provides robust crash recovery mechanisms to ensure agents can recover from failures automatically:- Automatic Resurrection: Retry failed executions up to N times
- Checkpoints: Incremental state snapshots for rollback
- Session Persistence: Atomic state writes survive crashes
- Fatal Error Detection: Skip resurrection for non-recoverable errors
Automatic Resurrection
How It Works
When an execution fails with a non-fatal error, Hive automatically restarts it from the failed node:- Execution fails at node
process-data - Check if error is fatal (credentials, imports, etc.)
- If non-fatal and resurrections < max, restart from
process-data - Preserve memory and conversation state
- Continue execution with 2-second cooldown
Fatal vs Non-Fatal Errors
Hive distinguishes between recoverable and non-recoverable failures: Fatal Errors (no resurrection):- Missing API credentials
- Invalid authentication tokens
- Import errors (missing dependencies)
- Permission denied (filesystem, API)
- Configuration errors
- Transient network failures
- Rate limit errors
- Temporary API unavailability
- Malformed data (can be retried with different approach)
- Tool execution errors
Resurrection Example
Disabling Resurrection
Checkpoint System
Checkpoint Configuration
Configure checkpoint behavior:Checkpoint Storage
Checkpoints are stored per-session:Manual Checkpointing
Create checkpoints programmatically:Checkpoint Pruning
Old checkpoints are automatically pruned:Session State Persistence
Atomic State Writes
State is written atomically to survive crashes:- No partial writes from crashes
- State is always consistent
- Can resume from last good state
State Write Points
State is written at critical points:- Execution start: Initial state with input data
- Execution pause: When waiting for user input
- Execution complete: Final state with output
- Execution failure: State with error information
- Resurrection: State before each retry attempt
Resume After Crash
If an agent crashes mid-execution, resume from saved state:JSONL Logging for Crash Resilience
L2 and L3 logs use JSONL format for incremental persistence:- Data on disk immediately after each node/tool completes
- No data loss even if agent crashes before end_run()
- Corrupt line handling: Skips partial writes from crashes
Multi-Graph Recovery
Secondary graphs recover independently:Error Handling Patterns
Graceful Degradation
Retry with Backoff
Circuit Breaker
Best Practices
Resurrection Budget
Set
max_resurrections based on your use case. 3 is reasonable for transient failures.Fatal Detection
Add custom fatal error patterns if needed. Don’t waste resurrections on config errors.
Checkpoint Frequency
Balance checkpoint overhead with recovery granularity. Node completion is usually sufficient.
State Validation
Validate session state before resuming. Detect corrupted state early.
Related
- Sessions - Session lifecycle and state management
- Monitoring - Track execution health
- Headless Execution - Production deployment patterns