Built-In Resilience
CockroachDB provides automatic resilience through its architecture:Automatic Replication
Every piece of data is replicated across multiple nodes by default:- 3x replication ensures data survives node failures
- Raft consensus maintains consistency across replicas
- Automatic rebalancing redistributes data when nodes join or leave
With 3x replication, you can lose any 1 node without data loss or downtime. The cluster automatically promotes new Raft leaders and leaseholders within seconds.
Self-Healing Capabilities
When failures occur, CockroachDB automatically:Elects New Leaders
Follower replicas hold elections and promote new Raft leaders for affected ranges.
The entire recovery process happens automatically without human intervention. Most failures are recovered in seconds, not minutes.
Range Circuit Breakers
When individual ranges become unavailable, per-replica circuit breakers prevent requests from hanging indefinitely:- Timeout after 60 seconds by default (configurable via
kv.replica_circuit_breaker.slow_replication_threshold) - Return
ReplicaUnavailableErrorinstead of hanging - Automatically reset when the range becomes available again
Backup and Restore
While automatic replication protects against hardware failures, backups protect against:- User errors: Accidental deletions or data corruption
- Software bugs: Application logic errors that corrupt data
- Security incidents: Ransomware or malicious data destruction
- Compliance requirements: Retention policies and audit trails
Backup Types
CockroachDB supports multiple backup strategies:Full Backups
Full Backups
What it is: A complete copy of your cluster, database, or table at a point in time.Use cases:Pros: Complete, standalone backup
Cons: Larger storage requirements, longer backup times
- Initial backup for a new backup schedule
- Base for incremental backups
- Regulatory compliance requiring complete snapshots
Incremental Backups
Incremental Backups
What it is: Only the data that changed since the last backup (full or incremental).Use cases:Pros: Fast, space-efficient
Cons: Requires full backup as base, longer restore chains
- Frequent backups with minimal storage overhead
- Continuous protection with hourly/daily incrementals
- Optimizing backup windows
Revision History Backups
Revision History Backups
What it is: Captures all changes within the garbage collection window, enabling point-in-time restore.Use cases:Pros: Point-in-time recovery, historical analysis
Cons: Larger backups, longer restore times
- Recovering from user errors at specific times
- Forensic analysis of data changes
- Compliance with retention policies
Scheduled Backups
Create automatic backup schedules:- Run automatically at specified intervals
- Protect data from garbage collection until backed up
- Can be paused, resumed, or modified
- Track completion and failures
Monitoring Backup Schedules
Monitoring Backup Schedules
- Last successful backup timestamp
- Backup duration and size
- Error messages and failures
- Protected timestamp advance
Backup Storage
Supported cloud storage:| Provider | URL Format | Use Case |
|---|---|---|
| AWS S3 | s3://bucket/path | Standard cloud backups |
| Google Cloud Storage | gs://bucket/path | Google Cloud deployments |
| Azure Blob Storage | azure://container/path | Azure deployments |
| S3-compatible | s3://endpoint/bucket/path | MinIO, DigitalOcean Spaces |
Immutable Storage
This prevents:- Accidental backup deletion
- Malicious backup destruction
- Ransomware from encrypting backups
Locality-Aware Backups
For multi-region deployments, optimize backup performance and costs:- Each region writes to a nearby storage bucket
- Reduces cross-region bandwidth costs
- Speeds up backup completion
- Enables faster locality-aware restores
Restoring Data
Restore entire databases, specific tables, or point-in-time snapshots:Full Database Restore
Full Database Restore
Specific Table Restore
Specific Table Restore
users table, useful for recovering from targeted data corruption.Point-in-Time Restore
Point-in-Time Restore
New Database Name
New Database Name
Disaster Recovery Strategies
RTO and RPO Planning
Define your recovery objectives:- Recovery Time Objective (RTO): How quickly must you restore service?
- Recovery Point Objective (RPO): How much data loss is acceptable?
| RTO | RPO | Strategy |
|---|---|---|
| Minutes | Seconds | Multi-region with region survival + follower reads |
| Hours | Minutes | Multi-region with zone survival + hourly backups |
| Days | Hours | Single-region with daily backups |
Geographic Redundancy
For critical workloads, implement geographic redundancy:Secondary Cluster (Optional)
For maximum resilience, maintain a standby cluster in a different geography using physical cluster replication or logical data replication.
Backup Validation
Never trust unvalidated backups. CockroachDB provides three levels of backup validation:Level 1: Metadata Validation (Fast)
Level 1: Metadata Validation (Fast)
- Backup metadata is readable
- File structure is intact
- Manifest is consistent
Level 2: Checksum Validation (Medium)
Level 2: Checksum Validation (Medium)
- All files are present and readable
- Checksums match expected values
Level 3: Full Restore Validation (Thorough)
Level 3: Full Restore Validation (Thorough)
- Complete restore succeeds
- Data is queryable
- Indexes and constraints are intact
Observability and Monitoring
Backup Monitoring
Monitor backup health through:- Backup job metrics: Success rate, duration, data size
- Protected timestamps: Ensure data isn’t garbage collected before backup
- Storage space: Alert when backup storage is running low
- Schedule failures: Immediate alerts when scheduled backups fail
Key Metrics to Track
| Metric | Alert Threshold | Action |
|---|---|---|
| Backup failure rate | > 0% | Investigate immediately |
| Backup duration | > 2x baseline | Check cluster performance |
| Backup age | > 25 hours (for daily) | Investigate schedule delays |
| Storage utilization | > 80% | Expand storage or adjust retention |
Prometheus Integration
Export backup metrics to Prometheus:Best Practices
Recovery Scenarios
Accidental Table Drop
Data Corruption in Specific Rows
Complete Cluster Loss
See Also
- Data Replication - How replication provides automatic resilience
- Multi-Region Deployments - Geographic redundancy strategies
- Distributed Transactions - How transactions maintain consistency