Overview
Cadence provides comprehensive disaster recovery capabilities through database backups, cross-datacenter replication, and workflow archival. This guide covers backup strategies, failover procedures, and recovery operations.
Backup Strategies
What to Back Up
Cadence requires backups of:
- Persistence data (databases)
- Workflow execution state
- Workflow history
- Domain metadata
- Task lists
- Visibility data (search indices)
- Workflow search index
- List query results
- Configuration
- Service configuration files
- Dynamic configuration
- Archived data (optional)
- Long-term workflow history storage
Workflow code (workflows and activities) should be backed up in your source control system, not as part of Cadence backup.
Database Backups
Cassandra Backup
Snapshot-Based Backup
#!/bin/bash
# Full cluster snapshot
BACKUP_DIR="/backup/cassandra/$(date +%Y%m%d_%H%M%S)"
# Take snapshot on all nodes
for host in $CASSANDRA_NODES; do
ssh $host "nodetool snapshot cadence cadence_visibility -t backup_$(date +%Y%m%d)"
done
# Copy snapshots to backup storage
for host in $CASSANDRA_NODES; do
ssh $host "tar czf /tmp/cassandra-snapshot.tar.gz \
/var/lib/cassandra/data/cadence/*/snapshots/ \
/var/lib/cassandra/data/cadence_visibility/*/snapshots/"
scp $host:/tmp/cassandra-snapshot.tar.gz \
$BACKUP_DIR/$host-snapshot.tar.gz
# Clean up remote snapshot
ssh $host "rm /tmp/cassandra-snapshot.tar.gz"
done
# Upload to S3
aws s3 sync $BACKUP_DIR s3://my-backups/cadence/cassandra/$(date +%Y%m%d)/
Incremental Backup
# Enable incremental backups (requires restart)
for host in $CASSANDRA_NODES; do
ssh $host "echo 'incremental_backups: true' >> /etc/cassandra/cassandra.yaml"
ssh $host "systemctl restart cassandra"
done
# Backups are stored in data/keyspace/table/backups/
# Copy incrementally to backup storage
for host in $CASSANDRA_NODES; do
rsync -avz --progress \
$host:/var/lib/cassandra/data/cadence/*/backups/ \
$BACKUP_DIR/$host/incremental/
done
Automated Backup Script
#!/bin/bash
# cassandra-backup.sh
set -e
BACKUP_NAME="cadence_$(date +%Y%m%d_%H%M%S)"
BACKUP_DIR="/backup/cassandra"
S3_BUCKET="s3://my-backups/cadence/cassandra"
RETENTION_DAYS=30
# Create snapshot
nodetool snapshot cadence cadence_visibility -t $BACKUP_NAME
# Archive snapshot
mkdir -p $BACKUP_DIR/$BACKUP_NAME
find /var/lib/cassandra/data -name "snapshots/$BACKUP_NAME" \
-exec tar czf $BACKUP_DIR/$BACKUP_NAME/snapshot.tar.gz {} +
# Upload to S3
aws s3 cp $BACKUP_DIR/$BACKUP_NAME/ $S3_BUCKET/$BACKUP_NAME/ --recursive
# Clean old snapshots
nodetool clearsnapshot -t $BACKUP_NAME cadence cadence_visibility
# Remove local backup after upload
rm -rf $BACKUP_DIR/$BACKUP_NAME
# Clean old S3 backups
aws s3 ls $S3_BUCKET/ | while read -r line; do
backup_date=$(echo $line | awk '{print $2}' | cut -d_ -f2)
if [[ $(date -d "$backup_date" +%s) -lt $(date -d "$RETENTION_DAYS days ago" +%s) ]]; then
aws s3 rm $S3_BUCKET/$(echo $line | awk '{print $2}') --recursive
fi
done
MySQL Backup
Logical Backup (mysqldump)
#!/bin/bash
# mysql-backup.sh
BACKUP_DIR="/backup/mysql"
BACKUP_FILE="cadence_$(date +%Y%m%d_%H%M%S).sql.gz"
# Full backup with all databases
mysqldump \
--host=mysql.example.com \
--user=backup_user \
--password=${MYSQL_PASSWORD} \
--single-transaction \
--routines \
--triggers \
--events \
--hex-blob \
--databases cadence cadence_visibility | gzip > $BACKUP_DIR/$BACKUP_FILE
# Upload to S3
aws s3 cp $BACKUP_DIR/$BACKUP_FILE s3://my-backups/cadence/mysql/
# Verify backup
gunzip -t $BACKUP_DIR/$BACKUP_FILE || echo "Backup verification failed!"
Physical Backup (Percona XtraBackup)
#!/bin/bash
# xtrabackup-full.sh
BACKUP_DIR="/backup/mysql/$(date +%Y%m%d_%H%M%S)"
# Full backup
xtrabackup --backup \
--target-dir=$BACKUP_DIR \
--user=backup_user \
--password=${MYSQL_PASSWORD}
# Prepare backup
xtrabackup --prepare --target-dir=$BACKUP_DIR
# Compress and upload
tar czf $BACKUP_DIR.tar.gz $BACKUP_DIR
aws s3 cp $BACKUP_DIR.tar.gz s3://my-backups/cadence/mysql/
# Clean local backup
rm -rf $BACKUP_DIR $BACKUP_DIR.tar.gz
Incremental Backup
# Take base backup
xtrabackup --backup --target-dir=/backup/base
# Incremental backup (daily)
xtrabackup --backup \
--target-dir=/backup/inc1 \
--incremental-basedir=/backup/base
# Restore process
xtrabackup --prepare --apply-log-only --target-dir=/backup/base
xtrabackup --prepare --apply-log-only \
--target-dir=/backup/base \
--incremental-dir=/backup/inc1
xtrabackup --prepare --target-dir=/backup/base
PostgreSQL Backup
Logical Backup (pg_dump)
#!/bin/bash
# postgres-backup.sh
BACKUP_DIR="/backup/postgres"
BACKUP_FILE="cadence_$(date +%Y%m%d_%H%M%S).dump"
# Custom format backup (supports parallel restore)
pg_dump \
--host=postgres.example.com \
--username=backup_user \
--format=custom \
--compress=9 \
--file=$BACKUP_DIR/$BACKUP_FILE \
cadence
# Visibility database
pg_dump \
--host=postgres.example.com \
--username=backup_user \
--format=custom \
--compress=9 \
--file=$BACKUP_DIR/visibility_$(date +%Y%m%d_%H%M%S).dump \
cadence_visibility
# Upload to S3
aws s3 cp $BACKUP_DIR/ s3://my-backups/cadence/postgres/ --recursive
Physical Backup (pg_basebackup)
# Base backup
pg_basebackup \
--host=postgres.example.com \
--username=replication_user \
--format=tar \
--gzip \
--progress \
--checkpoint=fast \
--wal-method=stream \
--pgdata=/backup/postgres/$(date +%Y%m%d_%H%M%S)
# Continuous archiving (WAL shipping)
# In postgresql.conf:
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-backups/cadence/postgres/wal/%f'
Backup Schedule Recommendations
| Backup Type | Frequency | Retention | RTO | RPO |
|---|
| Full Snapshot | Daily | 30 days | 2-4 hours | 24 hours |
| Incremental | Hourly | 7 days | 1-2 hours | 1 hour |
| WAL/Binlog | Continuous | 7 days | 30 min | 5 minutes |
| Configuration | On change | Indefinite | 5 min | 0 |
Combine full snapshots with continuous archiving (WAL/binlog) for best RPO (Recovery Point Objective).
Workflow Archival
Archival moves completed workflows to long-term storage (S3, GCS, etc.).
Archival Configuration
archival:
history:
status: "enabled"
enableRead: true
provider:
s3store:
region: "us-east-1"
endpoint: "s3.amazonaws.com"
visibility:
status: "enabled"
enableRead: true
provider:
s3store:
region: "us-east-1"
domainDefaults:
archival:
history:
status: "enabled"
URI: "s3://my-archival-bucket/cadence-history/"
visibility:
status: "enabled"
URI: "s3://my-archival-bucket/cadence-visibility/"
- S3:
s3://bucket-name/path/
- GCS:
gs://bucket-name/path/
- Local:
file:///mnt/archival/
Archival Lifecycle
- Workflow completes and passes retention period (default 7 days)
- Worker service archives workflow history to S3/GCS
- Primary database records are deleted
- Archived data remains accessible via Cadence APIs
# Query archived workflow
cadence workflow show \
--workflow_id my-workflow-id \
--run_id abc-def-123 \
--domain my-domain
# List archived workflows
cadence workflow list \
--domain my-domain \
--archived
Archival is one-way. Once a workflow is archived, it cannot be moved back to primary storage.
Cross-Datacenter Replication
Cadence supports active-active global domains across multiple datacenters.
Cluster Configuration
clusterGroupMetadata:
failoverVersionIncrement: 100
primaryClusterName: "dc1"
currentClusterName: "dc1"
clusterGroup:
dc1:
enabled: true
initialFailoverVersion: 1
rpcName: "cadence-frontend"
rpcAddress: "dc1.cadence.example.com:7933"
rpcTransport: "grpc"
dc2:
enabled: true
initialFailoverVersion: 2
rpcName: "cadence-frontend"
rpcAddress: "dc2.cadence.example.com:7933"
rpcTransport: "grpc"
dc3:
enabled: true
initialFailoverVersion: 3
rpcName: "cadence-frontend"
rpcAddress: "dc3.cadence.example.com:7933"
rpcTransport: "grpc"
Global Domain Setup
# Register global domain
cadence --domain my-global-domain domain register \
--global_domain true \
--active_cluster dc1 \
--clusters dc1 dc2 dc3 \
--retention 7 \
--description "Global domain spanning 3 DCs"
# Verify replication
cadence --domain my-global-domain domain describe
Replication Architecture
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ DC1 (Active) │ │ DC2 (Standby) │ │ DC3 (Standby) │
│ │ │ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Frontend │ │ │ │ Frontend │ │ │ │ Frontend │ │
│ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ History │◄──┼─────────┼──┤ History │◄──┼─────────┼──┤ History │ │
│ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Cassandra │ │ │ │ Cassandra │ │ │ │ Cassandra │ │
│ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
│ ▲ ▲
└─────────Replication Tasks───────┴─────────────────────────────────┘
Disaster Recovery Scenarios
Scenario 1: Database Corruption
Detection
# Symptoms
- Workflows failing to start
- Persistence errors in logs
- Inconsistent workflow state
# Verify
grep -i "persistence error" /var/log/cadence/*.log
Recovery
# 1. Stop affected services
kubectl scale deployment/cadence-history --replicas=0
# 2. Restore from backup (Cassandra example)
for host in $CASSANDRA_NODES; do
ssh $host "systemctl stop cassandra"
ssh $host "rm -rf /var/lib/cassandra/data/cadence"
ssh $host "tar xzf /backup/cassandra-latest.tar.gz -C /"
ssh $host "systemctl start cassandra"
done
# 3. Verify schema version
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version
# 4. Restart services
kubectl scale deployment/cadence-history --replicas=10
# 5. Verify recovery
cadence workflow list --domain my-domain
Scenario 2: Datacenter Failure
Failover Procedure
# 1. Verify active cluster is unavailable
curl http://dc1.cadence.example.com:7933/health || echo "DC1 down"
# 2. Initiate failover to DC2
cadence --domain my-global-domain domain update \
--active_cluster dc2
# 3. Verify failover
cadence --domain my-global-domain domain describe
# 4. Update DNS/load balancer to point to DC2
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "cadence.example.com",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "dc2.cadence.example.com"}]
}
}]
}'
# 5. Monitor for replication lag
watch -n 5 'cadence admin domain describe --domain my-global-domain'
Failback Procedure
# 1. Verify DC1 is recovered
curl http://dc1.cadence.example.com:7933/health
# 2. Wait for replication to catch up
cadence admin domain describe --domain my-global-domain
# Check replicationConfig.activeClusterName and failoverHistory
# 3. Failback to DC1
cadence --domain my-global-domain domain update \
--active_cluster dc1
# 4. Update DNS back to DC1
aws route53 change-resource-record-sets ...
Failover is domain-level, not cluster-level. Different domains can be active in different clusters simultaneously.
Scenario 3: Complete Cluster Loss
Prerequisites
- Recent database backups
- Configuration backups
- Documented cluster setup
Recovery Steps
# 1. Provision new infrastructure
# - Database cluster
# - Kubernetes cluster or VMs
# - Load balancers
# 2. Restore database
# (See database backup section above)
# 3. Deploy Cadence services
kubectl apply -f cadence-deployment.yaml
# 4. Restore configuration
kubectl create configmap cadence-config \
--from-file=config.yaml=config-backup.yaml
# 5. Verify cluster health
for svc in frontend history matching worker; do
curl http://$svc:909x/health
done
# 6. Verify workflows
cadence workflow list --domain my-domain
cadence workflow describe --workflow_id <wf-id> --domain my-domain
# 7. Resume workers
# Workers automatically reconnect and resume polling
Data Consistency
Consistency Guarantees
Cadence provides:
- Single cluster: Strong consistency (via database transactions)
- Global domains: Eventual consistency across datacenters
- Archival: Eventually consistent (async archival process)
Consistency Checks
# Verify workflow state consistency
cadence admin workflow show \
--workflow_id <wf-id> \
--run_id <run-id>
# Check for orphaned workflows
cadence admin db scan --db-type cassandra
# Verify history branch consistency
cadence admin workflow list-history-branches
Handling Inconsistencies
# Repair workflow (if stuck)
cadence admin workflow refresh --workflow_id <wf-id>
# Delete corrupt workflow
cadence admin workflow delete \
--workflow_id <wf-id> \
--run_id <run-id>
# Re-sync domain metadata
cadence admin domain describe --domain <domain> --sync
Testing Disaster Recovery
DR Testing Checklist
-
Backup verification
# Test restore to isolated environment
./restore-backup.sh --test-mode --backup=latest
-
Failover testing
# Practice failover in non-production
cadence --domain test-domain domain update --active_cluster dc2
-
Recovery time measurement
# Measure RTO (Recovery Time Objective)
time ./disaster-recovery.sh
-
Data loss assessment
# Verify RPO (Recovery Point Objective)
# Compare pre-failure and post-recovery workflow counts
Chaos Engineering
Use chaos testing to validate DR procedures:
# Kubernetes chaos-mesh example
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-history-pod
spec:
action: pod-kill
selector:
namespaces:
- cadence
labelSelectors:
app: cadence-history
mode: one
scheduler:
cron: "@hourly"
Monitoring and Alerting
Backup Monitoring
# Alert if backup is older than 25 hours
time() - cadence_backup_last_success_timestamp > 90000
# Alert on backup failures
rate(cadence_backup_failures[1h]) > 0
Replication Lag
# Alert if replication lag > 5 minutes
cadence_replication_lag_seconds > 300
# Alert on replication task failures
rate(cadence_replication_task_failures[5m]) > 10
Cross-Cluster Health
# Monitor all clusters
for cluster in dc1 dc2 dc3; do
curl http://$cluster.cadence.example.com:7933/health
done
Best Practices
1. Automate Everything
- Automated backup scripts
- Automated restore testing
- Automated failover procedures
2. Test Regularly
- Monthly backup restoration tests
- Quarterly full DR drills
- Annual cross-region failover tests
3. Document Procedures
- Runbooks for each DR scenario
- Contact information for on-call
- Decision trees for failover
4. Monitor Continuously
- Backup success/failure
- Replication lag
- Cross-cluster connectivity
5. Maintain Redundancy
- 3+ clusters for global domains
- Multiple backup copies (on-site + off-site)
- Geo-redundant archival storage
See Also