Skip to main content

Overview

Cadence provides comprehensive disaster recovery capabilities through database backups, cross-datacenter replication, and workflow archival. This guide covers backup strategies, failover procedures, and recovery operations.

Backup Strategies

What to Back Up

Cadence requires backups of:
  1. Persistence data (databases)
    • Workflow execution state
    • Workflow history
    • Domain metadata
    • Task lists
  2. Visibility data (search indices)
    • Workflow search index
    • List query results
  3. Configuration
    • Service configuration files
    • Dynamic configuration
  4. Archived data (optional)
    • Long-term workflow history storage
Workflow code (workflows and activities) should be backed up in your source control system, not as part of Cadence backup.

Database Backups

Cassandra Backup

Snapshot-Based Backup

#!/bin/bash
# Full cluster snapshot
BACKUP_DIR="/backup/cassandra/$(date +%Y%m%d_%H%M%S)"

# Take snapshot on all nodes
for host in $CASSANDRA_NODES; do
  ssh $host "nodetool snapshot cadence cadence_visibility -t backup_$(date +%Y%m%d)"
done

# Copy snapshots to backup storage
for host in $CASSANDRA_NODES; do
  ssh $host "tar czf /tmp/cassandra-snapshot.tar.gz \
    /var/lib/cassandra/data/cadence/*/snapshots/ \
    /var/lib/cassandra/data/cadence_visibility/*/snapshots/"
  
  scp $host:/tmp/cassandra-snapshot.tar.gz \
    $BACKUP_DIR/$host-snapshot.tar.gz
  
  # Clean up remote snapshot
  ssh $host "rm /tmp/cassandra-snapshot.tar.gz"
done

# Upload to S3
aws s3 sync $BACKUP_DIR s3://my-backups/cadence/cassandra/$(date +%Y%m%d)/

Incremental Backup

# Enable incremental backups (requires restart)
for host in $CASSANDRA_NODES; do
  ssh $host "echo 'incremental_backups: true' >> /etc/cassandra/cassandra.yaml"
  ssh $host "systemctl restart cassandra"
done

# Backups are stored in data/keyspace/table/backups/
# Copy incrementally to backup storage
for host in $CASSANDRA_NODES; do
  rsync -avz --progress \
    $host:/var/lib/cassandra/data/cadence/*/backups/ \
    $BACKUP_DIR/$host/incremental/
done

Automated Backup Script

#!/bin/bash
# cassandra-backup.sh

set -e

BACKUP_NAME="cadence_$(date +%Y%m%d_%H%M%S)"
BACKUP_DIR="/backup/cassandra"
S3_BUCKET="s3://my-backups/cadence/cassandra"
RETENTION_DAYS=30

# Create snapshot
nodetool snapshot cadence cadence_visibility -t $BACKUP_NAME

# Archive snapshot
mkdir -p $BACKUP_DIR/$BACKUP_NAME
find /var/lib/cassandra/data -name "snapshots/$BACKUP_NAME" \
  -exec tar czf $BACKUP_DIR/$BACKUP_NAME/snapshot.tar.gz {} +

# Upload to S3
aws s3 cp $BACKUP_DIR/$BACKUP_NAME/ $S3_BUCKET/$BACKUP_NAME/ --recursive

# Clean old snapshots
nodetool clearsnapshot -t $BACKUP_NAME cadence cadence_visibility

# Remove local backup after upload
rm -rf $BACKUP_DIR/$BACKUP_NAME

# Clean old S3 backups
aws s3 ls $S3_BUCKET/ | while read -r line; do
  backup_date=$(echo $line | awk '{print $2}' | cut -d_ -f2)
  if [[ $(date -d "$backup_date" +%s) -lt $(date -d "$RETENTION_DAYS days ago" +%s) ]]; then
    aws s3 rm $S3_BUCKET/$(echo $line | awk '{print $2}') --recursive
  fi
done

MySQL Backup

Logical Backup (mysqldump)

#!/bin/bash
# mysql-backup.sh

BACKUP_DIR="/backup/mysql"
BACKUP_FILE="cadence_$(date +%Y%m%d_%H%M%S).sql.gz"

# Full backup with all databases
mysqldump \
  --host=mysql.example.com \
  --user=backup_user \
  --password=${MYSQL_PASSWORD} \
  --single-transaction \
  --routines \
  --triggers \
  --events \
  --hex-blob \
  --databases cadence cadence_visibility | gzip > $BACKUP_DIR/$BACKUP_FILE

# Upload to S3
aws s3 cp $BACKUP_DIR/$BACKUP_FILE s3://my-backups/cadence/mysql/

# Verify backup
gunzip -t $BACKUP_DIR/$BACKUP_FILE || echo "Backup verification failed!"

Physical Backup (Percona XtraBackup)

#!/bin/bash
# xtrabackup-full.sh

BACKUP_DIR="/backup/mysql/$(date +%Y%m%d_%H%M%S)"

# Full backup
xtrabackup --backup \
  --target-dir=$BACKUP_DIR \
  --user=backup_user \
  --password=${MYSQL_PASSWORD}

# Prepare backup
xtrabackup --prepare --target-dir=$BACKUP_DIR

# Compress and upload
tar czf $BACKUP_DIR.tar.gz $BACKUP_DIR
aws s3 cp $BACKUP_DIR.tar.gz s3://my-backups/cadence/mysql/

# Clean local backup
rm -rf $BACKUP_DIR $BACKUP_DIR.tar.gz

Incremental Backup

# Take base backup
xtrabackup --backup --target-dir=/backup/base

# Incremental backup (daily)
xtrabackup --backup \
  --target-dir=/backup/inc1 \
  --incremental-basedir=/backup/base

# Restore process
xtrabackup --prepare --apply-log-only --target-dir=/backup/base
xtrabackup --prepare --apply-log-only \
  --target-dir=/backup/base \
  --incremental-dir=/backup/inc1
xtrabackup --prepare --target-dir=/backup/base

PostgreSQL Backup

Logical Backup (pg_dump)

#!/bin/bash
# postgres-backup.sh

BACKUP_DIR="/backup/postgres"
BACKUP_FILE="cadence_$(date +%Y%m%d_%H%M%S).dump"

# Custom format backup (supports parallel restore)
pg_dump \
  --host=postgres.example.com \
  --username=backup_user \
  --format=custom \
  --compress=9 \
  --file=$BACKUP_DIR/$BACKUP_FILE \
  cadence

# Visibility database
pg_dump \
  --host=postgres.example.com \
  --username=backup_user \
  --format=custom \
  --compress=9 \
  --file=$BACKUP_DIR/visibility_$(date +%Y%m%d_%H%M%S).dump \
  cadence_visibility

# Upload to S3
aws s3 cp $BACKUP_DIR/ s3://my-backups/cadence/postgres/ --recursive

Physical Backup (pg_basebackup)

# Base backup
pg_basebackup \
  --host=postgres.example.com \
  --username=replication_user \
  --format=tar \
  --gzip \
  --progress \
  --checkpoint=fast \
  --wal-method=stream \
  --pgdata=/backup/postgres/$(date +%Y%m%d_%H%M%S)

# Continuous archiving (WAL shipping)
# In postgresql.conf:
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-backups/cadence/postgres/wal/%f'

Backup Schedule Recommendations

Backup TypeFrequencyRetentionRTORPO
Full SnapshotDaily30 days2-4 hours24 hours
IncrementalHourly7 days1-2 hours1 hour
WAL/BinlogContinuous7 days30 min5 minutes
ConfigurationOn changeIndefinite5 min0
Combine full snapshots with continuous archiving (WAL/binlog) for best RPO (Recovery Point Objective).

Workflow Archival

Archival moves completed workflows to long-term storage (S3, GCS, etc.).

Archival Configuration

archival:
  history:
    status: "enabled"
    enableRead: true
    provider:
      s3store:
        region: "us-east-1"
        endpoint: "s3.amazonaws.com"
  visibility:
    status: "enabled"
    enableRead: true
    provider:
      s3store:
        region: "us-east-1"

domainDefaults:
  archival:
    history:
      status: "enabled"
      URI: "s3://my-archival-bucket/cadence-history/"
    visibility:
      status: "enabled"
      URI: "s3://my-archival-bucket/cadence-visibility/"

Archival URI Formats

  • S3: s3://bucket-name/path/
  • GCS: gs://bucket-name/path/
  • Local: file:///mnt/archival/

Archival Lifecycle

  1. Workflow completes and passes retention period (default 7 days)
  2. Worker service archives workflow history to S3/GCS
  3. Primary database records are deleted
  4. Archived data remains accessible via Cadence APIs
# Query archived workflow
cadence workflow show \
  --workflow_id my-workflow-id \
  --run_id abc-def-123 \
  --domain my-domain

# List archived workflows
cadence workflow list \
  --domain my-domain \
  --archived
Archival is one-way. Once a workflow is archived, it cannot be moved back to primary storage.

Cross-Datacenter Replication

Cadence supports active-active global domains across multiple datacenters.

Cluster Configuration

clusterGroupMetadata:
  failoverVersionIncrement: 100
  primaryClusterName: "dc1"
  currentClusterName: "dc1"
  clusterGroup:
    dc1:
      enabled: true
      initialFailoverVersion: 1
      rpcName: "cadence-frontend"
      rpcAddress: "dc1.cadence.example.com:7933"
      rpcTransport: "grpc"
    dc2:
      enabled: true
      initialFailoverVersion: 2
      rpcName: "cadence-frontend"
      rpcAddress: "dc2.cadence.example.com:7933"
      rpcTransport: "grpc"
    dc3:
      enabled: true
      initialFailoverVersion: 3
      rpcName: "cadence-frontend"
      rpcAddress: "dc3.cadence.example.com:7933"
      rpcTransport: "grpc"

Global Domain Setup

# Register global domain
cadence --domain my-global-domain domain register \
  --global_domain true \
  --active_cluster dc1 \
  --clusters dc1 dc2 dc3 \
  --retention 7 \
  --description "Global domain spanning 3 DCs"

# Verify replication
cadence --domain my-global-domain domain describe

Replication Architecture

┌─────────────────────┐         ┌─────────────────────┐         ┌─────────────────────┐
│      DC1 (Active)   │         │    DC2 (Standby)    │         │    DC3 (Standby)    │
│                     │         │                     │         │                     │
│  ┌──────────────┐   │         │  ┌──────────────┐   │         │  ┌──────────────┐   │
│  │   Frontend   │   │         │  │   Frontend   │   │         │  │   Frontend   │   │
│  └──────────────┘   │         │  └──────────────┘   │         │  └──────────────┘   │
│         │           │         │         │           │         │         │           │
│  ┌──────────────┐   │         │  ┌──────────────┐   │         │  ┌──────────────┐   │
│  │   History    │◄──┼─────────┼──┤   History    │◄──┼─────────┼──┤   History    │   │
│  └──────────────┘   │         │  └──────────────┘   │         │  └──────────────┘   │
│         │           │         │         │           │         │         │           │
│  ┌──────────────┐   │         │  ┌──────────────┐   │         │  ┌──────────────┐   │
│  │  Cassandra   │   │         │  │  Cassandra   │   │         │  │  Cassandra   │   │
│  └──────────────┘   │         │  └──────────────┘   │         │  └──────────────┘   │
└─────────────────────┘         └─────────────────────┘         └─────────────────────┘
         │                                 ▲                                 ▲
         └─────────Replication Tasks───────┴─────────────────────────────────┘

Disaster Recovery Scenarios

Scenario 1: Database Corruption

Detection

# Symptoms
- Workflows failing to start
- Persistence errors in logs
- Inconsistent workflow state

# Verify
grep -i "persistence error" /var/log/cadence/*.log

Recovery

# 1. Stop affected services
kubectl scale deployment/cadence-history --replicas=0

# 2. Restore from backup (Cassandra example)
for host in $CASSANDRA_NODES; do
  ssh $host "systemctl stop cassandra"
  ssh $host "rm -rf /var/lib/cassandra/data/cadence"
  ssh $host "tar xzf /backup/cassandra-latest.tar.gz -C /"
  ssh $host "systemctl start cassandra"
done

# 3. Verify schema version
cadence-cassandra-tool --ep 127.0.0.1 --keyspace cadence version

# 4. Restart services
kubectl scale deployment/cadence-history --replicas=10

# 5. Verify recovery
cadence workflow list --domain my-domain

Scenario 2: Datacenter Failure

Failover Procedure

# 1. Verify active cluster is unavailable
curl http://dc1.cadence.example.com:7933/health || echo "DC1 down"

# 2. Initiate failover to DC2
cadence --domain my-global-domain domain update \
  --active_cluster dc2

# 3. Verify failover
cadence --domain my-global-domain domain describe

# 4. Update DNS/load balancer to point to DC2
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "cadence.example.com",
        "Type": "CNAME",
        "TTL": 60,
        "ResourceRecords": [{"Value": "dc2.cadence.example.com"}]
      }
    }]
  }'

# 5. Monitor for replication lag
watch -n 5 'cadence admin domain describe --domain my-global-domain'

Failback Procedure

# 1. Verify DC1 is recovered
curl http://dc1.cadence.example.com:7933/health

# 2. Wait for replication to catch up
cadence admin domain describe --domain my-global-domain
# Check replicationConfig.activeClusterName and failoverHistory

# 3. Failback to DC1
cadence --domain my-global-domain domain update \
  --active_cluster dc1

# 4. Update DNS back to DC1
aws route53 change-resource-record-sets ...
Failover is domain-level, not cluster-level. Different domains can be active in different clusters simultaneously.

Scenario 3: Complete Cluster Loss

Prerequisites

  • Recent database backups
  • Configuration backups
  • Documented cluster setup

Recovery Steps

# 1. Provision new infrastructure
# - Database cluster
# - Kubernetes cluster or VMs
# - Load balancers

# 2. Restore database
# (See database backup section above)

# 3. Deploy Cadence services
kubectl apply -f cadence-deployment.yaml

# 4. Restore configuration
kubectl create configmap cadence-config \
  --from-file=config.yaml=config-backup.yaml

# 5. Verify cluster health
for svc in frontend history matching worker; do
  curl http://$svc:909x/health
done

# 6. Verify workflows
cadence workflow list --domain my-domain
cadence workflow describe --workflow_id <wf-id> --domain my-domain

# 7. Resume workers
# Workers automatically reconnect and resume polling

Data Consistency

Consistency Guarantees

Cadence provides:
  • Single cluster: Strong consistency (via database transactions)
  • Global domains: Eventual consistency across datacenters
  • Archival: Eventually consistent (async archival process)

Consistency Checks

# Verify workflow state consistency
cadence admin workflow show \
  --workflow_id <wf-id> \
  --run_id <run-id>

# Check for orphaned workflows
cadence admin db scan --db-type cassandra

# Verify history branch consistency
cadence admin workflow list-history-branches

Handling Inconsistencies

# Repair workflow (if stuck)
cadence admin workflow refresh --workflow_id <wf-id>

# Delete corrupt workflow
cadence admin workflow delete \
  --workflow_id <wf-id> \
  --run_id <run-id>

# Re-sync domain metadata
cadence admin domain describe --domain <domain> --sync

Testing Disaster Recovery

DR Testing Checklist

  1. Backup verification
    # Test restore to isolated environment
    ./restore-backup.sh --test-mode --backup=latest
    
  2. Failover testing
    # Practice failover in non-production
    cadence --domain test-domain domain update --active_cluster dc2
    
  3. Recovery time measurement
    # Measure RTO (Recovery Time Objective)
    time ./disaster-recovery.sh
    
  4. Data loss assessment
    # Verify RPO (Recovery Point Objective)
    # Compare pre-failure and post-recovery workflow counts
    

Chaos Engineering

Use chaos testing to validate DR procedures:
# Kubernetes chaos-mesh example
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-history-pod
spec:
  action: pod-kill
  selector:
    namespaces:
      - cadence
    labelSelectors:
      app: cadence-history
  mode: one
  scheduler:
    cron: "@hourly"

Monitoring and Alerting

Backup Monitoring

# Alert if backup is older than 25 hours
time() - cadence_backup_last_success_timestamp > 90000

# Alert on backup failures
rate(cadence_backup_failures[1h]) > 0

Replication Lag

# Alert if replication lag > 5 minutes
cadence_replication_lag_seconds > 300

# Alert on replication task failures
rate(cadence_replication_task_failures[5m]) > 10

Cross-Cluster Health

# Monitor all clusters
for cluster in dc1 dc2 dc3; do
  curl http://$cluster.cadence.example.com:7933/health
done

Best Practices

1. Automate Everything

  • Automated backup scripts
  • Automated restore testing
  • Automated failover procedures

2. Test Regularly

  • Monthly backup restoration tests
  • Quarterly full DR drills
  • Annual cross-region failover tests

3. Document Procedures

  • Runbooks for each DR scenario
  • Contact information for on-call
  • Decision trees for failover

4. Monitor Continuously

  • Backup success/failure
  • Replication lag
  • Cross-cluster connectivity

5. Maintain Redundancy

  • 3+ clusters for global domains
  • Multiple backup copies (on-site + off-site)
  • Geo-redundant archival storage

See Also

Build docs developers (and LLMs) love