Overview
Cadence Cross-DC Replication (also known as Cadence NDC - N Data Center) enables multi-region deployments with active-active replication for high availability and disaster recovery. The system asynchronously replicates workflows from active data centers to passive data centers.
Cadence Cross-DC is an AP system (Available and Partition-tolerant) in terms of the CAP theorem, prioritizing availability and partition tolerance over strong consistency.
Architecture
Core Concepts
1. Version System
Version is a fundamental concept that describes the chronological order of events per domain.
Version Configuration
Each data center is configured with:
Initial Version : Unique per data center (e.g., 1, 2, 3)
Shared Version Increment : Common across all data centers (e.g., 10)
Constraint : initialVersion < sharedVersionIncrement
clusterGroupMetadata :
failoverVersionIncrement : 10 # Shared increment
primaryClusterName : "cluster0"
currentClusterName : "cluster0"
clusterGroup :
cluster0 :
enabled : true
initialFailoverVersion : 0 # DC A: initial version = 0
rpcAddress : "cluster0.example.com:7833"
rpcTransport : "grpc"
cluster1 :
enabled : true
initialFailoverVersion : 1 # DC B: initial version = 1
rpcAddress : "cluster1.example.com:7833"
rpcTransport : "grpc"
cluster2 :
enabled : true
initialFailoverVersion : 2 # DC C: initial version = 2
rpcAddress : "cluster2.example.com:7833"
rpcTransport : "grpc"
Version Calculation During Failover
When failing over a domain from one DC to another:
newVersion = findSmallestVersion(
version % sharedVersionIncrement == targetDC.initialVersion
AND version >= currentDomainVersion
)
Example :
DC A: initialVersion = 1
DC B: initialVersion = 2
Shared increment: 10
Domain lifecycle:
Time Event Active DC Domain Version Event Version T=0 Domain registered A 1 1 T=1 Failover to B B 2 2 T=2 Failover to A A 11 11 T=3 Failover to B B 12 12 T=4 Failover to A A 21 21
2. Version History
Version history provides a high-level summary of workflow history events by version.
Version History Structure
Each entry in version history records:
Last event ID for that version
The version number
Example without conflict :
Event ID Event Version Version History 1 1 {1: v1}2 1 {2: v1}3 1 {3: v1}4 2 {3: v1, 4: v2}5 2 {3: v1, 5: v2}
3. Branch-Based History with Conflicts
When multiple DCs modify the same workflow during failover, history branches:
Version History (DC B) :
{
"3": {"version": 1},
"4": {"version": 2} // Branch A
}
Version History (DC C) :
{
"3": {"version": 1},
"4": {"version": 3} // Branch B - wins (higher version)
}
When replication tasks arrive:
Both branches exist as tree nodes
Branch with highest version becomes current branch
Mutable state rebuilt from current branch
4. Conflict Resolution
Cadence uses version-based conflict resolution :
Rule : History branch with the highest version wins and becomes the current branch.
// Pseudo-code for conflict resolution
func resolveConflict ( branches [] HistoryBranch ) HistoryBranch {
maxVersion := 0
var currentBranch HistoryBranch
for _ , branch := range branches {
if branch . version > maxVersion {
maxVersion = branch . version
currentBranch = branch
}
}
// Rebuild mutable state from current branch
rebuildMutableState ( currentBranch )
return currentBranch
}
When branch switching occurs :
Detect version conflict
Identify branch with highest version
Complete rebuild of workflow mutable state
Invalidate tasks from non-current branches
5. Zombie Workflows
Zombie workflows handle out-of-order replication of different runs.
Problem : Run 2 arrives before Run 1 due to network delays
Zombie State Rules :
Only one run can be active per (domain, workflowID)
Later runs take precedence
Earlier runs become zombies
Zombie workflows:
Cannot be mutated by the data center
Can only be updated via replication
Return to active when they complete
6. Task Invalidation
Tasks may become invalid due to history branch switching:
Task Validation :
// Pseudo-code for task validation
func validateTask ( task Task , mutableState MutableState ) bool {
currentVersion := mutableState . getEventVersion ( task . eventID )
if task . version != currentVersion {
// Task belongs to non-current branch
return false
}
return true
}
Replication Flow
Workflow History Replication
Domain Metadata Replication
Configuration
Multi-Cluster Configuration Example
clusterGroupMetadata :
failoverVersionIncrement : 10
primaryClusterName : "us-east-1"
currentClusterName : "us-east-1"
clusterGroup :
us-east-1 :
enabled : true
initialFailoverVersion : 0
rpcAddress : "cadence-us-east-1.example.com:7833"
rpcTransport : "grpc"
us-west-2 :
enabled : true
initialFailoverVersion : 1
rpcAddress : "cadence-us-west-2.example.com:7833"
rpcTransport : "grpc"
eu-central-1 :
enabled : true
initialFailoverVersion : 2
rpcAddress : "cadence-eu-central-1.example.com:7833"
rpcTransport : "grpc"
# Kafka configuration for replication
kafka :
clusters :
us-east-1 :
brokers :
- kafka1.us-east-1.example.com:9092
- kafka2.us-east-1.example.com:9092
us-west-2 :
brokers :
- kafka1.us-west-2.example.com:9092
- kafka2.us-west-2.example.com:9092
topics :
cadence-replication-us-east-1 :
cluster : us-east-1
cadence-replication-us-west-2 :
cluster : us-west-2
Domain Configuration for Replication
# Register a global domain with replication
cadence --domain my-global-domain domain register \
--global_domain true \
--active_cluster us-east-1 \
--clusters us-east-1 us-west-2 eu-central-1 \
--retention 7
Domain Configuration :
{
"name" : "my-global-domain" ,
"isGlobalDomain" : true ,
"activeClusterName" : "us-east-1" ,
"clusters" : [
{ "clusterName" : "us-east-1" },
{ "clusterName" : "us-west-2" },
{ "clusterName" : "eu-central-1" }
],
"failoverVersion" : 0 ,
"workflowExecutionRetentionPeriodInDays" : 7
}
Failover Operations
Manual Failover
# Failover domain from us-east-1 to us-west-2
cadence --domain my-global-domain domain update \
--active_cluster us-west-2
What happens during failover :
Domain version incremented
New version = smallest version where version % 10 == 1 (us-west-2’s initial version)
Domain metadata replicated to all clusters
New workflow events use new version
Workers in us-west-2 become active
Graceful Failover
For zero workflow disruption:
# Step 1: Drain workflows in source cluster
cadence --domain my-global-domain domain update \
--active_cluster us-west-2 \
--graceful_failover_timeout 60s
# Cadence will:
# 1. Stop accepting new workflows in us-east-1
# 2. Wait for existing workflows to complete (up to timeout)
# 3. Failover remaining workflows
Automated Failover
Use the failover manager worker:
dynamicconfig :
EnableFailoverManager : true
Failover manager monitors:
Cluster health
Replication lag
Error rates
Auto-failover triggers:
Cluster outage detected
Replication lag exceeds threshold
Error rate above limit
Monitoring Replication
Key Metrics
Replication Lag :
cadence_replication_lag_seconds{source_cluster="us-east-1", target_cluster="us-west-2"}
Replication Task Processing :
cadence_replication_tasks_processed{cluster="us-west-2"}
cadence_replication_tasks_failed{cluster="us-west-2"}
Domain Replication :
cadence_domain_replication_queue_size
cadence_domain_replication_latency_ms
Replication Health Check
# Check replication status
cadence --domain my-global-domain workflow list
# Check domain failover history
cadence --domain my-global-domain domain describe
Best Practices
1. Choosing Version Increment
failoverVersionIncrement : 10 # Supports up to 10 data centers
Small increment (10-100): Fewer data centers, simpler version numbers
Large increment (1000+): Many data centers, room for growth
2. Replication Topology
Hub-and-Spoke (recommended for less than 5 clusters):
DC-B
|
DC-A (hub) --- DC-C
|
DC-D
Full Mesh (for 2-3 clusters):
DC-A <---> DC-B
^ ^
| |
v v
DC-C <---> DC-D
3. Network Requirements
Latency : Less than 100ms between regions recommended
Bandwidth : Sufficient for replication throughput
Kafka : Cross-region Kafka replication or MirrorMaker
4. Testing Failover
# Create test workflow
cadence --domain test-domain workflow start \
--workflow_type TestWorkflow \
--tasklist test-tl \
--workflow_id test-wf-1
# Trigger failover
cadence --domain test-domain domain update \
--active_cluster us-west-2
# Verify workflow continues
cadence --domain test-domain workflow describe \
--workflow_id test-wf-1
5. Handling Conflicts
Minimize concurrent writes : Use sticky domains when possible
Monitor version history : Track branch creation
Failover discipline : Avoid unnecessary failovers
Troubleshooting
High Replication Lag
Symptoms : Replication lag metric increasing
Diagnosis :
# Check Kafka consumer lag
kafka-consumer-groups --bootstrap-server kafka:9092 \
--describe --group cadence-replication-worker
Solutions :
Scale up worker service instances
Increase Kafka partition count
Optimize network bandwidth
Stuck Replication Tasks
Symptoms : Workflows not replicating
Diagnosis :
# Check replication task errors
cadence admin workflow describe --workflow_id stuck-wf
Solutions :
Check for schema mismatches
Verify cluster connectivity
Review worker service logs
Version Conflicts
Symptoms : Unexpected workflow behavior after failover
Diagnosis :
# Examine version history
cadence workflow show \
--workflow_id problematic-wf \
--output json | jq '.versionHistories'
Solutions :
Review failover sequence
Check for split-brain scenarios
Verify domain configuration
Next Steps
Architecture Overview Return to architecture overview
Services Learn about service architecture