Skip to main content

Overview

Cadence Cross-DC Replication (also known as Cadence NDC - N Data Center) enables multi-region deployments with active-active replication for high availability and disaster recovery. The system asynchronously replicates workflows from active data centers to passive data centers.
Cadence Cross-DC is an AP system (Available and Partition-tolerant) in terms of the CAP theorem, prioritizing availability and partition tolerance over strong consistency.

Architecture

Core Concepts

1. Version System

Version is a fundamental concept that describes the chronological order of events per domain.

Version Configuration

Each data center is configured with:
  • Initial Version: Unique per data center (e.g., 1, 2, 3)
  • Shared Version Increment: Common across all data centers (e.g., 10)
Constraint: initialVersion < sharedVersionIncrement
clusterGroupMetadata:
  failoverVersionIncrement: 10          # Shared increment
  primaryClusterName: "cluster0"
  currentClusterName: "cluster0"
  clusterGroup:
    cluster0:
      enabled: true
      initialFailoverVersion: 0         # DC A: initial version = 0
      rpcAddress: "cluster0.example.com:7833"
      rpcTransport: "grpc"
    cluster1:
      enabled: true
      initialFailoverVersion: 1         # DC B: initial version = 1  
      rpcAddress: "cluster1.example.com:7833"
      rpcTransport: "grpc"
    cluster2:
      enabled: true
      initialFailoverVersion: 2         # DC C: initial version = 2
      rpcAddress: "cluster2.example.com:7833"
      rpcTransport: "grpc"

Version Calculation During Failover

When failing over a domain from one DC to another:
newVersion = findSmallestVersion(
    version % sharedVersionIncrement == targetDC.initialVersion
    AND version >= currentDomainVersion
)
Example:
  • DC A: initialVersion = 1
  • DC B: initialVersion = 2
  • Shared increment: 10
Domain lifecycle:
TimeEventActive DCDomain VersionEvent Version
T=0Domain registeredA11
T=1Failover to BB22
T=2Failover to AA1111
T=3Failover to BB1212
T=4Failover to AA2121

2. Version History

Version history provides a high-level summary of workflow history events by version.

Version History Structure

Each entry in version history records:
  • Last event ID for that version
  • The version number
Example without conflict:
Event IDEvent VersionVersion History
11{1: v1}
21{2: v1}
31{3: v1}
42{3: v1, 4: v2}
52{3: v1, 5: v2}

3. Branch-Based History with Conflicts

When multiple DCs modify the same workflow during failover, history branches: Version History (DC B):
{
  "3": {"version": 1},
  "4": {"version": 2}   // Branch A
}
Version History (DC C):
{
  "3": {"version": 1},
  "4": {"version": 3}   // Branch B - wins (higher version)
}
When replication tasks arrive:
  • Both branches exist as tree nodes
  • Branch with highest version becomes current branch
  • Mutable state rebuilt from current branch

4. Conflict Resolution

Cadence uses version-based conflict resolution: Rule: History branch with the highest version wins and becomes the current branch.
// Pseudo-code for conflict resolution
func resolveConflict(branches []HistoryBranch) HistoryBranch {
    maxVersion := 0
    var currentBranch HistoryBranch
    
    for _, branch := range branches {
        if branch.version > maxVersion {
            maxVersion = branch.version
            currentBranch = branch
        }
    }
    
    // Rebuild mutable state from current branch
    rebuildMutableState(currentBranch)
    return currentBranch
}
When branch switching occurs:
  1. Detect version conflict
  2. Identify branch with highest version
  3. Complete rebuild of workflow mutable state
  4. Invalidate tasks from non-current branches

5. Zombie Workflows

Zombie workflows handle out-of-order replication of different runs. Problem: Run 2 arrives before Run 1 due to network delays Zombie State Rules:
  • Only one run can be active per (domain, workflowID)
  • Later runs take precedence
  • Earlier runs become zombies
  • Zombie workflows:
    • Cannot be mutated by the data center
    • Can only be updated via replication
    • Return to active when they complete

6. Task Invalidation

Tasks may become invalid due to history branch switching: Task Validation:
// Pseudo-code for task validation
func validateTask(task Task, mutableState MutableState) bool {
    currentVersion := mutableState.getEventVersion(task.eventID)
    
    if task.version != currentVersion {
        // Task belongs to non-current branch
        return false
    }
    return true
}

Replication Flow

Workflow History Replication

Domain Metadata Replication

Configuration

Multi-Cluster Configuration Example

clusterGroupMetadata:
  failoverVersionIncrement: 10
  primaryClusterName: "us-east-1"
  currentClusterName: "us-east-1"
  clusterGroup:
    us-east-1:
      enabled: true
      initialFailoverVersion: 0
      rpcAddress: "cadence-us-east-1.example.com:7833"
      rpcTransport: "grpc"
    us-west-2:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "cadence-us-west-2.example.com:7833"
      rpcTransport: "grpc"
    eu-central-1:
      enabled: true
      initialFailoverVersion: 2
      rpcAddress: "cadence-eu-central-1.example.com:7833"
      rpcTransport: "grpc"

# Kafka configuration for replication
kafka:
  clusters:
    us-east-1:
      brokers:
        - kafka1.us-east-1.example.com:9092
        - kafka2.us-east-1.example.com:9092
    us-west-2:
      brokers:
        - kafka1.us-west-2.example.com:9092
        - kafka2.us-west-2.example.com:9092
  topics:
    cadence-replication-us-east-1:
      cluster: us-east-1
    cadence-replication-us-west-2:
      cluster: us-west-2

Domain Configuration for Replication

# Register a global domain with replication
cadence --domain my-global-domain domain register \
    --global_domain true \
    --active_cluster us-east-1 \
    --clusters us-east-1 us-west-2 eu-central-1 \
    --retention 7
Domain Configuration:
{
  "name": "my-global-domain",
  "isGlobalDomain": true,
  "activeClusterName": "us-east-1",
  "clusters": [
    {"clusterName": "us-east-1"},
    {"clusterName": "us-west-2"},
    {"clusterName": "eu-central-1"}
  ],
  "failoverVersion": 0,
  "workflowExecutionRetentionPeriodInDays": 7
}

Failover Operations

Manual Failover

# Failover domain from us-east-1 to us-west-2
cadence --domain my-global-domain domain update \
    --active_cluster us-west-2
What happens during failover:
  1. Domain version incremented
  2. New version = smallest version where version % 10 == 1 (us-west-2’s initial version)
  3. Domain metadata replicated to all clusters
  4. New workflow events use new version
  5. Workers in us-west-2 become active

Graceful Failover

For zero workflow disruption:
# Step 1: Drain workflows in source cluster
cadence --domain my-global-domain domain update \
    --active_cluster us-west-2 \
    --graceful_failover_timeout 60s

# Cadence will:
# 1. Stop accepting new workflows in us-east-1
# 2. Wait for existing workflows to complete (up to timeout)
# 3. Failover remaining workflows

Automated Failover

Use the failover manager worker:
dynamicconfig:
  EnableFailoverManager: true
Failover manager monitors:
  • Cluster health
  • Replication lag
  • Error rates
Auto-failover triggers:
  • Cluster outage detected
  • Replication lag exceeds threshold
  • Error rate above limit

Monitoring Replication

Key Metrics

Replication Lag:
cadence_replication_lag_seconds{source_cluster="us-east-1", target_cluster="us-west-2"}
Replication Task Processing:
cadence_replication_tasks_processed{cluster="us-west-2"}
cadence_replication_tasks_failed{cluster="us-west-2"}
Domain Replication:
cadence_domain_replication_queue_size
cadence_domain_replication_latency_ms

Replication Health Check

# Check replication status
cadence --domain my-global-domain workflow list

# Check domain failover history
cadence --domain my-global-domain domain describe

Best Practices

1. Choosing Version Increment

failoverVersionIncrement: 10  # Supports up to 10 data centers
  • Small increment (10-100): Fewer data centers, simpler version numbers
  • Large increment (1000+): Many data centers, room for growth

2. Replication Topology

Hub-and-Spoke (recommended for less than 5 clusters):
     DC-B
       |
 DC-A (hub) --- DC-C
       |
     DC-D
Full Mesh (for 2-3 clusters):
DC-A <---> DC-B
  ^          ^
  |          |
  v          v
DC-C <---> DC-D

3. Network Requirements

  • Latency: Less than 100ms between regions recommended
  • Bandwidth: Sufficient for replication throughput
  • Kafka: Cross-region Kafka replication or MirrorMaker

4. Testing Failover

# Create test workflow
cadence --domain test-domain workflow start \
    --workflow_type TestWorkflow \
    --tasklist test-tl \
    --workflow_id test-wf-1

# Trigger failover
cadence --domain test-domain domain update \
    --active_cluster us-west-2

# Verify workflow continues
cadence --domain test-domain workflow describe \
    --workflow_id test-wf-1

5. Handling Conflicts

  • Minimize concurrent writes: Use sticky domains when possible
  • Monitor version history: Track branch creation
  • Failover discipline: Avoid unnecessary failovers

Troubleshooting

High Replication Lag

Symptoms: Replication lag metric increasing Diagnosis:
# Check Kafka consumer lag
kafka-consumer-groups --bootstrap-server kafka:9092 \
    --describe --group cadence-replication-worker
Solutions:
  • Scale up worker service instances
  • Increase Kafka partition count
  • Optimize network bandwidth

Stuck Replication Tasks

Symptoms: Workflows not replicating Diagnosis:
# Check replication task errors
cadence admin workflow describe --workflow_id stuck-wf
Solutions:
  • Check for schema mismatches
  • Verify cluster connectivity
  • Review worker service logs

Version Conflicts

Symptoms: Unexpected workflow behavior after failover Diagnosis:
# Examine version history
cadence workflow show \
    --workflow_id problematic-wf \
    --output json | jq '.versionHistories'
Solutions:
  • Review failover sequence
  • Check for split-brain scenarios
  • Verify domain configuration

Next Steps

Architecture Overview

Return to architecture overview

Services

Learn about service architecture

Build docs developers (and LLMs) love