Cross-DC Replication

Overview

Cadence Cross-DC Replication (also known as Cadence NDC - N Data Center) enables multi-region deployments with active-active replication for high availability and disaster recovery. The system asynchronously replicates workflows from active data centers to passive data centers.

Cadence Cross-DC is an AP system (Available and Partition-tolerant) in terms of the CAP theorem, prioritizing availability and partition tolerance over strong consistency.

Architecture

Core Concepts

1. Version System

Version is a fundamental concept that describes the chronological order of events per domain.

Version Configuration

Each data center is configured with:

Initial Version: Unique per data center (e.g., 1, 2, 3)
Shared Version Increment: Common across all data centers (e.g., 10)

Constraint: initialVersion < sharedVersionIncrement

clusterGroupMetadata:
  failoverVersionIncrement: 10          # Shared increment
  primaryClusterName: "cluster0"
  currentClusterName: "cluster0"
  clusterGroup:
    cluster0:
      enabled: true
      initialFailoverVersion: 0         # DC A: initial version = 0
      rpcAddress: "cluster0.example.com:7833"
      rpcTransport: "grpc"
    cluster1:
      enabled: true
      initialFailoverVersion: 1         # DC B: initial version = 1  
      rpcAddress: "cluster1.example.com:7833"
      rpcTransport: "grpc"
    cluster2:
      enabled: true
      initialFailoverVersion: 2         # DC C: initial version = 2
      rpcAddress: "cluster2.example.com:7833"
      rpcTransport: "grpc"

Version Calculation During Failover

When failing over a domain from one DC to another:

newVersion = findSmallestVersion(
    version % sharedVersionIncrement == targetDC.initialVersion
    AND version >= currentDomainVersion
)

Example:

DC A: initialVersion = 1
DC B: initialVersion = 2
Shared increment: 10

Domain lifecycle:

Time	Event	Active DC	Domain Version	Event Version
T=0	Domain registered	A	1	1
T=1	Failover to B	B	2	2
T=2	Failover to A	A	11	11
T=3	Failover to B	B	12	12
T=4	Failover to A	A	21	21

2. Version History

Version history provides a high-level summary of workflow history events by version.

Version History Structure

Each entry in version history records:

Last event ID for that version
The version number

Example without conflict:

Event ID	Event Version	Version History
1	1	`{1: v1}`
2	1	`{2: v1}`
3	1	`{3: v1}`
4	2	`{3: v1, 4: v2}`
5	2	`{3: v1, 5: v2}`

3. Branch-Based History with Conflicts

When multiple DCs modify the same workflow during failover, history branches: Version History (DC B):

{
  "3": {"version": 1},
  "4": {"version": 2}   // Branch A
}

Version History (DC C):

{
  "3": {"version": 1},
  "4": {"version": 3}   // Branch B - wins (higher version)
}

When replication tasks arrive:

Both branches exist as tree nodes
Branch with highest version becomes current branch
Mutable state rebuilt from current branch

4. Conflict Resolution

Cadence uses version-based conflict resolution: Rule: History branch with the highest version wins and becomes the current branch.

// Pseudo-code for conflict resolution
func resolveConflict(branches []HistoryBranch) HistoryBranch {
    maxVersion := 0
    var currentBranch HistoryBranch
    
    for _, branch := range branches {
        if branch.version > maxVersion {
            maxVersion = branch.version
            currentBranch = branch
        }
    }
    
    // Rebuild mutable state from current branch
    rebuildMutableState(currentBranch)
    return currentBranch
}

When branch switching occurs:

Detect version conflict
Identify branch with highest version
Complete rebuild of workflow mutable state
Invalidate tasks from non-current branches

5. Zombie Workflows

Zombie workflows handle out-of-order replication of different runs. Problem: Run 2 arrives before Run 1 due to network delays Zombie State Rules:

Only one run can be active per (domain, workflowID)
Later runs take precedence
Earlier runs become zombies
Zombie workflows:
- Cannot be mutated by the data center
- Can only be updated via replication
- Return to active when they complete

6. Task Invalidation

Tasks may become invalid due to history branch switching: Task Validation:

// Pseudo-code for task validation
func validateTask(task Task, mutableState MutableState) bool {
    currentVersion := mutableState.getEventVersion(task.eventID)
    
    if task.version != currentVersion {
        // Task belongs to non-current branch
        return false
    }
    return true
}

Replication Flow

Workflow History Replication

Domain Metadata Replication

Configuration

Multi-Cluster Configuration Example

clusterGroupMetadata:
  failoverVersionIncrement: 10
  primaryClusterName: "us-east-1"
  currentClusterName: "us-east-1"
  clusterGroup:
    us-east-1:
      enabled: true
      initialFailoverVersion: 0
      rpcAddress: "cadence-us-east-1.example.com:7833"
      rpcTransport: "grpc"
    us-west-2:
      enabled: true
      initialFailoverVersion: 1
      rpcAddress: "cadence-us-west-2.example.com:7833"
      rpcTransport: "grpc"
    eu-central-1:
      enabled: true
      initialFailoverVersion: 2
      rpcAddress: "cadence-eu-central-1.example.com:7833"
      rpcTransport: "grpc"

# Kafka configuration for replication
kafka:
  clusters:
    us-east-1:
      brokers:
        - kafka1.us-east-1.example.com:9092
        - kafka2.us-east-1.example.com:9092
    us-west-2:
      brokers:
        - kafka1.us-west-2.example.com:9092
        - kafka2.us-west-2.example.com:9092
  topics:
    cadence-replication-us-east-1:
      cluster: us-east-1
    cadence-replication-us-west-2:
      cluster: us-west-2

Domain Configuration for Replication

# Register a global domain with replication
cadence --domain my-global-domain domain register \
    --global_domain true \
    --active_cluster us-east-1 \
    --clusters us-east-1 us-west-2 eu-central-1 \
    --retention 7

Domain Configuration:

{
  "name": "my-global-domain",
  "isGlobalDomain": true,
  "activeClusterName": "us-east-1",
  "clusters": [
    {"clusterName": "us-east-1"},
    {"clusterName": "us-west-2"},
    {"clusterName": "eu-central-1"}
  ],
  "failoverVersion": 0,
  "workflowExecutionRetentionPeriodInDays": 7
}

Failover Operations

Manual Failover

# Failover domain from us-east-1 to us-west-2
cadence --domain my-global-domain domain update \
    --active_cluster us-west-2

What happens during failover:

Domain version incremented
New version = smallest version where version % 10 == 1 (us-west-2’s initial version)
Domain metadata replicated to all clusters
New workflow events use new version
Workers in us-west-2 become active

Graceful Failover

For zero workflow disruption:

# Step 1: Drain workflows in source cluster
cadence --domain my-global-domain domain update \
    --active_cluster us-west-2 \
    --graceful_failover_timeout 60s

# Cadence will:
# 1. Stop accepting new workflows in us-east-1
# 2. Wait for existing workflows to complete (up to timeout)
# 3. Failover remaining workflows

Automated Failover

Use the failover manager worker:

dynamicconfig:
  EnableFailoverManager: true

Failover manager monitors:

Cluster health
Replication lag
Error rates

Auto-failover triggers:

Cluster outage detected
Replication lag exceeds threshold
Error rate above limit

Monitoring Replication

Key Metrics

Replication Lag:

cadence_replication_lag_seconds{source_cluster="us-east-1", target_cluster="us-west-2"}

Replication Task Processing:

cadence_replication_tasks_processed{cluster="us-west-2"}
cadence_replication_tasks_failed{cluster="us-west-2"}

Domain Replication:

cadence_domain_replication_queue_size
cadence_domain_replication_latency_ms

Replication Health Check

# Check replication status
cadence --domain my-global-domain workflow list

# Check domain failover history
cadence --domain my-global-domain domain describe

Best Practices

1. Choosing Version Increment

failoverVersionIncrement: 10  # Supports up to 10 data centers

Small increment (10-100): Fewer data centers, simpler version numbers
Large increment (1000+): Many data centers, room for growth

2. Replication Topology

Hub-and-Spoke (recommended for less than 5 clusters):

     DC-B
       |
 DC-A (hub) --- DC-C
       |
     DC-D

Full Mesh (for 2-3 clusters):

DC-A <---> DC-B
  ^          ^
  |          |
  v          v
DC-C <---> DC-D

3. Network Requirements

Latency: Less than 100ms between regions recommended
Bandwidth: Sufficient for replication throughput
Kafka: Cross-region Kafka replication or MirrorMaker

4. Testing Failover

# Create test workflow
cadence --domain test-domain workflow start \
    --workflow_type TestWorkflow \
    --tasklist test-tl \
    --workflow_id test-wf-1

# Trigger failover
cadence --domain test-domain domain update \
    --active_cluster us-west-2

# Verify workflow continues
cadence --domain test-domain workflow describe \
    --workflow_id test-wf-1

5. Handling Conflicts

Minimize concurrent writes: Use sticky domains when possible
Monitor version history: Track branch creation
Failover discipline: Avoid unnecessary failovers

Troubleshooting

High Replication Lag

Symptoms: Replication lag metric increasing Diagnosis:

# Check Kafka consumer lag
kafka-consumer-groups --bootstrap-server kafka:9092 \
    --describe --group cadence-replication-worker

Solutions:

Scale up worker service instances
Increase Kafka partition count
Optimize network bandwidth

Stuck Replication Tasks

Symptoms: Workflows not replicating Diagnosis:

# Check replication task errors
cadence admin workflow describe --workflow_id stuck-wf

Solutions:

Check for schema mismatches
Verify cluster connectivity
Review worker service logs

Version Conflicts

Symptoms: Unexpected workflow behavior after failover Diagnosis:

# Examine version history
cadence workflow show \
    --workflow_id problematic-wf \
    --output json | jq '.versionHistories'

Solutions:

Review failover sequence
Check for split-brain scenarios
Verify domain configuration

Next Steps

Architecture Overview

Return to architecture overview

Services

Learn about service architecture

Get Started

Core Concepts

Architecture

Deployment

Operations

Client SDKs

​Overview

​Architecture

​Core Concepts

​1. Version System

​Version Configuration

​Version Calculation During Failover

​2. Version History

​Version History Structure

​3. Branch-Based History with Conflicts

​4. Conflict Resolution

​5. Zombie Workflows

​6. Task Invalidation

​Replication Flow

​Workflow History Replication

​Domain Metadata Replication

​Configuration

​Multi-Cluster Configuration Example

​Domain Configuration for Replication

​Failover Operations

​Manual Failover

​Graceful Failover

​Automated Failover

​Monitoring Replication

​Key Metrics

​Replication Health Check

​Best Practices

​1. Choosing Version Increment

​2. Replication Topology

​3. Network Requirements

​4. Testing Failover

​5. Handling Conflicts

​Troubleshooting

​High Replication Lag

​Stuck Replication Tasks

​Version Conflicts

​Next Steps

Architecture Overview

Services

Build docs developers (and LLMs) love

Overview

Architecture

Core Concepts

1. Version System

Version Configuration

Version Calculation During Failover

2. Version History

Version History Structure

3. Branch-Based History with Conflicts

4. Conflict Resolution

5. Zombie Workflows

6. Task Invalidation

Replication Flow

Workflow History Replication

Domain Metadata Replication

Configuration

Multi-Cluster Configuration Example

Domain Configuration for Replication

Failover Operations

Manual Failover

Graceful Failover

Automated Failover

Monitoring Replication

Key Metrics

Replication Health Check

Best Practices

1. Choosing Version Increment

2. Replication Topology

3. Network Requirements

4. Testing Failover

5. Handling Conflicts

Troubleshooting

High Replication Lag

Stuck Replication Tasks

Version Conflicts

Next Steps