Skip to main content
Isolation Groups provide zone-aware routing of workflow tasks to workers, enabling fault isolation, disaster recovery, and efficient resource utilization across availability zones or data centers.

Overview

Isolation Groups enable:
  • Zone Awareness: Route tasks to workers in specific zones
  • Fault Isolation: Contain failures to specific zones
  • Graceful Draining: Drain zones for maintenance without workflow impact
  • Load Balancing: Distribute load across healthy zones
  • Multi-Region Support: Route tasks across geographic regions

Concepts

Isolation Group

A logical grouping of workers, typically corresponding to:
  • Availability zone (e.g., us-east-1a, us-east-1b)
  • Data center
  • Kubernetes cluster
  • Worker deployment

Drain State

Zones can be in three states:
  • Healthy: Actively receives new tasks
  • Draining: No new tasks, existing workflows continue
  • Drained: No tasks routed to this zone

Task List Partitioning

Task lists are partitioned across isolation groups for parallel processing and isolation.

Configuration

Global Isolation Groups

Define isolation groups at the cluster level:
# Get current isolation groups
cadence admin cluster get-isolation-groups

# Set isolation groups (JSON format)
cadence admin cluster update-global-isolation-groups \
  --json '{
    "isolationGroups": [
      {"name": "us-east-1a", "state": "HEALTHY"},
      {"name": "us-east-1b", "state": "HEALTHY"},
      {"name": "us-east-1c", "state": "HEALTHY"}
    ]
  }'

Domain-Specific Isolation

Apply isolation to specific domains:
# Update domain with isolation groups
cadence admin domain update-isolation-groups \
  --domain my-domain \
  --json '{
    "isolationGroups": [
      {"name": "zone-1", "state": "HEALTHY"},
      {"name": "zone-2", "state": "DRAINING"},
      {"name": "zone-3", "state": "HEALTHY"}
    ]
  }'

# Get domain isolation groups
cadence admin domain get-isolation-groups --domain my-domain

Worker Configuration

Workers must identify their isolation group:
import (
    "go.uber.org/cadence/client"
    "go.uber.org/cadence/worker"
)

func main() {
    // Create service client
    c, _ := client.Dial(&client.Options{
        HostPort: "cadence-frontend:7933",
    })

    // Create worker with isolation group
    w := worker.New(c, "my-task-list", worker.Options{
        Identity:       "worker-1",
        IsolationGroup: "us-east-1a",  // Set zone
    })

    // Register and start worker
    w.Start()
}
Java SDK:
WorkerOptions options = WorkerOptions.newBuilder()
    .setIdentity("worker-1")
    .setIsolationGroup("us-east-1a")
    .build();

Worker worker = workerFactory.newWorker("my-task-list", options);

Use Cases

Availability Zone Isolation

Isolate workers by AWS availability zone:
# 3 availability zones
isolationGroups:
  - name: "us-east-1a"
    state: "HEALTHY"
  - name: "us-east-1b"
    state: "HEALTHY"
  - name: "us-east-1c"
    state: "HEALTHY"
Benefits:
  • AZ failure contained to that zone’s workflows
  • New workflows distributed to healthy AZs
  • Ongoing workflows in failed AZ can be recovered

Graceful Zone Draining

Drain zone for maintenance:
# Step 1: Mark zone as draining
cadence admin cluster update-global-isolation-groups \
  --json '{
    "isolationGroups": [
      {"name": "us-east-1a", "state": "DRAINING"},  # No new tasks
      {"name": "us-east-1b", "state": "HEALTHY"},
      {"name": "us-east-1c", "state": "HEALTHY"}
    ]
  }'

# Step 2: Wait for in-flight workflows to complete
# Monitor: cadence --do domain workflow list --open

# Step 3: Shut down workers in us-east-1a
kubectl scale deployment worker-us-east-1a --replicas=0

# Step 4: Perform maintenance

# Step 5: Restore zone
cadence admin cluster update-global-isolation-groups \
  --json '{
    "isolationGroups": [
      {"name": "us-east-1a", "state": "HEALTHY"},
      {"name": "us-east-1b", "state": "HEALTHY"},
      {"name": "us-east-1c", "state": "HEALTHY"}
    ]
  }'

Multi-Region Deployment

Route tasks to specific regions:
isolationGroups:
  - name: "us-west-2"
    state: "HEALTHY"
  - name: "eu-west-1"
    state: "HEALTHY"
  - name: "ap-southeast-1"
    state: "HEALTHY"
Worker deployment:
# US region workers
ISOLATION_GROUP=us-west-2 ./worker start

# EU region workers
ISOLATION_GROUP=eu-west-1 ./worker start

# Asia region workers
ISOLATION_GROUP=ap-southeast-1 ./worker start

Canary Deployments

Gradually roll out new worker versions:
# Initial state: all traffic to stable
isolationGroups:
  - name: "stable"
    state: "HEALTHY"
  - name: "canary"
    state: "DRAINING"    # No traffic initially

# Step 1: Route small percentage to canary
isolationGroups:
  - name: "stable"
    state: "HEALTHY"
  - name: "canary"
    state: "HEALTHY"     # Start receiving traffic

# Step 2: Monitor canary metrics
# If healthy, drain stable

# Step 3: Switch to canary
isolationGroups:
  - name: "stable"
    state: "DRAINING"
  - name: "canary"
    state: "HEALTHY"

# Step 4: Complete migration
isolationGroups:
  - name: "canary"
    state: "HEALTHY"

Task Routing

Load Balancing Algorithm

Tasks are distributed across healthy isolation groups:
  1. Filter: Exclude draining/drained groups for new workflows
  2. Balance: Distribute tasks evenly across healthy groups
  3. Sticky: Existing workflows stay in their group if healthy
  4. Fallback: Redirect if group becomes unhealthy

Partition Configuration

Control partitioning with dynamic config:
matching.isolationGroupPartitions:
  - value: 3
    constraints:
      domainName: "my-domain"
      taskListName: "my-task-list"
More partitions = better parallelism but more coordination overhead.

Sticky Task Lists

Isolation groups work with sticky task lists:
  • Sticky task lists remain in their isolation group
  • Group health affects sticky routing
  • Draining groups stop receiving new sticky tasks

Monitoring

Key Metrics

Task Distribution:
# Tasks per isolation group
sum by (isolation_group) (
  rate(cadence_matching_tasks_dispatched[5m])
)

# Isolation group health
cadence_isolation_group_health{isolation_group="us-east-1a"}
Drain Progress:
# Open workflows in draining group
sum(
  cadence_workflows_open{
    isolation_group="us-east-1a",
    state="draining"
  }
)

CLI Monitoring

# List isolation groups
cadence admin cluster get-isolation-groups

# Check domain-specific groups
cadence admin domain get-isolation-groups --domain my-domain

# View task list by isolation group
cadence --do my-domain tasklist describe \
  --tl my-task-list \
  --tlt decision

Best Practices

Deployment Strategy

  • Start with 3 Zones: Balance availability and complexity
  • Use Existing Infrastructure: Align with AZ/region boundaries
  • Test Draining: Practice zone draining before incidents
  • Monitor Skew: Watch for uneven task distribution

Configuration Management

  • Version Control: Track isolation group configs in Git
  • Gradual Rollout: Test changes on non-critical domains first
  • Document Zones: Maintain mapping of zones to infrastructure
  • Automate Updates: Use CI/CD for isolation group changes

Operational Procedures

  • Drain Before Maintenance: Always drain before zone maintenance
  • Monitor Completion: Ensure workflows complete before shutdown
  • Staged Rollback: Re-enable zones gradually
  • Alerting: Alert on zone failures or skewed distribution

Troubleshooting

Tasks Not Routing to Zone

Problem: Workers in a zone not receiving tasks Solution:
# Check isolation group state
cadence admin cluster get-isolation-groups

# Verify worker configuration
# Check worker logs for isolation group setting

# Check task list partitioning
cadence --do domain tasklist describe --tl task-list

# Verify worker identity includes isolation group
cadence --do domain tasklist list-partition-workers --tl task-list

Zone Not Draining

Problem: Zone marked as draining but still receiving tasks Solution:
  • Existing workflows continue until completion (expected)
  • Check for new workflow starts (should be zero)
  • Verify drain state persisted:
    cadence admin cluster get-isolation-groups
    
  • Check for worker restarts (may revert identity)

Uneven Load Distribution

Problem: Tasks concentrated in one zone Solution:
  • Increase partition count for better distribution
  • Verify all zones marked as HEALTHY
  • Check worker polling rates in each zone
  • Review sticky task list distribution
  • Consider rebalancing by draining/re-enabling zones

Advanced Topics

Dynamic Isolation Group Management

Automate isolation group updates based on metrics:
import cadence_client

def auto_drain_unhealthy_zone():
    # Monitor zone health
    health = get_zone_health("us-east-1a")
    
    if health < THRESHOLD:
        # Drain unhealthy zone
        client.update_isolation_groups({
            "isolationGroups": [
                {"name": "us-east-1a", "state": "DRAINING"},
                {"name": "us-east-1b", "state": "HEALTHY"},
                {"name": "us-east-1c", "state": "HEALTHY"}
            ]
        })
        
        # Alert operations team
        send_alert("Zone us-east-1a drained due to health issues")

Cross-Cluster Isolation

Use isolation groups for cross-cluster routing in active-active setups:
# Cluster 1 (US)
isolationGroups:
  - name: "cluster-us"
    state: "HEALTHY"

# Cluster 2 (EU)
isolationGroups:
  - name: "cluster-eu"
    state: "HEALTHY"
Workers connect to local cluster but can handle cross-cluster tasks if needed.

Next Steps

Build docs developers (and LLMs) love