Overview
Isolation Groups enable:- Zone Awareness: Route tasks to workers in specific zones
- Fault Isolation: Contain failures to specific zones
- Graceful Draining: Drain zones for maintenance without workflow impact
- Load Balancing: Distribute load across healthy zones
- Multi-Region Support: Route tasks across geographic regions
Concepts
Isolation Group
A logical grouping of workers, typically corresponding to:- Availability zone (e.g.,
us-east-1a,us-east-1b) - Data center
- Kubernetes cluster
- Worker deployment
Drain State
Zones can be in three states:- Healthy: Actively receives new tasks
- Draining: No new tasks, existing workflows continue
- Drained: No tasks routed to this zone
Task List Partitioning
Task lists are partitioned across isolation groups for parallel processing and isolation.Configuration
Global Isolation Groups
Define isolation groups at the cluster level:Domain-Specific Isolation
Apply isolation to specific domains:Worker Configuration
Workers must identify their isolation group:Use Cases
Availability Zone Isolation
Isolate workers by AWS availability zone:- AZ failure contained to that zone’s workflows
- New workflows distributed to healthy AZs
- Ongoing workflows in failed AZ can be recovered
Graceful Zone Draining
Drain zone for maintenance:Multi-Region Deployment
Route tasks to specific regions:Canary Deployments
Gradually roll out new worker versions:Task Routing
Load Balancing Algorithm
Tasks are distributed across healthy isolation groups:- Filter: Exclude draining/drained groups for new workflows
- Balance: Distribute tasks evenly across healthy groups
- Sticky: Existing workflows stay in their group if healthy
- Fallback: Redirect if group becomes unhealthy
Partition Configuration
Control partitioning with dynamic config:Sticky Task Lists
Isolation groups work with sticky task lists:- Sticky task lists remain in their isolation group
- Group health affects sticky routing
- Draining groups stop receiving new sticky tasks
Monitoring
Key Metrics
Task Distribution:CLI Monitoring
Best Practices
Deployment Strategy
- Start with 3 Zones: Balance availability and complexity
- Use Existing Infrastructure: Align with AZ/region boundaries
- Test Draining: Practice zone draining before incidents
- Monitor Skew: Watch for uneven task distribution
Configuration Management
- Version Control: Track isolation group configs in Git
- Gradual Rollout: Test changes on non-critical domains first
- Document Zones: Maintain mapping of zones to infrastructure
- Automate Updates: Use CI/CD for isolation group changes
Operational Procedures
- Drain Before Maintenance: Always drain before zone maintenance
- Monitor Completion: Ensure workflows complete before shutdown
- Staged Rollback: Re-enable zones gradually
- Alerting: Alert on zone failures or skewed distribution
Troubleshooting
Tasks Not Routing to Zone
Problem: Workers in a zone not receiving tasks Solution:Zone Not Draining
Problem: Zone marked as draining but still receiving tasks Solution:- Existing workflows continue until completion (expected)
- Check for new workflow starts (should be zero)
- Verify drain state persisted:
- Check for worker restarts (may revert identity)
Uneven Load Distribution
Problem: Tasks concentrated in one zone Solution:- Increase partition count for better distribution
- Verify all zones marked as HEALTHY
- Check worker polling rates in each zone
- Review sticky task list distribution
- Consider rebalancing by draining/re-enabling zones
Advanced Topics
Dynamic Isolation Group Management
Automate isolation group updates based on metrics:Cross-Cluster Isolation
Use isolation groups for cross-cluster routing in active-active setups:Next Steps
- Configure Async Workflow Queues for offloading
- Set up Dynamic Config for isolation tuning
- Monitor with Web UI
- Test with Canary for validation