@parallel decorator enables multi-node distributed computing for workloads that require coordination across multiple machines. This is essential for distributed training, large-scale simulations, and other parallel workloads.
Overview
The@parallel decorator creates a cluster of worker nodes that can communicate with each other:
How It Works
Cluster Topology
When you use@parallel, Metaflow creates a cluster:
- Control node (index 0) - The main coordinator
- Worker nodes (index 1, 2, …) - Additional compute nodes
- Shared network access via
current.parallel.main_ip - Node identification via
current.parallel.node_index - Cluster size via
current.parallel.num_nodes
Basic Usage
Simple Parallel Computation
With Remote Compute
Combine@parallel with @batch or @kubernetes:
The current.parallel Object
Metaflow provides information about the parallel cluster:Distributed Training Patterns
PyTorch Distributed
TensorFlow MultiWorkerMirroredStrategy
Horovod
Communication Between Nodes
Using Sockets
Using Redis for Coordination
Platform-Specific Configuration
AWS Batch Multi-Node
Kubernetes JobSets
Local Testing
Test multi-node code locally:Best Practices
Save from control node only
Save from control node only
Only the control node (index 0) should save final outputs to avoid race conditions:
Use appropriate communication backends
Use appropriate communication backends
Choose the right backend for your framework:
- PyTorch:
ncclfor GPU,gloofor CPU - TensorFlow:
MultiWorkerMirroredStrategy - Horovod: Automatic backend selection
Handle node failures gracefully
Handle node failures gracefully
Implement checkpointing and recovery:
Right-size your cluster
Right-size your cluster
More nodes isn’t always better. Consider:
- Communication overhead increases with cluster size
- Batch size per worker decreases with more workers
- Diminishing returns beyond certain cluster sizes
- Start small and scale based on metrics
Common Issues
Nodes cannot communicate
Nodes cannot communicate
Symptoms: Timeout errors, connection refusedSolutions:
- Verify security groups allow inter-node communication
- Check firewall rules in Kubernetes
- Ensure correct port numbers
- Verify
current.parallel.main_ipis accessible
Unbalanced workload
Unbalanced workload
Symptoms: Some nodes finish much earlier than othersSolutions:
- Use distributed samplers to balance data
- Implement dynamic work distribution
- Profile per-node utilization
- Adjust batch sizes
Memory issues in distributed training
Memory issues in distributed training
Symptoms: OOM errors during multi-node trainingSolutions:
- Reduce per-node batch size
- Enable gradient checkpointing
- Use mixed precision training
- Increase memory allocation
Inconsistent results across runs
Inconsistent results across runs
Symptoms: Non-deterministic training resultsSolutions:
- Set random seeds on all nodes
- Synchronize initial model weights
- Use deterministic algorithms
- Verify data shuffling is consistent
Performance Optimization
Network Bandwidth
Optimize communication:Overlapping Computation and Communication
Next Steps
Resources Management
Configure CPU, memory, and GPU for distributed jobs
AWS Batch
Run distributed jobs on AWS Batch
Kubernetes
Deploy distributed workloads on Kubernetes
Remote Execution
Understand remote execution fundamentals
