Overview
Multi-node training allows you to scale CLIP training across multiple machines, enabling training on massive datasets with large models. OpenCLIP supports multi-node training through both native PyTorch distributed (torchrun) and SLURM cluster management.
OpenCLIP has been battle-tested on clusters with up to 1024 A100 GPUs, demonstrating robust scalability for large-scale training.
Prerequisites
Multiple machines with GPUs connected via high-bandwidth network
Network configuration allowing inter-node communication
Shared filesystem accessible from all nodes (recommended)
SLURM cluster (for SLURM-based training) or manual node coordination
Multi-Node with torchrun
Basic Setup
The torchrun launcher supports multi-node training with minimal configuration. The key is specifying the master node’s address and the number of nodes.
cd open_clip/src
torchrun --nproc_per_node=4 \
--nnodes=2 \
--node_rank= $NODE_RANK \
--rdzv_endpoint= $MASTER_ADDR : $MASTER_PORT \
-m open_clip_train.main \
--train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
--train-num-samples 10968539 \
--dataset-type webdataset \
--batch-size 320 \
--precision amp \
--workers 4 \
--imagenet-val /data/imagenet/validation/ \
--model ViT-B-32 \
--epochs 32
Key Parameters:
--nproc_per_node=4: Number of GPUs per node (4 in this example)
--nnodes=2: Total number of nodes
--node_rank=$NODE_RANK: Rank of current node (0 for master, 1, 2, … for workers)
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT: Address of the master node
Environment Variables
Set these environment variables on each node:
Master Node (Node 0):
export MASTER_ADDR = $( hostname -i ) # IP address of master node
export MASTER_PORT = 29500 # Communication port
export NODE_RANK = 0 # Master node rank
Worker Nodes (Node 1, 2, …):
export MASTER_ADDR =< master-node-ip > # IP of master node
export MASTER_PORT = 29500 # Same port as master
export NODE_RANK = 1 # 1, 2, 3, ... for each worker
Complete Multi-Node Example
Here’s a complete example with 2 nodes, 4 GPUs each:
On Master Node (192.168.1.10):
#!/bin/bash
export MASTER_ADDR = 192.168.1.10
export MASTER_PORT = 29500
export NODE_RANK = 0
cd open_clip/src
torchrun --nproc_per_node=4 \
--nnodes=2 \
--node_rank= $NODE_RANK \
--rdzv_endpoint= $MASTER_ADDR : $MASTER_PORT \
-m open_clip_train.main \
--train-data '/shared/data/laion400m/{00000..41455}.tar' \
--train-num-samples 400000000 \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 256 \
--precision amp \
--workers 6 \
--warmup 2000 \
--lr 5e-4 \
--wd 0.2 \
--epochs 32 \
--model ViT-B-32 \
--name "vit-b32-laion400m" \
--local-loss \
--gather-with-grad \
--report-to wandb
On Worker Node (192.168.1.11):
#!/bin/bash
export MASTER_ADDR = 192.168.1.10 # Master node IP
export MASTER_PORT = 29500
export NODE_RANK = 1 # Worker rank
cd open_clip/src
torchrun --nproc_per_node=4 \
--nnodes=2 \
--node_rank= $NODE_RANK \
--rdzv_endpoint= $MASTER_ADDR : $MASTER_PORT \
-m open_clip_train.main \
# ... same arguments as master node ...
SLURM-Based Training
SLURM is the recommended approach for large-scale cluster training. It automatically handles node allocation, environment setup, and process launching.
Basic SLURM Script
#!/bin/bash -x
#SBATCH --nodes=32
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=open_clip
#SBATCH --account=ACCOUNT_NAME
#SBATCH --partition=PARTITION_NAME
eval "$( /path/to/conda/bin/conda shell.bash hook)" # init conda
conda activate open_clip
export CUDA_VISIBLE_DEVICES = 0 , 1 , 2 , 3
export MASTER_PORT = 12802
master_addr = $( scontrol show hostnames " $SLURM_JOB_NODELIST " | head -n 1 )
export MASTER_ADDR = $master_addr
cd /shared/open_clip
export PYTHONPATH = " $PYTHONPATH : $PWD /src"
srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
--save-frequency 1 \
--report-to tensorboard \
--train-data "/data/LAION-400M/{00000..41455}.tar" \
--warmup 2000 \
--batch-size 256 \
--epochs 32 \
--workers 8 \
--model ViT-B-32 \
--name "ViT-B-32-Vanilla" \
--seed 0 \
--local-loss \
--gather-with-grad
SLURM Parameters:
--nodes=32: Number of nodes to allocate
--gres=gpu:4: Request 4 GPUs per node
--ntasks-per-node=4: Launch 4 tasks (1 per GPU) per node
--cpus-per-task=6: Allocate 6 CPU cores per task (for data loading)
--wait-all-nodes=1: Wait for all nodes to be ready before starting
Production SLURM Example
Here’s a production-ready SLURM script for training ViT-L/14 on LAION-400M:
#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=clip_vit_l14
#SBATCH --time=7-00:00:00
#SBATCH --output=logs/slurm-%j.out
#SBATCH --error=logs/slurm-%j.err
# Initialize conda
eval "$( /path/to/conda/bin/conda shell.bash hook)"
conda activate open_clip
# Set environment variables
export CUDA_VISIBLE_DEVICES = 0 , 1 , 2 , 3
export MASTER_PORT = 12802
export OMP_NUM_THREADS = 1
# Get master node address
master_addr = $( scontrol show hostnames " $SLURM_JOB_NODELIST " | head -n 1 )
export MASTER_ADDR = $master_addr
echo "Master address: $MASTER_ADDR "
echo "Number of nodes: $SLURM_JOB_NUM_NODES "
echo "Tasks per node: $SLURM_NTASKS_PER_NODE "
# Change to project directory
cd /shared/open_clip
export PYTHONPATH = " $PYTHONPATH : $PWD /src"
# Launch training
srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
--save-frequency 1 \
--zeroshot-frequency 2 \
--report-to wandb \
--wandb-project-name "clip-laion400m" \
--train-data "/data/LAION-400M/{00000..41455}.tar" \
--train-num-samples 400000000 \
--dataset-type webdataset \
--dataset-resampled \
--warmup 10000 \
--batch-size 128 \
--epochs 32 \
--workers 8 \
--model ViT-L-14 \
--precision amp \
--grad-checkpointing \
--lr 5e-4 \
--wd 0.2 \
--name "vit-l14-laion400m" \
--seed 42 \
--local-loss \
--gather-with-grad \
--imagenet-val /data/imagenet/validation/
Submitting SLURM Jobs
# Submit job
sbatch train_clip.sh
# Check job status
squeue -u $USER
# Monitor output
tail -f logs/slurm- < job-i d > .out
# Cancel job
scancel < job-i d >
Network Configuration
Firewall Settings
Ensure communication ports are open between nodes:
# Allow PyTorch distributed communication (example for firewalld)
sudo firewall-cmd --zone=public --add-port=29500/tcp --permanent
sudo firewall-cmd --zone=public --add-port=29400-29600/tcp --permanent
sudo firewall-cmd --reload
Network Backend
Configure the distributed backend for your hardware:
NVIDIA GPUs with NCCL (recommended):
# Default - no additional flags needed
# Uses NCCL automatically for GPU training
Ascend NPU:
CPU-only:
InfiniBand Optimization
For clusters with InfiniBand, optimize NCCL settings:
export NCCL_IB_DISABLE = 0
export NCCL_IB_HCA = mlx5
export NCCL_SOCKET_IFNAME = ib0
export NCCL_DEBUG = INFO # For debugging
Distributed Training Optimizations
Memory-Efficient Distributed Loss
For multi-node training, use these flags to reduce memory usage from O(n²) to O(n):
--local-loss \
--gather-with-grad
Without these flags:
Memory usage: O(batch_size × num_gpus)²
Example: 256 batch size × 128 GPUs = 8GB+ logit matrix
With these flags:
Memory usage: O(batch_size × num_gpus)
Same numerical results
Essential for large-scale training (64+ GPUs)
See Distributed Training for detailed explanation.
Gradient Accumulation
Simulate larger batch sizes across nodes:
--accum-freq 4 # Accumulate gradients over 4 steps
Effective batch size:
effective_batch = batch_size × num_gpus × num_nodes × accum_freq
= 128 × 4 × 32 × 4
= 65,536
Remote Checkpoint Syncing
For multi-node training, sync checkpoints to remote storage (S3, shared filesystem):
python -u src/open_clip_train/main.py \
--logs /local/scratch/logs \
--remote-sync s3://my-bucket/checkpoints \
--remote-sync-frequency 300 \
--delete-previous-checkpoint \
# ... other arguments
Parameters:
--logs: Local checkpoint directory
--remote-sync: Remote path (s3:// or shared filesystem path)
--remote-sync-frequency 300: Sync every 300 seconds (5 minutes)
--delete-previous-checkpoint: Save disk space on local nodes
Resume from Remote Checkpoint
--resume s3://my-bucket/checkpoints/experiment/checkpoints/epoch_10.pt
SLURM Job Management
Interactive SLURM Session
For debugging, request interactive session:
srun --nodes=2 --gres=gpu:4 --ntasks-per-node=4 --pty bash
# Then run training commands manually
Monitor Job Progress
# Watch job queue
watch -n 5 squeue -u $USER
# Check node allocation
scontrol show job < job-i d >
# View GPU usage on allocated nodes
srun --jobid= < job-id > --overlap nvidia-smi
Job Arrays for Hyperparameter Search
#!/bin/bash
#SBATCH --nodes=8
#SBATCH --array=0-4
# Different learning rates for each job
LRS = ( 1e-3 5e-4 1e-4 5e-5 1e-5 )
LR = ${ LRS [ $SLURM_ARRAY_TASK_ID ]}
srun python -u src/open_clip_train/main.py \
--lr $LR \
--name "experiment-lr- $LR " \
# ... other arguments
Troubleshooting Multi-Node Training
Nodes Can’t Communicate
Symptom: Training hangs at initialization
Solutions:
Check firewall settings
Verify MASTER_ADDR is reachable from all nodes:
ping $MASTER_ADDR
telnet $MASTER_ADDR $MASTER_PORT
Check SLURM node allocation:
scontrol show job $SLURM_JOB_ID
NCCL Initialization Errors
Symptom:
NCCL error: unhandled system error
Solutions:
Enable NCCL debugging:
export NCCL_DEBUG = INFO
export NCCL_DEBUG_SUBSYS = ALL
Check GPU visibility:
Verify InfiniBand configuration (if applicable)
Inconsistent Results Across Nodes
Symptom: Different nodes show different loss values
Solutions:
Ensure same code version on all nodes
Check data is accessible from all nodes
Verify --seed is set for reproducibility
Use --wait-all-nodes=1 in SLURM
Out of Memory on Some Nodes
Symptom: OOM error on specific nodes
Solutions:
Check GPU memory is equal across nodes:
Use --grad-checkpointing for memory efficiency
Reduce --batch-size per GPU
Enable --local-loss --gather-with-grad
Network Bandwidth
Monitor network usage during training:
# On each node
iftop -i ib0 # For InfiniBand
iftop -i eth0 # For Ethernet
Target: High utilization during gradient synchronization
Scaling Efficiency
Measure scaling efficiency:
# Ideal scaling: 2x nodes = 2x throughput
scaling_efficiency = (throughput_N_nodes / throughput_1_node) / N_nodes
# Target: > 0.9 for good scaling
Tips for better scaling:
Use --local-loss --gather-with-grad
Ensure sufficient batch size per GPU (128-512)
Use WebDataset format
Optimize --workers for data loading
# Run with different node counts and compare throughput
for NODES in 1 2 4 8 16 ; do
sbatch --nodes= $NODES benchmark.sh
done
Example: Large-Scale Training Configuration
Training ViT-H/14 on LAION-2B with 256 GPUs (64 nodes × 4 GPUs):
#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --wait-all-nodes=1
#SBATCH --time=14-00:00:00
eval "$( /path/to/conda/bin/conda shell.bash hook)"
conda activate open_clip
export CUDA_VISIBLE_DEVICES = 0 , 1 , 2 , 3
export MASTER_PORT = 12802
export NCCL_IB_DISABLE = 0
master_addr = $( scontrol show hostnames " $SLURM_JOB_NODELIST " | head -n 1 )
export MASTER_ADDR = $master_addr
cd /shared/open_clip
export PYTHONPATH = " $PYTHONPATH : $PWD /src"
srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
--train-data "/data/LAION-2B/{00000..100000}.tar" \
--train-num-samples 2000000000 \
--dataset-type webdataset \
--dataset-resampled \
--batch-size 256 \
--precision amp \
--grad-checkpointing \
--workers 12 \
--warmup 10000 \
--lr 5e-4 \
--wd 0.2 \
--epochs 32 \
--model ViT-H-14 \
--save-frequency 1 \
--zeroshot-frequency 2 \
--local-loss \
--gather-with-grad \
--report-to wandb \
--remote-sync s3://my-bucket/clip-checkpoints \
--remote-sync-frequency 600 \
--delete-previous-checkpoint
Next Steps
Distributed Training Learn about advanced distributed training techniques
Configuration Explore all training configuration options
Single-Node Training Start with single-node training before scaling
Data Preparation Prepare large-scale datasets for multi-node training