Multi-Node Training

Overview

Multi-node training allows you to scale CLIP training across multiple machines, enabling training on massive datasets with large models. OpenCLIP supports multi-node training through both native PyTorch distributed (torchrun) and SLURM cluster management.

OpenCLIP has been battle-tested on clusters with up to 1024 A100 GPUs, demonstrating robust scalability for large-scale training.

Prerequisites

Multiple machines with GPUs connected via high-bandwidth network
Network configuration allowing inter-node communication
Shared filesystem accessible from all nodes (recommended)
SLURM cluster (for SLURM-based training) or manual node coordination

Multi-Node with torchrun

Basic Setup

The torchrun launcher supports multi-node training with minimal configuration. The key is specifying the master node’s address and the number of nodes.

cd open_clip/src
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$NODE_RANK \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m open_clip_train.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --imagenet-val /data/imagenet/validation/ \
    --model ViT-B-32 \
    --epochs 32

Key Parameters:

--nproc_per_node=4: Number of GPUs per node (4 in this example)
--nnodes=2: Total number of nodes
--node_rank=$NODE_RANK: Rank of current node (0 for master, 1, 2, … for workers)
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT: Address of the master node

Environment Variables

Set these environment variables on each node: Master Node (Node 0):

export MASTER_ADDR=$(hostname -i)  # IP address of master node
export MASTER_PORT=29500           # Communication port
export NODE_RANK=0                 # Master node rank

Worker Nodes (Node 1, 2, …):

export MASTER_ADDR=<master-node-ip>  # IP of master node
export MASTER_PORT=29500              # Same port as master
export NODE_RANK=1                    # 1, 2, 3, ... for each worker

Complete Multi-Node Example

Here’s a complete example with 2 nodes, 4 GPUs each: On Master Node (192.168.1.10):

#!/bin/bash

export MASTER_ADDR=192.168.1.10
export MASTER_PORT=29500
export NODE_RANK=0

cd open_clip/src
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$NODE_RANK \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m open_clip_train.main \
    --train-data '/shared/data/laion400m/{00000..41455}.tar' \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp \
    --workers 6 \
    --warmup 2000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-B-32 \
    --name "vit-b32-laion400m" \
    --local-loss \
    --gather-with-grad \
    --report-to wandb

On Worker Node (192.168.1.11):

#!/bin/bash

export MASTER_ADDR=192.168.1.10  # Master node IP
export MASTER_PORT=29500
export NODE_RANK=1               # Worker rank

cd open_clip/src
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$NODE_RANK \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    -m open_clip_train.main \
    # ... same arguments as master node ...

SLURM-Based Training

SLURM is the recommended approach for large-scale cluster training. It automatically handles node allocation, environment setup, and process launching.

Basic SLURM Script

#!/bin/bash -x
#SBATCH --nodes=32
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=open_clip
#SBATCH --account=ACCOUNT_NAME
#SBATCH --partition=PARTITION_NAME

eval "$(/path/to/conda/bin/conda shell.bash hook)" # init conda
conda activate open_clip
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

cd /shared/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"
srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
    --save-frequency 1 \
    --report-to tensorboard \
    --train-data "/data/LAION-400M/{00000..41455}.tar" \
    --warmup 2000 \
    --batch-size 256 \
    --epochs 32 \
    --workers 8 \
    --model ViT-B-32 \
    --name "ViT-B-32-Vanilla" \
    --seed 0 \
    --local-loss \
    --gather-with-grad

SLURM Parameters:

--nodes=32: Number of nodes to allocate
--gres=gpu:4: Request 4 GPUs per node
--ntasks-per-node=4: Launch 4 tasks (1 per GPU) per node
--cpus-per-task=6: Allocate 6 CPU cores per task (for data loading)
--wait-all-nodes=1: Wait for all nodes to be ready before starting

Production SLURM Example

Here’s a production-ready SLURM script for training ViT-L/14 on LAION-400M:

#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=clip_vit_l14
#SBATCH --time=7-00:00:00
#SBATCH --output=logs/slurm-%j.out
#SBATCH --error=logs/slurm-%j.err

# Initialize conda
eval "$(/path/to/conda/bin/conda shell.bash hook)"
conda activate open_clip

# Set environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802
export OMP_NUM_THREADS=1

# Get master node address
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

echo "Master address: $MASTER_ADDR"
echo "Number of nodes: $SLURM_JOB_NUM_NODES"
echo "Tasks per node: $SLURM_NTASKS_PER_NODE"

# Change to project directory
cd /shared/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"

# Launch training
srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --report-to wandb \
    --wandb-project-name "clip-laion400m" \
    --train-data "/data/LAION-400M/{00000..41455}.tar" \
    --train-num-samples 400000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --warmup 10000 \
    --batch-size 128 \
    --epochs 32 \
    --workers 8 \
    --model ViT-L-14 \
    --precision amp \
    --grad-checkpointing \
    --lr 5e-4 \
    --wd 0.2 \
    --name "vit-l14-laion400m" \
    --seed 42 \
    --local-loss \
    --gather-with-grad \
    --imagenet-val /data/imagenet/validation/

Submitting SLURM Jobs

# Submit job
sbatch train_clip.sh

# Check job status
squeue -u $USER

# Monitor output
tail -f logs/slurm-<job-id>.out

# Cancel job
scancel <job-id>

Network Configuration

Firewall Settings

Ensure communication ports are open between nodes:

# Allow PyTorch distributed communication (example for firewalld)
sudo firewall-cmd --zone=public --add-port=29500/tcp --permanent
sudo firewall-cmd --zone=public --add-port=29400-29600/tcp --permanent
sudo firewall-cmd --reload

Network Backend

Configure the distributed backend for your hardware: NVIDIA GPUs with NCCL (recommended):

# Default - no additional flags needed
# Uses NCCL automatically for GPU training

Ascend NPU:

--dist-backend hccl

CPU-only:

--dist-backend gloo

InfiniBand Optimization

For clusters with InfiniBand, optimize NCCL settings:

export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5
export NCCL_SOCKET_IFNAME=ib0
export NCCL_DEBUG=INFO  # For debugging

Distributed Training Optimizations

Memory-Efficient Distributed Loss

For multi-node training, use these flags to reduce memory usage from O(n²) to O(n):

--local-loss \
--gather-with-grad

Without these flags:

Memory usage: O(batch_size × num_gpus)²
Example: 256 batch size × 128 GPUs = 8GB+ logit matrix

With these flags:

Memory usage: O(batch_size × num_gpus)
Same numerical results
Essential for large-scale training (64+ GPUs)

See Distributed Training for detailed explanation.

Gradient Accumulation

Simulate larger batch sizes across nodes:

--accum-freq 4  # Accumulate gradients over 4 steps

Effective batch size:

effective_batch = batch_size × num_gpus × num_nodes × accum_freq
              = 128 × 4 × 32 × 4
              = 65,536

Remote Checkpoint Syncing

For multi-node training, sync checkpoints to remote storage (S3, shared filesystem):

python -u src/open_clip_train/main.py \
    --logs /local/scratch/logs \
    --remote-sync s3://my-bucket/checkpoints \
    --remote-sync-frequency 300 \
    --delete-previous-checkpoint \
    # ... other arguments

Parameters:

--logs: Local checkpoint directory
--remote-sync: Remote path (s3:// or shared filesystem path)
--remote-sync-frequency 300: Sync every 300 seconds (5 minutes)
--delete-previous-checkpoint: Save disk space on local nodes

Resume from Remote Checkpoint

--resume s3://my-bucket/checkpoints/experiment/checkpoints/epoch_10.pt

SLURM Job Management

Interactive SLURM Session

For debugging, request interactive session:

srun --nodes=2 --gres=gpu:4 --ntasks-per-node=4 --pty bash

# Then run training commands manually

Monitor Job Progress

# Watch job queue
watch -n 5 squeue -u $USER

# Check node allocation
scontrol show job <job-id>

# View GPU usage on allocated nodes
srun --jobid=<job-id> --overlap nvidia-smi

Job Arrays for Hyperparameter Search

#!/bin/bash
#SBATCH --nodes=8
#SBATCH --array=0-4

# Different learning rates for each job
LRS=(1e-3 5e-4 1e-4 5e-5 1e-5)
LR=${LRS[$SLURM_ARRAY_TASK_ID]}

srun python -u src/open_clip_train/main.py \
    --lr $LR \
    --name "experiment-lr-$LR" \
    # ... other arguments

Troubleshooting Multi-Node Training

Nodes Can’t Communicate

Symptom: Training hangs at initialization Solutions:

Check firewall settings

Verify MASTER_ADDR is reachable from all nodes:

ping $MASTER_ADDR
telnet $MASTER_ADDR $MASTER_PORT

Check SLURM node allocation:
```
scontrol show job $SLURM_JOB_ID
```

NCCL Initialization Errors

Symptom:

NCCL error: unhandled system error

Solutions:

Enable NCCL debugging:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

Check GPU visibility:
```
nvidia-smi
```
Verify InfiniBand configuration (if applicable)

Inconsistent Results Across Nodes

Symptom: Different nodes show different loss values Solutions:

Ensure same code version on all nodes
Check data is accessible from all nodes
Verify --seed is set for reproducibility
Use --wait-all-nodes=1 in SLURM

Out of Memory on Some Nodes

Symptom: OOM error on specific nodes Solutions:

Check GPU memory is equal across nodes:
```
srun nvidia-smi
```
Use --grad-checkpointing for memory efficiency
Reduce --batch-size per GPU
Enable --local-loss --gather-with-grad

Performance Optimization

Network Bandwidth

Monitor network usage during training:

# On each node
iftop -i ib0  # For InfiniBand
iftop -i eth0 # For Ethernet

Target: High utilization during gradient synchronization

Scaling Efficiency

Measure scaling efficiency:

# Ideal scaling: 2x nodes = 2x throughput
scaling_efficiency = (throughput_N_nodes / throughput_1_node) / N_nodes

# Target: > 0.9 for good scaling

Tips for better scaling:

Use --local-loss --gather-with-grad
Ensure sufficient batch size per GPU (128-512)
Use WebDataset format
Optimize --workers for data loading

Benchmark Multi-Node Performance

# Run with different node counts and compare throughput
for NODES in 1 2 4 8 16; do
    sbatch --nodes=$NODES benchmark.sh
done

Example: Large-Scale Training Configuration

Training ViT-H/14 on LAION-2B with 256 GPUs (64 nodes × 4 GPUs):

#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --wait-all-nodes=1
#SBATCH --time=14-00:00:00

eval "$(/path/to/conda/bin/conda shell.bash hook)"
conda activate open_clip

export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802
export NCCL_IB_DISABLE=0

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

cd /shared/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"

srun --cpu_bind=v --accel-bind=gn python -u src/open_clip_train/main.py \
    --train-data "/data/LAION-2B/{00000..100000}.tar" \
    --train-num-samples 2000000000 \
    --dataset-type webdataset \
    --dataset-resampled \
    --batch-size 256 \
    --precision amp \
    --grad-checkpointing \
    --workers 12 \
    --warmup 10000 \
    --lr 5e-4 \
    --wd 0.2 \
    --epochs 32 \
    --model ViT-H-14 \
    --save-frequency 1 \
    --zeroshot-frequency 2 \
    --local-loss \
    --gather-with-grad \
    --report-to wandb \
    --remote-sync s3://my-bucket/clip-checkpoints \
    --remote-sync-frequency 600 \
    --delete-previous-checkpoint

Next Steps

Distributed Training

Learn about advanced distributed training techniques

Configuration

Explore all training configuration options

Single-Node Training

Start with single-node training before scaling

Data Preparation

Prepare large-scale datasets for multi-node training

Get Started

Core Concepts

Model Usage

Training

Advanced

Evaluation

​Overview

​Prerequisites

​Multi-Node with torchrun

​Basic Setup

​Environment Variables

​Complete Multi-Node Example

​SLURM-Based Training

​Basic SLURM Script

​Production SLURM Example

​Submitting SLURM Jobs

​Network Configuration

​Firewall Settings

​Network Backend

​InfiniBand Optimization

​Distributed Training Optimizations

​Memory-Efficient Distributed Loss

​Gradient Accumulation

​Remote Checkpoint Syncing

​Resume from Remote Checkpoint

​SLURM Job Management

​Interactive SLURM Session

​Monitor Job Progress

​Job Arrays for Hyperparameter Search

​Troubleshooting Multi-Node Training

​Nodes Can’t Communicate

​NCCL Initialization Errors

​Inconsistent Results Across Nodes

​Out of Memory on Some Nodes

​Performance Optimization

​Network Bandwidth

​Scaling Efficiency

​Benchmark Multi-Node Performance

​Example: Large-Scale Training Configuration

​Next Steps

Distributed Training

Configuration

Single-Node Training

Data Preparation

Build docs developers (and LLMs) love

Overview

Prerequisites

Multi-Node with torchrun

Basic Setup

Environment Variables

Complete Multi-Node Example

SLURM-Based Training

Basic SLURM Script

Production SLURM Example

Submitting SLURM Jobs

Network Configuration

Firewall Settings

Network Backend

InfiniBand Optimization

Distributed Training Optimizations

Memory-Efficient Distributed Loss

Gradient Accumulation

Remote Checkpoint Syncing

Resume from Remote Checkpoint

SLURM Job Management

Interactive SLURM Session

Monitor Job Progress

Job Arrays for Hyperparameter Search

Troubleshooting Multi-Node Training

Nodes Can’t Communicate

NCCL Initialization Errors

Inconsistent Results Across Nodes

Out of Memory on Some Nodes

Performance Optimization

Network Bandwidth

Scaling Efficiency

Benchmark Multi-Node Performance

Example: Large-Scale Training Configuration

Next Steps