How DDP works in nanoGPT
The training script automatically detects and configures DDP based on environment variables set bytorchrun.
DDP detection and initialization
Fromtrain.py:82-100:
Backend configuration
The default backend is NCCL, optimized for NVIDIA GPUs:NCCL is recommended for NVIDIA GPU clusters with high-speed interconnects like Infiniband. For CPU training or mixed CPU/GPU setups, use Gloo backend.
Single-node, multi-GPU training
Launch with torchrun
Train on all available GPUs on a single node:Parameters explained
| Parameter | Description |
|---|---|
--standalone | Single-node training (auto-configures master address) |
--nproc_per_node=8 | Number of processes (GPUs) to use |
train.py | Training script |
config/train_gpt2.py | Configuration file |
Example: 4 GPUs
torchrun sets environment variables
For each process:
RANK: Global rank (0-3)LOCAL_RANK: Local rank on this node (0-3)WORLD_SIZE: Total number of processes (4)
Each process initializes
- Loads the same model
- Sets different CUDA device based on
LOCAL_RANK - Uses different random seed (
1337 + RANK)
Multi-node training
Two-node example
For training across 2 nodes, each with 8 GPUs:- Master node
- Worker node
Multi-node parameters
| Parameter | Description |
|---|---|
--nproc_per_node=8 | GPUs per node |
--nnodes=2 | Total number of nodes |
--node_rank=0 | Rank of this node (0 for master, 1+ for workers) |
--master_addr | IP address of master node |
--master_port | Port for communication (default: 29500) |
Infiniband configuration
With Infiniband:Benchmark your interconnect
Useiperf3 to test network bandwidth between nodes:
- Infiniband: 100+ Gbps
- 10GbE: ~10 Gbps
- 1GbE: ~1 Gbps (will be very slow for multi-node training)
Gradient accumulation with DDP
Automatic scaling
Gradient accumulation steps are automatically divided by world size to maintain the same effective batch size:Example calculation
Withconfig/train_gpt2.py:
The effective batch size remains constant regardless of the number of GPUs. Each GPU processes fewer gradient accumulation steps.
Gradient synchronization
Efficient sync strategy
Fromtrain.py:292-298, gradients are only synchronized on the last micro-step:
Checkpointing and logging
Master process only
Only the master process (rank 0) performs I/O operations:Unwrap DDP for checkpointing
Cleanup
Always destroy the process group when training completes:Advanced DDP configurations
Custom backend
For CPU training or debugging:NCCL environment variables
Optimize NCCL performance:Find available network interfaces
Performance considerations
Scaling efficiency
Due to gradient synchronization overhead, scaling efficiency decreases as you add more GPUs. Expect 80-90% efficiency on 8 GPUs, 60-70% on 64 GPUs.
Batch size tuning
Increasebatch_size or gradient_accumulation_steps to:
- Reduce gradient sync overhead
- Improve GPU utilization
- Maintain stable training
Memory optimization
If you run out of memory:- Decrease
batch_size - Decrease
block_size(context length) - Enable gradient checkpointing (requires code modification)
- Use smaller model (
n_layer,n_head,n_embd)
Troubleshooting
Common issues
NCCL timeout or hang
NCCL timeout or hang
- Check network connectivity between nodes
- Verify firewall allows traffic on master port
- Try
NCCL_DEBUG=INFOto see detailed logs - Increase timeout:
NCCL_TIMEOUT=7200(seconds)
Out of memory on some GPUs
Out of memory on some GPUs
- Ensure all GPUs have the same memory
- Check for memory leaks in data loading
- Reduce
batch_sizeorblock_size
Slow multi-node training
Slow multi-node training
- Benchmark interconnect with
iperf3 - Disable Infiniband if not available:
NCCL_IB_DISABLE=1 - Check for network congestion
Different results on different GPUs
Different results on different GPUs
- Ensure deterministic operations are disabled (default)
- Check if
torch.manual_seedis set correctly - Verify all processes load the same initial checkpoint
Next steps
Reproducing GPT-2
Train a 124M parameter model with DDP
Finetuning
Finetune pretrained models on custom data