Distributed Training
ONNX Runtime Training seamlessly integrates with popular distributed training frameworks to scale training across multiple GPUs and nodes. This guide covers setup and best practices for distributed training with ORTModule.Supported Frameworks
ORTModule works with:- PyTorch DDP (DistributedDataParallel): Native PyTorch multi-GPU training
- DeepSpeed: Memory-efficient training with ZeRO optimizer
- DeepSpeed Pipeline Parallelism: Model parallelism for very large models
- PyTorch FSDP: Fully Sharded Data Parallel
- Horovod: Multi-framework distributed training
PyTorch DistributedDataParallel (DDP)
Basic Setup
Wrap your model with ORTModule before DDP:Launch Script
Complete DDP Example
DeepSpeed Integration
DeepSpeed provides memory-efficient training through ZeRO optimizer stages.Basic DeepSpeed Setup
DeepSpeed Configuration File
Launch with DeepSpeed
DeepSpeed Pipeline Parallelism
For models too large for single GPU, use pipeline parallelism:Data Loading Best Practices
Use DistributedSampler
Ensure each process gets different data:Load Balancing for Variable Length Sequences
For NLP and speech tasks with variable length inputs:Environment Variables for Distributed Training
Essential Variables
ORTModule Distributed Settings
Checkpoint Saving and Loading
Save Checkpoints (Rank 0 only)
Testing and Debugging
Test Distributed Setup
Enable Detailed Logging
Performance Tips
- Wrap Order: Always wrap with ORTModule before DDP/DeepSpeed
- Batch Size: Use largest batch size that fits in memory
- Gradient Accumulation: Simulate larger batches with accumulation
- Mixed Precision: Enable FP16 training for faster computation
- Communication Backend: Use NCCL for GPU training, Gloo for CPU
- Pin Memory: Enable
pin_memory=Truein DataLoader - Persistent Workers: Set
persistent_workers=Trueto avoid respawning - NCCL Tuning: Optimize NCCL settings for your network topology
Common Issues
Hanging on Initialization
Out of Memory
Gradient Synchronization Issues
Next Steps
ORTModule
Learn more about ORTModule features
Training Overview
Explore other training options