Quick Start
Train on multiple GPUs with a single command:Setup
Install Accelerate
Accelerate is included with LeRobot:Configure Accelerate
Generate a configuration file:~/.cache/huggingface/accelerate/default_config.yaml.
Training with Multiple GPUs
Basic Multi-GPU Training
Use theaccelerate launch command:
Inline Configuration
Specify configuration inline without a config file:YAML Configuration File
Create a custom config fileaccelerate_config.yaml:
Scaling Batch Size
When using multiple GPUs, scale your batch size accordingly:batch_size samples, so effective batch size = batch_size × num_gpus.
Adjusting Learning Rate
Scale learning rate with batch size:Mixed Precision Training
Use FP16 or BF16 for faster training:FP16 (Float16)
BF16 (BFloat16)
Advanced Features
Gradient Accumulation
Simulate larger batch sizes with gradient accumulation:16 × 4 GPUs × 4 accumulation steps = 256.
Selecting Specific GPUs
Use specific GPUs:DeepSpeed Integration
For very large models, use DeepSpeed:deepspeed_config.json):
Fully Sharded Data Parallel (FSDP)
For extremely large models:Implementation Details
LeRobot’s training script uses Accelerate’sAccelerator class:
accelerator.prepare()wraps objects for distributed trainingaccelerator.backward()handles gradient synchronization- Only the main process (rank 0) saves checkpoints
Testing Multi-GPU Setup
Test your setup with a short training run:Troubleshooting
Out of Memory (OOM)
Reduce batch size or enable gradient checkpointing:Slow Data Loading
Increase number of dataloader workers:GPUs Not All Used
Check thatnum_processes matches available GPUs:
Different GPU Memory
If GPUs have different memory, use the smallest batch size that fits:Performance Tips
# Start with small batch and increase until OOM
accelerate launch --num_processes=4 \
-m lerobot.scripts.lerobot_train \
--batch_size=32
accelerate launch --num_processes=4 \
-m lerobot.scripts.lerobot_train \
--num_workers=8 \
--pin_memory=true
Benchmark Results
Typical speedup from multi-GPU training:| GPUs | Steps/sec | Speedup | Training Time (50k steps) |
|---|---|---|---|
| 1x A100 | 2.5 | 1.0x | 5.5 hours |
| 2x A100 | 4.8 | 1.9x | 2.9 hours |
| 4x A100 | 9.2 | 3.7x | 1.5 hours |
| 8x A100 | 17.1 | 6.8x | 0.8 hours |
- Model size (larger models scale better)
- Batch size (larger batches scale better)
- Data loading speed
- Communication overhead
Multi-Node Training
For training across multiple machines:Next Steps
- Train Your First Policy - Training basics
- PEFT Training - Efficient fine-tuning
- Accelerate Documentation - Advanced features
- DeepSpeed Integration - For very large models