Skip to main content
LeRobot supports distributed training across multiple GPUs using Hugging Face Accelerate. This can significantly speed up training for large models and datasets.

Quick Start

Train on multiple GPUs with a single command:
accelerate launch --multi_gpu --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act \
  --dataset.repo_id=lerobot/aloha_mobile_cabinet \
  --steps=100000 \
  --batch_size=32
This distributes training across 4 GPUs automatically.

Setup

Install Accelerate

Accelerate is included with LeRobot:
pip install lerobot
Or install separately:
pip install accelerate

Configure Accelerate

Generate a configuration file:
accelerate config
Answer the prompts:
In which compute environment are you running?
> This machine

Which type of machine are you using?
> Multi-GPU

How many different machines will you use?
> 1

Do you want to use DeepSpeed?
> No

Do you want to use FullyShardedDataParallel?
> No

How many GPU(s) should be used for distributed training?
> 4

Do you wish to use mixed precision?
> fp16
This creates ~/.cache/huggingface/accelerate/default_config.yaml.

Training with Multiple GPUs

Basic Multi-GPU Training

Use the accelerate launch command:
accelerate launch --config_file ~/.cache/huggingface/accelerate/default_config.yaml \
  -m lerobot.scripts.lerobot_train \
  --policy.type=diffusion \
  --dataset.repo_id=lerobot/pusht \
  --steps=10000 \
  --batch_size=64

Inline Configuration

Specify configuration inline without a config file:
accelerate launch \
  --multi_gpu \
  --num_processes=4 \
  --mixed_precision=fp16 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act \
  --dataset.repo_id=lerobot/aloha_mobile_cabinet \
  --steps=100000 \
  --batch_size=8

YAML Configuration File

Create a custom config file accelerate_config.yaml:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: fp16
num_processes: 4
use_cpu: false
gpu_ids: all
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
rdzv_backend: static
same_network: true
Use it:
accelerate launch --config_file accelerate_config.yaml \
  -m lerobot.scripts.lerobot_train \
  --policy.type=diffusion \
  --dataset.repo_id=lerobot/pusht

Scaling Batch Size

When using multiple GPUs, scale your batch size accordingly:
# Single GPU: batch_size=64
lerobot-train --policy.type=diffusion --batch_size=64

# 4 GPUs: batch_size=64 per GPU = effective batch_size=256
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=diffusion \
  --batch_size=64
Each GPU processes batch_size samples, so effective batch size = batch_size × num_gpus.

Adjusting Learning Rate

Scale learning rate with batch size:
# Single GPU: lr=1e-4, batch=64
lerobot-train --policy.optimizer_lr=1e-4 --batch_size=64

# 4 GPUs: scale lr proportionally
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --policy.optimizer_lr=4e-4 \
  --batch_size=64
Rule of thumb: If you multiply batch size by N, multiply learning rate by √N.

Mixed Precision Training

Use FP16 or BF16 for faster training:

FP16 (Float16)

accelerate launch \
  --multi_gpu \
  --num_processes=4 \
  --mixed_precision=fp16 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act
FP16 is supported on most modern GPUs (NVIDIA Pascal and newer).

BF16 (BFloat16)

accelerate launch \
  --multi_gpu \
  --num_processes=4 \
  --mixed_precision=bf16 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act
BF16 requires Ampere GPUs (A100, RTX 3090, etc.) but is more numerically stable.

Advanced Features

Gradient Accumulation

Simulate larger batch sizes with gradient accumulation:
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=diffusion \
  --batch_size=16 \
  --gradient_accumulation_steps=4
Effective batch size = 16 × 4 GPUs × 4 accumulation steps = 256.

Selecting Specific GPUs

Use specific GPUs:
# Use GPUs 0 and 1
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --num_processes=2 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act

# Use GPUs 2, 3, 4, 5
CUDA_VISIBLE_DEVICES=2,3,4,5 accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act

DeepSpeed Integration

For very large models, use DeepSpeed:
accelerate launch \
  --use_deepspeed \
  --deepspeed_config_file=deepspeed_config.json \
  --num_processes=8 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=smolvla \
  --dataset.repo_id=HuggingFaceVLA/libero
DeepSpeed config (deepspeed_config.json):
{
  "train_batch_size": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu"
    }
  },
  "fp16": {
    "enabled": true
  }
}

Fully Sharded Data Parallel (FSDP)

For extremely large models:
accelerate launch \
  --use_fsdp \
  --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
  --num_processes=8 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=pi0 \
  --dataset.repo_id=HuggingFaceVLA/libero

Implementation Details

LeRobot’s training script uses Accelerate’s Accelerator class:
from accelerate import Accelerator

# Initialize accelerator
accelerator = Accelerator(
    mixed_precision='fp16',
    gradient_accumulation_steps=4
)

# Wrap model, optimizer, dataloader
model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)

# Training loop
for batch in dataloader:
    with accelerator.accumulate(model):
        loss, _ = model(batch)
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

# Only main process saves checkpoints
if accelerator.is_main_process:
    model.save_pretrained(checkpoint_dir)
Key points:
  • accelerator.prepare() wraps objects for distributed training
  • accelerator.backward() handles gradient synchronization
  • Only the main process (rank 0) saves checkpoints

Testing Multi-GPU Setup

Test your setup with a short training run:
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act \
  --dataset.repo_id=lerobot/pusht \
  --dataset.episodes=[0] \
  --steps=100 \
  --batch_size=4 \
  --log_freq=10
Monitor GPU usage:
watch -n 1 nvidia-smi
You should see all GPUs being utilized.

Troubleshooting

Out of Memory (OOM)

Reduce batch size or enable gradient checkpointing:
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --policy.type=act \
  --batch_size=8 \
  --gradient_accumulation_steps=4

Slow Data Loading

Increase number of dataloader workers:
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --num_workers=8

GPUs Not All Used

Check that num_processes matches available GPUs:
import torch
print(f"Available GPUs: {torch.cuda.device_count()}")

Different GPU Memory

If GPUs have different memory, use the smallest batch size that fits:
# For mixed GPU setup (e.g., 1x A100 40GB + 3x RTX 3090 24GB)
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --batch_size=8  # Sized for 24GB GPUs

Performance Tips

1
Use appropriate batch size
2
Maximize GPU utilization without OOM:
3
# Start with small batch and increase until OOM
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --batch_size=32
4
Enable mixed precision
5
FP16/BF16 reduces memory and increases speed:
6
accelerate launch --mixed_precision=fp16 --num_processes=4 \
  -m lerobot.scripts.lerobot_train
7
Optimize data loading
8
Use more workers and pin memory:
9
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --num_workers=8 \
  --pin_memory=true
10
Use gradient accumulation
11
Simulate larger batches:
12
accelerate launch --num_processes=4 \
  -m lerobot.scripts.lerobot_train \
  --batch_size=16 \
  --gradient_accumulation_steps=4

Benchmark Results

Typical speedup from multi-GPU training:
GPUsSteps/secSpeedupTraining Time (50k steps)
1x A1002.51.0x5.5 hours
2x A1004.81.9x2.9 hours
4x A1009.23.7x1.5 hours
8x A10017.16.8x0.8 hours
Scaling efficiency depends on:
  • Model size (larger models scale better)
  • Batch size (larger batches scale better)
  • Data loading speed
  • Communication overhead

Multi-Node Training

For training across multiple machines:
# On machine 0 (main)
accelerate launch \
  --num_processes=8 \
  --num_machines=2 \
  --machine_rank=0 \
  --main_process_ip=192.168.1.100 \
  --main_process_port=29500 \
  -m lerobot.scripts.lerobot_train

# On machine 1
accelerate launch \
  --num_processes=8 \
  --num_machines=2 \
  --machine_rank=1 \
  --main_process_ip=192.168.1.100 \
  --main_process_port=29500 \
  -m lerobot.scripts.lerobot_train

Next Steps

Build docs developers (and LLMs) love