Skip to main content
TRL supports training with DeepSpeed, a library that implements advanced distributed training optimizations including optimizer state partitioning, gradient partitioning, parameter offloading, and more. DeepSpeed integrates the Zero Redundancy Optimizer (ZeRO), which allows scaling model size proportionally to the number of devices while maintaining high efficiency. ZeRO stages diagram showing optimizer state, gradient, and parameter partitioning across GPUs

Installation

pip install deepspeed
No modifications to your training script are required to use DeepSpeed. Simply launch with an Accelerate config file.

ZeRO stages

DeepSpeed ZeRO has three stages, each partitioning progressively more state across devices:
Each GPU holds the full model parameters and gradients, but optimizer states (e.g., Adam momentum and variance) are sharded across GPUs.
  • Memory reduction: ~4x for mixed precision training (optimizer states typically consume the most memory)
  • Communication overhead: minimal
  • Best for: reducing optimizer memory with low communication cost
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero1.yaml train.py
Optimizer states and gradients are both sharded across GPUs. Model parameters remain replicated.
  • Memory reduction: ~8x for mixed precision training
  • Communication overhead: similar to DDP
  • Best for: most multi-GPU training scenarios
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml train.py
Optimizer states, gradients, and model parameters are all sharded across GPUs. Parameters are gathered on demand during the forward and backward passes.
  • Memory reduction: proportional to the number of GPUs
  • Communication overhead: higher than Stage 2
  • Best for: very large models that do not fit on a single GPU even with Stage 2
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml train.py

Running training with DeepSpeed

Use accelerate launch with a DeepSpeed config file. TRL provides ready-to-use config files in examples/accelerate_configs/.
acceleate launch --config_file <ACCELERATE_WITH_DEEPSPEED_CONFIG_FILE.yaml> train.py
For example, to train with ZeRO Stage 2:
acceleate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml train.py

Example accelerate configs

The following configs are provided in the TRL repository and can be used directly or adapted for your setup.
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  zero3_init_flag: false
  zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

Multi-node setup

For multi-node training, update num_machines, machine_rank, and rdzv_backend in your config file. The deepspeed_multinode_launcher: standard field is already set in the provided configs. Consult the Accelerate DeepSpeed documentation for detailed guidance on multi-node configuration.

Additional resources

Accelerate DeepSpeed guide

Full documentation for the DeepSpeed plugin in Accelerate.

TRL accelerate configs

Ready-to-use Accelerate config files for all ZeRO stages.

ZeRO paper

Zero Redundancy Optimizer — the foundational research paper.

Build docs developers (and LLMs) love