DeepSpeed Integration

TRL supports training with DeepSpeed, a library that implements advanced distributed training optimizations including optimizer state partitioning, gradient partitioning, parameter offloading, and more. DeepSpeed integrates the Zero Redundancy Optimizer (ZeRO), which allows scaling model size proportionally to the number of devices while maintaining high efficiency. ZeRO stages diagram showing optimizer state, gradient, and parameter partitioning across GPUs

ZeRO stages diagram showing optimizer state, gradient, and parameter partitioning across GPUs

Installation

pip install deepspeed

No modifications to your training script are required to use DeepSpeed. Simply launch with an Accelerate config file.

ZeRO stages

DeepSpeed ZeRO has three stages, each partitioning progressively more state across devices:

ZeRO Stage 1 — Optimizer state partitioning

Each GPU holds the full model parameters and gradients, but optimizer states (e.g., Adam momentum and variance) are sharded across GPUs.

Memory reduction: ~4x for mixed precision training (optimizer states typically consume the most memory)
Communication overhead: minimal
Best for: reducing optimizer memory with low communication cost

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero1.yaml train.py

ZeRO Stage 2 — Gradient and optimizer state partitioning

Optimizer states and gradients are both sharded across GPUs. Model parameters remain replicated.

Memory reduction: ~8x for mixed precision training
Communication overhead: similar to DDP
Best for: most multi-GPU training scenarios

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml train.py

ZeRO Stage 3 — Full parameter partitioning

Optimizer states, gradients, and model parameters are all sharded across GPUs. Parameters are gathered on demand during the forward and backward passes.

Memory reduction: proportional to the number of GPUs
Communication overhead: higher than Stage 2
Best for: very large models that do not fit on a single GPU even with Stage 2

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml train.py

Running training with DeepSpeed

Use accelerate launch with a DeepSpeed config file. TRL provides ready-to-use config files in examples/accelerate_configs/.

acceleate launch --config_file <ACCELERATE_WITH_DEEPSPEED_CONFIG_FILE.yaml> train.py

For example, to train with ZeRO Stage 2:

acceleate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml train.py

Example accelerate configs

The following configs are provided in the TRL repository and can be used directly or adapted for your setup.

ZeRO Stage 1
ZeRO Stage 2
ZeRO Stage 3

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  zero3_init_flag: false
  zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

Multi-node setup

For multi-node training, update num_machines, machine_rank, and rdzv_backend in your config file. The deepspeed_multinode_launcher: standard field is already set in the provided configs. Consult the Accelerate DeepSpeed documentation for detailed guidance on multi-node configuration.

Additional resources

Accelerate DeepSpeed guide

Full documentation for the DeepSpeed plugin in Accelerate.

TRL accelerate configs

Ready-to-use Accelerate config files for all ZeRO stages.

ZeRO paper

Zero Redundancy Optimizer — the foundational research paper.

Get Started

Concepts

Trainers

How-to Guides

Integrations

Installation

ZeRO stages

Running training with DeepSpeed

Example accelerate configs

Multi-node setup

Additional resources

Accelerate DeepSpeed guide

TRL accelerate configs

ZeRO paper

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Installation

​ZeRO stages

​Running training with DeepSpeed

​Example accelerate configs

​Multi-node setup

​Additional resources

Accelerate DeepSpeed guide

TRL accelerate configs

ZeRO paper

Build docs developers (and LLMs) love

Installation

ZeRO stages

Running training with DeepSpeed

Example accelerate configs

Multi-node setup

Additional resources