Installation
No modifications to your training script are required to use DeepSpeed. Simply launch with an Accelerate config file.
ZeRO stages
DeepSpeed ZeRO has three stages, each partitioning progressively more state across devices:ZeRO Stage 1 — Optimizer state partitioning
ZeRO Stage 1 — Optimizer state partitioning
Each GPU holds the full model parameters and gradients, but optimizer states (e.g., Adam momentum and variance) are sharded across GPUs.
- Memory reduction: ~4x for mixed precision training (optimizer states typically consume the most memory)
- Communication overhead: minimal
- Best for: reducing optimizer memory with low communication cost
ZeRO Stage 2 — Gradient and optimizer state partitioning
ZeRO Stage 2 — Gradient and optimizer state partitioning
Optimizer states and gradients are both sharded across GPUs. Model parameters remain replicated.
- Memory reduction: ~8x for mixed precision training
- Communication overhead: similar to DDP
- Best for: most multi-GPU training scenarios
ZeRO Stage 3 — Full parameter partitioning
ZeRO Stage 3 — Full parameter partitioning
Optimizer states, gradients, and model parameters are all sharded across GPUs. Parameters are gathered on demand during the forward and backward passes.
- Memory reduction: proportional to the number of GPUs
- Communication overhead: higher than Stage 2
- Best for: very large models that do not fit on a single GPU even with Stage 2
Running training with DeepSpeed
Useaccelerate launch with a DeepSpeed config file. TRL provides ready-to-use config files in examples/accelerate_configs/.
Example accelerate configs
The following configs are provided in the TRL repository and can be used directly or adapted for your setup.- ZeRO Stage 1
- ZeRO Stage 2
- ZeRO Stage 3
Multi-node setup
For multi-node training, updatenum_machines, machine_rank, and rdzv_backend in your config file. The deepspeed_multinode_launcher: standard field is already set in the provided configs.
Consult the Accelerate DeepSpeed documentation for detailed guidance on multi-node configuration.
Additional resources
Accelerate DeepSpeed guide
Full documentation for the DeepSpeed plugin in Accelerate.
TRL accelerate configs
Ready-to-use Accelerate config files for all ZeRO stages.
ZeRO paper
Zero Redundancy Optimizer — the foundational research paper.