Distributing training

TRL trainers use Accelerate to enable distributed training across multiple GPUs or nodes.

Multi-GPU training

Create an Accelerate config

Run the interactive configuration wizard:

accelerate config

Answer the questions for your multi-GPU or multi-node setup.

Launch distributed training

accelerate launch train.py

This automatically distributes the workload across all available GPUs.

You can also use the example config files provided in the TRL examples folder:

accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py <SCRIPT_ARGS>

Under the hood, Accelerate creates one model replica per GPU. Each process:

Processes its own batch of data
Computes loss and gradients for that batch
Shares gradient updates across all GPUs

The effective batch size is:

Batch Size = per_device_train_batch_size × num_devices × gradient_accumulation_steps

To maintain a consistent effective batch size when scaling to more GPUs, adjust per_device_train_batch_size and gradient_accumulation_steps accordingly:

GPUs	Per-device batch size	Gradient accumulation steps	Notes
1	32	1	Higher memory, faster training
1	4	8	Lower memory, slower training
8	4	1	Best of both worlds

Having one model per GPU can cause high memory usage for large models. Use DeepSpeed for model sharding, ZeRO Redundancy Optimizer, and CPU/NVMe offloading.

DeepSpeed ZeRO

DeepSpeed provides memory optimizations through the ZeRO (Zero Redundancy Optimizer) family of stages. TRL provides predefined accelerate configs you can use directly:

Profile name	Description
`zero1`	DeepSpeed ZeRO Stage 1
`zero2`	DeepSpeed ZeRO Stage 2
`zero3`	DeepSpeed ZeRO Stage 3

Pass the profile name via --accelerate_config in the TRL CLI:

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config zero3

Or pass a path to a custom Accelerate YAML config:

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config path/to/my/deepspeed_config.yaml

For a full DeepSpeed integration guide, see DeepSpeed Integration.

FSDP (Fully Sharded Data Parallel)

TRL also supports FSDP via predefined Accelerate config profiles:

Profile name	Description
`fsdp1`	FSDP Stage 1
`fsdp2`	FSDP Stage 2 (FSDP2 / PyTorch native FSDP v2)

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config fsdp2

Sequence parallelism for long-context training

Sequence Parallelism (also called Context Parallelism) splits the sequence dimension across multiple GPUs, enabling training with sequences longer than what fits on a single GPU. TRL supports two implementations:

Ring Attention (FSDP2)

Uses ring-based P2P communication. Best for extremely long sequences (1M+ tokens) and models with few attention heads. Requires Accelerate 1.11.0+ and FSDP2.

ALST/Ulysses (DeepSpeed)

Uses attention head parallelism. Best for high-bandwidth interconnects (NVLink, InfiniBand) and moderate sequence lengths (up to ~500k tokens). Requires DeepSpeed 0.18.1+ and Accelerate 1.12.0+.

Ring Attention (FSDP2)

Use the provided accelerate config (e.g. context_parallel_2gpu.yaml):

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_machines: 1
num_processes: 2
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_version: 2
parallelism_config:
  parallelism_config_dp_replicate_size: 1
  parallelism_config_dp_shard_size: 1
  parallelism_config_tp_size: 1
  parallelism_config_cp_size: 2  # context parallel size

With the corresponding training configuration:

from trl import SFTConfig

training_args = SFTConfig(
    pad_to_multiple_of=4,           # must be divisible by cp_size * 2
    max_length=16384,
    packing=True,
    use_liger_kernel=True,
    gradient_checkpointing=False,   # use fsdp_activation_checkpointing instead
    per_device_train_batch_size=1,
    ...
)

Launch with:

accelerate launch --config_file context_parallel_2gpu.yaml train.py

max_length refers to the global sequence length. The framework automatically splits it into micro-sequences per GPU based on cp_size. With max_length=8192 and cp_size=4, each GPU processes 2048 tokens.

ALST/Ulysses (DeepSpeed)

Use the provided accelerate config (e.g. alst_ulysses_4gpu.yaml):

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
mixed_precision: bf16
num_machines: 1
num_processes: 4
deepspeed_config:
  zero_stage: 3
  seq_parallel_communication_data_type: bf16
parallelism_config:
  parallelism_config_dp_replicate_size: 1
  parallelism_config_dp_shard_size: 2
  parallelism_config_tp_size: 1
  parallelism_config_sp_size: 2
  parallelism_config_sp_backend: deepspeed
  parallelism_config_sp_seq_length_is_variable: true
  parallelism_config_sp_attn_implementation: flash_attention_2

With the corresponding training configuration:

from trl import SFTConfig

training_args = SFTConfig(
    pad_to_multiple_of=2,           # must equal sp_size
    max_seq_length=4096,
    packing=True,
    attn_implementation="flash_attention_2",
    per_device_train_batch_size=1,
    ...
)

Launch a complete example with 4 GPUs:

accelerate launch --config_file examples/accelerate_configs/alst_ulysses_4gpu.yaml \
    trl/scripts/sft.py \
    --model_name_or_path Qwen/Qwen2-0.5B \
    --dataset_name trl-lib/Capybara \
    --learning_rate 2e-4 \
    --max_steps 100 \
    --max_seq_length 4096 \
    --packing \
    --torch_dtype bfloat16 \
    --attn_implementation flash_attention_2 \
    --output_dir output-alst-4gpu

2D parallelism scaling reference

GPUs	sp_size	dp_shard_size	Use case
4	2	2	Balanced — longer sequences + more data
4	4	1	Pure SP for maximum sequence length
8	2	4	Large-scale training

Ensure dp_replicate_size × dp_shard_size × sp_size = num_processes.

Multi-node training

When a single machine does not have enough GPUs, scale training across multiple machines (nodes).

Accelerate config for multi-node

Create a multi_node.yaml config:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 2
machine_rank: 0          # 0 for main node, 1 for second node
main_process_ip: 10.0.0.1  # IP of rank 0 node
main_process_port: 29500
num_processes: 16        # total processes across all nodes
mixed_precision: bf16
use_cpu: false
same_network: true

Replace 10.0.0.1 with the actual IP address of the rank 0 (main) node.

Launching

Manual launch
SLURM (HPC)

Run on each node:

# Node 0 (main node)
accelerate launch --config_file multi_node.yaml --machine_rank 0 train.py

# Node 1
accelerate launch --config_file multi_node.yaml --machine_rank 1 train.py

Create a SLURM job script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
#SBATCH --job-name=trl_multi

srun accelerate launch --config_file multi_node.yaml train.py

Submit the job:

sbatch slurm_job.sh

SLURM automatically distributes training across all requested nodes, and srun configures the necessary environment variables.

You can combine multi-node training with DeepSpeed by setting distributed_type: DEEPSPEED and adding a deepspeed_config block. See the DeepSpeed integration guide.

Get Started

Concepts

Trainers

How-to Guides

Integrations

Multi-GPU training

DeepSpeed ZeRO

FSDP (Fully Sharded Data Parallel)

Sequence parallelism for long-context training

Ring Attention (FSDP2)

ALST/Ulysses (DeepSpeed)

Ring Attention (FSDP2)

ALST/Ulysses (DeepSpeed)

2D parallelism scaling reference

Multi-node training

Accelerate config for multi-node

Launching

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Multi-GPU training

​DeepSpeed ZeRO

​FSDP (Fully Sharded Data Parallel)

​Sequence parallelism for long-context training

Ring Attention (FSDP2)

ALST/Ulysses (DeepSpeed)

​Ring Attention (FSDP2)

​ALST/Ulysses (DeepSpeed)

​2D parallelism scaling reference

​Multi-node training

​Accelerate config for multi-node

​Launching

Build docs developers (and LLMs) love

Multi-GPU training

DeepSpeed ZeRO

FSDP (Fully Sharded Data Parallel)

Sequence parallelism for long-context training

Ring Attention (FSDP2)

ALST/Ulysses (DeepSpeed)

2D parallelism scaling reference

Multi-node training

Accelerate config for multi-node

Launching