Skip to main content
TRL trainers use Accelerate to enable distributed training across multiple GPUs or nodes.

Multi-GPU training

1

Create an Accelerate config

Run the interactive configuration wizard:
accelerate config
Answer the questions for your multi-GPU or multi-node setup.
2

Launch distributed training

accelerate launch train.py
This automatically distributes the workload across all available GPUs.
You can also use the example config files provided in the TRL examples folder:
accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py <SCRIPT_ARGS>
Under the hood, Accelerate creates one model replica per GPU. Each process:
  • Processes its own batch of data
  • Computes loss and gradients for that batch
  • Shares gradient updates across all GPUs
The effective batch size is:
Batch Size = per_device_train_batch_size × num_devices × gradient_accumulation_steps
To maintain a consistent effective batch size when scaling to more GPUs, adjust per_device_train_batch_size and gradient_accumulation_steps accordingly:
GPUsPer-device batch sizeGradient accumulation stepsNotes
1321Higher memory, faster training
148Lower memory, slower training
841Best of both worlds
Having one model per GPU can cause high memory usage for large models. Use DeepSpeed for model sharding, ZeRO Redundancy Optimizer, and CPU/NVMe offloading.

DeepSpeed ZeRO

DeepSpeed provides memory optimizations through the ZeRO (Zero Redundancy Optimizer) family of stages. TRL provides predefined accelerate configs you can use directly:
Profile nameDescription
zero1DeepSpeed ZeRO Stage 1
zero2DeepSpeed ZeRO Stage 2
zero3DeepSpeed ZeRO Stage 3
Pass the profile name via --accelerate_config in the TRL CLI:
trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config zero3
Or pass a path to a custom Accelerate YAML config:
trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config path/to/my/deepspeed_config.yaml
For a full DeepSpeed integration guide, see DeepSpeed Integration.

FSDP (Fully Sharded Data Parallel)

TRL also supports FSDP via predefined Accelerate config profiles:
Profile nameDescription
fsdp1FSDP Stage 1
fsdp2FSDP Stage 2 (FSDP2 / PyTorch native FSDP v2)
trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config fsdp2

Sequence parallelism for long-context training

Sequence Parallelism (also called Context Parallelism) splits the sequence dimension across multiple GPUs, enabling training with sequences longer than what fits on a single GPU. TRL supports two implementations:

Ring Attention (FSDP2)

Uses ring-based P2P communication. Best for extremely long sequences (1M+ tokens) and models with few attention heads. Requires Accelerate 1.11.0+ and FSDP2.

ALST/Ulysses (DeepSpeed)

Uses attention head parallelism. Best for high-bandwidth interconnects (NVLink, InfiniBand) and moderate sequence lengths (up to ~500k tokens). Requires DeepSpeed 0.18.1+ and Accelerate 1.12.0+.

Ring Attention (FSDP2)

Use the provided accelerate config (e.g. context_parallel_2gpu.yaml):
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_machines: 1
num_processes: 2
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_version: 2
parallelism_config:
  parallelism_config_dp_replicate_size: 1
  parallelism_config_dp_shard_size: 1
  parallelism_config_tp_size: 1
  parallelism_config_cp_size: 2  # context parallel size
With the corresponding training configuration:
from trl import SFTConfig

training_args = SFTConfig(
    pad_to_multiple_of=4,           # must be divisible by cp_size * 2
    max_length=16384,
    packing=True,
    use_liger_kernel=True,
    gradient_checkpointing=False,   # use fsdp_activation_checkpointing instead
    per_device_train_batch_size=1,
    ...
)
Launch with:
accelerate launch --config_file context_parallel_2gpu.yaml train.py
max_length refers to the global sequence length. The framework automatically splits it into micro-sequences per GPU based on cp_size. With max_length=8192 and cp_size=4, each GPU processes 2048 tokens.

ALST/Ulysses (DeepSpeed)

Use the provided accelerate config (e.g. alst_ulysses_4gpu.yaml):
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
mixed_precision: bf16
num_machines: 1
num_processes: 4
deepspeed_config:
  zero_stage: 3
  seq_parallel_communication_data_type: bf16
parallelism_config:
  parallelism_config_dp_replicate_size: 1
  parallelism_config_dp_shard_size: 2
  parallelism_config_tp_size: 1
  parallelism_config_sp_size: 2
  parallelism_config_sp_backend: deepspeed
  parallelism_config_sp_seq_length_is_variable: true
  parallelism_config_sp_attn_implementation: flash_attention_2
With the corresponding training configuration:
from trl import SFTConfig

training_args = SFTConfig(
    pad_to_multiple_of=2,           # must equal sp_size
    max_seq_length=4096,
    packing=True,
    attn_implementation="flash_attention_2",
    per_device_train_batch_size=1,
    ...
)
Launch a complete example with 4 GPUs:
accelerate launch --config_file examples/accelerate_configs/alst_ulysses_4gpu.yaml \
    trl/scripts/sft.py \
    --model_name_or_path Qwen/Qwen2-0.5B \
    --dataset_name trl-lib/Capybara \
    --learning_rate 2e-4 \
    --max_steps 100 \
    --max_seq_length 4096 \
    --packing \
    --torch_dtype bfloat16 \
    --attn_implementation flash_attention_2 \
    --output_dir output-alst-4gpu

2D parallelism scaling reference

GPUssp_sizedp_shard_sizeUse case
422Balanced — longer sequences + more data
441Pure SP for maximum sequence length
824Large-scale training
Ensure dp_replicate_size × dp_shard_size × sp_size = num_processes.

Multi-node training

When a single machine does not have enough GPUs, scale training across multiple machines (nodes).

Accelerate config for multi-node

Create a multi_node.yaml config:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 2
machine_rank: 0          # 0 for main node, 1 for second node
main_process_ip: 10.0.0.1  # IP of rank 0 node
main_process_port: 29500
num_processes: 16        # total processes across all nodes
mixed_precision: bf16
use_cpu: false
same_network: true
Replace 10.0.0.1 with the actual IP address of the rank 0 (main) node.

Launching

Run on each node:
# Node 0 (main node)
accelerate launch --config_file multi_node.yaml --machine_rank 0 train.py

# Node 1
accelerate launch --config_file multi_node.yaml --machine_rank 1 train.py
You can combine multi-node training with DeepSpeed by setting distributed_type: DEEPSPEED and adding a deepspeed_config block. See the DeepSpeed integration guide.

Build docs developers (and LLMs) love