To maintain a consistent effective batch size when scaling to more GPUs, adjust per_device_train_batch_size and gradient_accumulation_steps accordingly:
GPUs
Per-device batch size
Gradient accumulation steps
Notes
1
32
1
Higher memory, faster training
1
4
8
Lower memory, slower training
8
4
1
Best of both worlds
Having one model per GPU can cause high memory usage for large models. Use DeepSpeed for model sharding, ZeRO Redundancy Optimizer, and CPU/NVMe offloading.
DeepSpeed provides memory optimizations through the ZeRO (Zero Redundancy Optimizer) family of stages. TRL provides predefined accelerate configs you can use directly:
Profile name
Description
zero1
DeepSpeed ZeRO Stage 1
zero2
DeepSpeed ZeRO Stage 2
zero3
DeepSpeed ZeRO Stage 3
Pass the profile name via --accelerate_config in the TRL CLI:
Sequence Parallelism (also called Context Parallelism) splits the sequence dimension across multiple GPUs, enabling training with sequences longer than what fits on a single GPU.TRL supports two implementations:
Ring Attention (FSDP2)
Uses ring-based P2P communication. Best for extremely long sequences (1M+ tokens) and models with few attention heads. Requires Accelerate 1.11.0+ and FSDP2.
ALST/Ulysses (DeepSpeed)
Uses attention head parallelism. Best for high-bandwidth interconnects (NVLink, InfiniBand) and moderate sequence lengths (up to ~500k tokens). Requires DeepSpeed 0.18.1+ and Accelerate 1.12.0+.
from trl import SFTConfigtraining_args = SFTConfig( pad_to_multiple_of=4, # must be divisible by cp_size * 2 max_length=16384, packing=True, use_liger_kernel=True, gradient_checkpointing=False, # use fsdp_activation_checkpointing instead per_device_train_batch_size=1, ...)
max_length refers to the global sequence length. The framework automatically splits it into micro-sequences per GPU based on cp_size. With max_length=8192 and cp_size=4, each GPU processes 2048 tokens.
compute_environment: LOCAL_MACHINEdistributed_type: MULTI_GPUnum_machines: 2machine_rank: 0 # 0 for main node, 1 for second nodemain_process_ip: 10.0.0.1 # IP of rank 0 nodemain_process_port: 29500num_processes: 16 # total processes across all nodesmixed_precision: bf16use_cpu: falsesame_network: true
Replace 10.0.0.1 with the actual IP address of the rank 0 (main) node.
SLURM automatically distributes training across all requested nodes, and srun configures the necessary environment variables.
You can combine multi-node training with DeepSpeed by setting distributed_type: DEEPSPEED and adding a deepspeed_config block. See the DeepSpeed integration guide.