TRL integrates vLLM for faster generation in online reinforcement learning methods. Online methods like GRPO and RLOO require the model to generate completions during training, which quickly becomes a bottleneck. vLLM’s PagedAttention technique stores key-value tensors in non-contiguous memory, greatly improving throughput and reducing the memory footprint for generation.
TRL currently only supports vLLM versions from 0.10.2 to 0.17.1. Ensure you have a compatible version installed.
The following trainers support vLLM generation:
GRPOTrainer
RLOOTrainer
experimental.nash_md.NashMDTrainer
experimental.online_dpo.OnlineDPOTrainer
experimental.xpo.XPOTrainer
Installation
Modes of operation
TRL supports two modes for integrating vLLM during training.
Colocate mode (default)
In colocate mode, vLLM runs inside the trainer process and shares GPU memory with the training model. No separate server process is required, but memory contention on the training GPUs is possible.
GRPO
RLOO
OnlineDPO
NashMD
XPO
from trl import GRPOConfig
training_args = GRPOConfig(
use_vllm=True, # vllm_mode="colocate" by default
)
from trl import RLOOConfig
training_args = RLOOConfig(
use_vllm=True, # vllm_mode="colocate" by default
)
from trl.experimental.online_dpo import OnlineDPOConfig
training_args = OnlineDPOConfig(
use_vllm=True, # vllm_mode="colocate" by default
)
from trl.experimental.nash_md import NashMDConfig
training_args = NashMDConfig(
use_vllm=True, # vllm_mode="colocate" by default
)
from trl.experimental.xpo import XPOConfig
training_args = XPOConfig(
use_vllm=True, # vllm_mode="colocate" by default
)
Server mode
In server mode, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer over HTTP. This is ideal when you have GPUs dedicated to inference.
The vLLM server and the trainer must run on separate CUDA devices to prevent NCCL communication conflicts. If you do not set CUDA_VISIBLE_DEVICES, TRL will detect same-device usage and raise an error.
Start the vLLM server on dedicated GPUs
In this example, GPUs 0–3 serve the model with tensor parallelism across all 4 GPUs:CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 4
Write a training script with server mode enabled
GRPO
RLOO
OnlineDPO
NashMD
XPO
from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-7B",
args=GRPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
from datasets import load_dataset
from trl import RLOOTrainer, RLOOConfig
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = RLOOTrainer(
model="Qwen/Qwen2.5-7B",
args=RLOOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
from datasets import load_dataset
from trl.experimental.online_dpo import OnlineDPOConfig, OnlineDPOTrainer
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = OnlineDPOTrainer(
model="Qwen/Qwen2.5-7B",
args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
from datasets import load_dataset
from trl.experimental.nash_md import NashMDConfig, NashMDTrainer
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = NashMDTrainer(
model="Qwen/Qwen2.5-7B",
args=NashMDConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
from datasets import load_dataset
from trl.experimental.xpo import XPOTrainer, XPOConfig
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = XPOTrainer(
model="Qwen/Qwen2.5-7B",
args=XPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
Launch training on the remaining GPUs
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
How it works under the hood
When you run trl vllm-serve --model <model_name>:
- vLLM spawns workers determined by
--tensor-parallel-size × --data-parallel-size. With --tensor-parallel-size 4, it spawns 4 workers.
- Incoming prompts are distributed across workers. The model weights are split across GPUs according to
--tensor-parallel-size.
- GPUs communicate via NVIDIA’s NCCL library to ensure each GPU processes its correct slice of the requests.
During the training loop:
- The trainer sends prompts to the server; the server generates completions via
vllm_client.generate.
- Completions are used to compute the reward signal and the training loss.
- After the backward pass, the trainer pushes updated weights to the server via
vllm_client.update_named_param.
The vLLM server handles only generation — it does not train the model. Updated weights are pushed from the trainer to the server after each backward pass.
Server configuration reference
All trl vllm-serve arguments:
| Argument | Default | Description |
|---|
--model | required | Model name or path. |
--revision | — | Model revision (branch, tag, or commit). |
--tensor-parallel-size | 1 | Number of tensor parallel workers. |
--data-parallel-size | 1 | Number of data parallel workers. For dense models, keep at 1 (required for vLLM ≥ 0.14.0). |
--host | 0.0.0.0 | Server host address. |
--port | 8000 | Server port. |
--gpu-memory-utilization | 0.9 | Fraction of GPU memory reserved for weights, activations, and KV cache. |
--dtype | auto | Data type for generation. |
--max-model-len | — | Override the model’s maximum context length. |
--enable-prefix-caching | — | Enable prefix caching if the model and hardware support it. |
--enforce-eager | False | Disable CUDA graph and use eager mode only. |
--kv-cache-dtype | auto | Data type for the KV cache. |
--trust-remote-code | False | Allow executing code from model repositories. |
--vllm-model-impl | vllm | Model implementation backend: vllm or transformers. |
vLLM can use the Transformers backend for model implementations, including vision-language models (VLMs):
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--tensor-parallel-size 1 \
--port 8000 \
--enforce_eager \
--vllm_model_impl transformers
Set vllm_model_impl="transformers" in your trainer config or pass it as a CLI argument. See the vLLM Transformers Backend blog post for details.