vLLM Integration

TRL integrates vLLM for faster generation in online reinforcement learning methods. Online methods like GRPO and RLOO require the model to generate completions during training, which quickly becomes a bottleneck. vLLM’s PagedAttention technique stores key-value tensors in non-contiguous memory, greatly improving throughput and reducing the memory footprint for generation.

TRL currently only supports vLLM versions from 0.10.2 to 0.17.1. Ensure you have a compatible version installed.

The following trainers support vLLM generation:

GRPOTrainer
RLOOTrainer
experimental.nash_md.NashMDTrainer
experimental.online_dpo.OnlineDPOTrainer
experimental.xpo.XPOTrainer

Installation

pip install "trl[vllm]"

Modes of operation

TRL supports two modes for integrating vLLM during training.

Colocate mode (default)

In colocate mode, vLLM runs inside the trainer process and shares GPU memory with the training model. No separate server process is required, but memory contention on the training GPUs is possible.

GRPO
RLOO
OnlineDPO
NashMD
XPO

from trl import GRPOConfig

training_args = GRPOConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)

from trl import RLOOConfig

training_args = RLOOConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)

from trl.experimental.online_dpo import OnlineDPOConfig

training_args = OnlineDPOConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)

from trl.experimental.nash_md import NashMDConfig

training_args = NashMDConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)

from trl.experimental.xpo import XPOConfig

training_args = XPOConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)

Server mode

In server mode, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer over HTTP. This is ideal when you have GPUs dedicated to inference.

The vLLM server and the trainer must run on separate CUDA devices to prevent NCCL communication conflicts. If you do not set CUDA_VISIBLE_DEVICES, TRL will detect same-device usage and raise an error.

Start the vLLM server on dedicated GPUs

In this example, GPUs 0–3 serve the model with tensor parallelism across all 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 4

Write a training script with server mode enabled

GRPO
RLOO
OnlineDPO
NashMD
XPO

from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B",
    args=GRPOConfig(use_vllm=True, vllm_mode="server"),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

from datasets import load_dataset
from trl import RLOOTrainer, RLOOConfig
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = RLOOTrainer(
    model="Qwen/Qwen2.5-7B",
    args=RLOOConfig(use_vllm=True, vllm_mode="server"),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

from datasets import load_dataset
from trl.experimental.online_dpo import OnlineDPOConfig, OnlineDPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = OnlineDPOTrainer(
    model="Qwen/Qwen2.5-7B",
    args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

from datasets import load_dataset
from trl.experimental.nash_md import NashMDConfig, NashMDTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = NashMDTrainer(
    model="Qwen/Qwen2.5-7B",
    args=NashMDConfig(use_vllm=True, vllm_mode="server"),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

from datasets import load_dataset
from trl.experimental.xpo import XPOTrainer, XPOConfig
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = XPOTrainer(
    model="Qwen/Qwen2.5-7B",
    args=XPOConfig(use_vllm=True, vllm_mode="server"),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

Launch training on the remaining GPUs

CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py

How it works under the hood

When you run trl vllm-serve --model <model_name>:

vLLM spawns workers determined by --tensor-parallel-size × --data-parallel-size. With --tensor-parallel-size 4, it spawns 4 workers.
Incoming prompts are distributed across workers. The model weights are split across GPUs according to --tensor-parallel-size.
GPUs communicate via NVIDIA’s NCCL library to ensure each GPU processes its correct slice of the requests.

During the training loop:

The trainer sends prompts to the server; the server generates completions via vllm_client.generate.
Completions are used to compute the reward signal and the training loss.
After the backward pass, the trainer pushes updated weights to the server via vllm_client.update_named_param.

The vLLM server handles only generation — it does not train the model. Updated weights are pushed from the trainer to the server after each backward pass.

Server configuration reference

All trl vllm-serve arguments:

trl vllm-serve --help

Argument	Default	Description
`--model`	required	Model name or path.
`--revision`	—	Model revision (branch, tag, or commit).
`--tensor-parallel-size`	`1`	Number of tensor parallel workers.
`--data-parallel-size`	`1`	Number of data parallel workers. For dense models, keep at `1` (required for vLLM ≥ 0.14.0).
`--host`	`0.0.0.0`	Server host address.
`--port`	`8000`	Server port.
`--gpu-memory-utilization`	`0.9`	Fraction of GPU memory reserved for weights, activations, and KV cache.
`--dtype`	`auto`	Data type for generation.
`--max-model-len`	—	Override the model’s maximum context length.
`--enable-prefix-caching`	—	Enable prefix caching if the model and hardware support it.
`--enforce-eager`	`False`	Disable CUDA graph and use eager mode only.
`--kv-cache-dtype`	`auto`	Data type for the KV cache.
`--trust-remote-code`	`False`	Allow executing code from model repositories.
`--vllm-model-impl`	`vllm`	Model implementation backend: `vllm` or `transformers`.

Transformers backend

vLLM can use the Transformers backend for model implementations, including vision-language models (VLMs):

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --tensor-parallel-size 1 \
    --port 8000 \
    --enforce_eager \
    --vllm_model_impl transformers

Set vllm_model_impl="transformers" in your trainer config or pass it as a CLI argument. See the vLLM Transformers Backend blog post for details.

Get Started

Concepts

Trainers

How-to Guides

Integrations

Installation

Modes of operation

Colocate mode (default)

Server mode

How it works under the hood

Server configuration reference

Transformers backend

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Installation

​Modes of operation

​Colocate mode (default)

​Server mode

​How it works under the hood

​Server configuration reference

​Transformers backend

Build docs developers (and LLMs) love

Installation

Modes of operation

Colocate mode (default)

Server mode

How it works under the hood

Server configuration reference

Transformers backend