Skip to main content
TRL integrates vLLM for faster generation in online reinforcement learning methods. Online methods like GRPO and RLOO require the model to generate completions during training, which quickly becomes a bottleneck. vLLM’s PagedAttention technique stores key-value tensors in non-contiguous memory, greatly improving throughput and reducing the memory footprint for generation.
TRL currently only supports vLLM versions from 0.10.2 to 0.17.1. Ensure you have a compatible version installed.
The following trainers support vLLM generation:
  • GRPOTrainer
  • RLOOTrainer
  • experimental.nash_md.NashMDTrainer
  • experimental.online_dpo.OnlineDPOTrainer
  • experimental.xpo.XPOTrainer

Installation

pip install "trl[vllm]"

Modes of operation

TRL supports two modes for integrating vLLM during training.

Colocate mode (default)

In colocate mode, vLLM runs inside the trainer process and shares GPU memory with the training model. No separate server process is required, but memory contention on the training GPUs is possible.
from trl import GRPOConfig

training_args = GRPOConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)

Server mode

In server mode, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer over HTTP. This is ideal when you have GPUs dedicated to inference.
The vLLM server and the trainer must run on separate CUDA devices to prevent NCCL communication conflicts. If you do not set CUDA_VISIBLE_DEVICES, TRL will detect same-device usage and raise an error.
1

Start the vLLM server on dedicated GPUs

In this example, GPUs 0–3 serve the model with tensor parallelism across all 4 GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 4
2

Write a training script with server mode enabled

from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B",
    args=GRPOConfig(use_vllm=True, vllm_mode="server"),
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()
3

Launch training on the remaining GPUs

CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py

How it works under the hood

When you run trl vllm-serve --model <model_name>:
  1. vLLM spawns workers determined by --tensor-parallel-size × --data-parallel-size. With --tensor-parallel-size 4, it spawns 4 workers.
  2. Incoming prompts are distributed across workers. The model weights are split across GPUs according to --tensor-parallel-size.
  3. GPUs communicate via NVIDIA’s NCCL library to ensure each GPU processes its correct slice of the requests.
During the training loop:
  • The trainer sends prompts to the server; the server generates completions via vllm_client.generate.
  • Completions are used to compute the reward signal and the training loss.
  • After the backward pass, the trainer pushes updated weights to the server via vllm_client.update_named_param.
The vLLM server handles only generation — it does not train the model. Updated weights are pushed from the trainer to the server after each backward pass.

Server configuration reference

All trl vllm-serve arguments:
trl vllm-serve --help
ArgumentDefaultDescription
--modelrequiredModel name or path.
--revisionModel revision (branch, tag, or commit).
--tensor-parallel-size1Number of tensor parallel workers.
--data-parallel-size1Number of data parallel workers. For dense models, keep at 1 (required for vLLM ≥ 0.14.0).
--host0.0.0.0Server host address.
--port8000Server port.
--gpu-memory-utilization0.9Fraction of GPU memory reserved for weights, activations, and KV cache.
--dtypeautoData type for generation.
--max-model-lenOverride the model’s maximum context length.
--enable-prefix-cachingEnable prefix caching if the model and hardware support it.
--enforce-eagerFalseDisable CUDA graph and use eager mode only.
--kv-cache-dtypeautoData type for the KV cache.
--trust-remote-codeFalseAllow executing code from model repositories.
--vllm-model-implvllmModel implementation backend: vllm or transformers.

Transformers backend

vLLM can use the Transformers backend for model implementations, including vision-language models (VLMs):
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
    --model Qwen/Qwen2.5-VL-3B-Instruct \
    --tensor-parallel-size 1 \
    --port 8000 \
    --enforce_eager \
    --vllm_model_impl transformers
Set vllm_model_impl="transformers" in your trainer config or pass it as a CLI argument. See the vLLM Transformers Backend blog post for details.

Build docs developers (and LLMs) love