RLOO Trainer

Overview

RLOO (REINFORCE Leave-One-Out) is described in the paper Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. It is an online RL method that uses a leave-one-out baseline to reduce gradient variance, avoiding the need for a separate value model as required by PPO. RLOO generates multiple completions per prompt and uses the average reward of all other completions as a baseline for each completion, reducing variance while remaining computationally efficient.

Quick start

# train_rloo.py
from datasets import load_dataset
from trl import RLOOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

accelerate launch train_rloo.py

How RLOO works

Generate completions

At each step, sample a batch of prompts and generate num_generations (G, default 2) completions per prompt.

Compute rewards

For each completion, compute a reward using the reward function(s). Add a KL penalty to discourage deviation from a reference policy:

r_i = R(o_i, q) − β · KL(π_θ ‖ π_ref)

Compute leave-one-out advantages

For each completion, compute a baseline as the average reward of all other completions in the same group:

b_i = (1 / (G − 1)) · Σ_{j ≠ i} r_j
A_i = r_i − b_i

This leave-one-out estimate eliminates the need for a value model while still reducing gradient variance.

Update the policy

Minimize the REINFORCE loss weighted by advantages. In the single-step setting (default), this is equivalent to standard REINFORCE. With num_iterations > 1, a clipped surrogate objective is used.

Dataset format

The dataset must include a "prompt" column. Additional columns are passed to reward functions.

# Standard format
{"prompt": "Solve: 2x + 3 = 7", "ground_truth": "2"}

# Conversational format
{"prompt": [{"role": "user", "content": "Solve: 2x + 3 = 7"}],
 "ground_truth": "2"}

For VLM training, include an image or images column alongside prompt.

Custom reward functions

Reward functions follow the same interface as in GRPOTrainer. They must accept prompts, completions, completion_ids, and any dataset columns as keyword arguments, and return a list of floats.

# Reward based on answer correctness
import re

def reward_func(completions, ground_truth, **kwargs):
    matches = [re.search(r"\\boxed\{(.*?)\}", completion) for completion in completions]
    contents = [match.group(1) if match else "" for match in matches]
    return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]

# Reward based on response format
import re

def format_reward_func(completions, **kwargs):
    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]

Pass reward functions to the trainer:

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=[format_reward_func, reward_func],
    train_dataset=dataset,
)

Reward functions can be async def coroutines. Multiple async functions are executed concurrently, so their latency overlaps.

Multi-task reward functions

Return None for samples that a reward function does not apply to. The trainer ignores None values:

def math_reward_func(completions, task, **kwargs):
    rewards = []
    for completion, t in zip(completions, task):
        if t == "math":
            rewards.append(1.0 if check_correct(completion) else -1.0)
        else:
            rewards.append(None)  # not applicable
    return rewards

Key configuration parameters

Generation

num_generations

int

default:"2"

Number of completions to generate per prompt (the group size G). Must be at least 2 for the leave-one-out baseline. The effective batch size must be divisible by this value.

max_completion_length

int | None

default:"256"

Maximum number of tokens to generate per completion.

temperature

float

default:"1.0"

Sampling temperature. Higher values produce more diverse completions.

Training

beta

float

default:"0.05"

KL coefficient controlling deviation from the reference model. When 0.0, the reference model is not loaded.

epsilon

float

default:"0.2"

Clipping range for the importance sampling ratio in the surrogate objective.

num_iterations

int

default:"1"

Number of gradient update passes per generated batch (μ in the algorithm). When greater than 1, uses a clipped surrogate objective.

normalize_advantages

bool

default:"false"

Normalize advantages across the generation batch to have mean 0 and standard deviation 1.

reward_weights

list[float] | None

Per-function weights when using multiple reward functions. Defaults to equal weighting.

reward_clip_range

tuple[float, float] | None

Clip rewards to (min, max) before computing advantages. If None, no clipping is applied.

mask_truncated_completions

bool

default:"false"

Exclude truncated completions from the loss. Recommended for training stability.

vLLM acceleration

use_vllm

bool

default:"false"

Use vLLM for faster generation. Requires pip install trl[vllm].

vllm_mode

str

default:"colocate"

How to run vLLM: "colocate" (shares training GPUs) or "server" (separate process on dedicated GPUs).

vllm_gpu_memory_utilization

float

default:"0.3"

Fraction of GPU memory reserved for vLLM in colocate mode.

RLOOConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-6.

Accelerating generation with vLLM

Colocate mode
Server mode

vLLM runs inside the trainer process and shares GPU memory:

from trl import RLOOConfig

training_args = RLOOConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)

vLLM runs in a separate process on dedicated GPUs:

# Start the server
trl vllm-serve --model Qwen/Qwen2-0.5B-Instruct

from trl import RLOOConfig

training_args = RLOOConfig(
    use_vllm=True,
    vllm_mode="server",
)

In server mode, ensure the vLLM server uses different GPUs than the trainer. Use CUDA_VISIBLE_DEVICES to separate them, or you may encounter NCCL errors.

Training at scale (70B+ models)

For large models, combine DeepSpeed ZeRO-3 with vLLM server mode:

#!/bin/bash
#SBATCH --nodes=5
#SBATCH --gres=gpu:8

NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))

# Nodes 0-3: training
srun --nodes=4 --ntasks=4 --nodelist="${NODELIST[@]:0:4}" accelerate launch \
     --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
     --num_processes 32 \
     train_rloo.py --server_ip ${NODELIST[4]} &

# Node 4: vLLM inference
srun --nodes=1 --ntasks=1 --nodelist="${NODELIST[4]}" \
     trl vllm-serve --model Qwen/Qwen2.5-72B --tensor_parallel_size 8 &

wait

GRPO vs RLOO

	RLOO	GRPO
Default group size	2	8
Advantage normalization	Leave-one-out baseline	Group-relative (mean/std)
KL default (`beta`)	0.05	0.0
Value model required	No	No
Best for	Low-memory online RL	Reasoning model training

Logged metrics

Metric	Description
`reward`	Overall average reward (sum across functions, weighted by `reward_weights`)
`reward_std`	Standard deviation of summed rewards across the batch
`completions/mean_length`	Average length of generated completions
`completions/clipped_ratio`	Fraction of completions truncated at `max_completion_length`
`entropy`	Average token prediction entropy across completions
`kl`	Average KL divergence from the reference model (only logged when `beta > 0`)
`clip_ratio/region_mean`	Fraction of sequences where the policy ratio was clipped

Get Started

Concepts

Trainers

How-to Guides

Integrations

Overview

Quick start

How RLOO works

Dataset format

Custom reward functions

Multi-task reward functions

Key configuration parameters

Accelerating generation with vLLM

Training at scale (70B+ models)

GRPO vs RLOO

Logged metrics

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Overview

​Quick start

​How RLOO works

​Dataset format

​Custom reward functions

​Multi-task reward functions

​Key configuration parameters

​Accelerating generation with vLLM

​Training at scale (70B+ models)

​GRPO vs RLOO

​Logged metrics

Build docs developers (and LLMs) love

Overview

Quick start

How RLOO works

Dataset format

Custom reward functions

Multi-task reward functions

Key configuration parameters

Accelerating generation with vLLM

Training at scale (70B+ models)

GRPO vs RLOO

Logged metrics