Skip to main content

Overview

RLOO (REINFORCE Leave-One-Out) is described in the paper Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. It is an online RL method that uses a leave-one-out baseline to reduce gradient variance, avoiding the need for a separate value model as required by PPO. RLOO generates multiple completions per prompt and uses the average reward of all other completions as a baseline for each completion, reducing variance while remaining computationally efficient.

Quick start

# train_rloo.py
from datasets import load_dataset
from trl import RLOOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()
accelerate launch train_rloo.py

How RLOO works

1

Generate completions

At each step, sample a batch of prompts and generate num_generations (G, default 2) completions per prompt.
2

Compute rewards

For each completion, compute a reward using the reward function(s). Add a KL penalty to discourage deviation from a reference policy:
r_i = R(o_i, q) − β · KL(π_θ ‖ π_ref)
3

Compute leave-one-out advantages

For each completion, compute a baseline as the average reward of all other completions in the same group:
b_i = (1 / (G − 1)) · Σ_{j ≠ i} r_j
A_i = r_i − b_i
This leave-one-out estimate eliminates the need for a value model while still reducing gradient variance.
4

Update the policy

Minimize the REINFORCE loss weighted by advantages. In the single-step setting (default), this is equivalent to standard REINFORCE. With num_iterations > 1, a clipped surrogate objective is used.

Dataset format

The dataset must include a "prompt" column. Additional columns are passed to reward functions.
# Standard format
{"prompt": "Solve: 2x + 3 = 7", "ground_truth": "2"}

# Conversational format
{"prompt": [{"role": "user", "content": "Solve: 2x + 3 = 7"}],
 "ground_truth": "2"}
For VLM training, include an image or images column alongside prompt.

Custom reward functions

Reward functions follow the same interface as in GRPOTrainer. They must accept prompts, completions, completion_ids, and any dataset columns as keyword arguments, and return a list of floats.
# Reward based on answer correctness
import re

def reward_func(completions, ground_truth, **kwargs):
    matches = [re.search(r"\\boxed\{(.*?)\}", completion) for completion in completions]
    contents = [match.group(1) if match else "" for match in matches]
    return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]
# Reward based on response format
import re

def format_reward_func(completions, **kwargs):
    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]
Pass reward functions to the trainer:
trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=[format_reward_func, reward_func],
    train_dataset=dataset,
)
Reward functions can be async def coroutines. Multiple async functions are executed concurrently, so their latency overlaps.

Multi-task reward functions

Return None for samples that a reward function does not apply to. The trainer ignores None values:
def math_reward_func(completions, task, **kwargs):
    rewards = []
    for completion, t in zip(completions, task):
        if t == "math":
            rewards.append(1.0 if check_correct(completion) else -1.0)
        else:
            rewards.append(None)  # not applicable
    return rewards

Key configuration parameters

num_generations
int
default:"2"
Number of completions to generate per prompt (the group size G). Must be at least 2 for the leave-one-out baseline. The effective batch size must be divisible by this value.
max_completion_length
int | None
default:"256"
Maximum number of tokens to generate per completion.
temperature
float
default:"1.0"
Sampling temperature. Higher values produce more diverse completions.
beta
float
default:"0.05"
KL coefficient controlling deviation from the reference model. When 0.0, the reference model is not loaded.
epsilon
float
default:"0.2"
Clipping range for the importance sampling ratio in the surrogate objective.
num_iterations
int
default:"1"
Number of gradient update passes per generated batch (μ in the algorithm). When greater than 1, uses a clipped surrogate objective.
normalize_advantages
bool
default:"false"
Normalize advantages across the generation batch to have mean 0 and standard deviation 1.
reward_weights
list[float] | None
Per-function weights when using multiple reward functions. Defaults to equal weighting.
reward_clip_range
tuple[float, float] | None
Clip rewards to (min, max) before computing advantages. If None, no clipping is applied.
mask_truncated_completions
bool
default:"false"
Exclude truncated completions from the loss. Recommended for training stability.
use_vllm
bool
default:"false"
Use vLLM for faster generation. Requires pip install trl[vllm].
vllm_mode
str
default:"colocate"
How to run vLLM: "colocate" (shares training GPUs) or "server" (separate process on dedicated GPUs).
vllm_gpu_memory_utilization
float
default:"0.3"
Fraction of GPU memory reserved for vLLM in colocate mode.
RLOOConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-6.

Accelerating generation with vLLM

vLLM runs inside the trainer process and shares GPU memory:
from trl import RLOOConfig

training_args = RLOOConfig(
    use_vllm=True,  # vllm_mode="colocate" by default
)
In server mode, ensure the vLLM server uses different GPUs than the trainer. Use CUDA_VISIBLE_DEVICES to separate them, or you may encounter NCCL errors.

Training at scale (70B+ models)

For large models, combine DeepSpeed ZeRO-3 with vLLM server mode:
#!/bin/bash
#SBATCH --nodes=5
#SBATCH --gres=gpu:8

NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))

# Nodes 0-3: training
srun --nodes=4 --ntasks=4 --nodelist="${NODELIST[@]:0:4}" accelerate launch \
     --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
     --num_processes 32 \
     train_rloo.py --server_ip ${NODELIST[4]} &

# Node 4: vLLM inference
srun --nodes=1 --ntasks=1 --nodelist="${NODELIST[4]}" \
     trl vllm-serve --model Qwen/Qwen2.5-72B --tensor_parallel_size 8 &

wait

GRPO vs RLOO

RLOOGRPO
Default group size28
Advantage normalizationLeave-one-out baselineGroup-relative (mean/std)
KL default (beta)0.050.0
Value model requiredNoNo
Best forLow-memory online RLReasoning model training

Logged metrics

MetricDescription
rewardOverall average reward (sum across functions, weighted by reward_weights)
reward_stdStandard deviation of summed rewards across the batch
completions/mean_lengthAverage length of generated completions
completions/clipped_ratioFraction of completions truncated at max_completion_length
entropyAverage token prediction entropy across completions
klAverage KL divergence from the reference model (only logged when beta > 0)
clip_ratio/region_meanFraction of sequences where the policy ratio was clipped

Build docs developers (and LLMs) love