Overview
RLOO (REINFORCE Leave-One-Out) is described in the paper Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. It is an online RL method that uses a leave-one-out baseline to reduce gradient variance, avoiding the need for a separate value model as required by PPO. RLOO generates multiple completions per prompt and uses the average reward of all other completions as a baseline for each completion, reducing variance while remaining computationally efficient.Quick start
How RLOO works
Generate completions
At each step, sample a batch of prompts and generate
num_generations (G, default 2) completions per prompt.Compute rewards
For each completion, compute a reward using the reward function(s). Add a KL penalty to discourage deviation from a reference policy:
Compute leave-one-out advantages
For each completion, compute a baseline as the average reward of all other completions in the same group:This leave-one-out estimate eliminates the need for a value model while still reducing gradient variance.
Dataset format
The dataset must include a"prompt" column. Additional columns are passed to reward functions.
image or images column alongside prompt.
Custom reward functions
Reward functions follow the same interface as inGRPOTrainer. They must accept prompts, completions, completion_ids, and any dataset columns as keyword arguments, and return a list of floats.
Multi-task reward functions
ReturnNone for samples that a reward function does not apply to. The trainer ignores None values:
Key configuration parameters
Generation
Generation
Number of completions to generate per prompt (the group size G). Must be at least 2 for the leave-one-out baseline. The effective batch size must be divisible by this value.
Maximum number of tokens to generate per completion.
Sampling temperature. Higher values produce more diverse completions.
Training
Training
KL coefficient controlling deviation from the reference model. When
0.0, the reference model is not loaded.Clipping range for the importance sampling ratio in the surrogate objective.
Number of gradient update passes per generated batch (μ in the algorithm). When greater than 1, uses a clipped surrogate objective.
Normalize advantages across the generation batch to have mean 0 and standard deviation 1.
Per-function weights when using multiple reward functions. Defaults to equal weighting.
Clip rewards to
(min, max) before computing advantages. If None, no clipping is applied.Exclude truncated completions from the loss. Recommended for training stability.
vLLM acceleration
vLLM acceleration
Use vLLM for faster generation. Requires
pip install trl[vllm].How to run vLLM:
"colocate" (shares training GPUs) or "server" (separate process on dedicated GPUs).Fraction of GPU memory reserved for vLLM in colocate mode.
RLOOConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-6.Accelerating generation with vLLM
- Colocate mode
- Server mode
vLLM runs inside the trainer process and shares GPU memory:
Training at scale (70B+ models)
For large models, combine DeepSpeed ZeRO-3 with vLLM server mode:GRPO vs RLOO
| RLOO | GRPO | |
|---|---|---|
| Default group size | 2 | 8 |
| Advantage normalization | Leave-one-out baseline | Group-relative (mean/std) |
KL default (beta) | 0.05 | 0.0 |
| Value model required | No | No |
| Best for | Low-memory online RL | Reasoning model training |
Logged metrics
| Metric | Description |
|---|---|
reward | Overall average reward (sum across functions, weighted by reward_weights) |
reward_std | Standard deviation of summed rewards across the batch |
completions/mean_length | Average length of generated completions |
completions/clipped_ratio | Fraction of completions truncated at max_completion_length |
entropy | Average token prediction entropy across completions |
kl | Average KL divergence from the reference model (only logged when beta > 0) |
clip_ratio/region_mean | Fraction of sequences where the policy ratio was clipped |