Overview
GRPO (Group Relative Policy Optimization) is an online RL algorithm introduced in the paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. It is a variant of PPO that reduces memory usage by replacing the value model with a group-relative advantage estimate. At each step, GRPO generates a group of completions per prompt, computes rewards for each completion, normalizes the rewards within the group to obtain advantages, and updates the policy to increase the probability of high-advantage completions. This approach has become the standard method for training reasoning models such as DeepSeek-R1.Quick start
How GRPO works
Generate completions
At each training step, sample a batch of prompts and generate
num_generations (G) completions per prompt.Compute advantages
For each completion, compute a scalar reward. Normalize within the group:
- Group normalization (default): subtract the group mean and divide by the group standard deviation.
- Batch normalization: compute mean at group level but standard deviation at the batch level (
scale_rewards="batch"). - No scaling: disable normalization entirely (
scale_rewards=False).
Estimate KL divergence
Use the Schulman approximator to estimate KL divergence between the policy and a fixed reference model. With
beta=0.0 (default), no reference model is loaded.Dataset format
The dataset must include a"prompt" column. All other columns are passed to reward functions as keyword arguments.
image or images column alongside prompt.
Custom reward functions
A reward function must acceptprompts, completions, completion_ids, and any dataset columns as keyword arguments, and return a list of floats (one per completion). Use **kwargs to accept all arguments.
Multi-task reward functions
ReturnNone for samples that a reward function does not apply to. The trainer ignores None values and sums only valid rewards:
Built-in rewards
TRL provides built-in reward functions intrl.rewards, including accuracy_reward for checking mathematical correctness.
Key configuration parameters
Generation
Generation
Number of completions to generate per prompt (the group size G). The effective batch size must be divisible by this value.
Maximum number of tokens to generate per completion.
Sampling temperature. Higher values produce more diverse completions.
Nucleus sampling cutoff. Set below 1.0 to restrict sampling to a smaller token set.
Training
Training
KL coefficient controlling deviation from the reference model. When
0.0 (default), the reference model is not loaded. DeepSeek-R1 uses 0.001.Loss normalization strategy. Options:
"dapo" (normalizes by active tokens in batch, default), "dr_grpo" (normalizes by max_completion_length), "grpo" (normalizes by sequence length, not recommended), "bnpo", "cispo", "sapo", "luspo", "vespo".Reward scaling strategy.
"group" (default): normalize within each prompt group. "batch": normalize across the entire batch. False: no scaling.Clipping range for the policy ratio in the surrogate objective.
Number of gradient update passes per generated batch (μ in the original paper). When greater than 1, uses the clipped surrogate objective.
Exclude truncated completions from the loss. Recommended for training stability, especially with long-chain-of-thought responses.
Per-function weights when using multiple reward functions. If
None, all functions are weighted equally.vLLM acceleration
vLLM acceleration
Use vLLM for faster generation. Requires
pip install trl[vllm].How to run vLLM:
"colocate" (shares training GPUs) or "server" (separate process on dedicated GPUs).Fraction of GPU memory reserved for vLLM when running in colocate mode.
Accelerating generation with vLLM
Generation is typically the bottleneck in online RL training. vLLM can provide a significant speedup.- Colocate mode
- Server mode
vLLM runs inside the trainer process and shares GPU memory with the training model:
Training with PEFT/LoRA
Agent training
GRPO supports agentic workflows through tool use. Pass a list of Python functions as tools:Tools must be Python functions with type-hinted arguments, return types, and a Google-style docstring. The model uses these to determine how to call each tool.
Logged metrics
| Metric | Description |
|---|---|
reward | Overall average reward (sum across functions, weighted by reward_weights) |
reward_std | Standard deviation of summed rewards across the batch |
completions/mean_length | Average length of generated completions |
completions/clipped_ratio | Fraction of completions truncated at max_completion_length |
entropy | Average token prediction entropy across completions |
kl | Average KL divergence from the reference model (only logged when beta > 0) |
clip_ratio/region_mean | Fraction of tokens where the policy ratio was clipped |