trl.rewards module provides ready-to-use reward functions primarily intended for GRPOTrainer and RLOOTrainer. All reward functions share the same calling convention: they receive a batch of completions and return a list of float rewards (or None for examples that should be skipped).
Install the optional dependency required by the accuracy rewards:
trl.rewards sub-package:
accuracy_reward
Checks whether each model completion matches its ground-truth solution using symbolic math verification from themath_verify library.
- If both the gold solution and the prediction are parseable LaTeX expressions,
math_verify.verifyis used for comparison. - If the gold solution cannot be parsed,
Noneis returned for that example so the trainer can skip it.
Requires the
math_verify package (pip install math_verify). The function detects non-main threads and disables signal-based timeouts automatically to avoid ValueError.Signature
Parameters
Batch of completions. Each completion is a single-element list containing a message dict with a
"content" key (the assistant’s output text).Batch of raw-text ground-truth solutions corresponding 1-to-1 with
completions.**kwargs
Additional keyword arguments accepted for compatibility with trainer interfaces (e.g.,
GRPOTrainer).Returns
list[float | None] — 1.0 if the answer matches, 0.0 if not, or None if the gold solution could not be parsed.
Example
reasoning_accuracy_reward
Variant ofaccuracy_reward designed for reasoning models that emit a thinking block before their final answer (e.g., models using <think>...</think> tags). The function strips the reasoning section and evaluates only the text that follows the last reasoning delimiter.
- Completions where no reasoning delimiter is found receive a reward of
0.0(penalizing incomplete reasoning chains). - Completions where the gold solution is unparseable receive
None(skip).
Signature
Parameters
Batch of completions. Each completion is a single-element list containing a message dict with a
"content" key.Batch of ground-truth solution strings.
List of delimiter strings marking the end of the reasoning block. Defaults to
["</think>"]. The final answer is taken as the text after the last occurrence of any delimiter.**kwargs
Additional keyword arguments for trainer compatibility.
Returns
list[float | None] — 1.0 on correct answer, 0.0 if wrong or reasoning is incomplete, None if gold is unparseable.
Example
think_format_reward
A lightweight format-checking reward that returns1.0 when the completion correctly wraps its reasoning inside a single <think>...</think> block, and 0.0 otherwise.
The regex pattern enforced is:
- Start with
<think>. - Contain exactly one
<think>opening tag. - Close with
</think>before any additional content.
Signature
Parameters
Batch of completions. Each element is a single-element list with a message dict containing a
"content" key.**kwargs
Additional keyword arguments for trainer compatibility.
Returns
list[float] — 1.0 if format is correct, 0.0 otherwise.
Example
get_soft_overlong_punishment
A factory function that returns a reward function penalizing completions that exceed a target length. Based on Equation 13 from the DAPO paper. The returned reward function applies the following piecewise penalty:Signature
Parameters
Maximum allowed completion length in tokens.
Soft penalty window . Completions in the range receive a linearly interpolated penalty. Set to
0 to apply no minimum-length tolerance.Returns
A callable with signature(completion_ids: list[list[int]], **kwargs) -> list[float] suitable for direct use as a reward function in GRPOTrainer.