Skip to main content
The trl.rewards module provides ready-to-use reward functions primarily intended for GRPOTrainer and RLOOTrainer. All reward functions share the same calling convention: they receive a batch of completions and return a list of float rewards (or None for examples that should be skipped). Install the optional dependency required by the accuracy rewards:
pip install math_verify
Import from the trl.rewards sub-package:
from trl.rewards import (
    accuracy_reward,
    reasoning_accuracy_reward,
    think_format_reward,
    get_soft_overlong_punishment,
)

accuracy_reward

Checks whether each model completion matches its ground-truth solution using symbolic math verification from the math_verify library.
  • If both the gold solution and the prediction are parseable LaTeX expressions, math_verify.verify is used for comparison.
  • If the gold solution cannot be parsed, None is returned for that example so the trainer can skip it.
Requires the math_verify package (pip install math_verify). The function detects non-main threads and disables signal-based timeouts automatically to avoid ValueError.

Signature

def accuracy_reward(
    completions: list[list[dict[str, str]]],
    solution: list[str],
    **kwargs,
) -> list[float | None]

Parameters

completions
list[list[dict[str, str]]]
Batch of completions. Each completion is a single-element list containing a message dict with a "content" key (the assistant’s output text).
solution
list[str]
Batch of raw-text ground-truth solutions corresponding 1-to-1 with completions.
**kwargs
Additional keyword arguments accepted for compatibility with trainer interfaces (e.g., GRPOTrainer).

Returns

list[float | None]1.0 if the answer matches, 0.0 if not, or None if the gold solution could not be parsed.

Example

from trl.rewards import accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{3}}"}],
    [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{2}}"}],
]
print(accuracy_reward(completions, solutions))
# [1.0, 0.0]

reasoning_accuracy_reward

Variant of accuracy_reward designed for reasoning models that emit a thinking block before their final answer (e.g., models using <think>...</think> tags). The function strips the reasoning section and evaluates only the text that follows the last reasoning delimiter.
  • Completions where no reasoning delimiter is found receive a reward of 0.0 (penalizing incomplete reasoning chains).
  • Completions where the gold solution is unparseable receive None (skip).

Signature

def reasoning_accuracy_reward(
    completions: list[list[dict[str, str]]],
    solution: list[str],
    reasoning_delimiters: list[str] | None = None,
    **kwargs,
) -> list[float | None]

Parameters

completions
list[list[dict[str, str]]]
Batch of completions. Each completion is a single-element list containing a message dict with a "content" key.
solution
list[str]
Batch of ground-truth solution strings.
reasoning_delimiters
list[str]
List of delimiter strings marking the end of the reasoning block. Defaults to ["</think>"]. The final answer is taken as the text after the last occurrence of any delimiter.
**kwargs
Additional keyword arguments for trainer compatibility.

Returns

list[float | None]1.0 on correct answer, 0.0 if wrong or reasoning is incomplete, None if gold is unparseable.

Example

from trl.rewards import reasoning_accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [{"role": "assistant", "content": r"<think> Reasoning </think> The answer is \boxed{\frac{1}{3}}"}],
    [{"role": "assistant", "content": r"<think> Reasoning </think> The answer is \boxed{\frac{1}{2}}"}],
    [{"role": "assistant", "content": r"<think> Incomplete reasoning with \boxed{\frac{1}{3}}"}],
]
print(reasoning_accuracy_reward(completions, solutions, reasoning_delimiters=["</think>"]))
# [1.0, 0.0, 0.0]

think_format_reward

A lightweight format-checking reward that returns 1.0 when the completion correctly wraps its reasoning inside a single <think>...</think> block, and 0.0 otherwise. The regex pattern enforced is:
^<think>(?!.*<think>)(.*?)</think>.*$
This means the completion must:
  • Start with <think>.
  • Contain exactly one <think> opening tag.
  • Close with </think> before any additional content.

Signature

def think_format_reward(
    completions: list[list[dict[str, str]]],
    **kwargs,
) -> list[float]

Parameters

completions
list[list[dict[str, str]]]
Batch of completions. Each element is a single-element list with a message dict containing a "content" key.
**kwargs
Additional keyword arguments for trainer compatibility.

Returns

list[float]1.0 if format is correct, 0.0 otherwise.

Example

from trl.rewards import think_format_reward

completions = [
    [{"content": "<think>\nReasoning here.\n</think>\nFinal answer."}],
    [{"content": "<think>\nReasoning without closing tag."}],
]
print(think_format_reward(completions))
# [1.0, 0.0]

get_soft_overlong_punishment

A factory function that returns a reward function penalizing completions that exceed a target length. Based on Equation 13 from the DAPO paper. The returned reward function applies the following piecewise penalty: R(y)={0yLmaxLcache(LmaxLcache)yLcacheLmaxLcache<yLmax1y>LmaxR(y) = \begin{cases} 0 & |y| \le L_{\max} - L_{\text{cache}} \\ \dfrac{(L_{\max} - L_{\text{cache}}) - |y|}{L_{\text{cache}}} & L_{\max} - L_{\text{cache}} < |y| \le L_{\max} \\ -1 & |y| > L_{\max} \end{cases}

Signature

def get_soft_overlong_punishment(
    max_completion_len: int,
    soft_punish_cache: int,
) -> Callable

Parameters

max_completion_len
int
Maximum allowed completion length LmaxL_{\max} in tokens.
soft_punish_cache
int
Soft penalty window LcacheL_{\text{cache}}. Completions in the range (LmaxLcache,Lmax](L_{\max} - L_{\text{cache}},\, L_{\max}] receive a linearly interpolated penalty. Set to 0 to apply no minimum-length tolerance.

Returns

A callable with signature (completion_ids: list[list[int]], **kwargs) -> list[float] suitable for direct use as a reward function in GRPOTrainer.

Example

from trl.rewards import get_soft_overlong_punishment

reward_fn = get_soft_overlong_punishment(max_completion_len=100, soft_punish_cache=20)

# Token ids for a completion of length 90 — in the soft penalty zone [80, 100]
completion_ids = [[1] * 90]
print(reward_fn(completion_ids))
# [-0.5]

# Within the safe zone (<= 80)
print(reward_fn([[1] * 70]))
# [0.0]

# Beyond max length (> 100)
print(reward_fn([[1] * 110]))
# [-1.0]

Using with GRPOTrainer

from trl import GRPOTrainer, GRPOConfig
from trl.rewards import get_soft_overlong_punishment

overlong_reward = get_soft_overlong_punishment(
    max_completion_len=512,
    soft_punish_cache=64,
)

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(output_dir="./output"),
    reward_funcs=[overlong_reward],
    train_dataset=train_dataset,
)

Build docs developers (and LLMs) love