Reward Functions

The trl.rewards module provides ready-to-use reward functions primarily intended for GRPOTrainer and RLOOTrainer. All reward functions share the same calling convention: they receive a batch of completions and return a list of float rewards (or None for examples that should be skipped). Install the optional dependency required by the accuracy rewards:

pip install math_verify

Import from the trl.rewards sub-package:

from trl.rewards import (
    accuracy_reward,
    reasoning_accuracy_reward,
    think_format_reward,
    get_soft_overlong_punishment,
)

accuracy_reward

Checks whether each model completion matches its ground-truth solution using symbolic math verification from the math_verify library.

If both the gold solution and the prediction are parseable LaTeX expressions, math_verify.verify is used for comparison.
If the gold solution cannot be parsed, None is returned for that example so the trainer can skip it.

Requires the math_verify package (pip install math_verify). The function detects non-main threads and disables signal-based timeouts automatically to avoid ValueError.

Signature

def accuracy_reward(
    completions: list[list[dict[str, str]]],
    solution: list[str],
    **kwargs,
) -> list[float | None]

Parameters

completions

list[list[dict[str, str]]]

Batch of completions. Each completion is a single-element list containing a message dict with a "content" key (the assistant’s output text).

solution

list[str]

Batch of raw-text ground-truth solutions corresponding 1-to-1 with completions.

**kwargs

Additional keyword arguments accepted for compatibility with trainer interfaces (e.g., GRPOTrainer).

Returns

list[float | None] — 1.0 if the answer matches, 0.0 if not, or None if the gold solution could not be parsed.

Example

from trl.rewards import accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{3}}"}],
    [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{2}}"}],
]
print(accuracy_reward(completions, solutions))
# [1.0, 0.0]

reasoning_accuracy_reward

Variant of accuracy_reward designed for reasoning models that emit a thinking block before their final answer (e.g., models using <think>...</think> tags). The function strips the reasoning section and evaluates only the text that follows the last reasoning delimiter.

Completions where no reasoning delimiter is found receive a reward of 0.0 (penalizing incomplete reasoning chains).
Completions where the gold solution is unparseable receive None (skip).

Signature

def reasoning_accuracy_reward(
    completions: list[list[dict[str, str]]],
    solution: list[str],
    reasoning_delimiters: list[str] | None = None,
    **kwargs,
) -> list[float | None]

Parameters

completions

list[list[dict[str, str]]]

Batch of completions. Each completion is a single-element list containing a message dict with a "content" key.

solution

list[str]

Batch of ground-truth solution strings.

reasoning_delimiters

list[str]

List of delimiter strings marking the end of the reasoning block. Defaults to ["</think>"]. The final answer is taken as the text after the last occurrence of any delimiter.

**kwargs

Additional keyword arguments for trainer compatibility.

Returns

list[float | None] — 1.0 on correct answer, 0.0 if wrong or reasoning is incomplete, None if gold is unparseable.

Example

from trl.rewards import reasoning_accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [{"role": "assistant", "content": r"<think> Reasoning </think> The answer is \boxed{\frac{1}{3}}"}],
    [{"role": "assistant", "content": r"<think> Reasoning </think> The answer is \boxed{\frac{1}{2}}"}],
    [{"role": "assistant", "content": r"<think> Incomplete reasoning with \boxed{\frac{1}{3}}"}],
]
print(reasoning_accuracy_reward(completions, solutions, reasoning_delimiters=["</think>"]))
# [1.0, 0.0, 0.0]

think_format_reward

A lightweight format-checking reward that returns 1.0 when the completion correctly wraps its reasoning inside a single <think>...</think> block, and 0.0 otherwise. The regex pattern enforced is:

^<think>(?!.*<think>)(.*?)</think>.*$

This means the completion must:

Start with <think>.
Contain exactly one <think> opening tag.
Close with </think> before any additional content.

Signature

def think_format_reward(
    completions: list[list[dict[str, str]]],
    **kwargs,
) -> list[float]

Parameters

completions

list[list[dict[str, str]]]

Batch of completions. Each element is a single-element list with a message dict containing a "content" key.

**kwargs

Additional keyword arguments for trainer compatibility.

Returns

list[float] — 1.0 if format is correct, 0.0 otherwise.

Example

from trl.rewards import think_format_reward

completions = [
    [{"content": "<think>\nReasoning here.\n</think>\nFinal answer."}],
    [{"content": "<think>\nReasoning without closing tag."}],
]
print(think_format_reward(completions))
# [1.0, 0.0]

get_soft_overlong_punishment

A factory function that returns a reward function penalizing completions that exceed a target length. Based on Equation 13 from the DAPO paper. The returned reward function applies the following piecewise penalty:

R(y) = \begin{cases} 0 & |y| \le L_{\max} - L_{\text{cache}} \\ \dfrac{(L_{\max} - L_{\text{cache}}) - |y|}{L_{\text{cache}}} & L_{\max} - L_{\text{cache}} < |y| \le L_{\max} \\ -1 & |y| > L_{\max} \end{cases}

Signature

def get_soft_overlong_punishment(
    max_completion_len: int,
    soft_punish_cache: int,
) -> Callable

Parameters

max_completion_len

int

Maximum allowed completion length

L_{\max}

in tokens.

soft_punish_cache

int

Soft penalty window

L_{\text{cache}}

. Completions in the range

(L_{\max} - L_{\text{cache}},\, L_{\max}]

receive a linearly interpolated penalty. Set to 0 to apply no minimum-length tolerance.

Returns

A callable with signature (completion_ids: list[list[int]], **kwargs) -> list[float] suitable for direct use as a reward function in GRPOTrainer.

Example

from trl.rewards import get_soft_overlong_punishment

reward_fn = get_soft_overlong_punishment(max_completion_len=100, soft_punish_cache=20)

# Token ids for a completion of length 90 — in the soft penalty zone [80, 100]
completion_ids = [[1] * 90]
print(reward_fn(completion_ids))
# [-0.5]

# Within the safe zone (<= 80)
print(reward_fn([[1] * 70]))
# [0.0]

# Beyond max length (> 100)
print(reward_fn([[1] * 110]))
# [-1.0]

Using with GRPOTrainer

from trl import GRPOTrainer, GRPOConfig
from trl.rewards import get_soft_overlong_punishment

overlong_reward = get_soft_overlong_punishment(
    max_completion_len=512,
    soft_punish_cache=64,
)

trainer = GRPOTrainer(
    model=model,
    args=GRPOConfig(output_dir="./output"),
    reward_funcs=[overlong_reward],
    train_dataset=train_dataset,
)

API

accuracy_reward

Signature

Parameters

Returns

Example

reasoning_accuracy_reward

Signature

Parameters

Returns

Example

think_format_reward

Signature

Parameters

Returns

Example

get_soft_overlong_punishment

Signature

Parameters

Returns

Example

Using with GRPOTrainer

Build docs developers (and LLMs) love

API

​accuracy_reward

​Signature

​Parameters

​Returns

​Example

​reasoning_accuracy_reward

​Signature

​Parameters

​Returns

​Example

​think_format_reward

​Signature

​Parameters

​Returns

​Example

​get_soft_overlong_punishment

​Signature

​Parameters

​Returns

​Example

​Using with GRPOTrainer

Build docs developers (and LLMs) love

accuracy_reward

Signature

Parameters

Returns

Example

reasoning_accuracy_reward

Signature

Parameters

Returns

Example

think_format_reward

Signature

Parameters

Returns

Example

get_soft_overlong_punishment

Signature

Parameters

Returns

Example

Using with GRPOTrainer