Reward Trainer

Overview

Reward models (RMs) are trained to assign scalar scores to model outputs, reflecting how well a response aligns with human preferences. A well-trained reward model can then be used to guide online RL methods (such as PPO, GRPO, or RLOO) or to evaluate model outputs at inference time. RewardTrainer trains a AutoModelForSequenceClassification model (with num_labels=1) on a preference dataset using a Bradley-Terry pairwise ranking objective. The model learns to assign higher scores to preferred responses than to rejected ones.

Quick start

from trl import RewardTrainer
from datasets import load_dataset

trainer = RewardTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()

Dataset format

RewardTrainer requires a preference dataset with chosen and rejected fields. An optional prompt field is supported. Both standard and conversational formats are accepted.

# Standard preference — implicit prompt
{"chosen": "The sky is blue.",
 "rejected": "The sky is green."}

# Conversational preference — implicit prompt
{"chosen": [{"role": "user", "content": "What color is the sky?"},
            {"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is green."}]}

# Standard preference — explicit prompt
{"prompt": "The sky is",
 "chosen": " blue.",
 "rejected": " green."}

# Conversational preference — explicit prompt
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "chosen": [{"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "assistant", "content": "It is green."}]}

To convert from a different format:

from datasets import load_dataset
import json

dataset = load_dataset("lmarena-ai/arena-human-preference-55k")

# Filter out ties
dataset = dataset.filter(lambda example: example["winner_tie"] == 0)

def make_conversation(example):
    prompt = json.loads(example["prompt"])[0]
    chosen = json.loads(example["chosen"])[0]
    rejected = json.loads(example["rejected"])[0]
    return {
        "chosen": [{"role": "user", "content": prompt}, {"role": "assistant", "content": chosen}],
        "rejected": [{"role": "user", "content": prompt}, {"role": "assistant", "content": rejected}],
    }

dataset = dataset.map(make_conversation)
dataset = dataset.select_columns(["chosen", "rejected"])

How reward modeling works

Under the Bradley-Terry model, the probability that response y⁺ is preferred over y⁻ is:

p(y⁺ ≻ y⁻ | x) = σ(r(x, y⁺) − r(x, y⁻))

The reward model is trained with the negative log-likelihood of observed preferences:

L(θ) = -E[ log σ(r_θ(x, y⁺) − r_θ(x, y⁻)) ]

The Bradley-Terry model is underdetermined — adding a constant to all rewards does not change preference probabilities. Use center_rewards_coefficient to encourage mean-zero rewards, which helps with reward hacking.

Key configuration parameters

Training

center_rewards_coefficient

float | None

Coefficient for an auxiliary loss term that encourages the model to output mean-zero rewards. Recommended value: 0.01. Addresses the underdetermination of the Bradley-Terry model.

activation_offloading

bool

default:"false"

Offload activations to CPU to reduce GPU memory usage.

disable_dropout

bool

default:"true"

Disable dropout in the model during training. Recommended to improve consistency of reward estimates.

Data preprocessing

max_length

int | None

default:"1024"

Maximum tokenized sequence length. Samples where either chosen or rejected exceeds this length are filtered out.

eos_token

str | None

Token used to indicate end of sequence. Defaults to the tokenizer’s eos_token.

pad_token

str | None

Token used for padding. Defaults to processing_class.pad_token, falling back to eos_token.

Model initialization

model_init_kwargs

dict | None

Keyword arguments forwarded to AutoModelForSequenceClassification.from_pretrained when the model argument is a string. Note: num_labels is always set to 1 automatically and cannot be overridden here.

chat_template_path

str | None

Path to a tokenizer or Jinja template file to set as the model’s chat template.

RewardConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-4.

Training with PEFT/LoRA

When fine-tuning a base causal LM as a reward model using LoRA, include the classification head (score) in modules_to_save:

from datasets import load_dataset
from trl import RewardTrainer
from peft import LoraConfig

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    "Qwen/Qwen3-4B",
    train_dataset=dataset,
    peft_config=LoraConfig(modules_to_save=["score"]),
)
trainer.train()

When training a reward model adapter on a base causal LM (not a sequence classification model), you must include "score" in modules_to_save. This ensures the classification head is trained and saved alongside the adapter.

To continue training an existing PEFT reward model:

from datasets import load_dataset
from trl import RewardTrainer
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("trl-lib/Qwen3-4B-Reward-LoRA", is_trainable=True)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model=model,
    train_dataset=dataset,
)
trainer.train()

When training reward model adapters, use a higher learning rate (around 1e-3) since only new parameters are being learned.

Using the reward model in an RLHF pipeline

After training, use the reward model as the reward_funcs argument in online RL trainers:

from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs="path/to/reward-model",  # HuggingFace Hub ID or local path
    train_dataset=load_dataset("trl-lib/DeepMath-103K", split="train"),
)
trainer.train()

You can also combine a trained reward model with custom reward functions to create a hybrid reward signal.

Logged metrics

Metric	Description
`accuracy`	Proportion of examples where chosen reward > rejected reward
`loss`	Average Bradley-Terry loss
`margin`	Average reward margin (chosen minus rejected)
`mean_reward`	Average reward score across both chosen and rejected responses
`min_reward`	Minimum reward score (averaged over the logging interval)
`max_reward`	Maximum reward score (averaged over the logging interval)

Get Started

Concepts

Trainers

How-to Guides

Integrations

Overview

Quick start

Dataset format

How reward modeling works

Key configuration parameters

Training with PEFT/LoRA

Using the reward model in an RLHF pipeline

Logged metrics

Build docs developers (and LLMs) love

Get Started

Concepts

Trainers

How-to Guides

Integrations

​Overview

​Quick start

​Dataset format

​How reward modeling works

​Key configuration parameters

​Training with PEFT/LoRA

​Using the reward model in an RLHF pipeline

​Logged metrics

Build docs developers (and LLMs) love

Overview

Quick start

Dataset format

How reward modeling works

Key configuration parameters

Training with PEFT/LoRA

Using the reward model in an RLHF pipeline

Logged metrics