Skip to main content

Overview

Reward models (RMs) are trained to assign scalar scores to model outputs, reflecting how well a response aligns with human preferences. A well-trained reward model can then be used to guide online RL methods (such as PPO, GRPO, or RLOO) or to evaluate model outputs at inference time. RewardTrainer trains a AutoModelForSequenceClassification model (with num_labels=1) on a preference dataset using a Bradley-Terry pairwise ranking objective. The model learns to assign higher scores to preferred responses than to rejected ones.

Quick start

from trl import RewardTrainer
from datasets import load_dataset

trainer = RewardTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()

Dataset format

RewardTrainer requires a preference dataset with chosen and rejected fields. An optional prompt field is supported. Both standard and conversational formats are accepted.
# Standard preference — implicit prompt
{"chosen": "The sky is blue.",
 "rejected": "The sky is green."}

# Conversational preference — implicit prompt
{"chosen": [{"role": "user", "content": "What color is the sky?"},
            {"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is green."}]}

# Standard preference — explicit prompt
{"prompt": "The sky is",
 "chosen": " blue.",
 "rejected": " green."}

# Conversational preference — explicit prompt
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "chosen": [{"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "assistant", "content": "It is green."}]}
To convert from a different format:
from datasets import load_dataset
import json

dataset = load_dataset("lmarena-ai/arena-human-preference-55k")

# Filter out ties
dataset = dataset.filter(lambda example: example["winner_tie"] == 0)

def make_conversation(example):
    prompt = json.loads(example["prompt"])[0]
    chosen = json.loads(example["chosen"])[0]
    rejected = json.loads(example["rejected"])[0]
    return {
        "chosen": [{"role": "user", "content": prompt}, {"role": "assistant", "content": chosen}],
        "rejected": [{"role": "user", "content": prompt}, {"role": "assistant", "content": rejected}],
    }

dataset = dataset.map(make_conversation)
dataset = dataset.select_columns(["chosen", "rejected"])

How reward modeling works

Under the Bradley-Terry model, the probability that response y⁺ is preferred over y⁻ is:
p(y⁺ ≻ y⁻ | x) = σ(r(x, y⁺) − r(x, y⁻))
The reward model is trained with the negative log-likelihood of observed preferences:
L(θ) = -E[ log σ(r_θ(x, y⁺) − r_θ(x, y⁻)) ]
The Bradley-Terry model is underdetermined — adding a constant to all rewards does not change preference probabilities. Use center_rewards_coefficient to encourage mean-zero rewards, which helps with reward hacking.

Key configuration parameters

center_rewards_coefficient
float | None
Coefficient for an auxiliary loss term that encourages the model to output mean-zero rewards. Recommended value: 0.01. Addresses the underdetermination of the Bradley-Terry model.
activation_offloading
bool
default:"false"
Offload activations to CPU to reduce GPU memory usage.
disable_dropout
bool
default:"true"
Disable dropout in the model during training. Recommended to improve consistency of reward estimates.
max_length
int | None
default:"1024"
Maximum tokenized sequence length. Samples where either chosen or rejected exceeds this length are filtered out.
eos_token
str | None
Token used to indicate end of sequence. Defaults to the tokenizer’s eos_token.
pad_token
str | None
Token used for padding. Defaults to processing_class.pad_token, falling back to eos_token.
model_init_kwargs
dict | None
Keyword arguments forwarded to AutoModelForSequenceClassification.from_pretrained when the model argument is a string. Note: num_labels is always set to 1 automatically and cannot be overridden here.
chat_template_path
str | None
Path to a tokenizer or Jinja template file to set as the model’s chat template.
RewardConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-4.

Training with PEFT/LoRA

When fine-tuning a base causal LM as a reward model using LoRA, include the classification head (score) in modules_to_save:
from datasets import load_dataset
from trl import RewardTrainer
from peft import LoraConfig

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    "Qwen/Qwen3-4B",
    train_dataset=dataset,
    peft_config=LoraConfig(modules_to_save=["score"]),
)
trainer.train()
When training a reward model adapter on a base causal LM (not a sequence classification model), you must include "score" in modules_to_save. This ensures the classification head is trained and saved alongside the adapter.
To continue training an existing PEFT reward model:
from datasets import load_dataset
from trl import RewardTrainer
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("trl-lib/Qwen3-4B-Reward-LoRA", is_trainable=True)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model=model,
    train_dataset=dataset,
)
trainer.train()
When training reward model adapters, use a higher learning rate (around 1e-3) since only new parameters are being learned.

Using the reward model in an RLHF pipeline

After training, use the reward model as the reward_funcs argument in online RL trainers:
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs="path/to/reward-model",  # HuggingFace Hub ID or local path
    train_dataset=load_dataset("trl-lib/DeepMath-103K", split="train"),
)
trainer.train()
You can also combine a trained reward model with custom reward functions to create a hybrid reward signal.

Logged metrics

MetricDescription
accuracyProportion of examples where chosen reward > rejected reward
lossAverage Bradley-Terry loss
marginAverage reward margin (chosen minus rejected)
mean_rewardAverage reward score across both chosen and rejected responses
min_rewardMinimum reward score (averaged over the logging interval)
max_rewardMaximum reward score (averaged over the logging interval)

Build docs developers (and LLMs) love