Overview
Reward models (RMs) are trained to assign scalar scores to model outputs, reflecting how well a response aligns with human preferences. A well-trained reward model can then be used to guide online RL methods (such as PPO, GRPO, or RLOO) or to evaluate model outputs at inference time.RewardTrainer trains a AutoModelForSequenceClassification model (with num_labels=1) on a preference dataset using a Bradley-Terry pairwise ranking objective. The model learns to assign higher scores to preferred responses than to rejected ones.
Quick start
Dataset format
RewardTrainer requires a preference dataset with chosen and rejected fields. An optional prompt field is supported. Both standard and conversational formats are accepted.
How reward modeling works
Under the Bradley-Terry model, the probability that responsey⁺ is preferred over y⁻ is:
Key configuration parameters
Training
Training
Coefficient for an auxiliary loss term that encourages the model to output mean-zero rewards. Recommended value:
0.01. Addresses the underdetermination of the Bradley-Terry model.Offload activations to CPU to reduce GPU memory usage.
Disable dropout in the model during training. Recommended to improve consistency of reward estimates.
Data preprocessing
Data preprocessing
Maximum tokenized sequence length. Samples where either
chosen or rejected exceeds this length are filtered out.Token used to indicate end of sequence. Defaults to the tokenizer’s
eos_token.Token used for padding. Defaults to
processing_class.pad_token, falling back to eos_token.Model initialization
Model initialization
Keyword arguments forwarded to
AutoModelForSequenceClassification.from_pretrained when the model argument is a string. Note: num_labels is always set to 1 automatically and cannot be overridden here.Path to a tokenizer or Jinja template file to set as the model’s chat template.
RewardConfig overrides some TrainingArguments defaults: logging_steps=10, gradient_checkpointing=True, bf16=True, and learning_rate=1e-4.Training with PEFT/LoRA
When fine-tuning a base causal LM as a reward model using LoRA, include the classification head (score) in modules_to_save:
Using the reward model in an RLHF pipeline
After training, use the reward model as thereward_funcs argument in online RL trainers:
Logged metrics
| Metric | Description |
|---|---|
accuracy | Proportion of examples where chosen reward > rejected reward |
loss | Average Bradley-Terry loss |
margin | Average reward margin (chosen minus rejected) |
mean_reward | Average reward score across both chosen and rejected responses |
min_reward | Minimum reward score (averaged over the logging interval) |
max_reward | Maximum reward score (averaged over the logging interval) |