Implementing Reward Functions

Overview

Reward functions are the core of reinforcement learning in rLLM. They evaluate agent actions and return rewards that guide training. This guide shows you how to implement custom reward functions following the RewardFunction protocol.

RewardFunction Protocol

All reward functions must implement this signature:

from rllm.rewards.reward_fn import RewardFunction
from rllm.rewards.reward_types import RewardOutput

def my_reward_fn(task_info: dict, action: str) -> RewardOutput:
    """Calculate reward for an agent's action.
    
    Args:
        task_info: Task dictionary with question, ground_truth, metadata
        action: Agent's response/solution
        
    Returns:
        RewardOutput: Reward value and metadata
    """
    # Implement evaluation logic
    is_correct = evaluate(task_info, action)
    reward = 1.0 if is_correct else 0.0
    
    return RewardOutput(
        reward=reward,
        is_correct=is_correct,
        metadata={}
    )

See rllm/rewards/reward_fn.py:14.

RewardOutput Structure

The return value must be a RewardOutput dataclass:

from dataclasses import dataclass, field

@dataclass
class RewardOutput:
    reward: float                          # The computed reward value
    metadata: dict = field(default_factory=dict)  # Additional info
    is_correct: bool | None = None         # Whether answer is correct

From rllm/rewards/reward_types.py:77.

Built-in Reward Functions

rLLM provides several pre-built reward functions:

Math Reward Function

Evaluates mathematical answers using symbolic comparison:

from rllm.rewards.reward_fn import math_reward_fn

task = {
    "question": "What is 2 + 2?",
    "ground_truth": "4",
    "data_source": "gsm8k"
}

action = "The answer is \\boxed{4}."

result = math_reward_fn(task, action)
print(result.reward)      # 1.0
print(result.is_correct)  # True

From rllm/rewards/reward_fn.py:47.

Code Reward Function

Executes code and validates against test cases:

from rllm.rewards.reward_fn import code_reward_fn

task = {
    "question": "Write a function to add two numbers",
    "test_cases": [...],
    "data_source": "humaneval"
}

action = """def add(a, b):
    return a + b"""

result = code_reward_fn(task, action)

From rllm/rewards/reward_fn.py:87.

F1 Reward Function

Computes F1 score for text-based answers:

from rllm.rewards.reward_fn import f1_reward_fn

task = {"ground_truth": "The capital of France is Paris"}
action = "Paris is the capital of France"

result = f1_reward_fn(task, action)
print(result.reward)  # F1 score between 0.0 and 1.0

From rllm/rewards/reward_fn.py:105.

Creating Custom Reward Functions

Define the function signature

Follow the RewardFunction protocol:

from rllm.rewards.reward_types import RewardOutput
from rllm.agents.agent import Action

def custom_reward_fn(task_info: dict, action: str) -> RewardOutput:
    # Handle Action objects
    if isinstance(action, Action):
        action = action.action
    
    # Your logic here
    pass

Extract required fields from task_info

Access task data safely:

def custom_reward_fn(task_info: dict, action: str) -> RewardOutput:
    # Extract fields with defaults
    question = task_info.get("question", "")
    ground_truth = task_info.get("ground_truth", None)
    data_source = task_info.get("data_source", "unknown")
    
    # Validate required fields
    if ground_truth is None:
        return RewardOutput(
            reward=0.0,
            is_correct=False,
            metadata={"error": "No ground truth provided"}
        )

Parse and validate the action

Extract the answer from the model’s response:

import re

def extract_answer(response: str) -> str | None:
    # Extract from \\boxed{...}
    match = re.search(r"\\boxed\{([^}]+)\}", response)
    if match:
        return match.group(1).strip()
    
    # Extract from <answer>...</answer>
    match = re.search(r"<answer>(.*?)</answer>", response, re.IGNORECASE | re.DOTALL)
    if match:
        return match.group(1).strip()
    
    return None

Implement evaluation logic

Compare the extracted answer to ground truth:

def custom_reward_fn(task_info: dict, action: str) -> RewardOutput:
    if isinstance(action, Action):
        action = action.action
    
    ground_truth = task_info.get("ground_truth")
    if ground_truth is None:
        return RewardOutput(reward=0.0, is_correct=False)
    
    # Parse action
    predicted_answer = extract_answer(action)
    if predicted_answer is None:
        return RewardOutput(reward=0.0, is_correct=False)
    
    # Normalize and compare
    is_correct = normalize(predicted_answer) == normalize(ground_truth)
    
    return RewardOutput(
        reward=1.0 if is_correct else 0.0,
        is_correct=is_correct,
        metadata={"predicted": predicted_answer}
    )

Return RewardOutput

Always return a RewardOutput object:

return RewardOutput(
    reward=1.0,              # Float value
    is_correct=True,         # Boolean flag
    metadata={               # Additional info
        "predicted": "...",
        "explanation": "..."
    }
)

Complete Example: Math Reward Function

Here’s the implementation of the built-in math reward function:

from rllm.rewards.reward_types import RewardConfig, RewardOutput
from rllm.rewards.math_utils.utils import extract_answer, grade_answer_mathd, grade_answer_sympy
from rllm.globals import THOUGHT_DELIMITER_END

class RewardMathFn:
    def __init__(self, config: RewardConfig):
        self.config = config
    
    def __call__(self, task_info: dict, action: str) -> RewardOutput:
        model_response = action
        
        # Handle empty response
        if model_response is None or model_response == "":
            return RewardOutput(
                reward=self.config.format_error_reward,
                is_correct=False
            )
        
        # Extract solution (remove thinking portion)
        if THOUGHT_DELIMITER_END in model_response:
            model_solution = model_response.split(THOUGHT_DELIMITER_END)[1]
        else:
            if self.config.apply_format_reward:
                return RewardOutput(
                    reward=self.config.format_error_reward,
                    is_correct=False
                )
            model_solution = model_response
        
        # Extract answer from \\boxed{...}
        model_answer = extract_answer(model_solution)
        if model_answer is None:
            return RewardOutput(
                reward=self.config.format_error_reward,
                is_correct=False
            )
        
        # Get ground truth(s)
        ground_truths = task_info.get("ground_truth", None)
        if ground_truths is None:
            return RewardOutput(
                reward=self.config.unk_error_reward,
                is_correct=False
            )
        
        # Convert single answer to list
        if isinstance(ground_truths, str | float | int):
            ground_truths = [ground_truths]
        
        # Process each ground truth
        processed_ground_truths = []
        for truth in ground_truths:
            truth = str(truth)
            if "\\boxed" in truth:
                processed_truth = extract_answer(truth)
                if processed_truth:
                    processed_ground_truths.append(processed_truth)
            else:
                processed_ground_truths.append(truth)
        
        # Check against all possible correct answers
        for ground_truth in processed_ground_truths:
            is_correct = (
                grade_answer_mathd(model_answer, ground_truth) or
                grade_answer_sympy(model_answer, ground_truth)
            )
            if is_correct:
                reward = self.config.correct_reward
                # Add bonus for tool usage
                if task_info.get("has_toolcall", False):
                    reward += self.config.toolcall_bonus
                return RewardOutput(reward=reward, is_correct=True)
        
        return RewardOutput(
            reward=self.config.incorrect_reward,
            is_correct=False
        )

# Convenience function
def math_reward_fn(task_info: dict, action: str) -> RewardOutput:
    reward_config = RewardConfig()
    reward_fn = RewardMathFn(reward_config)
    return reward_fn(task_info, action)

From rllm/rewards/math_reward.py:18.

Configuring Reward Behavior

Use RewardConfig to customize reward values:

from rllm.rewards.reward_types import RewardConfig

config = RewardConfig(
    correct_reward=1.0,           # Reward for correct answers
    incorrect_reward=0.0,         # Reward for incorrect answers
    format_error_reward=0.0,      # Reward for format errors
    unk_error_reward=0.0,         # Reward for unknown errors
    toolcall_bonus=0.5,           # Bonus for using tools
    apply_format_reward=False     # Whether to penalize format errors
)

reward_fn = RewardMathFn(config)

From rllm/rewards/reward_types.py:11.

Using Reward Functions

With SingleTurnEnvironment

from rllm.environments.base.single_turn_env import SingleTurnEnvironment
from rllm.trainer.agent_trainer import AgentTrainer

env_args = {"reward_fn": custom_reward_fn}

trainer = AgentTrainer(
    agent_class=MyAgent,
    env_class=SingleTurnEnvironment,
    env_args=env_args,
    config=config,
    train_dataset=train_dataset
)

With Workflows

from rllm.workflows.workflow import Workflow

class MyWorkflow(Workflow):
    def __init__(self, rollout_engine, reward_function=None, **kwargs):
        super().__init__(rollout_engine, **kwargs)
        self.reward_fn = reward_function or custom_reward_fn
    
    async def run(self, task: dict, uid: str, **kwargs):
        # Generate response
        output = await self.rollout_engine.get_model_response(messages)
        
        # Compute reward
        reward_result = self.reward_fn(task, output.content)
        
        # Use reward in trajectory
        # ...

Advanced Patterns

LLM-as-Judge Rewards

Use another LLM to evaluate responses:

async def llm_judge_reward_fn(task_info: dict, action: str) -> RewardOutput:
    judge_prompt = f"""
    Question: {task_info['question']}
    Ground Truth: {task_info['ground_truth']}
    Student Answer: {action}
    
    Is the student answer correct? Reply with YES or NO.
    """
    
    # Call judge model
    judge_response = await judge_llm.generate(judge_prompt)
    is_correct = "YES" in judge_response.upper()
    
    return RewardOutput(
        reward=1.0 if is_correct else 0.0,
        is_correct=is_correct,
        metadata={"judge_response": judge_response}
    )

Multi-Objective Rewards

Combine multiple reward signals:

def multi_objective_reward_fn(task_info: dict, action: str) -> RewardOutput:
    # Correctness reward
    correctness = evaluate_correctness(task_info, action)
    
    # Length penalty (encourage concise answers)
    length_penalty = min(1.0, 100 / len(action))
    
    # Clarity reward (based on readability)
    clarity = evaluate_clarity(action)
    
    # Combine rewards
    reward = 0.7 * correctness + 0.2 * length_penalty + 0.1 * clarity
    
    return RewardOutput(
        reward=reward,
        is_correct=correctness > 0.5,
        metadata={
            "correctness": correctness,
            "length_penalty": length_penalty,
            "clarity": clarity
        }
    )

Shaped Rewards

Provide intermediate rewards for multi-step reasoning:

def shaped_math_reward_fn(task_info: dict, action: str) -> RewardOutput:
    # Base correctness
    is_correct = check_final_answer(task_info, action)
    base_reward = 1.0 if is_correct else 0.0
    
    # Intermediate step bonuses
    has_clear_steps = check_step_structure(action)
    uses_correct_formula = check_formula(task_info, action)
    shows_work = check_working(action)
    
    # Shape reward
    shaped_reward = base_reward
    if not is_correct:
        # Give partial credit for good reasoning
        shaped_reward += 0.2 * has_clear_steps
        shaped_reward += 0.2 * uses_correct_formula
        shaped_reward += 0.1 * shows_work
    
    return RewardOutput(
        reward=shaped_reward,
        is_correct=is_correct,
        metadata={"partial_credit": shaped_reward - base_reward}
    )

Be careful with reward shaping! Overly complex reward functions can lead to unexpected behaviors. Start simple and add complexity only when needed.

Best Practices

Handle edge cases: Empty responses, malformed answers, missing ground truth
Normalize answers: Case, whitespace, punctuation before comparison
Support multiple formats: Allow flexibility in how answers are expressed
Return meaningful metadata: Help debug and analyze model behavior
Make rewards deterministic: Avoid randomness unless necessary
Test thoroughly: Validate on known correct/incorrect examples

The is_correct field in RewardOutput is used for computing accuracy metrics during training. Make sure it accurately reflects task success.

Next Steps

Build custom agents that work with your reward function
Create custom environments to structure interactions
Setup distributed training to scale your experiments

Get Started

Core Concepts

SDK

Training Backends

Guides

Implementing Reward Functions

Overview

RewardFunction Protocol

RewardOutput Structure

Built-in Reward Functions

Math Reward Function

Code Reward Function

F1 Reward Function

Creating Custom Reward Functions

Complete Example: Math Reward Function

Configuring Reward Behavior

Using Reward Functions

With SingleTurnEnvironment

With Workflows

Advanced Patterns

LLM-as-Judge Rewards

Multi-Objective Rewards

Shaped Rewards

Best Practices

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

SDK

Training Backends

Guides

​Overview

​RewardFunction Protocol

​RewardOutput Structure

​Built-in Reward Functions

​Math Reward Function

​Code Reward Function

​F1 Reward Function

​Creating Custom Reward Functions

​Complete Example: Math Reward Function

​Configuring Reward Behavior

​Using Reward Functions

​With SingleTurnEnvironment

​With Workflows

​Advanced Patterns

​LLM-as-Judge Rewards

​Multi-Objective Rewards

​Shaped Rewards

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Overview

RewardFunction Protocol

RewardOutput Structure

Built-in Reward Functions

Math Reward Function

Code Reward Function

F1 Reward Function

Creating Custom Reward Functions

Complete Example: Math Reward Function

Configuring Reward Behavior

Using Reward Functions

With SingleTurnEnvironment

With Workflows

Advanced Patterns

LLM-as-Judge Rewards

Multi-Objective Rewards

Shaped Rewards

Best Practices

Next Steps