Skip to main content

Overview

Rubrics manage the scoring logic for rollouts, combining multiple reward functions into a final reward signal. Each rubric holds reward functions, computes weighted combinations, and tracks metrics for observability.
import verifiers as vf

async def correct_answer(completion, answer) -> float:
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

rubric = vf.Rubric(funcs=[correct_answer])

Basic Reward Functions

Reward functions evaluate rollouts and return floats (typically 0.0 to 1.0). They request data by naming arguments:
async def exact_match(completion, answer) -> float:
    """Check if answer appears in completion."""
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

async def length_penalty(completion) -> float:
    """Penalize overly long responses."""
    response = completion[-1]["content"]
    return 1.0 if len(response) < 500 else 0.5

rubric = vf.Rubric(
    funcs=[exact_match, length_penalty],
    weights=[1.0, 0.1]  # exact_match weighted 10x more
)

Available Arguments

Reward functions can request these standard arguments:
ArgumentTypeDescription
completionMessagesModel’s output messages
promptMessagesInput messages
answerstrGround truth from dataset
infoInfo (dict)Metadata from dataset
stateStateFull rollout state
taskstrTask identifier
Type signatures:
from verifiers.types import Messages, Info, State

Messages = list[Message]  # list of chat messages
Info = dict[str, Any]
State = dict  # with additional input forwarding

Argument Injection Pattern

The rubric uses introspection to inject only requested arguments:
# Only receives what it asks for
async def simple_reward(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

# Can request all available data
async def complex_reward(prompt, completion, answer, info, state) -> float:
    difficulty = info.get("difficulty", 1)
    tokens_used = state.get("usage", {}).get("total_tokens", 0)
    correct = answer in completion[-1]["content"]
    return float(correct) * (1.0 / difficulty) * (1.0 if tokens_used < 1000 else 0.5)

# Use **kwargs to accept everything
async def flexible_reward(**kwargs) -> float:
    completion = kwargs["completion"]
    answer = kwargs.get("answer", "")
    return 1.0 if answer in completion[-1]["content"] else 0.0

Multiple Reward Functions

Combine reward functions with custom weights:
async def correctness(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

async def formatting(completion, parser) -> float:
    try:
        parser.parse(completion[-1]["content"])
        return 1.0
    except:
        return 0.0

async def conciseness(completion) -> float:
    length = len(completion[-1]["content"])
    return 1.0 if length < 200 else 0.5

rubric = vf.Rubric(
    funcs=[correctness, formatting, conciseness],
    weights=[1.0, 0.5, 0.1]
)
Final reward computation:
reward = (correctness * 1.0) + (formatting * 0.5) + (conciseness * 0.1)

Adding Functions Dynamically

rubric = vf.Rubric()
rubric.add_reward_func(correctness, weight=1.0)
rubric.add_reward_func(formatting, weight=0.5)

Execution Order and State Sharing

Reward functions execute sequentially in the order they’re added. Since state is mutable, earlier functions can store computed values for later functions:
async def compute_similarity(completion, answer, state) -> float:
    """Compute and cache similarity score."""
    response = completion[-1]["content"]
    score = compute_embedding_similarity(response, answer)  # expensive
    state["similarity"] = score  # cache for other functions
    return score

async def similarity_threshold(state) -> float:
    """Use cached similarity without recomputing."""
    return 1.0 if state["similarity"] > 0.8 else 0.0

rubric = vf.Rubric(
    funcs=[compute_similarity, similarity_threshold],
    weights=[0.0, 1.0]  # log similarity (weight=0), reward threshold (weight=1)
)
Execution flow:
  1. compute_similarity runs, stores state["similarity"]
  2. similarity_threshold runs, reads cached value
  3. Final reward = 0.0 * similarity + 1.0 * threshold

Group-Based Reward Functions

During evaluation and RL training, rollouts are organized into groups by example_id. Group reward functions operate on all rollouts for an example together:
async def diversity_bonus(completions) -> list[float]:
    """Reward unique responses within a group."""
    responses = [c[-1]["content"] for c in completions]
    unique = set(responses)
    return [0.2 if responses.count(r) == 1 else 0.0 for r in responses]

async def individual_correctness(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

rubric = vf.Rubric(
    funcs=[individual_correctness, diversity_bonus],
    weights=[1.0, 0.5]
)

Detection

Group functions are detected by:
  1. Plural argument names: completions, prompts, answers, states, tasks, infos
  2. Return type: list[float] instead of float

Available Group Arguments

ArgumentTypeDescription
completionslist[Messages]All completions in group
promptslist[Messages]All prompts in group
answerslist[str]All answers in group
stateslist[State]All states in group
taskslist[str]All task IDs in group
infoslist[Info]All info dicts in group

Example: Relative Ranking

async def rank_by_length(completions) -> list[float]:
    """Reward shorter completions more within a group."""
    lengths = [len(c[-1]["content"]) for c in completions]
    max_len = max(lengths) if lengths else 1
    return [1.0 - (length / max_len) for length in lengths]

Shared Objects

Rubrics can provide shared objects accessible to all reward functions via class_objects:
rubric = vf.Rubric(funcs=[my_reward_func])
rubric.add_class_object("my_helper", some_helper_object)

async def my_reward_func(completion, my_helper) -> float:
    # my_helper is injected by name
    return await my_helper.score(completion)

Parsers

Parsers extract structured content from model responses:
parser = vf.XMLParser(fields=["reasoning", "answer"])
rubric = vf.Rubric(funcs=[my_reward_func], parser=parser)

async def my_reward_func(completion, answer, parser) -> float:
    parsed = parser.parse(completion[-1]["content"])
    return 1.0 if parsed.answer == answer else 0.0
Built-in parsers:
  • vf.Parser() - Pass-through (no parsing)
  • vf.XMLParser(fields=[...]) - Extract XML tags
  • vf.ThinkParser() - Extract content after </think>
  • vf.MaybeThinkParser() - Handle optional <think> tags

Judges (LLM-as-Judge)

JudgeRubric integrates LLM-based evaluation:
judge_rubric = vf.JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt="""Is this answer correct?
Question: {question}
Answer: {response}
Ground Truth: {answer}

Respond with YES or NO."""
)

async def judge_correctness(prompt, completion, answer, judge) -> float:
    question = prompt[0]["content"]
    response = completion[-1]["content"]
    verdict = await judge(prompt, completion, answer)
    return 1.0 if "yes" in verdict.lower() else 0.0

judge_rubric.add_reward_func(judge_correctness)
Built-in judge callable:
# Signature
async def judge(
    prompt: Messages,
    completion: Messages,
    answer: str
) -> str:
    ...
Exposed objects:
  • judge - Callable that formats prompt and calls judge model
  • judge_client - Raw AsyncOpenAI client
  • judge_model - Model name string
  • judge_prompt - Template string
  • judge_sampling_args - Sampling parameters dict

Custom Shared Objects

Add domain-specific helpers:
class MathVerifier:
    async def verify(self, expression: str, expected: str) -> bool:
        # Symbolic math verification
        ...

verifier = MathVerifier()
rubric = vf.Rubric(funcs=[verify_answer])
rubric.add_class_object("verifier", verifier)

async def verify_answer(completion, answer, verifier) -> float:
    response = completion[-1]["content"]
    is_correct = await verifier.verify(response, answer)
    return 1.0 if is_correct else 0.0

Rubric Groups

Combine multiple rubrics for heterogeneous scoring:
math_rubric = vf.MathRubric()  # symbolic math verification
judge_rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
judge_rubric.add_reward_func(judge_correctness, weight=0.5)

combined = vf.RubricGroup(rubrics=[math_rubric, judge_rubric])
Behavior:
  • All rubrics execute in parallel
  • Final reward = sum of all rubric rewards
  • Metrics from all rubrics are collected together
Use cases:
  • Combining deterministic and LLM-based evaluation
  • Multi-faceted scoring (correctness + style + efficiency)
  • Environment-specific monitors + task-specific rewards

Metrics and Monitor Rubrics

Adding Metrics

Metrics are reward functions with weight=0.0 (tracked but don’t affect reward):
async def response_length(completion) -> float:
    return float(len(completion[-1]["content"]))

async def token_count(state) -> float:
    return float(state.get("usage", {}).get("total_tokens", 0))

rubric = vf.Rubric(funcs=[correctness])
rubric.add_metric(response_length)  # shorthand for weight=0
rubric.add_metric(token_count)

Monitor Rubrics

Environments automatically include monitor rubrics for tracking metrics:
EnvironmentTracked Metrics
MultiTurnEnvnum_turns
ToolEnvtotal_tool_calls, per-tool counts
SandboxEnvsandbox_ready_wait_time, sandbox_command_execution_time
PythonEnvpython_ready_wait_time
Example monitor rubric:
class MultiTurnMonitorRubric(vf.Rubric):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.add_metric(self.num_turns)
    
    async def num_turns(self, state: vf.State) -> int:
        return len(state["trajectory"])

Custom Monitor Rubrics

Add environment-specific metrics:
class MyMonitorRubric(vf.Rubric):
    def __init__(self):
        super().__init__()
        self.add_metric(self.trajectory_length)
        self.add_metric(self.error_count)
    
    async def trajectory_length(self, state: vf.State) -> float:
        return float(len(state["trajectory"]))
    
    async def error_count(self, state: vf.State) -> float:
        errors = sum(1 for step in state["trajectory"] if step.get("error"))
        return float(errors)

env = vf.ToolEnv(dataset=dataset, tools=tools, rubric=rubric)
env.add_rubric(MyMonitorRubric())

Built-in Rubrics

MathRubric

Symbolic math verification using math-verify:
math_rubric = vf.MathRubric()

# Automatically includes:
# - Parser for \boxed{} answers
# - Symbolic equivalence checking
# - Normalization of mathematical expressions
Usage:
env = vf.SingleTurnEnv(
    dataset=math_dataset,
    rubric=math_rubric
)

JudgeRubric

LLM-as-judge evaluation:
judge_rubric = vf.JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt="""Evaluate if the response correctly answers the question.

Question: {question}
Response: {response}
Ground Truth: {answer}

Answer YES or NO.""",
    judge_sampling_args={"temperature": 0.0}
)

async def judge_score(prompt, completion, answer, judge) -> float:
    verdict = await judge(prompt, completion, answer)
    return 1.0 if "yes" in verdict.lower() else 0.0

judge_rubric.add_reward_func(judge_score)

Scoring Lifecycle

Individual Scoring

For rollouts scored independently:
# Called automatically after each rollout
state = await env.rollout(input, client, model, sampling_args)
await rubric.score_rollout(state)

# Sets state["reward"], state["metrics"]

Group Scoring

For rollouts scored together (default for evaluate() and training):
# Generate group of rollouts
states = await asyncio.gather(*[
    env.rollout(input, client, model, sampling_args)
    for input in group_inputs
])

# Score group together
await rubric.score_group(states)

# Sets state["reward"], state["advantage"], state["metrics"] for all states
Advantage computation:
avg_reward = sum(state["reward"] for state in states) / len(states)
for state in states:
    state["advantage"] = state["reward"] - avg_reward

Disabling Scoring

For pure generation without scoring:
env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric, score_rollouts=False)

# Or dynamically:
env.set_score_rollouts(False)

RolloutScore Type

Rubrics produce RolloutScore objects:
from verifiers.types import RolloutScore

class RolloutScore(TypedDict):
    reward: float
    metrics: dict[str, float]

# Example:
result = RolloutScore(
    reward=0.85,
    metrics={
        "correctness": 1.0,
        "formatting": 0.8,
        "conciseness": 0.5,
        "response_length": 342.0
    }
)

Complete Example

import verifiers as vf
from datasets import Dataset

# Dataset
dataset = Dataset.from_list([
    {
        "question": "What is 2+2?",
        "answer": "4",
        "info": {"difficulty": 1, "category": "arithmetic"}
    },
    {
        "question": "What is the derivative of x^2?",
        "answer": "2x",
        "info": {"difficulty": 3, "category": "calculus"}
    },
])

# Reward functions
async def correctness(completion, answer, parser) -> float:
    parsed = parser.parse_answer(completion)
    return 1.0 if parsed == answer else 0.0

async def difficulty_bonus(state, info) -> float:
    """Bonus for harder problems."""
    if state["reward"] == 1.0:  # only if correct
        return info.get("difficulty", 1) * 0.1
    return 0.0

async def response_length(completion) -> float:
    return float(len(completion[-1]["content"]))

# Parser
parser = vf.XMLParser(fields=["reasoning", "answer"])

# Rubric
rubric = vf.Rubric(
    funcs=[correctness, difficulty_bonus],
    weights=[1.0, 1.0],
    parser=parser
)
rubric.add_metric(response_length)

# Environment
env = vf.SingleTurnEnv(
    dataset=dataset,
    parser=parser,
    rubric=rubric,
    system_prompt="Answer step by step in XML format."
)
When combining multiple reward functions, ensure weights are tuned to avoid any single function dominating the reward signal. Common practice is to normalize weights or use coefficients < 1.0 for auxiliary rewards.

Build docs developers (and LLMs) love