Rubrics

Overview

Rubrics manage the scoring logic for rollouts, combining multiple reward functions into a final reward signal. Each rubric holds reward functions, computes weighted combinations, and tracks metrics for observability.

import verifiers as vf

async def correct_answer(completion, answer) -> float:
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

rubric = vf.Rubric(funcs=[correct_answer])

Basic Reward Functions

Reward functions evaluate rollouts and return floats (typically 0.0 to 1.0). They request data by naming arguments:

async def exact_match(completion, answer) -> float:
    """Check if answer appears in completion."""
    response = completion[-1]["content"]
    return 1.0 if answer in response else 0.0

async def length_penalty(completion) -> float:
    """Penalize overly long responses."""
    response = completion[-1]["content"]
    return 1.0 if len(response) < 500 else 0.5

rubric = vf.Rubric(
    funcs=[exact_match, length_penalty],
    weights=[1.0, 0.1]  # exact_match weighted 10x more
)

Available Arguments

Reward functions can request these standard arguments:

Argument	Type	Description
`completion`	`Messages`	Model’s output messages
`prompt`	`Messages`	Input messages
`answer`	`str`	Ground truth from dataset
`info`	`Info` (dict)	Metadata from dataset
`state`	`State`	Full rollout state
`task`	`str`	Task identifier

Type signatures:

from verifiers.types import Messages, Info, State

Messages = list[Message]  # list of chat messages
Info = dict[str, Any]
State = dict  # with additional input forwarding

Argument Injection Pattern

The rubric uses introspection to inject only requested arguments:

# Only receives what it asks for
async def simple_reward(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

# Can request all available data
async def complex_reward(prompt, completion, answer, info, state) -> float:
    difficulty = info.get("difficulty", 1)
    tokens_used = state.get("usage", {}).get("total_tokens", 0)
    correct = answer in completion[-1]["content"]
    return float(correct) * (1.0 / difficulty) * (1.0 if tokens_used < 1000 else 0.5)

# Use **kwargs to accept everything
async def flexible_reward(**kwargs) -> float:
    completion = kwargs["completion"]
    answer = kwargs.get("answer", "")
    return 1.0 if answer in completion[-1]["content"] else 0.0

Multiple Reward Functions

Combine reward functions with custom weights:

async def correctness(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

async def formatting(completion, parser) -> float:
    try:
        parser.parse(completion[-1]["content"])
        return 1.0
    except:
        return 0.0

async def conciseness(completion) -> float:
    length = len(completion[-1]["content"])
    return 1.0 if length < 200 else 0.5

rubric = vf.Rubric(
    funcs=[correctness, formatting, conciseness],
    weights=[1.0, 0.5, 0.1]
)

Final reward computation:

reward = (correctness * 1.0) + (formatting * 0.5) + (conciseness * 0.1)

Adding Functions Dynamically

rubric = vf.Rubric()
rubric.add_reward_func(correctness, weight=1.0)
rubric.add_reward_func(formatting, weight=0.5)

Reward functions execute sequentially in the order they’re added. Since state is mutable, earlier functions can store computed values for later functions:

async def compute_similarity(completion, answer, state) -> float:
    """Compute and cache similarity score."""
    response = completion[-1]["content"]
    score = compute_embedding_similarity(response, answer)  # expensive
    state["similarity"] = score  # cache for other functions
    return score

async def similarity_threshold(state) -> float:
    """Use cached similarity without recomputing."""
    return 1.0 if state["similarity"] > 0.8 else 0.0

rubric = vf.Rubric(
    funcs=[compute_similarity, similarity_threshold],
    weights=[0.0, 1.0]  # log similarity (weight=0), reward threshold (weight=1)
)

Execution flow:

compute_similarity runs, stores state["similarity"]
similarity_threshold runs, reads cached value
Final reward = 0.0 * similarity + 1.0 * threshold

Group-Based Reward Functions

During evaluation and RL training, rollouts are organized into groups by example_id. Group reward functions operate on all rollouts for an example together:

async def diversity_bonus(completions) -> list[float]:
    """Reward unique responses within a group."""
    responses = [c[-1]["content"] for c in completions]
    unique = set(responses)
    return [0.2 if responses.count(r) == 1 else 0.0 for r in responses]

async def individual_correctness(completion, answer) -> float:
    return 1.0 if answer in completion[-1]["content"] else 0.0

rubric = vf.Rubric(
    funcs=[individual_correctness, diversity_bonus],
    weights=[1.0, 0.5]
)

Detection

Group functions are detected by:

Plural argument names: completions, prompts, answers, states, tasks, infos
Return type: list[float] instead of float

Available Group Arguments

Argument	Type	Description
`completions`	`list[Messages]`	All completions in group
`prompts`	`list[Messages]`	All prompts in group
`answers`	`list[str]`	All answers in group
`states`	`list[State]`	All states in group
`tasks`	`list[str]`	All task IDs in group
`infos`	`list[Info]`	All info dicts in group

Example: Relative Ranking

async def rank_by_length(completions) -> list[float]:
    """Reward shorter completions more within a group."""
    lengths = [len(c[-1]["content"]) for c in completions]
    max_len = max(lengths) if lengths else 1
    return [1.0 - (length / max_len) for length in lengths]

Shared Objects

Rubrics can provide shared objects accessible to all reward functions via class_objects:

rubric = vf.Rubric(funcs=[my_reward_func])
rubric.add_class_object("my_helper", some_helper_object)

async def my_reward_func(completion, my_helper) -> float:
    # my_helper is injected by name
    return await my_helper.score(completion)

Parsers

Parsers extract structured content from model responses:

parser = vf.XMLParser(fields=["reasoning", "answer"])
rubric = vf.Rubric(funcs=[my_reward_func], parser=parser)

async def my_reward_func(completion, answer, parser) -> float:
    parsed = parser.parse(completion[-1]["content"])
    return 1.0 if parsed.answer == answer else 0.0

Built-in parsers:

vf.Parser() - Pass-through (no parsing)
vf.XMLParser(fields=[...]) - Extract XML tags
vf.ThinkParser() - Extract content after </think>
vf.MaybeThinkParser() - Handle optional <think> tags

Judges (LLM-as-Judge)

JudgeRubric integrates LLM-based evaluation:

judge_rubric = vf.JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt="""Is this answer correct?
Question: {question}
Answer: {response}
Ground Truth: {answer}

Respond with YES or NO."""
)

async def judge_correctness(prompt, completion, answer, judge) -> float:
    question = prompt[0]["content"]
    response = completion[-1]["content"]
    verdict = await judge(prompt, completion, answer)
    return 1.0 if "yes" in verdict.lower() else 0.0

judge_rubric.add_reward_func(judge_correctness)

Built-in judge callable:

# Signature
async def judge(
    prompt: Messages,
    completion: Messages,
    answer: str
) -> str:
    ...

Exposed objects:

judge - Callable that formats prompt and calls judge model
judge_client - Raw AsyncOpenAI client
judge_model - Model name string
judge_prompt - Template string
judge_sampling_args - Sampling parameters dict

Custom Shared Objects

Add domain-specific helpers:

class MathVerifier:
    async def verify(self, expression: str, expected: str) -> bool:
        # Symbolic math verification
        ...

verifier = MathVerifier()
rubric = vf.Rubric(funcs=[verify_answer])
rubric.add_class_object("verifier", verifier)

async def verify_answer(completion, answer, verifier) -> float:
    response = completion[-1]["content"]
    is_correct = await verifier.verify(response, answer)
    return 1.0 if is_correct else 0.0

Rubric Groups

Combine multiple rubrics for heterogeneous scoring:

math_rubric = vf.MathRubric()  # symbolic math verification
judge_rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
judge_rubric.add_reward_func(judge_correctness, weight=0.5)

combined = vf.RubricGroup(rubrics=[math_rubric, judge_rubric])

Behavior:

All rubrics execute in parallel
Final reward = sum of all rubric rewards
Metrics from all rubrics are collected together

Use cases:

Combining deterministic and LLM-based evaluation
Multi-faceted scoring (correctness + style + efficiency)
Environment-specific monitors + task-specific rewards

Metrics and Monitor Rubrics

Adding Metrics

Metrics are reward functions with weight=0.0 (tracked but don’t affect reward):

async def response_length(completion) -> float:
    return float(len(completion[-1]["content"]))

async def token_count(state) -> float:
    return float(state.get("usage", {}).get("total_tokens", 0))

rubric = vf.Rubric(funcs=[correctness])
rubric.add_metric(response_length)  # shorthand for weight=0
rubric.add_metric(token_count)

Monitor Rubrics

Environments automatically include monitor rubrics for tracking metrics:

Environment	Tracked Metrics
`MultiTurnEnv`	`num_turns`
`ToolEnv`	`total_tool_calls`, per-tool counts
`SandboxEnv`	`sandbox_ready_wait_time`, `sandbox_command_execution_time`
`PythonEnv`	`python_ready_wait_time`

Example monitor rubric:

class MultiTurnMonitorRubric(vf.Rubric):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.add_metric(self.num_turns)
    
    async def num_turns(self, state: vf.State) -> int:
        return len(state["trajectory"])

Custom Monitor Rubrics

Add environment-specific metrics:

class MyMonitorRubric(vf.Rubric):
    def __init__(self):
        super().__init__()
        self.add_metric(self.trajectory_length)
        self.add_metric(self.error_count)
    
    async def trajectory_length(self, state: vf.State) -> float:
        return float(len(state["trajectory"]))
    
    async def error_count(self, state: vf.State) -> float:
        errors = sum(1 for step in state["trajectory"] if step.get("error"))
        return float(errors)

env = vf.ToolEnv(dataset=dataset, tools=tools, rubric=rubric)
env.add_rubric(MyMonitorRubric())

Built-in Rubrics

MathRubric

Symbolic math verification using math-verify:

math_rubric = vf.MathRubric()

# Automatically includes:
# - Parser for \boxed{} answers
# - Symbolic equivalence checking
# - Normalization of mathematical expressions

Usage:

env = vf.SingleTurnEnv(
    dataset=math_dataset,
    rubric=math_rubric
)

JudgeRubric

LLM-as-judge evaluation:

judge_rubric = vf.JudgeRubric(
    judge_model="gpt-4.1-mini",
    judge_prompt="""Evaluate if the response correctly answers the question.

Question: {question}
Response: {response}
Ground Truth: {answer}

Answer YES or NO.""",
    judge_sampling_args={"temperature": 0.0}
)

async def judge_score(prompt, completion, answer, judge) -> float:
    verdict = await judge(prompt, completion, answer)
    return 1.0 if "yes" in verdict.lower() else 0.0

judge_rubric.add_reward_func(judge_score)

Scoring Lifecycle

Individual Scoring

For rollouts scored independently:

# Called automatically after each rollout
state = await env.rollout(input, client, model, sampling_args)
await rubric.score_rollout(state)

# Sets state["reward"], state["metrics"]

Group Scoring

For rollouts scored together (default for evaluate() and training):

# Generate group of rollouts
states = await asyncio.gather(*[
    env.rollout(input, client, model, sampling_args)
    for input in group_inputs
])

# Score group together
await rubric.score_group(states)

# Sets state["reward"], state["advantage"], state["metrics"] for all states

Advantage computation:

avg_reward = sum(state["reward"] for state in states) / len(states)
for state in states:
    state["advantage"] = state["reward"] - avg_reward

Disabling Scoring

For pure generation without scoring:

env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric, score_rollouts=False)

# Or dynamically:
env.set_score_rollouts(False)

RolloutScore Type

Rubrics produce RolloutScore objects:

from verifiers.types import RolloutScore

class RolloutScore(TypedDict):
    reward: float
    metrics: dict[str, float]

# Example:
result = RolloutScore(
    reward=0.85,
    metrics={
        "correctness": 1.0,
        "formatting": 0.8,
        "conciseness": 0.5,
        "response_length": 342.0
    }
)

Complete Example

import verifiers as vf
from datasets import Dataset

# Dataset
dataset = Dataset.from_list([
    {
        "question": "What is 2+2?",
        "answer": "4",
        "info": {"difficulty": 1, "category": "arithmetic"}
    },
    {
        "question": "What is the derivative of x^2?",
        "answer": "2x",
        "info": {"difficulty": 3, "category": "calculus"}
    },
])

# Reward functions
async def correctness(completion, answer, parser) -> float:
    parsed = parser.parse_answer(completion)
    return 1.0 if parsed == answer else 0.0

async def difficulty_bonus(state, info) -> float:
    """Bonus for harder problems."""
    if state["reward"] == 1.0:  # only if correct
        return info.get("difficulty", 1) * 0.1
    return 0.0

async def response_length(completion) -> float:
    return float(len(completion[-1]["content"]))

# Parser
parser = vf.XMLParser(fields=["reasoning", "answer"])

# Rubric
rubric = vf.Rubric(
    funcs=[correctness, difficulty_bonus],
    weights=[1.0, 1.0],
    parser=parser
)
rubric.add_metric(response_length)

# Environment
env = vf.SingleTurnEnv(
    dataset=dataset,
    parser=parser,
    rubric=rubric,
    system_prompt="Answer step by step in XML format."
)

When combining multiple reward functions, ensure weights are tuned to avoid any single function dominating the reward signal. Common practice is to normalize weights or use coefficients < 1.0 for auxiliary rewards.

Get Started

Core Concepts

Guides

Integrations

Overview

Basic Reward Functions

Available Arguments

Argument Injection Pattern

Multiple Reward Functions

Adding Functions Dynamically

Group-Based Reward Functions

Detection

Available Group Arguments

Example: Relative Ranking

Shared Objects

Parsers

Judges (LLM-as-Judge)

Custom Shared Objects

Rubric Groups

Metrics and Monitor Rubrics

Adding Metrics

Monitor Rubrics

Custom Monitor Rubrics

Built-in Rubrics

MathRubric

JudgeRubric

Scoring Lifecycle

Individual Scoring

Group Scoring

Disabling Scoring

RolloutScore Type

Complete Example

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Integrations

​Overview

​Basic Reward Functions

​Available Arguments

​Argument Injection Pattern

​Multiple Reward Functions

​Adding Functions Dynamically

​Execution Order and State Sharing

​Group-Based Reward Functions

​Detection

​Available Group Arguments

​Example: Relative Ranking

​Shared Objects

​Parsers

​Judges (LLM-as-Judge)

​Custom Shared Objects

​Rubric Groups

​Metrics and Monitor Rubrics

​Adding Metrics

​Monitor Rubrics

​Custom Monitor Rubrics

​Built-in Rubrics

​MathRubric

​JudgeRubric

​Scoring Lifecycle

​Individual Scoring

​Group Scoring

​Disabling Scoring

​RolloutScore Type

​Complete Example

Build docs developers (and LLMs) love

Overview

Basic Reward Functions

Available Arguments

Argument Injection Pattern

Multiple Reward Functions

Adding Functions Dynamically

Execution Order and State Sharing

Group-Based Reward Functions

Detection

Available Group Arguments

Example: Relative Ranking

Shared Objects

Parsers

Judges (LLM-as-Judge)

Custom Shared Objects

Rubric Groups

Metrics and Monitor Rubrics

Adding Metrics

Monitor Rubrics

Custom Monitor Rubrics

Built-in Rubrics

MathRubric

JudgeRubric

Scoring Lifecycle

Individual Scoring

Group Scoring

Disabling Scoring

RolloutScore Type

Complete Example