Overview
Rubrics manage the scoring logic for rollouts, combining multiple reward functions into a final reward signal. Each rubric holds reward functions, computes weighted combinations, and tracks metrics for observability.
import verifiers as vf
async def correct_answer(completion, answer) -> float:
response = completion[-1]["content"]
return 1.0 if answer in response else 0.0
rubric = vf.Rubric(funcs=[correct_answer])
Basic Reward Functions
Reward functions evaluate rollouts and return floats (typically 0.0 to 1.0). They request data by naming arguments:
async def exact_match(completion, answer) -> float:
"""Check if answer appears in completion."""
response = completion[-1]["content"]
return 1.0 if answer in response else 0.0
async def length_penalty(completion) -> float:
"""Penalize overly long responses."""
response = completion[-1]["content"]
return 1.0 if len(response) < 500 else 0.5
rubric = vf.Rubric(
funcs=[exact_match, length_penalty],
weights=[1.0, 0.1] # exact_match weighted 10x more
)
Available Arguments
Reward functions can request these standard arguments:
| Argument | Type | Description |
|---|
completion | Messages | Model’s output messages |
prompt | Messages | Input messages |
answer | str | Ground truth from dataset |
info | Info (dict) | Metadata from dataset |
state | State | Full rollout state |
task | str | Task identifier |
Type signatures:
from verifiers.types import Messages, Info, State
Messages = list[Message] # list of chat messages
Info = dict[str, Any]
State = dict # with additional input forwarding
Argument Injection Pattern
The rubric uses introspection to inject only requested arguments:
# Only receives what it asks for
async def simple_reward(completion, answer) -> float:
return 1.0 if answer in completion[-1]["content"] else 0.0
# Can request all available data
async def complex_reward(prompt, completion, answer, info, state) -> float:
difficulty = info.get("difficulty", 1)
tokens_used = state.get("usage", {}).get("total_tokens", 0)
correct = answer in completion[-1]["content"]
return float(correct) * (1.0 / difficulty) * (1.0 if tokens_used < 1000 else 0.5)
# Use **kwargs to accept everything
async def flexible_reward(**kwargs) -> float:
completion = kwargs["completion"]
answer = kwargs.get("answer", "")
return 1.0 if answer in completion[-1]["content"] else 0.0
Multiple Reward Functions
Combine reward functions with custom weights:
async def correctness(completion, answer) -> float:
return 1.0 if answer in completion[-1]["content"] else 0.0
async def formatting(completion, parser) -> float:
try:
parser.parse(completion[-1]["content"])
return 1.0
except:
return 0.0
async def conciseness(completion) -> float:
length = len(completion[-1]["content"])
return 1.0 if length < 200 else 0.5
rubric = vf.Rubric(
funcs=[correctness, formatting, conciseness],
weights=[1.0, 0.5, 0.1]
)
Final reward computation:
reward = (correctness * 1.0) + (formatting * 0.5) + (conciseness * 0.1)
Adding Functions Dynamically
rubric = vf.Rubric()
rubric.add_reward_func(correctness, weight=1.0)
rubric.add_reward_func(formatting, weight=0.5)
Execution Order and State Sharing
Reward functions execute sequentially in the order they’re added. Since state is mutable, earlier functions can store computed values for later functions:
async def compute_similarity(completion, answer, state) -> float:
"""Compute and cache similarity score."""
response = completion[-1]["content"]
score = compute_embedding_similarity(response, answer) # expensive
state["similarity"] = score # cache for other functions
return score
async def similarity_threshold(state) -> float:
"""Use cached similarity without recomputing."""
return 1.0 if state["similarity"] > 0.8 else 0.0
rubric = vf.Rubric(
funcs=[compute_similarity, similarity_threshold],
weights=[0.0, 1.0] # log similarity (weight=0), reward threshold (weight=1)
)
Execution flow:
compute_similarity runs, stores state["similarity"]
similarity_threshold runs, reads cached value
- Final reward =
0.0 * similarity + 1.0 * threshold
Group-Based Reward Functions
During evaluation and RL training, rollouts are organized into groups by example_id. Group reward functions operate on all rollouts for an example together:
async def diversity_bonus(completions) -> list[float]:
"""Reward unique responses within a group."""
responses = [c[-1]["content"] for c in completions]
unique = set(responses)
return [0.2 if responses.count(r) == 1 else 0.0 for r in responses]
async def individual_correctness(completion, answer) -> float:
return 1.0 if answer in completion[-1]["content"] else 0.0
rubric = vf.Rubric(
funcs=[individual_correctness, diversity_bonus],
weights=[1.0, 0.5]
)
Detection
Group functions are detected by:
- Plural argument names:
completions, prompts, answers, states, tasks, infos
- Return type:
list[float] instead of float
Available Group Arguments
| Argument | Type | Description |
|---|
completions | list[Messages] | All completions in group |
prompts | list[Messages] | All prompts in group |
answers | list[str] | All answers in group |
states | list[State] | All states in group |
tasks | list[str] | All task IDs in group |
infos | list[Info] | All info dicts in group |
Example: Relative Ranking
async def rank_by_length(completions) -> list[float]:
"""Reward shorter completions more within a group."""
lengths = [len(c[-1]["content"]) for c in completions]
max_len = max(lengths) if lengths else 1
return [1.0 - (length / max_len) for length in lengths]
Shared Objects
Rubrics can provide shared objects accessible to all reward functions via class_objects:
rubric = vf.Rubric(funcs=[my_reward_func])
rubric.add_class_object("my_helper", some_helper_object)
async def my_reward_func(completion, my_helper) -> float:
# my_helper is injected by name
return await my_helper.score(completion)
Parsers
Parsers extract structured content from model responses:
parser = vf.XMLParser(fields=["reasoning", "answer"])
rubric = vf.Rubric(funcs=[my_reward_func], parser=parser)
async def my_reward_func(completion, answer, parser) -> float:
parsed = parser.parse(completion[-1]["content"])
return 1.0 if parsed.answer == answer else 0.0
Built-in parsers:
vf.Parser() - Pass-through (no parsing)
vf.XMLParser(fields=[...]) - Extract XML tags
vf.ThinkParser() - Extract content after </think>
vf.MaybeThinkParser() - Handle optional <think> tags
Judges (LLM-as-Judge)
JudgeRubric integrates LLM-based evaluation:
judge_rubric = vf.JudgeRubric(
judge_model="gpt-4.1-mini",
judge_prompt="""Is this answer correct?
Question: {question}
Answer: {response}
Ground Truth: {answer}
Respond with YES or NO."""
)
async def judge_correctness(prompt, completion, answer, judge) -> float:
question = prompt[0]["content"]
response = completion[-1]["content"]
verdict = await judge(prompt, completion, answer)
return 1.0 if "yes" in verdict.lower() else 0.0
judge_rubric.add_reward_func(judge_correctness)
Built-in judge callable:
# Signature
async def judge(
prompt: Messages,
completion: Messages,
answer: str
) -> str:
...
Exposed objects:
judge - Callable that formats prompt and calls judge model
judge_client - Raw AsyncOpenAI client
judge_model - Model name string
judge_prompt - Template string
judge_sampling_args - Sampling parameters dict
Custom Shared Objects
Add domain-specific helpers:
class MathVerifier:
async def verify(self, expression: str, expected: str) -> bool:
# Symbolic math verification
...
verifier = MathVerifier()
rubric = vf.Rubric(funcs=[verify_answer])
rubric.add_class_object("verifier", verifier)
async def verify_answer(completion, answer, verifier) -> float:
response = completion[-1]["content"]
is_correct = await verifier.verify(response, answer)
return 1.0 if is_correct else 0.0
Rubric Groups
Combine multiple rubrics for heterogeneous scoring:
math_rubric = vf.MathRubric() # symbolic math verification
judge_rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
judge_rubric.add_reward_func(judge_correctness, weight=0.5)
combined = vf.RubricGroup(rubrics=[math_rubric, judge_rubric])
Behavior:
- All rubrics execute in parallel
- Final reward = sum of all rubric rewards
- Metrics from all rubrics are collected together
Use cases:
- Combining deterministic and LLM-based evaluation
- Multi-faceted scoring (correctness + style + efficiency)
- Environment-specific monitors + task-specific rewards
Metrics and Monitor Rubrics
Adding Metrics
Metrics are reward functions with weight=0.0 (tracked but don’t affect reward):
async def response_length(completion) -> float:
return float(len(completion[-1]["content"]))
async def token_count(state) -> float:
return float(state.get("usage", {}).get("total_tokens", 0))
rubric = vf.Rubric(funcs=[correctness])
rubric.add_metric(response_length) # shorthand for weight=0
rubric.add_metric(token_count)
Monitor Rubrics
Environments automatically include monitor rubrics for tracking metrics:
| Environment | Tracked Metrics |
|---|
MultiTurnEnv | num_turns |
ToolEnv | total_tool_calls, per-tool counts |
SandboxEnv | sandbox_ready_wait_time, sandbox_command_execution_time |
PythonEnv | python_ready_wait_time |
Example monitor rubric:
class MultiTurnMonitorRubric(vf.Rubric):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.add_metric(self.num_turns)
async def num_turns(self, state: vf.State) -> int:
return len(state["trajectory"])
Custom Monitor Rubrics
Add environment-specific metrics:
class MyMonitorRubric(vf.Rubric):
def __init__(self):
super().__init__()
self.add_metric(self.trajectory_length)
self.add_metric(self.error_count)
async def trajectory_length(self, state: vf.State) -> float:
return float(len(state["trajectory"]))
async def error_count(self, state: vf.State) -> float:
errors = sum(1 for step in state["trajectory"] if step.get("error"))
return float(errors)
env = vf.ToolEnv(dataset=dataset, tools=tools, rubric=rubric)
env.add_rubric(MyMonitorRubric())
Built-in Rubrics
MathRubric
Symbolic math verification using math-verify:
math_rubric = vf.MathRubric()
# Automatically includes:
# - Parser for \boxed{} answers
# - Symbolic equivalence checking
# - Normalization of mathematical expressions
Usage:
env = vf.SingleTurnEnv(
dataset=math_dataset,
rubric=math_rubric
)
JudgeRubric
LLM-as-judge evaluation:
judge_rubric = vf.JudgeRubric(
judge_model="gpt-4.1-mini",
judge_prompt="""Evaluate if the response correctly answers the question.
Question: {question}
Response: {response}
Ground Truth: {answer}
Answer YES or NO.""",
judge_sampling_args={"temperature": 0.0}
)
async def judge_score(prompt, completion, answer, judge) -> float:
verdict = await judge(prompt, completion, answer)
return 1.0 if "yes" in verdict.lower() else 0.0
judge_rubric.add_reward_func(judge_score)
Scoring Lifecycle
Individual Scoring
For rollouts scored independently:
# Called automatically after each rollout
state = await env.rollout(input, client, model, sampling_args)
await rubric.score_rollout(state)
# Sets state["reward"], state["metrics"]
Group Scoring
For rollouts scored together (default for evaluate() and training):
# Generate group of rollouts
states = await asyncio.gather(*[
env.rollout(input, client, model, sampling_args)
for input in group_inputs
])
# Score group together
await rubric.score_group(states)
# Sets state["reward"], state["advantage"], state["metrics"] for all states
Advantage computation:
avg_reward = sum(state["reward"] for state in states) / len(states)
for state in states:
state["advantage"] = state["reward"] - avg_reward
Disabling Scoring
For pure generation without scoring:
env = vf.SingleTurnEnv(dataset=dataset, rubric=rubric, score_rollouts=False)
# Or dynamically:
env.set_score_rollouts(False)
RolloutScore Type
Rubrics produce RolloutScore objects:
from verifiers.types import RolloutScore
class RolloutScore(TypedDict):
reward: float
metrics: dict[str, float]
# Example:
result = RolloutScore(
reward=0.85,
metrics={
"correctness": 1.0,
"formatting": 0.8,
"conciseness": 0.5,
"response_length": 342.0
}
)
Complete Example
import verifiers as vf
from datasets import Dataset
# Dataset
dataset = Dataset.from_list([
{
"question": "What is 2+2?",
"answer": "4",
"info": {"difficulty": 1, "category": "arithmetic"}
},
{
"question": "What is the derivative of x^2?",
"answer": "2x",
"info": {"difficulty": 3, "category": "calculus"}
},
])
# Reward functions
async def correctness(completion, answer, parser) -> float:
parsed = parser.parse_answer(completion)
return 1.0 if parsed == answer else 0.0
async def difficulty_bonus(state, info) -> float:
"""Bonus for harder problems."""
if state["reward"] == 1.0: # only if correct
return info.get("difficulty", 1) * 0.1
return 0.0
async def response_length(completion) -> float:
return float(len(completion[-1]["content"]))
# Parser
parser = vf.XMLParser(fields=["reasoning", "answer"])
# Rubric
rubric = vf.Rubric(
funcs=[correctness, difficulty_bonus],
weights=[1.0, 1.0],
parser=parser
)
rubric.add_metric(response_length)
# Environment
env = vf.SingleTurnEnv(
dataset=dataset,
parser=parser,
rubric=rubric,
system_prompt="Answer step by step in XML format."
)
When combining multiple reward functions, ensure weights are tuned to avoid any single function dominating the reward signal. Common practice is to normalize weights or use coefficients < 1.0 for auxiliary rewards.