Overview
MathRubric extends Rubric to provide mathematical equivalence checking using the math_verify library. It evaluates whether model-generated mathematical answers are equivalent to ground truth, handling symbolic expressions, algebraic equivalence, and numerical precision.
Constructor
MathRubric(
funcs: list[RewardFunc] | None = None,
weights: list[float] | None = None,
parser: Parser | None = None,
max_workers: int = 50,
timeout_seconds: float = 5,
)
funcs
list[RewardFunc] | None
default:"None"
Additional reward functions beyond the built-in correct_answer function.
weights
list[float] | None
default:"None"
Weights for additional reward functions. The built-in correct_answer gets weight 1.0.
parser
Parser | None
default:"None"
Parser for extracting answers. Defaults to MaybeThinkParser(extract_fn=extract_boxed_answer).
Maximum number of thread pool workers for parallel verification.
Per-verification timeout in seconds. Returns 0.0 reward if exceeded.
Built-in Reward Function
correct_answer
async def correct_answer(
self,
parser: Parser,
completion: Messages,
answer: str,
**kwargs
) -> float
Verifies mathematical equivalence between the parsed completion and ground truth answer.
Parser instance (automatically provided by the rubric).
Model’s completion to verify.
Ground truth answer in LaTeX format (will be wrapped in \boxed{}).
Returns: float - 1.0 if mathematically equivalent, 0.0 otherwise.
The function automatically wraps answers in \boxed{} format before verification.
Timeouts
MathRubric implements two-level timeout protection:
Soft timeout: Returns 0.0 if verification takes longer than this, but logs it as debug.
Hard timeout: Absolute maximum time for verification. Logs a warning if exceeded.
Attributes
Thread pool for running blocking math_verify operations asynchronously.
All attributes from Rubric are also available.
Example Usage
Basic Math Verification
import verifiers as vf
# Create math rubric with default settings
rubric = vf.MathRubric()
# Score a mathematical response
state = {
"prompt": "Solve for x: 2x + 5 = 13",
"completion": [{"role": "assistant", "content": "x = 4"}],
"answer": "4",
"task": "math",
"timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(f"Reward: {state['reward']}") # 1.0 if correct
Custom Parser for Boxed Answers
from verifiers.parsers import MaybeThinkParser
from verifiers.utils.data_utils import extract_boxed_answer
# Create rubric expecting \boxed{} format
rubric = vf.MathRubric(
parser=MaybeThinkParser(extract_fn=extract_boxed_answer)
)
state = {
"prompt": "What is 1/2 + 1/3?",
"completion": [{"role": "assistant", "content": "The answer is \\boxed{\\frac{5}{6}}"}],
"answer": "\\frac{5}{6}",
"task": "math",
"timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(f"Reward: {state['reward']}") # 1.0 - equivalent fractions
Adding Additional Metrics
rubric = vf.MathRubric()
# Track response length as a metric (weight=0)
def response_length(completion, **kwargs):
text = completion[-1]["content"] if isinstance(completion, list) else completion
return len(text)
rubric.add_metric(response_length, weight=0.0)
# Track number of steps
def step_count(completion, **kwargs):
text = completion[-1]["content"] if isinstance(completion, list) else completion
return text.count("\n") + 1
rubric.add_metric(step_count, weight=0.0)
await rubric.score_rollout(state)
print(f"Metrics: {state['metrics']}") # Includes correct_answer, response_length, step_count
Custom Timeout Settings
# Faster timeout for high-throughput evaluation
rubric = vf.MathRubric(
timeout_seconds=2.0, # Only allow 2 seconds per verification
max_workers=100 # More parallelism
)
# Longer timeout for complex symbolic math
rubric = vf.MathRubric(
timeout_seconds=30.0, # Allow 30 seconds for complex expressions
max_workers=10
)
Combining with Other Reward Functions
from verifiers.utils.data_utils import extract_boxed_answer
rubric = vf.MathRubric()
# Reward showing work
async def shows_reasoning(completion, parser, **kwargs):
"""Reward explanations before the answer."""
text = completion[-1]["content"] if isinstance(completion, list) else completion
answer = parser.parse_answer(completion) or ""
# Check if there's text before the boxed answer
boxed_idx = text.find("\\boxed")
if boxed_idx > 50: # At least 50 chars of explanation
return 1.0
return 0.0
rubric.add_reward_func(shows_reasoning, weight=0.2)
# Now reward = 1.0 * correct + 0.2 * shows_reasoning
Handling Edge Cases
rubric = vf.MathRubric()
# Empty completion
state = {
"prompt": "What is 2+2?",
"completion": [{"role": "assistant", "content": ""}],
"answer": "4",
"task": "math",
"timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(state["reward"]) # 0.0 - empty responses get 0
# Equivalent expressions
state = {
"prompt": "Simplify.",
"completion": [{"role": "assistant", "content": "\\boxed{2x + 6}"}],
"answer": "2(x + 3)",
"task": "math",
"timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(state["reward"]) # 1.0 - algebraically equivalent
Mathematical Equivalence
The math_verify library checks various forms of equivalence:
- Symbolic:
2x + 6 ≡ 2(x + 3)
- Numerical:
0.333... ≡ 1/3
- Algebraic:
(x+1)^2 ≡ x^2 + 2x + 1
- Trigonometric:
sin^2(x) + cos^2(x) ≡ 1
Mathematical verification can be computationally expensive:
- Set appropriate
timeout_seconds to avoid hanging on complex expressions
- Use
max_workers to control parallelism and memory usage
- The thread pool is automatically cleaned up when the rubric is garbage collected
Verification runs in a thread pool executor to avoid blocking the async event loop.
Notes
- Empty or unparseable responses always receive 0.0 reward
- Timeout violations are logged at debug level for soft timeouts, warning for hard timeouts
- The rubric suppresses
math_verify library timeout warnings (handled internally)
- All parsing exceptions return 0.0 rather than raising errors
See Also