Skip to main content

Overview

MathRubric extends Rubric to provide mathematical equivalence checking using the math_verify library. It evaluates whether model-generated mathematical answers are equivalent to ground truth, handling symbolic expressions, algebraic equivalence, and numerical precision.

Constructor

MathRubric(
    funcs: list[RewardFunc] | None = None,
    weights: list[float] | None = None,
    parser: Parser | None = None,
    max_workers: int = 50,
    timeout_seconds: float = 5,
)
funcs
list[RewardFunc] | None
default:"None"
Additional reward functions beyond the built-in correct_answer function.
weights
list[float] | None
default:"None"
Weights for additional reward functions. The built-in correct_answer gets weight 1.0.
parser
Parser | None
default:"None"
Parser for extracting answers. Defaults to MaybeThinkParser(extract_fn=extract_boxed_answer).
max_workers
int
default:"50"
Maximum number of thread pool workers for parallel verification.
timeout_seconds
float
default:"5"
Per-verification timeout in seconds. Returns 0.0 reward if exceeded.

Built-in Reward Function

correct_answer

async def correct_answer(
    self,
    parser: Parser,
    completion: Messages,
    answer: str,
    **kwargs
) -> float
Verifies mathematical equivalence between the parsed completion and ground truth answer.
parser
Parser
Parser instance (automatically provided by the rubric).
completion
Messages
Model’s completion to verify.
answer
str
Ground truth answer in LaTeX format (will be wrapped in \boxed{}).
Returns: float - 1.0 if mathematically equivalent, 0.0 otherwise.
The function automatically wraps answers in \boxed{} format before verification.

Timeouts

MathRubric implements two-level timeout protection:
timeout_seconds
float
default:"5"
Soft timeout: Returns 0.0 if verification takes longer than this, but logs it as debug.
HARD_TIMEOUT_SECONDS
float
default:"120"
Hard timeout: Absolute maximum time for verification. Logs a warning if exceeded.

Attributes

executor
ThreadPoolExecutor
Thread pool for running blocking math_verify operations asynchronously.
All attributes from Rubric are also available.

Example Usage

Basic Math Verification

import verifiers as vf

# Create math rubric with default settings
rubric = vf.MathRubric()

# Score a mathematical response
state = {
    "prompt": "Solve for x: 2x + 5 = 13",
    "completion": [{"role": "assistant", "content": "x = 4"}],
    "answer": "4",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}

await rubric.score_rollout(state)
print(f"Reward: {state['reward']}")  # 1.0 if correct

Custom Parser for Boxed Answers

from verifiers.parsers import MaybeThinkParser
from verifiers.utils.data_utils import extract_boxed_answer

# Create rubric expecting \boxed{} format
rubric = vf.MathRubric(
    parser=MaybeThinkParser(extract_fn=extract_boxed_answer)
)

state = {
    "prompt": "What is 1/2 + 1/3?",
    "completion": [{"role": "assistant", "content": "The answer is \\boxed{\\frac{5}{6}}"}],
    "answer": "\\frac{5}{6}",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}

await rubric.score_rollout(state)
print(f"Reward: {state['reward']}")  # 1.0 - equivalent fractions

Adding Additional Metrics

rubric = vf.MathRubric()

# Track response length as a metric (weight=0)
def response_length(completion, **kwargs):
    text = completion[-1]["content"] if isinstance(completion, list) else completion
    return len(text)

rubric.add_metric(response_length, weight=0.0)

# Track number of steps
def step_count(completion, **kwargs):
    text = completion[-1]["content"] if isinstance(completion, list) else completion
    return text.count("\n") + 1

rubric.add_metric(step_count, weight=0.0)

await rubric.score_rollout(state)
print(f"Metrics: {state['metrics']}")  # Includes correct_answer, response_length, step_count

Custom Timeout Settings

# Faster timeout for high-throughput evaluation
rubric = vf.MathRubric(
    timeout_seconds=2.0,  # Only allow 2 seconds per verification
    max_workers=100  # More parallelism
)

# Longer timeout for complex symbolic math
rubric = vf.MathRubric(
    timeout_seconds=30.0,  # Allow 30 seconds for complex expressions
    max_workers=10
)

Combining with Other Reward Functions

from verifiers.utils.data_utils import extract_boxed_answer

rubric = vf.MathRubric()

# Reward showing work
async def shows_reasoning(completion, parser, **kwargs):
    """Reward explanations before the answer."""
    text = completion[-1]["content"] if isinstance(completion, list) else completion
    answer = parser.parse_answer(completion) or ""
    # Check if there's text before the boxed answer
    boxed_idx = text.find("\\boxed")
    if boxed_idx > 50:  # At least 50 chars of explanation
        return 1.0
    return 0.0

rubric.add_reward_func(shows_reasoning, weight=0.2)

# Now reward = 1.0 * correct + 0.2 * shows_reasoning

Handling Edge Cases

rubric = vf.MathRubric()

# Empty completion
state = {
    "prompt": "What is 2+2?",
    "completion": [{"role": "assistant", "content": ""}],
    "answer": "4",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(state["reward"])  # 0.0 - empty responses get 0

# Equivalent expressions
state = {
    "prompt": "Simplify.",
    "completion": [{"role": "assistant", "content": "\\boxed{2x + 6}"}],
    "answer": "2(x + 3)",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(state["reward"])  # 1.0 - algebraically equivalent

Mathematical Equivalence

The math_verify library checks various forms of equivalence:
  • Symbolic: 2x + 62(x + 3)
  • Numerical: 0.333...1/3
  • Algebraic: (x+1)^2x^2 + 2x + 1
  • Trigonometric: sin^2(x) + cos^2(x)1

Performance Considerations

Mathematical verification can be computationally expensive:
  • Set appropriate timeout_seconds to avoid hanging on complex expressions
  • Use max_workers to control parallelism and memory usage
  • The thread pool is automatically cleaned up when the rubric is garbage collected
Verification runs in a thread pool executor to avoid blocking the async event loop.

Notes

  • Empty or unparseable responses always receive 0.0 reward
  • Timeout violations are logged at debug level for soft timeouts, warning for hard timeouts
  • The rubric suppresses math_verify library timeout warnings (handled internally)
  • All parsing exceptions return 0.0 rather than raising errors

See Also

Build docs developers (and LLMs) love