MathRubric

Overview

MathRubric extends Rubric to provide mathematical equivalence checking using the math_verify library. It evaluates whether model-generated mathematical answers are equivalent to ground truth, handling symbolic expressions, algebraic equivalence, and numerical precision.

Constructor

MathRubric(
    funcs: list[RewardFunc] | None = None,
    weights: list[float] | None = None,
    parser: Parser | None = None,
    max_workers: int = 50,
    timeout_seconds: float = 5,
)

funcs

list[RewardFunc] | None

default:"None"

Additional reward functions beyond the built-in correct_answer function.

weights

list[float] | None

default:"None"

Weights for additional reward functions. The built-in correct_answer gets weight 1.0.

parser

Parser | None

default:"None"

Parser for extracting answers. Defaults to MaybeThinkParser(extract_fn=extract_boxed_answer).

max_workers

int

default:"50"

Maximum number of thread pool workers for parallel verification.

timeout_seconds

float

default:"5"

Per-verification timeout in seconds. Returns 0.0 reward if exceeded.

Built-in Reward Function

correct_answer

async def correct_answer(
    self,
    parser: Parser,
    completion: Messages,
    answer: str,
    **kwargs
) -> float

Verifies mathematical equivalence between the parsed completion and ground truth answer.

parser

Parser

Parser instance (automatically provided by the rubric).

completion

Messages

Model’s completion to verify.

answer

str

Ground truth answer in LaTeX format (will be wrapped in \boxed{}).

Returns: float - 1.0 if mathematically equivalent, 0.0 otherwise.

The function automatically wraps answers in \boxed{} format before verification.

Timeouts

MathRubric implements two-level timeout protection:

timeout_seconds

float

default:"5"

Soft timeout: Returns 0.0 if verification takes longer than this, but logs it as debug.

HARD_TIMEOUT_SECONDS

float

default:"120"

Hard timeout: Absolute maximum time for verification. Logs a warning if exceeded.

Attributes

executor

ThreadPoolExecutor

Thread pool for running blocking math_verify operations asynchronously.

All attributes from Rubric are also available.

Example Usage

Basic Math Verification

import verifiers as vf

# Create math rubric with default settings
rubric = vf.MathRubric()

# Score a mathematical response
state = {
    "prompt": "Solve for x: 2x + 5 = 13",
    "completion": [{"role": "assistant", "content": "x = 4"}],
    "answer": "4",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}

await rubric.score_rollout(state)
print(f"Reward: {state['reward']}")  # 1.0 if correct

Custom Parser for Boxed Answers

from verifiers.parsers import MaybeThinkParser
from verifiers.utils.data_utils import extract_boxed_answer

# Create rubric expecting \boxed{} format
rubric = vf.MathRubric(
    parser=MaybeThinkParser(extract_fn=extract_boxed_answer)
)

state = {
    "prompt": "What is 1/2 + 1/3?",
    "completion": [{"role": "assistant", "content": "The answer is \\boxed{\\frac{5}{6}}"}],
    "answer": "\\frac{5}{6}",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}

await rubric.score_rollout(state)
print(f"Reward: {state['reward']}")  # 1.0 - equivalent fractions

Adding Additional Metrics

rubric = vf.MathRubric()

# Track response length as a metric (weight=0)
def response_length(completion, **kwargs):
    text = completion[-1]["content"] if isinstance(completion, list) else completion
    return len(text)

rubric.add_metric(response_length, weight=0.0)

# Track number of steps
def step_count(completion, **kwargs):
    text = completion[-1]["content"] if isinstance(completion, list) else completion
    return text.count("\n") + 1

rubric.add_metric(step_count, weight=0.0)

await rubric.score_rollout(state)
print(f"Metrics: {state['metrics']}")  # Includes correct_answer, response_length, step_count

Custom Timeout Settings

# Faster timeout for high-throughput evaluation
rubric = vf.MathRubric(
    timeout_seconds=2.0,  # Only allow 2 seconds per verification
    max_workers=100  # More parallelism
)

# Longer timeout for complex symbolic math
rubric = vf.MathRubric(
    timeout_seconds=30.0,  # Allow 30 seconds for complex expressions
    max_workers=10
)

Combining with Other Reward Functions

from verifiers.utils.data_utils import extract_boxed_answer

rubric = vf.MathRubric()

# Reward showing work
async def shows_reasoning(completion, parser, **kwargs):
    """Reward explanations before the answer."""
    text = completion[-1]["content"] if isinstance(completion, list) else completion
    answer = parser.parse_answer(completion) or ""
    # Check if there's text before the boxed answer
    boxed_idx = text.find("\\boxed")
    if boxed_idx > 50:  # At least 50 chars of explanation
        return 1.0
    return 0.0

rubric.add_reward_func(shows_reasoning, weight=0.2)

# Now reward = 1.0 * correct + 0.2 * shows_reasoning

Handling Edge Cases

rubric = vf.MathRubric()

# Empty completion
state = {
    "prompt": "What is 2+2?",
    "completion": [{"role": "assistant", "content": ""}],
    "answer": "4",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(state["reward"])  # 0.0 - empty responses get 0

# Equivalent expressions
state = {
    "prompt": "Simplify.",
    "completion": [{"role": "assistant", "content": "\\boxed{2x + 6}"}],
    "answer": "2(x + 3)",
    "task": "math",
    "timing": {"scoring_ms": 0, "total_ms": 0}
}
await rubric.score_rollout(state)
print(state["reward"])  # 1.0 - algebraically equivalent

Mathematical Equivalence

The math_verify library checks various forms of equivalence:

Symbolic: 2x + 6 ≡ 2(x + 3)
Numerical: 0.333... ≡ 1/3
Algebraic: (x+1)^2 ≡ x^2 + 2x + 1
Trigonometric: sin^2(x) + cos^2(x) ≡ 1

Performance Considerations

Mathematical verification can be computationally expensive:

Set appropriate timeout_seconds to avoid hanging on complex expressions
Use max_workers to control parallelism and memory usage
The thread pool is automatically cleaned up when the rubric is garbage collected

Verification runs in a thread pool executor to avoid blocking the async event loop.

Notes

Empty or unparseable responses always receive 0.0 reward
Timeout violations are logged at debug level for soft timeouts, warning for hard timeouts
The rubric suppresses math_verify library timeout warnings (handled internally)
All parsing exceptions return 0.0 rather than raising errors

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

Overview

Constructor

Built-in Reward Function

correct_answer

Timeouts

Attributes

Example Usage

Basic Math Verification

Custom Parser for Boxed Answers

Adding Additional Metrics

Custom Timeout Settings

Combining with Other Reward Functions

Handling Edge Cases

Mathematical Equivalence

Performance Considerations

Notes

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​Overview

​Constructor

​Built-in Reward Function

​correct_answer

​Timeouts

​Attributes

​Example Usage

​Basic Math Verification

​Custom Parser for Boxed Answers

​Adding Additional Metrics

​Custom Timeout Settings

​Combining with Other Reward Functions

​Handling Edge Cases

​Mathematical Equivalence

​Performance Considerations

​Notes

​See Also

Build docs developers (and LLMs) love

Overview

Constructor

Built-in Reward Function

correct_answer

Timeouts

Attributes

Example Usage

Basic Math Verification

Custom Parser for Boxed Answers

Adding Additional Metrics

Custom Timeout Settings

Combining with Other Reward Functions

Handling Edge Cases

Mathematical Equivalence

Performance Considerations

Notes

See Also