MathEquivalence compares mathematical answers by normalizing LaTeX notation and then attempting numeric comparison. It is implemented in pure Python with no external math libraries.
Constructor
MathEquivalence takes no constructor parameters.
score()
The unmodified example dict. Must contain an
"answer" key with the ground-truth mathematical expression or value.The output dict returned by the system under test. Must contain a
"response" key with the model’s answer string. If the response contains \boxed{...}, the content of the box is extracted and used as the answer.Return values
1.0 if the reference and response are mathematically equivalent, otherwise 0.0.If
answer is empty, returns math_equiv: 1.0. If response is empty, returns math_equiv: 0.0.Examples
Auto-wired datasets
MathEquivalence is automatically applied when any of the following datasets are selected:
| CLI name | Dataset |
|---|---|
math | MATH (competition mathematics) |
gsm8k | GSM8K (grade school math) |
mgsm | MGSM (multilingual math; configurable, e.g. mgsm:de) |
Implementation notes
Equivalence is checked in a cascade — the first matching condition returnsTrue:
- Boxed extraction — if the response contains
\boxed{...}, the content is extracted using brace-depth matching before any other processing. - Case-insensitive string match — whitespace-normalized strings are compared directly.
- LaTeX normalization — both strings are normalized and compared again. Normalization handles:
\frac{a}{b}→(a)/(b)\text{},\mathrm{},\textbf{}→ plain text\left,\right→ removed\sqrt{x}→sqrt(x)\cdot,\times→*\pi→pi,\infty→inf\pm→+-- LaTeX spacing commands removed
- Numeric comparison — if both normalized strings parse as numbers (including fractions
a/band percentagesx%), they are compared with a relative tolerance of1e-6. - Numeric comparison on original strings — the same numeric comparison is attempted on the pre-normalization strings as a final fallback.
