What It Tests
This benchmark assesses:- Large Integer Arithmetic: Ability to correctly add numbers with 2-30 digits
- Numerical Accuracy: Precise calculation without rounding or approximation errors
- Output Format Compliance: Returning only the numeric result without explanations
Implementation Details
The benchmark is implemented in theadd_two_ints() function (main.py:194-269):
Random Integer Generation
For each iteration, two random integers are generated:Random length between 2 and 30 digits for the first integer
Random length between 2 and 30 digits for the second integer
Prompt Template
The model receives this exact prompt:The prompt explicitly instructs the model to output only the numeric sum with no calculations, working, explanations, or additional text.
Success Criteria
The benchmark validates responses using integer comparison with error handling:- The output can be parsed as an integer (no non-numeric characters)
- The parsed integer exactly equals
int1 + int2 - Leading and trailing whitespace is stripped before parsing
Error Handling
The benchmark handles two types of failures:- ValueError/TypeError: Output contains non-numeric characters or cannot be parsed
- Incorrect Sum: Output is numeric but mathematically incorrect
Example
Input Integers
Prompt Sent to Model
Expected Output
Result Recording
Each test result is recorded with:Failure Cases
Common failure modes include:- Showing Work:
123456789012345 + 987654321098765 = 1111111110111110 - Explanation:
The sum is 1111111110111110 - Incorrect Calculation:
1111111110111111(off by one) - Scientific Notation:
1.11111111e15(not accepted) - Rounding Errors:
1111111110111000(lost precision) - Non-numeric Output:
One trillion, one hundred eleven billion...
Performance Metrics
The benchmark tracks:- Success Rate: Percentage of correct additions across all tries
- Duration: Time taken for each calculation
- Error Types: Whether failures are due to parsing errors or incorrect arithmetic
- Reasoning: Optional reasoning traces from reasoning-capable models
This benchmark is particularly revealing for identifying models that struggle with large number arithmetic or have trouble following strict output format requirements.
Difficulty Factors
The challenge of this benchmark increases with:- Number Size: Larger integers (approaching 30 digits) are more difficult
- Carry Operations: Numbers requiring many carry operations are harder
- Output Discipline: Models must resist explaining their work