Skip to main content
All benchmark numbers are measured in clock cycles from the simulated machine running the standard test configuration (forest_height=10, rounds=16, batch_size=256).

Baseline Performance

Starting Point

147,734 cycles - The unoptimized scalar implementation included in this repository
This is the BASELINE constant defined in perf_takehome.py. Any optimization that reduces cycle count below this threshold demonstrates improvement.

Speedup Calculation

Your speedup is calculated as:
speedup = 147734 / your_cycle_count
For example:
  • 18,532 cycles = 7.97x speedup
  • 1,790 cycles = 82.5x speedup
  • 1,363 cycles = 108.4x speedup

Claude Performance Tiers

All Claude benchmarks below are for the 2-hour version which started at 18,532 cycles (not the 147,734 baseline in this repo).
ModelConfigurationCyclesSpeedupNotes
Claude Opus 4Many hours (test-time compute)2,16468.3xFirst major breakthrough
Claude Opus 4.5Casual session1,79082.5xMatches best human 2hr performance
Claude Opus 4.52 hours (harness)1,57993.5xStructured optimization session
Claude Sonnet 4.5Many hours1,54895.5xExtended compute time
Claude Opus 4.511.5 hours (harness)1,48799.3xBest at launch
Claude Opus 4.5Improved harness1,363108.4xState-of-the-art AI performance
Best human everUnknown??????xSubstantially better (undisclosed)
The best human performance is substantially better than 1,363 cycles, but Anthropic hasn’t disclosed exactly how much better.

Updated Starting Point

When Anthropic updated the challenge to 2 hours, they provided starter code at 18,532 cycles (7.97x faster than this repo’s baseline). This gave candidates a head start with some basic optimizations already applied.
If you reach this level from the 147,734 baseline, you’ve recovered the optimizations included in the updated starter code.

Submission Guidelines

Threshold for Recruiting Interest

Beat Claude Opus 4.5

If you optimize below 1,487 cycles, Anthropic wants to hear from you!
Email your submission to: [email protected]
New model releases may change what threshold impresses the recruiting team. No guarantees that thresholds stay current with latest releases.

What to Include

1

Your optimized code

Include your modified perf_takehome.py or the specific build_kernel() implementation
2

Validation proof

Show the output from python tests/submission_tests.py with your cycle count
3

Test integrity check

Demonstrate that git diff origin/main tests/ is empty
4

Resume (ideally)

Attach your resume if you’re interested in opportunities at Anthropic
5

Explanation (bonus)

Describe your optimization approach and key insights

Validation Requirements

All first-day submissions below 1,300 cycles were invalid—language models had modified the tests to make the problem easier.

Validation Commands

Run these commands and mention that you did so when submitting:
# Verify tests folder is unchanged
git diff origin/main tests/
This should produce no output. Any changes to the tests/ folder invalidate your submission.
# Check which thresholds you pass
python tests/submission_tests.py
This runs both correctness tests and speed benchmarks. Use the cycle count this prints, not the one from perf_takehome.py.

Test Thresholds

The submission_tests.py file defines these test cases:
class SpeedTests(unittest.TestCase):
    def test_kernel_speedup(self):
        assert cycles() < BASELINE  # 147734
    
    def test_kernel_updated_starting_point(self):
        assert cycles() < 18532
    
    def test_opus4_many_hours(self):
        assert cycles() < 2164
    
    def test_opus45_casual(self):
        assert cycles() < 1790
    
    def test_opus45_2hr(self):
        assert cycles() < 1579
    
    def test_sonnet45_many_hours(self):
        assert cycles() < 1548
    
    def test_opus45_11hr(self):
        assert cycles() < 1487
    
    def test_opus45_improved_harness(self):
        assert cycles() < 1363
You don’t need to pass all tests to be impressive. The difficulty is non-linear—each tier requires increasingly sophisticated optimizations.

Correctness Testing

Before speed testing, the suite runs:
class CorrectnessTests(unittest.TestCase):
    def test_kernel_correctness(self):
        for i in range(8):
            do_kernel_test(10, 16, 256)
This runs 8 iterations with different random seeds to ensure your kernel produces correct results consistently.
If correctness tests fail, your cycle count is artificially inflated to BASELINE * 2 (295,468 cycles).

Comparing Your Results

Interpreting Your Performance

Here’s a rough guide to what different cycle counts mean:
You’ve beaten the baseline! This shows basic understanding of the problem and some successful optimizations.
You’ve matched or exceeded the optimizations in the 2-hour version’s starter code. Strong baseline optimization skills.
You’re in rarefied air. This required deep understanding of the machine architecture and creative algorithm changes.
You’ve matched what top human candidates achieved in 2 hours. Excellent performance.
Submit to Anthropic! You’ve exceeded the best AI performance at Opus 4.5 launch. This demonstrates exceptional optimization skills.
You’re approaching the best known AI performance. Anthropic will be very interested in your techniques.
The best human performance is substantially better than 1,363 cycles. If you’re getting close to this range, you’re demonstrating world-class optimization ability.

Next Steps

Understand the Task

Review task requirements and rules

Challenge Overview

Learn about the challenge background
Good luck! The journey from baseline to elite performance involves discovering increasingly subtle optimizations. Each breakthrough brings you closer to the theoretical minimum.

Build docs developers (and LLMs) love