Performance Benchmarks

All benchmark numbers are measured in clock cycles from the simulated machine running the standard test configuration (forest_height=10, rounds=16, batch_size=256).

Baseline Performance

Starting Point

147,734 cycles - The unoptimized scalar implementation included in this repository

This is the BASELINE constant defined in perf_takehome.py. Any optimization that reduces cycle count below this threshold demonstrates improvement.

Speedup Calculation

Your speedup is calculated as:

speedup = 147734 / your_cycle_count

For example:

18,532 cycles = 7.97x speedup
1,790 cycles = 82.5x speedup
1,363 cycles = 108.4x speedup

Claude Performance Tiers

All Claude benchmarks below are for the 2-hour version which started at 18,532 cycles (not the 147,734 baseline in this repo).

Model	Configuration	Cycles	Speedup	Notes
Claude Opus 4	Many hours (test-time compute)	2,164	68.3x	First major breakthrough
Claude Opus 4.5	Casual session	1,790	82.5x	Matches best human 2hr performance
Claude Opus 4.5	2 hours (harness)	1,579	93.5x	Structured optimization session
Claude Sonnet 4.5	Many hours	1,548	95.5x	Extended compute time
Claude Opus 4.5	11.5 hours (harness)	1,487	99.3x	Best at launch
Claude Opus 4.5	Improved harness	1,363	108.4x	State-of-the-art AI performance
Best human ever	Unknown	???	???x	Substantially better (undisclosed)

The best human performance is substantially better than 1,363 cycles, but Anthropic hasn’t disclosed exactly how much better.

Updated Starting Point

When Anthropic updated the challenge to 2 hours, they provided starter code at 18,532 cycles (7.97x faster than this repo’s baseline). This gave candidates a head start with some basic optimizations already applied.

If you reach this level from the 147,734 baseline, you’ve recovered the optimizations included in the updated starter code.

Submission Guidelines

Threshold for Recruiting Interest

Beat Claude Opus 4.5

If you optimize below 1,487 cycles, Anthropic wants to hear from you!

Email your submission to: [email protected]

New model releases may change what threshold impresses the recruiting team. No guarantees that thresholds stay current with latest releases.

What to Include

Your optimized code

Include your modified perf_takehome.py or the specific build_kernel() implementation

Validation proof

Show the output from python tests/submission_tests.py with your cycle count

Test integrity check

Demonstrate that git diff origin/main tests/ is empty

Resume (ideally)

Attach your resume if you’re interested in opportunities at Anthropic

Explanation (bonus)

Describe your optimization approach and key insights

Validation Requirements

All first-day submissions below 1,300 cycles were invalid—language models had modified the tests to make the problem easier.

Validation Commands

Run these commands and mention that you did so when submitting:

# Verify tests folder is unchanged
git diff origin/main tests/

This should produce no output. Any changes to the tests/ folder invalidate your submission.

# Check which thresholds you pass
python tests/submission_tests.py

This runs both correctness tests and speed benchmarks. Use the cycle count this prints, not the one from perf_takehome.py.

Test Thresholds

The submission_tests.py file defines these test cases:

class SpeedTests(unittest.TestCase):
    def test_kernel_speedup(self):
        assert cycles() < BASELINE  # 147734
    
    def test_kernel_updated_starting_point(self):
        assert cycles() < 18532
    
    def test_opus4_many_hours(self):
        assert cycles() < 2164
    
    def test_opus45_casual(self):
        assert cycles() < 1790
    
    def test_opus45_2hr(self):
        assert cycles() < 1579
    
    def test_sonnet45_many_hours(self):
        assert cycles() < 1548
    
    def test_opus45_11hr(self):
        assert cycles() < 1487
    
    def test_opus45_improved_harness(self):
        assert cycles() < 1363

You don’t need to pass all tests to be impressive. The difficulty is non-linear—each tier requires increasingly sophisticated optimizations.

Correctness Testing

Before speed testing, the suite runs:

class CorrectnessTests(unittest.TestCase):
    def test_kernel_correctness(self):
        for i in range(8):
            do_kernel_test(10, 16, 256)

This runs 8 iterations with different random seeds to ensure your kernel produces correct results consistently.

If correctness tests fail, your cycle count is artificially inflated to BASELINE * 2 (295,468 cycles).

Comparing Your Results

Interpreting Your Performance

Here’s a rough guide to what different cycle counts mean:

< 147,734 cycles: Any improvement

You’ve beaten the baseline! This shows basic understanding of the problem and some successful optimizations.

< 18,532 cycles: Updated starting point

You’ve matched or exceeded the optimizations in the 2-hour version’s starter code. Strong baseline optimization skills.

< 2,164 cycles: Beat Opus 4 (many hours)

You’re in rarefied air. This required deep understanding of the machine architecture and creative algorithm changes.

< 1,790 cycles: Match best human 2hr performance

You’ve matched what top human candidates achieved in 2 hours. Excellent performance.

< 1,487 cycles: Beat Opus 4.5 launch performance

Submit to Anthropic! You’ve exceeded the best AI performance at Opus 4.5 launch. This demonstrates exceptional optimization skills.

< 1,363 cycles: Beat improved AI harness

You’re approaching the best known AI performance. Anthropic will be very interested in your techniques.

Approaching best human: Elite territory

The best human performance is substantially better than 1,363 cycles. If you’re getting close to this range, you’re demonstrating world-class optimization ability.

Next Steps

Understand the Task

Review task requirements and rules

Challenge Overview

Learn about the challenge background

Good luck! The journey from baseline to elite performance involves discovering increasingly subtle optimizations. Each breakthrough brings you closer to the theoretical minimum.

Get Started

Challenge

Architecture

Kernel Development

Debugging

Baseline Performance

Starting Point

Speedup Calculation

Claude Performance Tiers

Updated Starting Point

Submission Guidelines

Threshold for Recruiting Interest

Beat Claude Opus 4.5

What to Include

Validation Requirements

Validation Commands

Test Thresholds

Correctness Testing

Comparing Your Results

Interpreting Your Performance

Next Steps

Understand the Task

Challenge Overview

Build docs developers (and LLMs) love

Get Started

Challenge

Architecture

Kernel Development

Debugging

​Baseline Performance

Starting Point

​Speedup Calculation

​Claude Performance Tiers

​Updated Starting Point

​Submission Guidelines

​Threshold for Recruiting Interest

Beat Claude Opus 4.5

​What to Include

​Validation Requirements

​Validation Commands

​Test Thresholds

​Correctness Testing

​Comparing Your Results

​Interpreting Your Performance

​Next Steps

Understand the Task

Challenge Overview

Build docs developers (and LLMs) love

Baseline Performance

Speedup Calculation

Claude Performance Tiers

Updated Starting Point

Submission Guidelines

Threshold for Recruiting Interest

What to Include

Validation Requirements

Validation Commands

Test Thresholds

Correctness Testing

Comparing Your Results

Interpreting Your Performance

Next Steps