Baseline Performance
Starting Point
147,734 cycles - The unoptimized scalar implementation included in this repository
BASELINE constant defined in perf_takehome.py. Any optimization that reduces cycle count below this threshold demonstrates improvement.
Speedup Calculation
Your speedup is calculated as:- 18,532 cycles = 7.97x speedup
- 1,790 cycles = 82.5x speedup
- 1,363 cycles = 108.4x speedup
Claude Performance Tiers
All Claude benchmarks below are for the 2-hour version which started at 18,532 cycles (not the 147,734 baseline in this repo).
| Model | Configuration | Cycles | Speedup | Notes |
|---|---|---|---|---|
| Claude Opus 4 | Many hours (test-time compute) | 2,164 | 68.3x | First major breakthrough |
| Claude Opus 4.5 | Casual session | 1,790 | 82.5x | Matches best human 2hr performance |
| Claude Opus 4.5 | 2 hours (harness) | 1,579 | 93.5x | Structured optimization session |
| Claude Sonnet 4.5 | Many hours | 1,548 | 95.5x | Extended compute time |
| Claude Opus 4.5 | 11.5 hours (harness) | 1,487 | 99.3x | Best at launch |
| Claude Opus 4.5 | Improved harness | 1,363 | 108.4x | State-of-the-art AI performance |
| Best human ever | Unknown | ??? | ???x | Substantially better (undisclosed) |
Updated Starting Point
When Anthropic updated the challenge to 2 hours, they provided starter code at 18,532 cycles (7.97x faster than this repo’s baseline). This gave candidates a head start with some basic optimizations already applied.
Submission Guidelines
Threshold for Recruiting Interest
Beat Claude Opus 4.5
If you optimize below 1,487 cycles, Anthropic wants to hear from you!
What to Include
Your optimized code
Include your modified
perf_takehome.py or the specific build_kernel() implementationValidation Requirements
Validation Commands
Run these commands and mention that you did so when submitting:This should produce no output. Any changes to the
tests/ folder invalidate your submission.This runs both correctness tests and speed benchmarks. Use the cycle count this prints, not the one from
perf_takehome.py.Test Thresholds
Thesubmission_tests.py file defines these test cases:
You don’t need to pass all tests to be impressive. The difficulty is non-linear—each tier requires increasingly sophisticated optimizations.
Correctness Testing
Before speed testing, the suite runs:Comparing Your Results
Interpreting Your Performance
Here’s a rough guide to what different cycle counts mean:< 147,734 cycles: Any improvement
< 147,734 cycles: Any improvement
You’ve beaten the baseline! This shows basic understanding of the problem and some successful optimizations.
< 18,532 cycles: Updated starting point
< 18,532 cycles: Updated starting point
You’ve matched or exceeded the optimizations in the 2-hour version’s starter code. Strong baseline optimization skills.
< 2,164 cycles: Beat Opus 4 (many hours)
< 2,164 cycles: Beat Opus 4 (many hours)
You’re in rarefied air. This required deep understanding of the machine architecture and creative algorithm changes.
< 1,790 cycles: Match best human 2hr performance
< 1,790 cycles: Match best human 2hr performance
You’ve matched what top human candidates achieved in 2 hours. Excellent performance.
< 1,487 cycles: Beat Opus 4.5 launch performance
< 1,487 cycles: Beat Opus 4.5 launch performance
Submit to Anthropic! You’ve exceeded the best AI performance at Opus 4.5 launch. This demonstrates exceptional optimization skills.
< 1,363 cycles: Beat improved AI harness
< 1,363 cycles: Beat improved AI harness
You’re approaching the best known AI performance. Anthropic will be very interested in your techniques.
Approaching best human: Elite territory
Approaching best human: Elite territory
The best human performance is substantially better than 1,363 cycles. If you’re getting close to this range, you’re demonstrating world-class optimization ability.
Next Steps
Understand the Task
Review task requirements and rules
Challenge Overview
Learn about the challenge background