What is YC-Bench?
YC-Bench is a long-horizon deterministic benchmark for evaluating LLM agents. The agent plays the role of CEO of an AI startup over a simulated 1–3 year period, operating exclusively through a CLI tool against a SQLite-backed discrete-event simulation. Unlike short-horizon benchmarks that test single-turn or simple multi-turn interactions, YC-Bench evaluates whether agents can manage compounding decisions sustained over hundreds of turns:- Prestige specialization across 4 technical domains
- Employee allocation and productivity inference
- Cash flow management under monthly payroll pressure
- Deadline risk assessment and capacity planning
Why Long-Horizon Benchmarks Matter
Most existing agent benchmarks test isolated capabilities or short sequences of actions. Real-world agent deployments require:- Strategic coherence over extended time horizons
- Compounding decision quality where early mistakes cascade
- Resource management under continuous constraints
- Pattern recognition from incomplete information
What Makes YC-Bench Unique
Deterministic Simulation
Every run with the same seed produces identical results. This enables:- Reproducible comparisons between models
- Precise debugging of agent behavior
- Controlled difficulty scaling via configuration presets
Hidden Information
Employees have per-domain skill rates that are invisible to the agent. The agent sees only:- Employee tier (junior/mid/senior)
- Monthly salary
- Current task assignments
Compounding Complexity
- Prestige decay: Every domain loses prestige daily. Neglected domains decay back toward 1.0.
- Salary bumps: Each successful task completion awards assigned employees a 1% raise, creating compounding payroll pressure.
- Throughput splitting: An employee assigned to N tasks has
effective_rate = base_rate / N. - Per-domain prestige gating: Higher-prestige tasks require ALL of their required domains to meet the threshold.
Key Capabilities Tested
Strategic Planning
Balancing immediate cash needs with long-term prestige building across 4 domains
Resource Allocation
Assigning 10 employees with hidden skills across competing tasks and domains
Risk Management
Avoiding bankruptcy while managing deadline pressure and prestige penalties
Productivity Inference
Learning employee strengths from progress observations without direct skill data
The Four Domains
Every task in YC-Bench requires work in 1–3 of these domains:- Research — algorithm design, paper analysis
- Inference — serving, optimization, deployment
- Data/Environment — datasets, benchmarks, tooling
- Training — model training, compute management
Configuration Presets
YC-Bench ships with five difficulty presets:| Preset | Horizon | Focus | What It Tests |
|---|---|---|---|
| tutorial | 1 year | Very relaxed deadlines, prestige-1 tasks | Basic accept→assign→dispatch loop |
| easy | 1 year | Relaxed deadlines, prestige-1 tasks | Throughput awareness and parallelism |
| medium | 1 year | Moderate deadlines, prestige 2–4 tasks | Prestige climbing + domain specialization |
| hard | 1 year | Tight deadlines, prestige 3–5 tasks | Precise ETA reasoning + capacity planning |
| nightmare | 1 year | Razor-thin deadlines, prestige 4–7 tasks | Sustained perfection under payroll pressure |
We recommend starting with medium difficulty. It introduces the core prestige mechanics without overwhelming deadline pressure.
Next Steps
Installation
Install YC-Bench and set up API keys
Quickstart
Run your first benchmark in 5 minutes