Skip to main content
YC-Bench logo

What is YC-Bench?

YC-Bench is a long-horizon deterministic benchmark for evaluating LLM agents. The agent plays the role of CEO of an AI startup over a simulated 1–3 year period, operating exclusively through a CLI tool against a SQLite-backed discrete-event simulation. Unlike short-horizon benchmarks that test single-turn or simple multi-turn interactions, YC-Bench evaluates whether agents can manage compounding decisions sustained over hundreds of turns:
  • Prestige specialization across 4 technical domains
  • Employee allocation and productivity inference
  • Cash flow management under monthly payroll pressure
  • Deadline risk assessment and capacity planning

Why Long-Horizon Benchmarks Matter

Most existing agent benchmarks test isolated capabilities or short sequences of actions. Real-world agent deployments require:
  • Strategic coherence over extended time horizons
  • Compounding decision quality where early mistakes cascade
  • Resource management under continuous constraints
  • Pattern recognition from incomplete information
YC-Bench addresses these gaps by creating a realistic business simulation where the agent must balance short-term survival (avoiding bankruptcy) with long-term growth (climbing prestige to access better opportunities).

What Makes YC-Bench Unique

Deterministic Simulation

Every run with the same seed produces identical results. This enables:
  • Reproducible comparisons between models
  • Precise debugging of agent behavior
  • Controlled difficulty scaling via configuration presets

Hidden Information

Employees have per-domain skill rates that are invisible to the agent. The agent sees only:
  • Employee tier (junior/mid/senior)
  • Monthly salary
  • Current task assignments
The agent must infer productivity from task progress checkpoints at 25%, 50%, 75%, and 100% completion.

Compounding Complexity

  • Prestige decay: Every domain loses prestige daily. Neglected domains decay back toward 1.0.
  • Salary bumps: Each successful task completion awards assigned employees a 1% raise, creating compounding payroll pressure.
  • Throughput splitting: An employee assigned to N tasks has effective_rate = base_rate / N.
  • Per-domain prestige gating: Higher-prestige tasks require ALL of their required domains to meet the threshold.

Key Capabilities Tested

Strategic Planning

Balancing immediate cash needs with long-term prestige building across 4 domains

Resource Allocation

Assigning 10 employees with hidden skills across competing tasks and domains

Risk Management

Avoiding bankruptcy while managing deadline pressure and prestige penalties

Productivity Inference

Learning employee strengths from progress observations without direct skill data

The Four Domains

Every task in YC-Bench requires work in 1–3 of these domains:
  • Research — algorithm design, paper analysis
  • Inference — serving, optimization, deployment
  • Data/Environment — datasets, benchmarks, tooling
  • Training — model training, compute management
Each domain tracks prestige independently in the range [1.0, 10.0]. Tasks gate access based on your prestige level in each required domain.

Configuration Presets

YC-Bench ships with five difficulty presets:
PresetHorizonFocusWhat It Tests
tutorial1 yearVery relaxed deadlines, prestige-1 tasksBasic accept→assign→dispatch loop
easy1 yearRelaxed deadlines, prestige-1 tasksThroughput awareness and parallelism
medium1 yearModerate deadlines, prestige 2–4 tasksPrestige climbing + domain specialization
hard1 yearTight deadlines, prestige 3–5 tasksPrecise ETA reasoning + capacity planning
nightmare1 yearRazor-thin deadlines, prestige 4–7 tasksSustained perfection under payroll pressure
All presets use 10 employees and 200 market tasks. Difficulty comes from deadline pressure, penalty severity, prestige requirements, and task size.
We recommend starting with medium difficulty. It introduces the core prestige mechanics without overwhelming deadline pressure.

Next Steps

Installation

Install YC-Bench and set up API keys

Quickstart

Run your first benchmark in 5 minutes

Build docs developers (and LLMs) love