Skip to main content
This guide walks you through running your first YC-Bench evaluation from installation to results.

Option 1: One-Command Quickstart

The fastest way to get started is using the interactive launcher:
curl -sSL https://raw.githubusercontent.com/collinear-ai/yc-bench/main/start.sh | bash
This script will:
  • Install uv if missing
  • Clone the repository (or update if already cloned)
  • Launch the interactive setup wizard
The script is safe to run multiple times. It will update an existing installation rather than duplicate it.

Option 2: Manual Setup

If you’ve already installed YC-Bench, launch the interactive wizard:
cd yc-bench
uv run yc-bench start

Interactive Setup

The yc-bench start command guides you through a 3-step setup:
1

Choose difficulty preset

Select a configuration preset:
┌─ Step 1/3 ─ Configure the eval ────────────────────────────┐
│                                                              │
│  #   Preset           Horizon  Team    Tasks      Description│
│  1   Tutorial         1 yr     10 emp  200 tasks  Learn the basics │
│  2   Easy             1 yr     10 emp  200 tasks  Gentle intro │
│  3   Medium (recommended) 1 yr 10 emp  200 tasks  Prestige + specialization │
│  4   Hard             1 yr     10 emp  200 tasks  Deadline pressure │
│  5   Nightmare        1 yr     10 emp  200 tasks  Sustained perfection │
│                                                              │
│  0   Custom           (build your own config)               │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Enter number [3]: 
Recommendation: Start with 3 (Medium) to experience the core prestige mechanics.You’ll also be prompted for a seed:
Seed [1]: 1
Seeds produce deterministic worlds. Use the same seed across models for fair comparisons.
2

Select a model

Choose from popular models:
┌─ Step 2/3 ─ Choose a model ────────────────────────────────┐
│                                                              │
│  #   Provider    Model                      Model ID        │
│                                                              │
│  1   Anthropic   Claude Opus 4.6            anthropic/claude-opus-4-6 │
│  2   Anthropic   Claude Sonnet 4.6          anthropic/claude-sonnet-4-6 │
│  3   Anthropic   Claude Haiku 4.5           anthropic/claude-haiku-4-5-20251001 │
│                                                              │
│  4   OpenAI      GPT-5.2                    openai/gpt-5.2  │
│  5   OpenAI      GPT-5.1 Mini               openai/gpt-5.1-mini │
│  6   OpenAI      o4-mini                    openai/o4-mini  │
│                                                              │
│  7   Google      Gemini 3.1 Pro             openrouter/google/gemini-3.1-pro-preview │
│  8   Google      Gemini 3 Flash            openrouter/google/gemini-3-flash-preview │
│  9   Google      Gemini 2.5 Flash (free)   openrouter/google/gemini-2.5-flash-preview:free │
│                                                              │
│  0   Custom model ID                                        │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Enter number [1]: 
Select any model your API key supports. Option 0 lets you enter a custom LiteLLM model string.
3

Configure API key

The wizard will detect any API keys in your environment or .env file:
┌─ Step 3/3 ─ API key ───────────────────────────────────────┐
│                                                              │
│  Found ANTHROPIC_API_KEY in environment: sk-ant-...6x2f    │
│  Use this key? [Y/n]: y                                     │
│                                                              │
│  > Detected: Anthropic key                                  │
│                                                              │
└──────────────────────────────────────────────────────────────┘
If no key is found, you’ll be prompted to paste one. The wizard auto-detects the provider by key prefix.

Running the Benchmark

Once configured, the benchmark launches automatically:
┌─ Launching ────────────────────────────────────────────────┐

  yc-bench run --model anthropic/claude-sonnet-4-6
               --seed 1
               --config medium

└──────────────────────────────────────────────────────────────┘
The agent loop begins:
[2025-01-01 09:00] Turn 1 — Company Status
  Funds: $150,000.00
  Monthly Payroll: $32,400.00
  Runway: ~4.6 months
  Prestige: research=1.0 inference=1.0 data=1.0 training=1.0

[Turn 1] Agent → run_command("yc-bench company status")
[Turn 1] ← {"funds": 15000000, "prestige": {...}, ...}

[Turn 2] Agent → run_command("yc-bench market browse --required-prestige-lte 1")
[Turn 2] ← {"tasks": [{"id": "abc123", "domains": ["research"], ...}], ...}

[Turn 3] Agent → run_command("yc-bench task accept --task-id abc123")
[Turn 3] ← {"success": true, "deadline": "2025-01-15", ...}

Live Dashboard

By default, YC-Bench displays a live terminal dashboard with:
  • Current simulation time and turn count
  • Funds, runway, and prestige across domains
  • Active and planned tasks
  • Recent events (task completions, payroll, etc.)
To disable the dashboard, run with --no-live:
uv run yc-bench run --model MODEL --seed 1 --config medium --no-live

Understanding the Output

A typical run produces:

Console Output

Real-time agent actions and events:
[2025-01-08 14:30] Turn 18 — Task Milestone
  Task abc123 (research): 50% complete
  Estimated completion: 2025-01-12 (3 days before deadline)

[Turn 18] Agent → run_command("yc-bench task inspect --task-id abc123")
[Turn 18] ← {"progress": 0.5, "assigned": [{"id": "emp456", ...}], ...}

[2025-01-12 17:00] Turn 24 — Task Complete
  Task abc123: SUCCESS
  Reward: $35,400 + prestige delta +0.12 (research)
  Employee emp456: salary bump $3,200 → $3,232 (+1%)

Database Output

Full run history stored in SQLite:
db/run_<model>_seed<N>_<timestamp>.db
The database contains:
  • Complete event log (task acceptances, completions, payroll, etc.)
  • Company state snapshots per turn
  • Employee history (assignments, skill progression, salary changes)
  • Financial ledger (all transactions)

JSON Rollout

Structured summary of the entire run:
results/run_<model>_seed<N>_<timestamp>.json
Includes:
  • Final company state (funds, prestige, task counts)
  • Turn-by-turn agent actions and LLM responses
  • Performance metrics (success rate, prestige growth, bankruptcy status)

Example: First Few Turns

Here’s what a typical agent does in the first 10 turns:
1

Turn 1-2: Observe initial state

Agent: run_command("yc-bench company status")
 {"funds": 15000000, "prestige": {"research": 1.0, ...}, "runway_months": 4.6}

Agent: run_command("yc-bench employee list")
 {"employees": [{"id": "emp1", "tier": "mid", "salary": 7200}, ...]}
The agent learns:
  • Starting funds: $150,000
  • Monthly burn: $32,400
  • Runway: ~4.6 months before bankruptcy
  • 10 employees (5 junior, 3 mid, 2 senior)
2

Turn 3-5: Browse and accept tasks

Agent: run_command("yc-bench market browse --required-prestige-lte 1 --limit 20")
 {"tasks": [{"id": "task1", "domains": ["research"], "required_prestige": 1, 
               "reward": 3540000, "required_qty": {"research": 1200}}, ...]}

Agent: run_command("yc-bench task accept --task-id task1")
 {"success": true, "deadline": "2025-01-15"}

Agent: run_command("yc-bench task accept --task-id task2")
 {"success": true, "deadline": "2025-01-18"}
The agent accepts 2-3 prestige-1 tasks to generate initial revenue.
3

Turn 6-8: Assign employees and dispatch

Agent: run_command("yc-bench task assign --task-id task1 --employee-id emp3")
Agent: run_command("yc-bench task assign --task-id task1 --employee-id emp5")
 {"success": true}

Agent: run_command("yc-bench task dispatch --task-id task1")
 {"success": true, "status": "active"}
The agent assigns multiple employees to each task and starts work.
4

Turn 9-10: Resume simulation

Agent: run_command("yc-bench sim resume")
 {"events": [{"type": "task_half", "task_id": "task1", "progress": 0.25}],
    "sim_time": "2025-01-05 11:30"}
Time advances to the next event (first progress checkpoint at 25%).

What Happens Next

After the first few tasks, the agent must:
  1. Monitor progress — Use checkpoint events to estimate employee productivity
  2. Climb prestige — Complete tasks to unlock higher-prestige (higher-reward) tasks
  3. Specialize domains — Focus on 2-3 domains rather than spreading thin
  4. Manage capacity — Balance parallelism (throughput splitting) vs focus
  5. Avoid bankruptcy — Maintain runway while climbing the prestige ladder

Advanced: Command-Line Run

Skip the interactive wizard and run directly:
uv run yc-bench run \
  --model anthropic/claude-sonnet-4-6 \
  --seed 1 \
  --config medium

All Options

uv run yc-bench run \
  --model MODEL_ID \
  --seed SEED \
  --config PRESET_NAME \
  --horizon-years YEARS \
  --company-name "Your Startup" \
  --start-date 2025-01-01 \
  --no-live
OptionDescriptionDefault
--modelLiteLLM model string (required)
--seedRandom seed for world generation (required)
--configPreset name (tutorial, easy, medium, hard, nightmare) or path to .toml filedefault
--horizon-yearsOverride simulation lengthFrom preset
--company-nameCompany name in the simulationBenchCo
--start-dateSimulation start date (YYYY-MM-DD)2025-01-01
--no-liveDisable live dashboardDashboard enabled

Running Multiple Models in Parallel

Benchmark multiple models on the same seed:
bash scripts/run_benchmark.sh --seed 1 --config hard
This script runs all models defined in scripts/run_benchmark.sh in parallel, making it easy to compare performance across models.
API costs: Running multiple models in parallel will consume API credits faster. Monitor your usage.

Next Steps

CLI Reference

Complete guide to all YC-Bench CLI commands

Configuration

Customize presets and create your own difficulty settings

Understanding Results

Interpret benchmark output and performance metrics

Simulation Mechanics

Learn how the simulation engine works

Build docs developers (and LLMs) love