This guide walks you through running your first YC-Bench evaluation from installation to results.
Option 1: One-Command Quickstart
The fastest way to get started is using the interactive launcher:
curl -sSL https://raw.githubusercontent.com/collinear-ai/yc-bench/main/start.sh | bash
This script will:
Install uv if missing
Clone the repository (or update if already cloned)
Launch the interactive setup wizard
The script is safe to run multiple times. It will update an existing installation rather than duplicate it.
Option 2: Manual Setup
If you’ve already installed YC-Bench , launch the interactive wizard:
cd yc-bench
uv run yc-bench start
Interactive Setup
The yc-bench start command guides you through a 3-step setup:
Choose difficulty preset
Select a configuration preset: ┌─ Step 1/3 ─ Configure the eval ────────────────────────────┐
│ │
│ # Preset Horizon Team Tasks Description│
│ 1 Tutorial 1 yr 10 emp 200 tasks Learn the basics │
│ 2 Easy 1 yr 10 emp 200 tasks Gentle intro │
│ 3 Medium (recommended) 1 yr 10 emp 200 tasks Prestige + specialization │
│ 4 Hard 1 yr 10 emp 200 tasks Deadline pressure │
│ 5 Nightmare 1 yr 10 emp 200 tasks Sustained perfection │
│ │
│ 0 Custom (build your own config) │
│ │
└──────────────────────────────────────────────────────────────┘
Enter number [3]:
Recommendation : Start with 3 (Medium) to experience the core prestige mechanics.You’ll also be prompted for a seed: Seeds produce deterministic worlds. Use the same seed across models for fair comparisons.
Select a model
Choose from popular models: ┌─ Step 2/3 ─ Choose a model ────────────────────────────────┐
│ │
│ # Provider Model Model ID │
│ │
│ 1 Anthropic Claude Opus 4.6 anthropic/claude-opus-4-6 │
│ 2 Anthropic Claude Sonnet 4.6 anthropic/claude-sonnet-4-6 │
│ 3 Anthropic Claude Haiku 4.5 anthropic/claude-haiku-4-5-20251001 │
│ │
│ 4 OpenAI GPT-5.2 openai/gpt-5.2 │
│ 5 OpenAI GPT-5.1 Mini openai/gpt-5.1-mini │
│ 6 OpenAI o4-mini openai/o4-mini │
│ │
│ 7 Google Gemini 3.1 Pro openrouter/google/gemini-3.1-pro-preview │
│ 8 Google Gemini 3 Flash openrouter/google/gemini-3-flash-preview │
│ 9 Google Gemini 2.5 Flash (free) openrouter/google/gemini-2.5-flash-preview:free │
│ │
│ 0 Custom model ID │
│ │
└──────────────────────────────────────────────────────────────┘
Enter number [1]:
Select any model your API key supports. Option 0 lets you enter a custom LiteLLM model string.
Configure API key
The wizard will detect any API keys in your environment or .env file: ┌─ Step 3/3 ─ API key ───────────────────────────────────────┐
│ │
│ Found ANTHROPIC_API_KEY in environment: sk-ant-...6x2f │
│ Use this key? [Y/n]: y │
│ │
│ > Detected: Anthropic key │
│ │
└──────────────────────────────────────────────────────────────┘
If no key is found, you’ll be prompted to paste one. The wizard auto-detects the provider by key prefix.
Running the Benchmark
Once configured, the benchmark launches automatically:
┌─ Launching ────────────────────────────────────────────────┐
│ │
│ yc-bench run --model anthropic/claude-sonnet-4-6 │
│ --seed 1 │
│ --config medium │
│ │
└──────────────────────────────────────────────────────────────┘
The agent loop begins:
[2025-01-01 09:00] Turn 1 — Company Status
Funds: $150,000.00
Monthly Payroll: $32,400.00
Runway: ~4.6 months
Prestige: research=1.0 inference=1.0 data=1.0 training=1.0
[Turn 1] Agent → run_command("yc-bench company status")
[Turn 1] ← {"funds": 15000000, "prestige": {...}, ...}
[Turn 2] Agent → run_command("yc-bench market browse --required-prestige-lte 1")
[Turn 2] ← {"tasks": [{"id": "abc123", "domains": ["research"], ...}], ...}
[Turn 3] Agent → run_command("yc-bench task accept --task-id abc123")
[Turn 3] ← {"success": true, "deadline": "2025-01-15", ...}
Live Dashboard
By default, YC-Bench displays a live terminal dashboard with:
Current simulation time and turn count
Funds, runway, and prestige across domains
Active and planned tasks
Recent events (task completions, payroll, etc.)
To disable the dashboard, run with --no-live:
uv run yc-bench run --model MODEL --seed 1 --config medium --no-live
Understanding the Output
A typical run produces:
Console Output
Real-time agent actions and events:
[2025-01-08 14:30] Turn 18 — Task Milestone
Task abc123 (research): 50% complete
Estimated completion: 2025-01-12 (3 days before deadline)
[Turn 18] Agent → run_command("yc-bench task inspect --task-id abc123")
[Turn 18] ← {"progress": 0.5, "assigned": [{"id": "emp456", ...}], ...}
[2025-01-12 17:00] Turn 24 — Task Complete
Task abc123: SUCCESS
Reward: $35,400 + prestige delta +0.12 (research)
Employee emp456: salary bump $3,200 → $3,232 (+1%)
Database Output
Full run history stored in SQLite:
db/run_<model>_seed<N>_<timestamp>.db
The database contains:
Complete event log (task acceptances, completions, payroll, etc.)
Company state snapshots per turn
Employee history (assignments, skill progression, salary changes)
Financial ledger (all transactions)
JSON Rollout
Structured summary of the entire run:
results/run_<model>_seed<N>_<timestamp>.json
Includes:
Final company state (funds, prestige, task counts)
Turn-by-turn agent actions and LLM responses
Performance metrics (success rate, prestige growth, bankruptcy status)
Example: First Few Turns
Here’s what a typical agent does in the first 10 turns:
Turn 1-2: Observe initial state
Agent: run_command ( "yc-bench company status" )
← {"funds": 15000000, "prestige": {"research": 1.0, ...}, "runway_months": 4.6 }
Agent: run_command ( "yc-bench employee list" )
← {"employees": [{ "id" : "emp1", "tier": "mid", "salary": 7200 }, ...]}
The agent learns:
Starting funds: $150,000
Monthly burn: $32,400
Runway: ~4.6 months before bankruptcy
10 employees (5 junior, 3 mid, 2 senior)
Turn 3-5: Browse and accept tasks
Agent: run_command ( "yc-bench market browse --required-prestige-lte 1 --limit 20" )
← {"tasks": [{ "id" : "task1", "domains": [ "research" ], "required_prestige" : 1,
"reward" : 3540000, "required_qty": {"research": 1200 }}, ...]}
Agent: run_command ( "yc-bench task accept --task-id task1" )
← {"success": true , "deadline": "2025-01-15"}
Agent: run_command ( "yc-bench task accept --task-id task2" )
← {"success": true , "deadline": "2025-01-18"}
The agent accepts 2-3 prestige-1 tasks to generate initial revenue.
Turn 6-8: Assign employees and dispatch
Agent: run_command ( "yc-bench task assign --task-id task1 --employee-id emp3" )
Agent: run_command ( "yc-bench task assign --task-id task1 --employee-id emp5" )
← {"success": true }
Agent: run_command ( "yc-bench task dispatch --task-id task1" )
← {"success": true , "status": "active"}
The agent assigns multiple employees to each task and starts work.
Turn 9-10: Resume simulation
Agent: run_command ( "yc-bench sim resume" )
← {"events": [{ "type" : "task_half", "task_id": "task1", "progress": 0.25 }],
"sim_time" : "2025-01-05 11:30"}
Time advances to the next event (first progress checkpoint at 25%).
What Happens Next
After the first few tasks, the agent must:
Monitor progress — Use checkpoint events to estimate employee productivity
Climb prestige — Complete tasks to unlock higher-prestige (higher-reward) tasks
Specialize domains — Focus on 2-3 domains rather than spreading thin
Manage capacity — Balance parallelism (throughput splitting) vs focus
Avoid bankruptcy — Maintain runway while climbing the prestige ladder
Advanced: Command-Line Run
Skip the interactive wizard and run directly:
uv run yc-bench run \
--model anthropic/claude-sonnet-4-6 \
--seed 1 \
--config medium
All Options
uv run yc-bench run \
--model MODEL_ID \
--seed SEED \
--config PRESET_NAME \
--horizon-years YEARS \
--company-name "Your Startup" \
--start-date 2025-01-01 \
--no-live
Option Description Default --modelLiteLLM model string (required) — --seedRandom seed for world generation (required) — --configPreset name (tutorial, easy, medium, hard, nightmare) or path to .toml file default--horizon-yearsOverride simulation length From preset --company-nameCompany name in the simulation BenchCo--start-dateSimulation start date (YYYY-MM-DD) 2025-01-01--no-liveDisable live dashboard Dashboard enabled
Running Multiple Models in Parallel
Benchmark multiple models on the same seed:
bash scripts/run_benchmark.sh --seed 1 --config hard
This script runs all models defined in scripts/run_benchmark.sh in parallel, making it easy to compare performance across models.
API costs : Running multiple models in parallel will consume API credits faster. Monitor your usage.
Next Steps
CLI Reference Complete guide to all YC-Bench CLI commands
Configuration Customize presets and create your own difficulty settings
Understanding Results Interpret benchmark output and performance metrics
Simulation Mechanics Learn how the simulation engine works