Skip to main content

Overview

YC-Bench measures agent performance across multiple dimensions: survival, profitability, task completion rate, prestige achieved, and efficiency. Unlike single-metric benchmarks, YC-Bench requires agents to balance competing objectives.
There is no single “score” — instead, the benchmark reports a rollout with detailed metrics. Researchers can define custom scoring functions based on their evaluation priorities.

Primary Success Criteria

1. Survival (Binary)

Did the company survive to horizon end without going bankrupt?
survival = (final_funds >= 0) and (reached_horizon_end)
Failure modes:
  • Bankruptcy: funds < 0 after payroll or any transaction
  • Agent crash: Unhandled exception or timeout
  • Max turns exceeded: Agent hit turn limit (if configured)
Survival is a hard requirement. A run that ends in bankruptcy is considered a failure regardless of other metrics.

2. Final Funds (Continuous)

How much cash remains at the end?
final_funds = company.funds_cents  # in cents
final_funds_usd = final_funds / 100
Interpretation:
  • Negative: Bankrupt ❌
  • 00–50K: Barely survived (razor-thin margins)
  • 50K50K–200K: Comfortable survival
  • $200K+: Strong performance (healthy cash reserves)

Typical Final Funds by Preset

PresetStarting FundsTypical Final Funds (Survival)Strong Performance
Tutorial$80K50K50K–150K$200K+
Easy$120K100K100K–300K$500K+
Medium$150K50K50K–200K$400K+
Hard$150K00–100K$300K+
Nightmare$250K-$50K (bankruptcy common)$100K+
Final funds reflect cumulative profit over the entire run, accounting for all task rewards, payroll expenses, and compounding salary growth.

Task Completion Metrics

Tasks Completed (Success)

Number of tasks finished on time (before deadline).
tasks_success = count(tasks.status == COMPLETED_SUCCESS)
Typical ranges (3-year run, 10 employees):
  • Tutorial: 100–150 tasks
  • Easy: 80–120 tasks
  • Medium: 60–100 tasks
  • Hard: 40–80 tasks
  • Nightmare: 20–50 tasks

Tasks Failed (Late Completion)

Number of tasks completed after deadline.
tasks_failed = count(tasks.status == COMPLETED_FAIL)
Interpretation:
  • 0 failures: Perfect execution ✅
  • 1–5 failures: Acceptable (occasional missed deadline)
  • 5–10 failures: Suboptimal (deadline estimation issues)
  • 10+ failures: Poor planning (over-commitment or under-resourcing)
Failed tasks earn no funds and incur a 1.4× prestige penalty. High failure rates indicate poor throughput estimation or over-commitment.

Tasks Cancelled

Number of tasks cancelled before completion.
tasks_cancelled = count(tasks.status == CANCELLED)
Interpretation:
  • 0 cancellations: Committed to all accepted tasks ✅
  • 1–3 cancellations: Strategic cancellation (rare)
  • 3+ cancellations: Poor task selection or over-commitment
Cancellation incurs a 2.0× prestige penalty (worse than failure). Frequent cancellations suggest the agent is accepting tasks speculatively without validating feasibility.

Task Completion Rate

total_accepted = tasks_success + tasks_failed + tasks_cancelled
completion_rate = tasks_success / total_accepted
Target: ≥85% success rate Breakdown:
  • 95%+ success: Excellent planning and execution
  • 85–95% success: Good (occasional deadline miss)
  • 70–85% success: Acceptable (some planning issues)
  • Below 70% success: Poor (frequent failures/cancellations)

Prestige Levels Achieved

Final prestige in each domain reflects how far the agent climbed the prestige ladder.
prestige_research = company.prestige.research
prestige_inference = company.prestige.inference
prestige_data_environment = company.prestige.data_environment
prestige_training = company.prestige.training

avg_prestige = (prestige_research + prestige_inference + 
                prestige_data_environment + prestige_training) / 4

Prestige Interpretation

Avg PrestigeMarket AccessDifficulty
1.0–2.0Entry-level tasks onlyAgent never climbed prestige
2.0–3.0Low-tier tasksMinimal progression
3.0–5.0Mid-tier tasks (profitable)Good
5.0–7.0High-tier tasks (high margin)Excellent
7.0–10.0Elite tasks (maximum difficulty)Outstanding

Prestige Balance

Check variance across domains:
prestige_variance = variance([prestige_research, prestige_inference, 
                              prestige_data_environment, prestige_training])
Interpretation:
  • Low variance (e.g., all domains within 1.0 of each other): Balanced strategy ✅
  • High variance (e.g., one domain at 7.0, others at 2.0): Narrow specialization ⚠️
In medium and hard presets, most tasks require 2 domains. Agents with high prestige variance (narrow specialization) will be locked out of multi-domain tasks.

Efficiency Metrics

Runway Utilization

How efficiently did the agent use the available time?
sim_duration_days = (horizon_end - start_date).days
revenue_per_day = total_revenue / sim_duration_days
profit_per_day = (final_funds - starting_funds) / sim_duration_days
Target: Maximize profit_per_day while maintaining survival.

Employee Utilization

total_employee_hours = num_employees × sim_duration_days × work_hours_per_day
active_task_hours = sum(task.completed_qty / employee_rate for all tasks)
utilization_rate = active_task_hours / total_employee_hours
Interpretation:
  • Below 50%: Employees idle (under-utilized)
  • 50–70%: Reasonable (some idle time for flexibility)
  • 70–85%: High utilization (efficient)
  • 85%+: Over-commitment risk (no buffer for delays)
100% utilization is not optimal — it leaves no buffer for unexpected delays or employee throughput variance. Aim for 70–80% utilization.

Payroll-to-Revenue Ratio

total_payroll = sum(all payroll ledger entries)
total_revenue = sum(all task reward ledger entries)
payroll_ratio = total_payroll / total_revenue
Interpretation:
  • Below 30%: Excellent margin (high-prestige tasks)
  • 30–50%: Good margin
  • 50–70%: Thin margin (risky)
  • Above 70%: Unsustainable (bankruptcy risk)
As salaries compound over time (+1% per task), the payroll ratio increases throughout the run. A healthy run should show declining payroll ratio as prestige climbs (higher task rewards offset salary growth).

Interpreting Results: Good vs. Great Runs

Good Run

{
  "survival": true,
  "final_funds_usd": 120000,
  "tasks_completed_success": 65,
  "tasks_completed_fail": 5,
  "tasks_cancelled": 2,
  "completion_rate": 0.90,
  "avg_prestige": 4.2,
  "prestige_variance": 0.8,
  "payroll_ratio": 0.45
}
Analysis:
  • ✅ Survived with healthy cash reserves ($120K)
  • ✅ 90% task success rate (5 failures, 2 cancellations)
  • ✅ Climbed to prestige ~4 (mid-tier tasks)
  • ✅ Balanced prestige across domains (variance 0.8)
  • ✅ Reasonable payroll ratio (45%)

Great Run

{
  "survival": true,
  "final_funds_usd": 450000,
  "tasks_completed_success": 85,
  "tasks_completed_fail": 2,
  "tasks_cancelled": 0,
  "completion_rate": 0.98,
  "avg_prestige": 6.5,
  "prestige_variance": 0.3,
  "payroll_ratio": 0.32
}
Analysis:
  • ✅ Survived with excellent cash reserves ($450K)
  • ✅ 98% task success rate (near-perfect execution)
  • ✅ Climbed to prestige 6–7 (high-tier tasks)
  • ✅ Extremely balanced prestige (variance 0.3)
  • ✅ Excellent payroll ratio (32%)

Poor Run (But Survived)

{
  "survival": true,
  "final_funds_usd": 15000,
  "tasks_completed_success": 40,
  "tasks_completed_fail": 12,
  "tasks_cancelled": 8,
  "completion_rate": 0.67,
  "avg_prestige": 2.8,
  "prestige_variance": 2.1,
  "payroll_ratio": 0.68
}
Analysis:
  • ⚠️ Barely survived ($15K remaining)
  • ❌ 67% success rate (many failures/cancellations)
  • ❌ Low prestige (never climbed to mid-tier)
  • ❌ High prestige variance (narrow specialization)
  • ❌ High payroll ratio (68% — unsustainable)
This run survived but demonstrates poor planning: over-commitment (12 failures), poor task selection (8 cancellations), narrow specialization (high variance), and thin margins (68% payroll ratio).

Failure Analysis

When a run ends in bankruptcy, examine:

1. Cash Flow Timeline

yc-bench finance ledger
Look for:
  • ❌ Long gaps between task completions (revenue droughts)
  • ❌ Payroll > revenue in consecutive months
  • ❌ Large failed tasks (no revenue, but payroll still paid)

2. Task Failure Rate

If tasks_failed / tasks_accepted > 20%:
  • Agent is over-committing (accepting too many tasks)
  • Agent is under-estimating task duration (poor throughput inference)
  • Agent is over-splitting employees (throughput penalty)

3. Prestige Decay

If prestige levels declined over time:
  • Agent is not completing enough tasks per domain to offset decay
  • Agent specialized too narrowly (unused domains decayed)
  • Agent got locked out of market (prestige too low to accept new tasks)

4. Payroll Growth

If payroll grew faster than revenue:
  • Agent completed many tasks (salary bumps) but failed to climb prestige
  • Agent accepted low-margin tasks (prestige 1–2) that don’t offset compounding payroll

Benchmark Comparison

To compare agents, aggregate metrics across multiple seeds (e.g., 10 runs per agent):
survival_rate = count(runs.survival) / count(runs)
avg_final_funds = mean(runs.final_funds | runs.survival == True)
avg_completion_rate = mean(runs.completion_rate)
avg_prestige = mean(runs.avg_prestige)

Example Leaderboard

AgentSurvival RateAvg Final FundsAvg Completion RateAvg Prestige
GPT-490%$185K87%4.8
Claude Opus85%$210K91%5.2
Gemini Pro70%$95K78%3.9
Baseline40%$30K65%2.5
The benchmark is designed to have no ceiling — even the best agents will struggle to achieve 100% survival rate on nightmare preset. The goal is to measure relative performance across agents.

Custom Scoring Functions

Researchers can define custom scoring functions. Example:
def compute_score(rollout):
    if not rollout.survival:
        return 0.0  # Hard requirement
    
    # Weighted composite score
    funds_score = min(1.0, rollout.final_funds / 500_000)
    completion_score = rollout.completion_rate
    prestige_score = min(1.0, rollout.avg_prestige / 8.0)
    
    score = (
        0.4 * funds_score +
        0.3 * completion_score +
        0.3 * prestige_score
    )
    return score

Observing Results During Run

Agents can monitor progress in real-time:
yc-bench company status         # Current funds, prestige, runway
yc-bench report monthly         # P&L summary by month
yc-bench finance ledger         # Full transaction history
yc-bench task list --status completed_success  # Completed tasks

Next Steps

How It Works

Review the core game loop and terminal conditions.

Configuration

Tune difficulty presets to create custom benchmarks.

Development

Understand the codebase architecture and data model.

Task Management

Understand how task outcomes factor into final scores.

Build docs developers (and LLMs) love