Overview
YC-Bench is a deterministic long-horizon benchmark where an LLM agent plays the CEO of an AI startup. The system is built as a discrete-event simulation with a CLI interface, backed by SQLite, and orchestrated by an agent runtime loop.Project Structure
Key Modules
CLI (cli/)
The agent interacts with the simulation exclusively through JSON-returning CLI commands built with Typer.
Key files:
__init__.py- Main Typer app and command registrationsim_commands.py- Time advancement (sim resume)company_commands.py- Company status and prestige queriesemployee_commands.py- Employee listing and inspectiontask_commands.py- Task lifecycle (accept, assign, dispatch, cancel, inspect)market_commands.py- Market task browsingfinance_commands.py- Ledger and transaction historyreport_commands.py- Monthly P&L reportsscratchpad_commands.py- Persistent agent memorystart_command.py- Interactive quickstart wizard
- All commands return JSON (via
json_output()helper) - All commands use transactional DB sessions (
get_db()context manager) - Error messages returned as JSON with
{"error": "message"} - Commands are stateless — all state lives in the database
Agent (agent/)
Handles the LLM agent runtime, conversation management, and tool execution.
Key files:
loop.py- Main agent execution loop (run_agent_loop())prompt.py- System prompt and turn context buildersrun_state.py- Tracks turn count, costs, terminal stateruntime/base.py- Abstract runtime interfaceruntime/litellm_runtime.py- LiteLLM-based implementationruntime/factory.py- Runtime instantiationcommands/executor.py- CLI command execution wrappercommands/policy.py- Safety policies (prevents destructive operations)
- Build turn context (sim state snapshot + history)
- Call LLM runtime with tool schema (
run_command) - Execute tool calls sequentially via
command_executor - Check if agent called
sim resume(checkpoint advancement) - If no resume after N turns, force auto-advance
- Check terminal conditions (bankruptcy, horizon end, max turns)
- Record turn transcript and costs
- Repeat until terminal
- Only the last
history_keep_roundsconversation rounds are kept in context - Older turns are truncated but preserved in the rollout JSON
- Scratchpad provides persistent memory across truncation
Core (core/)
The discrete-event simulation engine.
Key files:
engine.py- Time advancement and event dispatch (advance_time())progress.py- Task work accumulation (flush_progress())eta.py- ETA calculation and milestone schedulingevents.py- Event insertion, fetching, consumptionbusiness_time.py- Business day calculations, payroll boundarieshandlers/- Event handlers (task completion, bankruptcy, horizon end)
engine.py:advance_time()):
TASK_HALF_PROGRESS- Milestone checkpoints (25%, 50%, 75%)TASK_COMPLETED- Task reaches 100% completionBANKRUPTCY- Funds drop below zeroHORIZON_END- Simulation time limit reached
- All randomness uses seeded RNGs during world generation
- Task progress is computed deterministically from employee skill rates
- Event processing order is deterministic (payroll → event → target at same timestamp)
Database (db/)
SQLAlchemy ORM models for all simulation state.
Key models (db/models/):
company.py- Company funds, prestige per domain (4 domains)employee.py- Employee tier, salary, skill rates per domaintask.py- Task status (market/planned/active/complete), requirements, progresssim_state.py- Current sim_time, horizon_end, seedevent.py- Scheduled events with payload and dedupe_keyledger.py- Financial transaction log (payroll, task rewards)scratchpad.py- Agent’s persistent notessession.py- Conversation transcript (for rollout export)
- CLI commands: one session per command (auto-commit on success)
- Agent loop: separate session per DB query (via
db_factory()) - World seeding: single transactional session
Config (config/)
Pydantic-validated configuration system.
Key files:
schema.py- Pydantic models for all experiment parametersloader.py- TOML → Pydantic validation and loadingsampling.py- Distribution families (Beta, Normal, Triangular, Uniform)presets/- Difficulty presets (tutorial, easy, medium, hard, nightmare)
- Load preset TOML from
config/presets/{name}.toml - Parse and validate with Pydantic
- CLI flags override config values (e.g.,
--model,--horizon-years) - Export to
YC_BENCH_EXPERIMENTenv var for CLI subprocess access
Services (services/)
World generation utilities.
Key files:
seed_world.py- Orchestrates world seedinggenerate_employees.py- Creates 10 employees with random skill distributionsgenerate_tasks.py- Generates market task pool with random requirementsrng.py- Seeded random number generator utilities
- Seed company with initial funds
- Create prestige entries for 4 domains (research, inference, data, training)
- Generate employees with tier-stratified salaries and per-domain skill rates
- Generate market tasks with prestige requirements, rewards, domain requirements
Runner (runner/)
Benchmark orchestration and live dashboard.
Key files:
main.py- Entrypoint: DB setup, world init, agent loop invocationdashboard.py- Rich-based live terminal UI (funds, prestige, tasks, turn log)args.py- CLI argument parsingsession.py- Session ID generation
main.py:run_benchmark()):
- Load experiment config (preset or TOML path)
- Create/resume SQLite database
- Seed world if not already initialized
- Build agent runtime (LiteLLM + command executor)
- Initialize run state (tracks turns, costs, terminal reason)
- Start live dashboard (if in TTY)
- Run agent loop until terminal
- Save rollout JSON to
results/ - Print summary (turns, cost, final funds, outcome)
Data Flow
Agent Turn Cycle
Time Advancement Flow
When the agent callsyc-bench sim resume:
Extension Points
Adding New CLI Commands
- Create a new command function in the appropriate
cli/*_commands.pyfile - Use
@app.command()decorator (whereappis the Typer sub-app) - Query/mutate DB using
get_db()context manager - Return JSON via
json_output(data)orerror_output(message) - Add command to agent tool schema in
agent/tools/run_command_schema.pyif agent-accessible
Customizing Simulation Mechanics
Modify prestige mechanics:- Edit
core/engine.py:apply_prestige_decay()for decay formula - Edit
core/handlers/task_complete.pyfor prestige rewards - Edit
cli/task_commands.py:accept()for prestige gating logic
- Edit
core/progress.py:flush_progress()for work accumulation - Edit
core/eta.py:recalculate_etas()for ETA logic - Adjust
prestige_qty_scalein config to change prestige→work scaling
- Add enum variant to
db/models/event.py:EventType - Create handler in
core/handlers/my_event.py - Register handler in
core/engine.py:dispatch_event() - Insert events via
core/events.py:insert_event()
Creating New Agent Runtimes
To support a new LLM provider or custom agent architecture:- Subclass
agent/runtime/base.py:AgentRuntime - Implement
run_turn(request: RuntimeTurnRequest) -> RuntimeTurnResult - Implement
clear_session(session_id: str) - Register in
agent/runtime/factory.py:build_runtime()
Modifying Configuration
Create a custom preset:- Copy
src/yc_bench/config/presets/default.toml - Edit parameters (see
config/schema.pyfor all options) - Save as
my_preset.toml - Run with
--config my_presetor--config /path/to/my_preset.toml
- Initial funds, employees, market tasks
- Prestige decay rate, min/max bounds, reward scale
- Deadline pressure, penalty multipliers
- Salary bump rate, work hours per day
- Sampling distributions for world generation (see
WorldDists) - Agent model, temperature, history window
- Auto-advance threshold, max turns
Custom Metrics and Logging
Inject callbacks into the agent loop:runner/dashboard.py for a complete example (live terminal UI).