Architecture

Overview

YC-Bench is a deterministic long-horizon benchmark where an LLM agent plays the CEO of an AI startup. The system is built as a discrete-event simulation with a CLI interface, backed by SQLite, and orchestrated by an agent runtime loop.

Project Structure

src/yc_bench/
├── cli/              # Command-line interface (Typer commands)
├── agent/            # Agent runtime and execution loop
├── core/             # Simulation engine and event processing
├── db/               # Database models (SQLAlchemy)
├── config/           # Configuration system and presets
├── services/         # World generation utilities
└── runner/           # Benchmark orchestration and dashboard

Key Modules

CLI (`cli/`)

The agent interacts with the simulation exclusively through JSON-returning CLI commands built with Typer. Key files:

__init__.py - Main Typer app and command registration
sim_commands.py - Time advancement (sim resume)
company_commands.py - Company status and prestige queries
employee_commands.py - Employee listing and inspection
task_commands.py - Task lifecycle (accept, assign, dispatch, cancel, inspect)
market_commands.py - Market task browsing
finance_commands.py - Ledger and transaction history
report_commands.py - Monthly P&L reports
scratchpad_commands.py - Persistent agent memory
start_command.py - Interactive quickstart wizard

Design principles:

All commands return JSON (via json_output() helper)
All commands use transactional DB sessions (get_db() context manager)
Error messages returned as JSON with {"error": "message"}
Commands are stateless — all state lives in the database

Agent (`agent/`)

Handles the LLM agent runtime, conversation management, and tool execution. Key files:

loop.py - Main agent execution loop (run_agent_loop())
prompt.py - System prompt and turn context builders
run_state.py - Tracks turn count, costs, terminal state
runtime/base.py - Abstract runtime interface
runtime/litellm_runtime.py - LiteLLM-based implementation
runtime/factory.py - Runtime instantiation
commands/executor.py - CLI command execution wrapper
commands/policy.py - Safety policies (prevents destructive operations)

Agent loop flow:

Build turn context (sim state snapshot + history)
Call LLM runtime with tool schema (run_command)
Execute tool calls sequentially via command_executor
Check if agent called sim resume (checkpoint advancement)
If no resume after N turns, force auto-advance
Check terminal conditions (bankruptcy, horizon end, max turns)
Record turn transcript and costs
Repeat until terminal

History management:

Only the last history_keep_rounds conversation rounds are kept in context
Older turns are truncated but preserved in the rollout JSON
Scratchpad provides persistent memory across truncation

Core (`core/`)

The discrete-event simulation engine. Key files:

engine.py - Time advancement and event dispatch (advance_time())
progress.py - Task work accumulation (flush_progress())
eta.py - ETA calculation and milestone scheduling
events.py - Event insertion, fetching, consumption
business_time.py - Business day calculations, payroll boundaries
handlers/ - Event handlers (task completion, bankruptcy, horizon end)

Simulation loop (in engine.py:advance_time()):

while current_time < target_time:
    # 1. Find next action: min(next_event, next_payroll, target_time)
    # 2. Flush work progress from current_time → action_time
    # 3. Apply prestige decay proportional to elapsed days
    # 4. Process action:
    #    - If payroll: deduct salaries, bankruptcy check, STOP (wake agent)
    #    - If event: dispatch to handler, consume event
    #      - Task milestone (25%, 50%, 75%): wake agent
    #      - Task complete: award funds/prestige, update ETAs
    #      - Bankruptcy/horizon: terminal condition
    #    - If target: stop
    # 5. Update sim_time, loop

Event types:

TASK_HALF_PROGRESS - Milestone checkpoints (25%, 50%, 75%)
TASK_COMPLETED - Task reaches 100% completion
BANKRUPTCY - Funds drop below zero
HORIZON_END - Simulation time limit reached

Determinism:

All randomness uses seeded RNGs during world generation
Task progress is computed deterministically from employee skill rates
Event processing order is deterministic (payroll → event → target at same timestamp)

Database (`db/`)

SQLAlchemy ORM models for all simulation state. Key models (db/models/):

company.py - Company funds, prestige per domain (4 domains)
employee.py - Employee tier, salary, skill rates per domain
task.py - Task status (market/planned/active/complete), requirements, progress
sim_state.py - Current sim_time, horizon_end, seed
event.py - Scheduled events with payload and dedupe_key
ledger.py - Financial transaction log (payroll, task rewards)
scratchpad.py - Agent’s persistent notes
session.py - Conversation transcript (for rollout export)

Database session lifecycle:

CLI commands: one session per command (auto-commit on success)
Agent loop: separate session per DB query (via db_factory())
World seeding: single transactional session

Config (`config/`)

Pydantic-validated configuration system. Key files:

schema.py - Pydantic models for all experiment parameters
loader.py - TOML → Pydantic validation and loading
sampling.py - Distribution families (Beta, Normal, Triangular, Uniform)
presets/ - Difficulty presets (tutorial, easy, medium, hard, nightmare)

Configuration hierarchy:

ExperimentConfig
├── agent: AgentConfig       # model, temperature, history_keep_rounds
├── loop: LoopConfig         # auto_advance_after_turns, max_turns
├── sim: SimConfig           # start_date, horizon_years, company_name
└── world: WorldConfig       # funds, employees, tasks, prestige rules
    ├── dist: WorldDists     # sampling distributions for world gen
    └── salary tiers         # junior/mid/senior configs

Loading order:

Load preset TOML from config/presets/{name}.toml
Parse and validate with Pydantic
CLI flags override config values (e.g., --model, --horizon-years)
Export to YC_BENCH_EXPERIMENT env var for CLI subprocess access

Services (`services/`)

World generation utilities. Key files:

seed_world.py - Orchestrates world seeding
generate_employees.py - Creates 10 employees with random skill distributions
generate_tasks.py - Generates market task pool with random requirements
rng.py - Seeded random number generator utilities

World generation flow:

Seed company with initial funds
Create prestige entries for 4 domains (research, inference, data, training)
Generate employees with tier-stratified salaries and per-domain skill rates
Generate market tasks with prestige requirements, rewards, domain requirements

Runner (`runner/`)

Benchmark orchestration and live dashboard. Key files:

main.py - Entrypoint: DB setup, world init, agent loop invocation
dashboard.py - Rich-based live terminal UI (funds, prestige, tasks, turn log)
args.py - CLI argument parsing
session.py - Session ID generation

Benchmark flow (main.py:run_benchmark()):

Load experiment config (preset or TOML path)
Create/resume SQLite database
Seed world if not already initialized
Build agent runtime (LiteLLM + command executor)
Initialize run state (tracks turns, costs, terminal reason)
Start live dashboard (if in TTY)
Run agent loop until terminal
Save rollout JSON to results/
Print summary (turns, cost, final funds, outcome)

Data Flow

Agent Turn Cycle

┌─────────────────────────────────────────────────┐
│ 1. Build turn context                           │
│    - Snapshot DB state (funds, tasks, prestige) │
│    - Load last N conversation rounds            │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│ 2. Call LLM runtime                             │
│    - Send context + tool schema                 │
│    - LLM responds with tool calls               │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│ 3. Execute tool calls                           │
│    - For each tool call:                        │
│      - run_command("yc-bench ...")              │
│      - CLI command reads/writes DB              │
│      - Return JSON result                       │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│ 4. Check checkpoint advancement                 │
│    - Did agent call "sim resume"?               │
│    - If yes: extract wake events                │
│    - If no for N turns: force auto-advance      │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│ 5. Check terminal conditions                    │
│    - Bankruptcy? Horizon end? Max turns?        │
│    - If terminal: mark reason and stop          │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│ 6. Record turn                                  │
│    - Save transcript (user input + agent output)│
│    - Update cost accumulator                    │
│    - Truncate old conversation rounds           │
└─────────────────────────────────────────────────┘

Time Advancement Flow

When the agent calls yc-bench sim resume:

┌─────────────────────────────────────────────────┐
│ CLI: sim_commands.py:resume()                   │
│ - Load current sim_time from DB                 │
│ - Determine target_time (next event or +7 days) │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│ Engine: engine.py:advance_time()                │
│ Loop until current_time >= target_time:         │
│   1. Find min(next_event, next_payroll, target) │
│   2. Flush progress: update task completion     │
│   3. Apply prestige decay (all domains)         │
│   4. Process action (payroll/event/target)      │
│   5. Update sim_time in DB                      │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│ Return wake events to agent                     │
│ - monthly_payroll: funds after payroll          │
│ - task_half: milestone checkpoint (25/50/75%)   │
│ - task_completed: success, funds_delta          │
│ - bankruptcy: terminal                          │
│ - horizon_end: terminal                         │
└─────────────────────────────────────────────────┘

Extension Points

Adding New CLI Commands

Create a new command function in the appropriate cli/*_commands.py file
Use @app.command() decorator (where app is the Typer sub-app)
Query/mutate DB using get_db() context manager
Return JSON via json_output(data) or error_output(message)
Add command to agent tool schema in agent/tools/run_command_schema.py if agent-accessible

Example:

# cli/task_commands.py
@task_app.command("priority")
def set_task_priority(
    task_id: UUID = typer.Option(...),
    priority: int = typer.Option(...),
):
    with get_db() as db:
        task = db.query(Task).filter(Task.id == task_id).first()
        if not task:
            error_output("Task not found")
        task.priority = priority
        db.flush()
        json_output({"task_id": str(task_id), "priority": priority})

Customizing Simulation Mechanics

Modify prestige mechanics:

Edit core/engine.py:apply_prestige_decay() for decay formula
Edit core/handlers/task_complete.py for prestige rewards
Edit cli/task_commands.py:accept() for prestige gating logic

Change task progress calculation:

Edit core/progress.py:flush_progress() for work accumulation
Edit core/eta.py:recalculate_etas() for ETA logic
Adjust prestige_qty_scale in config to change prestige→work scaling

Add new event types:

Add enum variant to db/models/event.py:EventType
Create handler in core/handlers/my_event.py
Register handler in core/engine.py:dispatch_event()
Insert events via core/events.py:insert_event()

Creating New Agent Runtimes

To support a new LLM provider or custom agent architecture:

Subclass agent/runtime/base.py:AgentRuntime
Implement run_turn(request: RuntimeTurnRequest) -> RuntimeTurnResult
Implement clear_session(session_id: str)
Register in agent/runtime/factory.py:build_runtime()

Example:

# agent/runtime/my_runtime.py
from .base import AgentRuntime
from .schemas import RuntimeTurnRequest, RuntimeTurnResult

class MyRuntime(AgentRuntime):
    def run_turn(self, request):
        # 1. Load conversation history for session_id
        # 2. Append user_input to history
        # 3. Call your LLM with tool schema
        # 4. Execute tool calls via command_executor
        # 5. Return RuntimeTurnResult with:
        #    - final_output: agent's final response
        #    - resume_payload: if "sim resume" was called
        #    - checkpoint_advanced: bool
        #    - turn_cost_usd: float
        #    - raw_result: dict for logging
        pass
    
    def clear_session(self, session_id):
        # Remove conversation history for session_id
        pass

Modifying Configuration

Create a custom preset:

Copy src/yc_bench/config/presets/default.toml
Edit parameters (see config/schema.py for all options)
Save as my_preset.toml
Run with --config my_preset or --config /path/to/my_preset.toml

Tunable parameters:

Initial funds, employees, market tasks
Prestige decay rate, min/max bounds, reward scale
Deadline pressure, penalty multipliers
Salary bump rate, work hours per day
Sampling distributions for world generation (see WorldDists)
Agent model, temperature, history window
Auto-advance threshold, max turns

Custom Metrics and Logging

Inject callbacks into the agent loop:

def on_turn_start(turn_num):
    print(f"Starting turn {turn_num}")

def on_turn(snapshot, run_state, commands_executed):
    # snapshot: current sim state (funds, tasks, prestige)
    # run_state: turn count, costs, terminal status
    # commands_executed: list of CLI commands run this turn
    log_metrics(snapshot, run_state)

run_agent_loop(
    runtime=runtime,
    db_factory=db_factory,
    company_id=company_id,
    run_state=run_state,
    on_turn_start=on_turn_start,
    on_turn=on_turn,
)

See runner/dashboard.py for a complete example (live terminal UI).

Get Started

Core Concepts

Configuration

Development

Overview

Project Structure

Key Modules

CLI (`cli/`)

Agent (`agent/`)

Core (`core/`)

Database (`db/`)

Config (`config/`)

Services (`services/`)

Runner (`runner/`)

Data Flow

Agent Turn Cycle

Time Advancement Flow

Extension Points

Adding New CLI Commands

Customizing Simulation Mechanics

Creating New Agent Runtimes

Modifying Configuration

Custom Metrics and Logging

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Development

​Overview

​Project Structure

​Key Modules

​CLI (cli/)

​Agent (agent/)

​Core (core/)

​Database (db/)

​Config (config/)

​Services (services/)

​Runner (runner/)

​Data Flow

​Agent Turn Cycle

​Time Advancement Flow

​Extension Points

​Adding New CLI Commands

​Customizing Simulation Mechanics

​Creating New Agent Runtimes

​Modifying Configuration

​Custom Metrics and Logging

Build docs developers (and LLMs) love

Overview

Project Structure

Key Modules

CLI (`cli/`)

Agent (`agent/`)

Core (`core/`)

Database (`db/`)

Config (`config/`)

Services (`services/`)

Runner (`runner/`)

Data Flow

Agent Turn Cycle

Time Advancement Flow

Extension Points

Adding New CLI Commands

Customizing Simulation Mechanics

Creating New Agent Runtimes

Modifying Configuration

Custom Metrics and Logging