Skip to main content

Overview

generate_loop.py is the main entry point for HyperAgents. It orchestrates the full evolutionary loop: selecting a parent generation, running the meta-agent inside a Docker container to produce a diff, evaluating the resulting agent, updating the archive, and repeating.
python generate_loop.py --domains <domain> [options]
Outputs are written to outputs/generate_<timestamp>/ by default.

Arguments

--run_id

run_id
string
default:"null (auto-generated)"
A string identifier for this run. If not provided, a timestamp-based ID is generated automatically (YYYYMMDD_HHMMSS_ffffff). The run ID becomes the suffix of the output directory name: outputs/generate_<run_id>/.

--domains (required)

domains
string[]
required
One or more domains to run. All listed domains are evaluated jointly in each generation. Accepted values:
  • search_arena
  • paper_review
  • balrog_babyai
  • balrog_babaisai
  • balrog_minihack
  • balrog_nle
  • genesis_go2walking
  • genesis_go2walkback
  • genesis_go2hop
  • polyglot
  • imo_grading
  • imo_proof
polyglot uses a separate evaluation harness from all other domains and is handled specially inside generate_loop.py. When polyglot is included alongside other domains, the other domains run through the standard domains.harness path while polyglot runs through its own two-stage evaluation.

--max_generation

max_generation
integer
default:"10"
Total number of evolution generations to run. The loop runs generations start through max_generation inclusive, where start is determined by the current archive size (0 on a fresh run, or the resume point when using --resume_from).

--eval_samples

eval_samples
integer[]
default:"-1 (all samples)"
Number of evaluation samples per domain. Provide one integer per domain listed in --domains in the same order. Use -1 to evaluate on all available samples for that domain.
# Evaluate 50 samples on search_arena and all samples on paper_review
python generate_loop.py --domains search_arena paper_review --eval_samples 50 -1

--eval_workers

eval_workers
integer
default:"10"
Number of parallel workers used when running evaluation inside the container. Higher values speed up evaluation but require more CPU and memory.

--parent_selection

parent_selection
string
default:"score_child_prop"
Strategy for selecting which archived generation to use as the parent for the next generation.
ValueDescription
randomPick any valid parent uniformly at random
latestAlways use the most recent valid generation
bestUse the highest-scoring generation in the archive
score_propSample proportionally to each node’s own score
score_child_propSample proportionally to the best score among a node’s children (default)
When --run_baseline no_archive is set, this argument is ignored and latest is used automatically, since the no-archive baseline always continues from the most recent generation.

--resume_from

resume_from
string
Path to an existing output directory (outputs/generate_<run_id>/) to resume an interrupted run. The loop reads the existing archive and continues from where it left off.
python generate_loop.py --domains search_arena --resume_from outputs/generate_20240101_120000_000000

--output_dir_parent

output_dir_parent
string
default:"./outputs/"
Parent directory under which the new run’s output folder is created. Defaults to outputs/ relative to the working directory. The actual output directory will be <output_dir_parent>/generate_<run_id>/.

--meta_patch_files

meta_patch_files
string[]
One or more .diff patch files to apply to the repository before starting the evolution loop. These patches define the initial state of the agent before any meta-agent improvements. If not provided, the run starts from the unmodified repository.

--reset_task_agent

reset_task_agent
boolean
default:"false"
When set, changes to the task agent from --meta_patch_files are reset so that the task agent starts from its original form. Used for self-referential self-improvement transfer experiments where you want to apply a pre-evolved meta-agent but keep the task agent at its baseline.

--reset_meta_agent

reset_meta_agent
boolean
default:"false"
When set, changes to the meta-agent from --meta_patch_files are reset. Used in combination with --reset_task_agent for transfer experiments.

--copy_root_dir

copy_root_dir
string
Path to a directory to use as the root repository state instead of the current working directory. Useful when comparing runs that need to start from the exact same initial commit.

--run_baseline

run_baseline
string
Run one of the comparison baselines instead of the default HyperAgents algorithm. See Baselines for full details.
ValueDescription
no_selfimproveNo meta-agent improvement; static task agent
no_archiveLinear chain — only latest generation used as parent
dgmDGM coding agent as meta-agent
dgm_customDGM with a custom problem statement
dgm+no_selfimproveDGM meta-agent with a static underlying agent
dgm_custom+no_selfimproveDGM custom + static underlying agent

--optimize_option

optimize_option
string
default:"only_agent"
Controls which components of the system the meta-agent is allowed to optimize.
ValueDescription
only_agentOptimize only the task agent code (default)
only_ensembleOptimize only the ensemble selection mechanism; requires --agent_archive_path
both_agent_ensembleOptimize both the task agent and the ensemble mechanism

--agent_archive_path

agent_archive_path
string
Path to an existing agent archive directory. Required when --optimize_option only_ensemble is used — the ensemble optimizer needs a pre-existing set of agents to combine.
python generate_loop.py --domains search_arena \
  --optimize_option only_ensemble \
  --agent_archive_path outputs/generate_20240101_120000_000000

--eval_test

eval_test
boolean
default:"false"
When set, evaluation is also run on the held-out test split in addition to the training split. By default only the train split is evaluated during the evolution loop to avoid test-set leakage.

--skip_staged_eval

skip_staged_eval
boolean
default:"false"
Skip the staged (small-sample) evaluation gate. By default, each generation first evaluates on a small subset of problems before running full evaluation. If the agent scores 0 on the small subset, full evaluation is skipped. Setting this flag bypasses that gate and always runs full evaluation.
Use --skip_staged_eval during debugging or when you want deterministic evaluation regardless of intermediate scores.

--edit_select_parent

edit_select_parent
boolean
default:"false"
Allow the meta-agent to modify the parent selection mechanism itself. When enabled, parent selection runs inside a Docker container via utils.run_select_next_parent instead of being computed on the host. This is part of the fully self-referential setting described in the paper.

Usage Examples

# Run HyperAgents on search_arena for 10 generations
python generate_loop.py --domains search_arena

Output Structure

Each run produces a directory under outputs/:
outputs/
└── generate_<run_id>/
    ├── archive.jsonl              # Archive state after each generation
    ├── generate_loop.log          # Argument log for reproducibility
    ├── select_next_parent.log     # Parent selection log (if --edit_select_parent)
    ├── gen_initial/               # Baseline evaluation of the starting agent
    ├── gen_0/                     # Generation 0 (if meta_patch_files provided)
    │   ├── metadata.json
    │   ├── generate.log
    │   ├── agent_output/
    │   │   ├── model_patch.diff
    │   │   └── meta_agent_chat_history.md
    │   └── <domain>_eval/
    ├── gen_1/
    │   └── ...
    └── gen_N/
The archive.jsonl file records the full lineage graph and scores for all generations and is used by the parent selection algorithm in subsequent runs.

Build docs developers (and LLMs) love