Overview
generate_loop.py is the main entry point for HyperAgents. It orchestrates the full evolutionary loop: selecting a parent generation, running the meta-agent inside a Docker container to produce a diff, evaluating the resulting agent, updating the archive, and repeating.
outputs/generate_<timestamp>/ by default.
Arguments
--run_id
A string identifier for this run. If not provided, a timestamp-based ID is generated automatically (
YYYYMMDD_HHMMSS_ffffff). The run ID becomes the suffix of the output directory name: outputs/generate_<run_id>/.--domains (required)
One or more domains to run. All listed domains are evaluated jointly in each generation. Accepted values:
search_arenapaper_reviewbalrog_babyaibalrog_babaisaibalrog_minihackbalrog_nlegenesis_go2walkinggenesis_go2walkbackgenesis_go2hoppolyglotimo_gradingimo_proof
polyglot uses a separate evaluation harness from all other domains and is handled specially inside generate_loop.py. When polyglot is included alongside other domains, the other domains run through the standard domains.harness path while polyglot runs through its own two-stage evaluation.--max_generation
Total number of evolution generations to run. The loop runs generations
start through max_generation inclusive, where start is determined by the current archive size (0 on a fresh run, or the resume point when using --resume_from).--eval_samples
Number of evaluation samples per domain. Provide one integer per domain listed in
--domains in the same order. Use -1 to evaluate on all available samples for that domain.--eval_workers
Number of parallel workers used when running evaluation inside the container. Higher values speed up evaluation but require more CPU and memory.
--parent_selection
Strategy for selecting which archived generation to use as the parent for the next generation.
| Value | Description |
|---|---|
random | Pick any valid parent uniformly at random |
latest | Always use the most recent valid generation |
best | Use the highest-scoring generation in the archive |
score_prop | Sample proportionally to each node’s own score |
score_child_prop | Sample proportionally to the best score among a node’s children (default) |
When
--run_baseline no_archive is set, this argument is ignored and latest is used automatically, since the no-archive baseline always continues from the most recent generation.--resume_from
Path to an existing output directory (
outputs/generate_<run_id>/) to resume an interrupted run. The loop reads the existing archive and continues from where it left off.--output_dir_parent
Parent directory under which the new run’s output folder is created. Defaults to
outputs/ relative to the working directory. The actual output directory will be <output_dir_parent>/generate_<run_id>/.--meta_patch_files
One or more
.diff patch files to apply to the repository before starting the evolution loop. These patches define the initial state of the agent before any meta-agent improvements. If not provided, the run starts from the unmodified repository.--reset_task_agent
When set, changes to the task agent from
--meta_patch_files are reset so that the task agent starts from its original form. Used for self-referential self-improvement transfer experiments where you want to apply a pre-evolved meta-agent but keep the task agent at its baseline.--reset_meta_agent
When set, changes to the meta-agent from
--meta_patch_files are reset. Used in combination with --reset_task_agent for transfer experiments.--copy_root_dir
Path to a directory to use as the root repository state instead of the current working directory. Useful when comparing runs that need to start from the exact same initial commit.
--run_baseline
Run one of the comparison baselines instead of the default HyperAgents algorithm. See Baselines for full details.
| Value | Description |
|---|---|
no_selfimprove | No meta-agent improvement; static task agent |
no_archive | Linear chain — only latest generation used as parent |
dgm | DGM coding agent as meta-agent |
dgm_custom | DGM with a custom problem statement |
dgm+no_selfimprove | DGM meta-agent with a static underlying agent |
dgm_custom+no_selfimprove | DGM custom + static underlying agent |
--optimize_option
Controls which components of the system the meta-agent is allowed to optimize.
| Value | Description |
|---|---|
only_agent | Optimize only the task agent code (default) |
only_ensemble | Optimize only the ensemble selection mechanism; requires --agent_archive_path |
both_agent_ensemble | Optimize both the task agent and the ensemble mechanism |
--agent_archive_path
Path to an existing agent archive directory. Required when
--optimize_option only_ensemble is used — the ensemble optimizer needs a pre-existing set of agents to combine.--eval_test
When set, evaluation is also run on the held-out test split in addition to the training split. By default only the train split is evaluated during the evolution loop to avoid test-set leakage.
--skip_staged_eval
Skip the staged (small-sample) evaluation gate. By default, each generation first evaluates on a small subset of problems before running full evaluation. If the agent scores 0 on the small subset, full evaluation is skipped. Setting this flag bypasses that gate and always runs full evaluation.
--edit_select_parent
Allow the meta-agent to modify the parent selection mechanism itself. When enabled, parent selection runs inside a Docker container via
utils.run_select_next_parent instead of being computed on the host. This is part of the fully self-referential setting described in the paper.Usage Examples
Output Structure
Each run produces a directory underoutputs/:
archive.jsonl file records the full lineage graph and scores for all generations and is used by the parent selection algorithm in subsequent runs.