generate() in a Docker container for each generation, updates scores, selects the next parent, and saves progress checkpoints.
generate_loop() Signature
generate_loop.py
A Single generate() Call
Each generation is executed inside the generate() function, which manages a dedicated Docker container:
Create and start container
A fresh Docker container is built from the
HyperAgents image. The image contains the full repo and all domain dependencies.Apply parent lineage patches
All
.diff files from the selected parent’s lineage (accumulated across ancestors) are applied via patch -p1 inside the container. This reconstructs the parent agent’s exact codebase state.Run the meta-agent (6h timeout)
run_meta_agent.py is executed inside the container. It invokes MetaAgent.forward(), captures git diff, and saves model_patch.diff:Compilation check
Before running evaluation, the loop imports both
MetaAgent and TaskAgent inside the container. If either fails to import, evaluation is skipped and run_eval is set to False in metadata.Staged evaluation
Unless
--skip_staged_eval is set, the agent is evaluated on a small subset first. If the score is None or <= 0, the generation is pruned and full evaluation does not run. See Staged Evaluation below.Full evaluation (5h timeout per domain)
If staged evaluation passes (or is skipped),
domains/harness.py runs TaskAgent across the full dataset. Results are saved to gen_<id>/<domain>_eval/report.json.Cleanup
The container runs
git reset --hard <root_commit> and git clean -fd to restore a pristine state, then is stopped and removed. Outputs have already been copied to the host by this point.Staged Evaluation
Staged evaluation is a two-phase filter that avoids spending compute on agents that produce no improvement. Phase 1 — small subset: Each domain defines a small evaluation subset viaget_domain_stagedeval_samples(domain). The agent is evaluated on this small set first.
Threshold:
None or exactly 0, the generation is dropped and full evaluation is not run. The metadata records run_full_eval: False.
Phase 2 — full dataset:
If the threshold is passed, evaluation runs on the complete dataset.
Score adjustment:
For nodes that only completed staged evaluation, get_saved_score() scales the score by stagedeval_frac (a domain-specific fraction) to make it comparable to scores from full evaluation when selecting parents:
utils/gl_utils.py (get_saved_score)
Parent Selection Strategies
select_parent() in utils/gl_utils.py reads the archive and picks the next parent. The method is set via --parent_selection.
random
random
Selects uniformly at random from all valid candidates that have evaluation scores.Good for: exploration experiments, ablation studies.
latest
latest
Always selects the most recently added valid node. Equivalent to a linear chain — no branching.
This is the forced strategy when
--run_baseline=no_archive is set.best
best
Greedy selection: picks the single highest-scoring node.Risk: premature convergence — all future generations branch from the same node.
score_prop
score_prop
Probabilistic selection proportional to a sigmoid of the score:Higher-scoring nodes are more likely to be selected but lower-scoring nodes still have a chance.
score_child_prop (default)
score_child_prop (default)
Same as The penalty is cubic-exponential: a node with 8 children has its probability multiplied by ~0.37; a node with 16 children by ~0.006. This encourages the tree to stay broad and explore diverse lineages.This is the default because it balances exploitation (high-scoring nodes) with exploration (underexplored branches).
score_prop but penalises nodes that already have many children:Archive Management
update_and_save_archive()
Called after every generate() call:
utils/gl_utils.py (update_and_save_archive)
archive.jsonl Format
outputs/generate_<run_id>/archive.jsonl
load_archive_data(filepath, last_only=True) reads the last line to get the current archive state.
get_score() — Reading a Generation’s Score
utils/gl_utils.py (get_score)
utils/domain_utils.py.
Resuming a Previous Run
Pass--resume_from <output_dir> to continue an interrupted run:
generate_loop.py (resume)
start_genid = len(archive)) and picks up the parent selection from there. The snapshotted repo in gen_initial/<REPO_NAME>/ is reused as root_dir.
Command Examples
- Basic run
- Multiple domains
- Skip staged eval
- Resume
- Start from a patch
Supported Domains
Meta-Agent
What runs inside each generate() call.
Architecture
High-level system overview and Docker sandbox details.