Baselines

Overview

The --run_baseline argument in generate_loop.py lets you run controlled comparison experiments against the default HyperAgents algorithm. Each baseline ablates or replaces a specific component of the system, making it possible to measure the contribution of the meta-agent, the archive, and the self-improvement mechanism independently.

When no --run_baseline is specified, HyperAgents runs in its default configuration: a meta-agent that improves the task agent over multiple generations, selecting parents from a scored archive using score_child_prop.

Baseline Reference

`no_selfimprove`

The meta-agent runs but its modifications are applied to a copy of the repository (/DONOTTOUCH_hyperagents), not to the live agent. The task agent that gets evaluated each generation is always the static baseline agent — the meta-agent cannot make permanent improvements to it. What it tests: Whether the archive and parent selection mechanism alone contribute to performance, independent of actual agent improvement. When to use it: As the strongest ablation of self-improvement. If HyperAgents does not outperform no_selfimprove, the meta-agent’s code changes are not actually beneficial.

`no_archive`

The evolutionary archive is disabled. Instead of selecting parents from the full history of generations, the loop always uses the most recent generation as the parent (equivalent to forcing --parent_selection latest). This turns the branching archive into a simple linear chain. What it tests: The value of maintaining and searching the archive versus simple sequential improvement. When to use it: To isolate the benefit of the archive structure. A result where HyperAgents outperforms no_archive demonstrates that the archive enables discovery of better improvement trajectories.

When --run_baseline no_archive is active, the --parent_selection argument is ignored. Parent selection is automatically set to latest.

`dgm`

Replaces the HyperAgents meta-agent (run_meta_agent.py) with the DGM (Darwin Gödel Machine) coding agent (baselines/dgm/coding_agent.py). The DGM agent receives a problem statement derived from the current domain and evaluation results, then produces a code diff in the same way as the HyperAgents meta-agent. The rest of the framework — archive, parent selection, evaluation harness, Docker containers — remains the same as the default HyperAgents run. What it tests: Whether the HyperAgents meta-agent design outperforms the DGM coding agent within the same evolutionary framework. When to use it: For a direct comparison with DGM as described in the HyperAgents paper.

For the polyglot domain, the HyperAgents default run already uses claude-3-5-sonnet-20241022 to match the DGM comparison setup.

`dgm_custom`

Identical to dgm, except the problem statement passed to the DGM coding agent is customized per domain. The get_problem_statement utility generates a domain-specific prompt rather than a generic one. What it tests: DGM performance when given more targeted guidance about what to improve. When to use it: When you want to give DGM the strongest possible starting point for comparison.

`dgm+no_selfimprove`

Combines the DGM meta-agent with the no_selfimprove constraint: the DGM agent runs but its changes are applied to a copy of the repository, not the live agent. The task agent evaluated each generation is static. What it tests: DGM acting purely as a meta-agent that cannot persistently improve the underlying task agent — a maximally conservative comparison. When to use it: To establish a lower bound on what DGM can contribute when its output never accumulates.

`dgm_custom+no_selfimprove`

Same as dgm+no_selfimprove but with the custom problem statement variant. What it tests: Same as dgm+no_selfimprove but with better-guided DGM prompts.

Comparison Summary

Baseline	Meta-agent	Archive	Self-improvement
(default)	HyperAgents	Full archive	Yes
`no_selfimprove`	HyperAgents	Full archive	No
`no_archive`	HyperAgents	Latest only	Yes
`dgm`	DGM	Full archive	Yes
`dgm_custom`	DGM (custom prompt)	Full archive	Yes
`dgm+no_selfimprove`	DGM	Full archive	No
`dgm_custom+no_selfimprove`	DGM (custom prompt)	Full archive	No

Example Commands

# Ablate self-improvement: meta-agent runs but cannot modify the task agent
python generate_loop.py \
  --domains search_arena \
  --run_baseline no_selfimprove \
  --max_generation 10

Running Multiple Baselines for Comparison

To produce a clean comparison across baselines, run each with the same --max_generation and --eval_samples settings and point outputs to a shared parent directory:

OUTDIR=/scratch/experiments/comparison

python generate_loop.py --domains search_arena --output_dir_parent $OUTDIR/default     --max_generation 10
python generate_loop.py --domains search_arena --output_dir_parent $OUTDIR/no_selfimprove --run_baseline no_selfimprove --max_generation 10
python generate_loop.py --domains search_arena --output_dir_parent $OUTDIR/no_archive  --run_baseline no_archive  --max_generation 10
python generate_loop.py --domains search_arena --output_dir_parent $OUTDIR/dgm         --run_baseline dgm         --max_generation 10

The analysis/plot_progress.py and analysis/visualize_archive.py scripts can then be run across the output directories to produce comparison plots.

Each baseline run starts a fresh Docker container per generation and runs a full meta-agent + evaluation cycle. Running all baselines in parallel requires substantial compute — plan for at least 6 hours per generation per baseline (meta-agent timeout) plus up to 5 hours for evaluation.

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

Overview

Baseline Reference

`no_selfimprove`

`no_archive`

`dgm`

`dgm_custom`

`dgm+no_selfimprove`

`dgm_custom+no_selfimprove`

Comparison Summary

Example Commands

Running Multiple Baselines for Comparison

Build docs developers (and LLMs) love

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

​Overview

​Baseline Reference

​no_selfimprove

​no_archive

​dgm

​dgm_custom

​dgm+no_selfimprove

​dgm_custom+no_selfimprove

​Comparison Summary

​Example Commands

​Running Multiple Baselines for Comparison

Build docs developers (and LLMs) love

Overview

Baseline Reference

`no_selfimprove`

`no_archive`

`dgm`

`dgm_custom`

`dgm+no_selfimprove`

`dgm_custom+no_selfimprove`

Comparison Summary

Example Commands

Running Multiple Baselines for Comparison