Analysis & Visualization

The analysis/ directory contains all plotting and statistical-testing scripts. Plots are also generated automatically at the end of every generation during a live run.

Analysis directory contents

File	Purpose
`plot_progress.py`	Line plots of score vs. iteration for a single run.
`visualize_archive.py`	Directed-graph visualization of the full archive tree.
`plot_comparison.py`	Multi-run comparison curves with bootstrap confidence intervals.
`plot_testevals.py`	Bar charts of best-found agent test scores across methods.
`analysis_utils.py`	Shared bootstrap CI and significance-testing helpers.
`transfer_utils.py`	Utilities for selecting nodes to transfer across experiments.

Parameters common to all plotting functions

`split`

Which evaluation split’s scores to use:

Value	Source directory
`"train"`	`gen_{N}/{domain}_eval/`
`"val"`	`gen_{N}/{domain}_eval_val/`
`"test"`	`gen_{N}/{domain}_eval_test/`

A split is only available if evaluation was run for it. For most domains only "train" is produced by default; "val" is added when the domain definition includes a validation subset.

`type`

Which score value to use when plotting:

Value	Behaviour
`"agent"`	Use the task-agent score from `report.json`.
`"ensemble"`	Use the ensemble score from `report_ensemble_{domain}_{split}.json`.
`"max"`	Use `max(agent_score, ensemble_score)`. Falls back to whichever is available.

When a domain does not support ensembling, only "agent" is meaningful.

Progress plots

plot_progress.py produces line plots showing score progression over the ordered sequence of archive additions.

`plot_progress_single`

from analysis.plot_progress import plot_progress_single

plot_progress_single(
    domain,      # e.g. "paper_review"
    exp_dir,     # path to the generate_<run_id>/ directory
    split="train",
    type="agent",
    color="blue",  # "blue", "green", or "orange"
    svg=False,     # also save an SVG copy
)

What it produces:

progress_plot_{domain}_{split}_{type}.png — line chart saved to exp_dir/
progress_info_{domain}_{split}_{type}.txt — companion text file with iteration counts, best scores per iteration, and the patch-file lineage of the best agent

What the plot shows: Three series are drawn on the same axes:

Best Agent — the running maximum score across all archive members seen so far.
Average of Archive — the running mean of all non-None scores seen so far.
Lineage to Final Best Agent — the score at each node on the path from gen_initial to the best-scoring generation, connected as a line. This highlights the evolutionary path that produced the best result.

The x-axis counts iterations (one per archive entry); the y-axis is the raw score value.

`plot_progress_together`

from analysis.plot_progress import plot_progress_together

plot_progress_together(
    domains,     # list of domain strings, e.g. ["paper_review", "genesis_go2walking"]
    exp_dir,
    split="train",
    type="agent",
    color="blue",
    svg=False,
)

Aggregates scores across multiple domains before plotting. For each node, the aggregated score is the mean of per-domain scores. If any domain returns None for a node (not compilable or not evaluated), that node’s aggregated score is also None and it is excluded from the running average. Output files:

progress_plot_together_{domA}_{domB}_..._{split}_{type}.png
progress_info_together_{domA}_{domB}_..._{split}_{type}.txt

Running from the command line

# Single domain
python -m analysis.plot_progress \
  --domains paper_review \
  --path outputs/generate_20251216_192315_534288 \
  --color blue

# Multiple domains (aggregated together)
python -m analysis.plot_progress \
  --domains paper_review genesis_go2walking \
  --path outputs/generate_20251216_192315_534288 \
  --color green

# Also check ensemble types and save SVG
python -m analysis.plot_progress \
  --domains paper_review \
  --path outputs/generate_20251216_192315_534288 \
  --check_ensemble \
  --svg

When --check_ensemble is set and the domain supports ensembling, plots are generated for all three score types (agent, ensemble, max).

Archive tree visualization

visualize_archive.py renders the full archive as a directed acyclic graph (DAG) where nodes are generation IDs and edges point from parent to child.

`visualize_archive_single`

from analysis.visualize_archive import visualize_archive_single

visualize_archive_single(
    domain,
    exp_dir,
    trunc_its=-1,    # -1 = no truncation; positive int limits nodes shown
    split="train",
    type="agent",
    plot_borders=False,  # color node borders by valid_parent status
    save_svg=False,
)

Output: archive_graph_{domain}_{split}_{type}.png in exp_dir/ What the graph shows:

Each node represents one generation. The node labelled #0 corresponds to gen_initial (the initial node is relabelled "0" in the graph for visual consistency).
Node color encodes score via an orange → yellow → green colormap. Scores are normalized to [min, max] across all valid nodes in the graph.
Nodes with score = None (non-compilable or evaluation failed) are colored at the low end of the colormap and labelled N/A.
The best-scoring node is drawn as a diamond (◆); all others are circles.
Edges run from parent to child (arrows are not drawn; the top-down layout from graphviz dot makes direction clear).
Node labels show #{genid} on the first line and the formatted score (e.g. 0.432) on the second line.
When plot_borders=True, node borders are green if valid_parent=true and red if valid_parent=false.

`visualize_archive_together`

from analysis.visualize_archive import visualize_archive_together

visualize_archive_together(
    domains,     # list of domain strings
    exp_dir,
    trunc_its=-1,
    split="train",
    type="agent",
    plot_borders=False,
    save_svg=False,
)

Produces a single graph where each node’s color reflects the mean score across all domains. If any domain has None for a node, that node is treated as invalid (shown in gray/low-color end). The colorbar label reads "Score (aggregated)". Output: archive_graph_together_{domA}_{domB}_..._{split}_{type}.png

Running from the command line

# Single domain
python -m analysis.visualize_archive \
  --domains paper_review \
  --path outputs/generate_20251216_192315_534288

# Multiple domains, aggregated
python -m analysis.visualize_archive \
  --domains paper_review genesis_go2walking \
  --path outputs/generate_20251216_192315_534288 \
  --together

# Show valid_parent borders, limit to first 50 nodes, save SVG
python -m analysis.visualize_archive \
  --domains paper_review \
  --path outputs/generate_20251216_192315_534288 \
  --plot_borders \
  --trunc_its 50 \
  --svg

visualize_archive.py requires graphviz and the Python pygraphviz package. Install the system package (graphviz graphviz-devel) before installing pygraphviz via pip. The dot layout program must be on PATH.

Automatic plot generation during a run

generate_loop.py calls both plot_progress_single / plot_progress_together and visualize_archive_single / visualize_archive_together automatically at the end of every generation (lines 968–1004 of generate_loop.py). The score types generated depend on the run configuration:

Condition	Score types plotted
`optimize_option="only_ensemble"`	`["ensemble"]`
Ensemble evaluation is enabled	`["agent", "ensemble", "max"]`
Default (agent only)	`["agent"]`

Combined _together plots are only generated when the run targets more than one domain and all domains share at least one common split. All plots are saved directly to the run-root directory (outputs/generate_{run_id}/), so they accumulate alongside the archive.jsonl and can be inspected while the run is still in progress.

Multi-run comparison plots

plot_comparison.py is used for comparing multiple methods across multiple runs. It is configured by editing the domains_paths_data list at the top of the file’s main() function.

python -m analysis.plot_comparison

What it produces (all saved to analysis/outputs/):

comparison_{plotlabel}_{domain1}+{domain2}+....png — cumulative-max score curves with 95% bootstrap confidence intervals, one curve per method. For a single domain the filename is comparison_{plotlabel}_{domain}.png.
analysis/outputs/pdfs/comparison_{plotlabel}_{domain1}+{domain2}+....pdf / .svg — publication-quality transparent versions.
significance_{plotlabel}_{domain1}+{domain2}+....txt — pairwise statistical significance tests (Wilcoxon signed-rank or Mann-Whitney U, depending on sample sizes). For a single domain the filename is significance_{plotlabel}_{domain}.txt.

Each method entry in domains_paths_data maps to a list of run directories. The script interpolates each run’s cumulative-max scores onto a shared x-grid and then computes the median and bootstrap CIs across runs.

Statistical testing utilities

analysis/analysis_utils.py provides two functions used by both plot_comparison.py and plot_testevals.py.

`compute_bootstrap_ci`

from analysis.analysis_utils import compute_bootstrap_ci

median, lower_ci, upper_ci = compute_bootstrap_ci(
    data,              # 1D array of scalar scores
    n_bootstrap=1000,
    ci_level=0.95,
    random_seed=42,
)

Returns the median and 95% bootstrap confidence interval for a set of scores. When data has only one element, the CI collapses to the single value.

`save_significance_tests`

from analysis.analysis_utils import save_significance_tests

save_significance_tests(
    methods_data,     # dict: method_name -> np.ndarray of scores
    output_file,      # path to write results
    metadata=None,    # optional dict of metadata to include in the header
    use_bootstrap=False,
)

Performs all pairwise one-sided comparisons (H₁: method A > method B). With use_bootstrap=True uses Wilcoxon signed-rank (paired) or Mann-Whitney U (unpaired). With use_bootstrap=False uses paired or independent t-tests. Significance stars (*, **, ***) are added for p < 0.05, 0.01, 0.001.

Transfer analysis

analysis/transfer_utils.py helps identify which node from a completed run is the best candidate for zero-shot transfer to a new run.

python -m analysis.transfer_utils \
  --domains paper_review genesis_go2walking \
  --path outputs/generate_20251216_192315_534288 \
  --top_n 3 \
  --get_commands

Three selection strategies are supported via choose_node_for_transfer:

Method	Selects
`max_score`	Top-N nodes by absolute score.
`growth`	Top-N nodes by discounted descendant-improvement score (descendants scored relative to the root).
`growth_imd`	Same as `growth`, but each descendant’s delta is computed relative to its immediate parent rather than the root.

The --get_commands flag prints the python -m domains.run_eval commands needed to evaluate the selected nodes on a held-out test domain, useful for transfer experiments.

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

Analysis & Visualization

Analysis directory contents

Parameters common to all plotting functions

`split`

`type`

Progress plots

`plot_progress_single`

`plot_progress_together`

Running from the command line

Archive tree visualization

`visualize_archive_single`

`visualize_archive_together`

Running from the command line

Automatic plot generation during a run

Multi-run comparison plots

Statistical testing utilities

`compute_bootstrap_ci`

`save_significance_tests`

Transfer analysis

Build docs developers (and LLMs) love

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

​Analysis directory contents

​Parameters common to all plotting functions

​split

​type

​Progress plots

​plot_progress_single

​plot_progress_together

​Running from the command line

​Archive tree visualization

​visualize_archive_single

​visualize_archive_together

​Running from the command line

​Automatic plot generation during a run

​Multi-run comparison plots

​Statistical testing utilities

​compute_bootstrap_ci

​save_significance_tests

​Transfer analysis

Build docs developers (and LLMs) love

Analysis directory contents

Parameters common to all plotting functions

`split`

`type`

Progress plots

`plot_progress_single`

`plot_progress_together`

Running from the command line

Archive tree visualization

`visualize_archive_single`

`visualize_archive_together`

Running from the command line

Automatic plot generation during a run

Multi-run comparison plots

Statistical testing utilities

`compute_bootstrap_ci`

`save_significance_tests`

Transfer analysis