Skip to main content
HyperAgents executes model-generated code in Docker containers. Review the Safety page before proceeding.
1

Prerequisites

Before you begin, make sure you have the following installed:
  • Python 3.12 (the virtualenv and Dockerfile both target 3.12)
  • Docker with access to the Docker daemon
  • Git
  • A Linux host running a package manager compatible with dnf (Fedora / RHEL / Rocky). Adjust commands for your distro as needed.
2

Set up API keys

HyperAgents uses LiteLLM to call foundation models. Create a .env file in the repository root with the keys for the providers you intend to use:
# .env — place in the repository root
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AI...
The default meta-agent model is anthropic/claude-sonnet-4-5-20250929 (see agent/llm.py and run_meta_agent.py). Supported models include:
ProviderExample model constants
OpenAIgpt-4o, o3, o3-mini, o4-mini, gpt-5, gpt-5.2, gpt-5-mini
Anthropicclaude-sonnet-4-5-20250929, claude-3-5-sonnet-20241022, claude-3-haiku-20240307
Googlegemini-2.5-pro, gemini-2.5-flash, gemini-3-pro-preview
You only need keys for the providers whose models you plan to use. The .env file is read automatically via python-dotenv at import time.
3

Install system dependencies

Install the C/C++ build tools and Python 3.12 headers required by several packages (including pygraphviz and native extensions):
sudo dnf install -y python3.12-devel
sudo dnf install -y graphviz graphviz-devel cmake ninja-build bzip2-devel zlib-devel ncurses-devel libffi-devel
4

Create a virtualenv and install Python packages

Create an isolated Python 3.12 environment and install the main and dev requirements:
python3.12 -m venv venv_nat
source venv_nat/bin/activate
pip install -r requirements.txt
pip install -r requirements_dev.txt
Key runtime packages installed by requirements.txt:
PackageVersionPurpose
litellm1.74.9Unified LLM API client
docker7.1.0Python Docker SDK for container management
datasets3.6.0HuggingFace datasets (domain data loading)
GitPython3.1.44Git operations from Python
backoff2.2.1Exponential retry for LLM calls
dotenv0.9.9.env file loading
The requirements.txt also installs packages for every domain (BALROG games, Genesis robotics). If you only need a subset of domains, you can skip domain-specific pip extras — the Dockerfile handles them in the container anyway.
5

Build the Docker container

The Docker image (hyperagents) is the execution sandbox. It is based on nvidia/cuda:13.0.0-devel-ubuntu22.04, installs Python 3.12 via the deadsnakes PPA, and bakes in all dependencies including PyTorch with CUDA support.
docker build --network=host -t hyperagents .
This step can take 10–20 minutes the first time because it downloads CUDA base layers and installs all Python packages inside the image.
The --network=host flag is needed during the build to reach external pip indices. If your Docker daemon is configured differently, you may need to adjust the flag.
6

Bootstrap initial agents

Before starting the evolution loop, you need to evaluate the baseline (generation 0) agents on each domain’s train, validation, and test splits. setup_initial.sh handles this for the paper_review domain by default:
bash ./setup_initial.sh
This script runs the following sequence for paper_review:
# Curate the dataset subsets
python -m domains.paper_review.curate_subsets

# Run the initial agent on train / val / test splits (10 samples each)
python -m domains.harness --domain paper_review --run_id initial_paper_review_filtered_100_train_0 --subset _filtered_100_train --num_samples 10
python -m domains.harness --domain paper_review --run_id initial_paper_review_filtered_100_val_0   --subset _filtered_100_val   --num_samples 10
python -m domains.harness --domain paper_review --run_id initial_paper_review_filtered_100_test_0  --subset _filtered_100_test  --num_samples 10

# Generate score reports for each split
python -m domains.report --domain paper_review --dname ./outputs/initial_paper_review_filtered_100_train_0
python -m domains.report --domain paper_review --dname ./outputs/initial_paper_review_filtered_100_val_0
python -m domains.report --domain paper_review --dname ./outputs/initial_paper_review_filtered_100_test_0
To bootstrap other domains (BALROG, Genesis, IMO, Polyglot), uncomment the relevant sections in setup_initial.sh.All outputs are written to the outputs/ directory.
7

Run the evolution loop

Once the initial agents are bootstrapped, start the main generate loop:
python generate_loop.py --domains paper_review --max_generation 10
This will:
  1. Create a timestamped run directory under outputs/generate_<run_id>/.
  2. Spin up a Docker container per generation.
  3. Run the meta-agent to produce a diff of proposed improvements.
  4. Apply the diff and evaluate the new task-agent.
  5. Update the evolutionary archive and select the best parent for the next generation.

Common options

# Run on multiple domains simultaneously
python generate_loop.py --domains paper_review imo_grading --max_generation 20

# Control evaluation sample count per domain
python generate_loop.py --domains paper_review --max_generation 10 --eval_samples 50

# Change the parent selection strategy
python generate_loop.py --domains paper_review --max_generation 10 \
  --parent_selection score_child_prop

# Resume an interrupted run
python generate_loop.py --domains paper_review \
  --resume_from ./outputs/generate_20260101_120000_000000

# Always run test-set evaluation (disabled by default to save compute)
python generate_loop.py --domains paper_review --max_generation 10 --eval_test

All generate_loop.py arguments

ArgumentDefaultDescription
--domains(required)One or more domains: paper_review, balrog_babyai, balrog_babaisai, balrog_minihack, balrog_nle, genesis_go2walking, genesis_go2walkback, genesis_go2hop, imo_grading, imo_proof, polyglot
--max_generation10Number of evolution generations to run
--eval_samples-1 (all)Samples per domain during evaluation
--eval_workers10Parallel evaluation workers
--parent_selectionscore_child_propStrategy: random, latest, best, score_prop, score_child_prop
--resume_fromNonePath to an existing output folder to resume
--output_dir_parentoutputs/Root folder for new run output
--optimize_optiononly_agentWhat to optimize: only_agent, only_ensemble, both_agent_ensemble
--run_baselineNoneAblation mode: no_selfimprove, no_archive, dgm, dgm_custom, etc.
--eval_testFalseInclude test-split evaluation every generation
--skip_staged_evalFalseSkip small-sample staged evaluation gate
--edit_select_parentFalseAllow the agent to modify its own parent-selection logic
Outputs are saved to outputs/ by default. Each run creates a generate_<run_id>/ subdirectory containing one gen_<n>/ folder per generation, an archive.jsonl file, and per-generation metadata, logs, diffs, and evaluation results.

Build docs developers (and LLMs) love