Prerequisites
Before you begin, make sure you have the following installed:
- Python 3.12 (the virtualenv and Dockerfile both target 3.12)
- Docker with access to the Docker daemon
- Git
- A Linux host running a package manager compatible with
dnf(Fedora / RHEL / Rocky). Adjust commands for your distro as needed.
Set up API keys
HyperAgents uses LiteLLM to call foundation models. Create a The default meta-agent model is
.env file in the repository root with the keys for the providers you intend to use:anthropic/claude-sonnet-4-5-20250929 (see agent/llm.py and run_meta_agent.py). Supported models include:| Provider | Example model constants |
|---|---|
| OpenAI | gpt-4o, o3, o3-mini, o4-mini, gpt-5, gpt-5.2, gpt-5-mini |
| Anthropic | claude-sonnet-4-5-20250929, claude-3-5-sonnet-20241022, claude-3-haiku-20240307 |
gemini-2.5-pro, gemini-2.5-flash, gemini-3-pro-preview |
You only need keys for the providers whose models you plan to use. The
.env file is read automatically via python-dotenv at import time.Install system dependencies
Install the C/C++ build tools and Python 3.12 headers required by several packages (including
pygraphviz and native extensions):Create a virtualenv and install Python packages
Create an isolated Python 3.12 environment and install the main and dev requirements:Key runtime packages installed by
requirements.txt:| Package | Version | Purpose |
|---|---|---|
litellm | 1.74.9 | Unified LLM API client |
docker | 7.1.0 | Python Docker SDK for container management |
datasets | 3.6.0 | HuggingFace datasets (domain data loading) |
GitPython | 3.1.44 | Git operations from Python |
backoff | 2.2.1 | Exponential retry for LLM calls |
dotenv | 0.9.9 | .env file loading |
Build the Docker container
The Docker image (This step can take 10–20 minutes the first time because it downloads CUDA base layers and installs all Python packages inside the image.
hyperagents) is the execution sandbox. It is based on nvidia/cuda:13.0.0-devel-ubuntu22.04, installs Python 3.12 via the deadsnakes PPA, and bakes in all dependencies including PyTorch with CUDA support.The
--network=host flag is needed during the build to reach external pip indices. If your Docker daemon is configured differently, you may need to adjust the flag.Bootstrap initial agents
Before starting the evolution loop, you need to evaluate the baseline (generation 0) agents on each domain’s train, validation, and test splits. This script runs the following sequence for To bootstrap other domains (BALROG, Genesis, IMO, Polyglot), uncomment the relevant sections in
setup_initial.sh handles this for the paper_review domain by default:paper_review:setup_initial.sh.All outputs are written to the outputs/ directory.Run the evolution loop
Once the initial agents are bootstrapped, start the main generate loop:This will:All
- Create a timestamped run directory under
outputs/generate_<run_id>/. - Spin up a Docker container per generation.
- Run the meta-agent to produce a diff of proposed improvements.
- Apply the diff and evaluate the new task-agent.
- Update the evolutionary archive and select the best parent for the next generation.
Common options
All generate_loop.py arguments
| Argument | Default | Description |
|---|---|---|
--domains | (required) | One or more domains: paper_review, balrog_babyai, balrog_babaisai, balrog_minihack, balrog_nle, genesis_go2walking, genesis_go2walkback, genesis_go2hop, imo_grading, imo_proof, polyglot |
--max_generation | 10 | Number of evolution generations to run |
--eval_samples | -1 (all) | Samples per domain during evaluation |
--eval_workers | 10 | Parallel evaluation workers |
--parent_selection | score_child_prop | Strategy: random, latest, best, score_prop, score_child_prop |
--resume_from | None | Path to an existing output folder to resume |
--output_dir_parent | outputs/ | Root folder for new run output |
--optimize_option | only_agent | What to optimize: only_agent, only_ensemble, both_agent_ensemble |
--run_baseline | None | Ablation mode: no_selfimprove, no_archive, dgm, dgm_custom, etc. |
--eval_test | False | Include test-split evaluation every generation |
--skip_staged_eval | False | Skip small-sample staged evaluation gate |
--edit_select_parent | False | Allow the agent to modify its own parent-selection logic |
Outputs are saved to
outputs/ by default. Each run creates a generate_<run_id>/ subdirectory containing one gen_<n>/ folder per generation, an archive.jsonl file, and per-generation metadata, logs, diffs, and evaluation results.