Quickstart

HyperAgents executes model-generated code in Docker containers. Review the Safety page before proceeding.

Prerequisites

Before you begin, make sure you have the following installed:

Python 3.12 (the virtualenv and Dockerfile both target 3.12)
Docker with access to the Docker daemon
Git
A Linux host running a package manager compatible with dnf (Fedora / RHEL / Rocky). Adjust commands for your distro as needed.

Set up API keys

HyperAgents uses LiteLLM to call foundation models. Create a .env file in the repository root with the keys for the providers you intend to use:

# .env — place in the repository root
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AI...

The default meta-agent model is anthropic/claude-sonnet-4-5-20250929 (see agent/llm.py and run_meta_agent.py). Supported models include:

Provider	Example model constants
OpenAI	`gpt-4o`, `o3`, `o3-mini`, `o4-mini`, `gpt-5`, `gpt-5.2`, `gpt-5-mini`
Anthropic	`claude-sonnet-4-5-20250929`, `claude-3-5-sonnet-20241022`, `claude-3-haiku-20240307`
Google	`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-3-pro-preview`

You only need keys for the providers whose models you plan to use. The .env file is read automatically via python-dotenv at import time.

Install system dependencies

Install the C/C++ build tools and Python 3.12 headers required by several packages (including pygraphviz and native extensions):

sudo dnf install -y python3.12-devel
sudo dnf install -y graphviz graphviz-devel cmake ninja-build bzip2-devel zlib-devel ncurses-devel libffi-devel

Create a virtualenv and install Python packages

Create an isolated Python 3.12 environment and install the main and dev requirements:

python3.12 -m venv venv_nat
source venv_nat/bin/activate
pip install -r requirements.txt
pip install -r requirements_dev.txt

Key runtime packages installed by requirements.txt:

Package	Version	Purpose
`litellm`	1.74.9	Unified LLM API client
`docker`	7.1.0	Python Docker SDK for container management
`datasets`	3.6.0	HuggingFace datasets (domain data loading)
`GitPython`	3.1.44	Git operations from Python
`backoff`	2.2.1	Exponential retry for LLM calls
`dotenv`	0.9.9	`.env` file loading

The requirements.txt also installs packages for every domain (BALROG games, Genesis robotics). If you only need a subset of domains, you can skip domain-specific pip extras — the Dockerfile handles them in the container anyway.

Build the Docker container

The Docker image (hyperagents) is the execution sandbox. It is based on nvidia/cuda:13.0.0-devel-ubuntu22.04, installs Python 3.12 via the deadsnakes PPA, and bakes in all dependencies including PyTorch with CUDA support.

docker build --network=host -t hyperagents .

This step can take 10–20 minutes the first time because it downloads CUDA base layers and installs all Python packages inside the image.

The --network=host flag is needed during the build to reach external pip indices. If your Docker daemon is configured differently, you may need to adjust the flag.

Bootstrap initial agents

Before starting the evolution loop, you need to evaluate the baseline (generation 0) agents on each domain’s train, validation, and test splits. setup_initial.sh handles this for the paper_review domain by default:

bash ./setup_initial.sh

This script runs the following sequence for paper_review:

# Curate the dataset subsets
python -m domains.paper_review.curate_subsets

# Run the initial agent on train / val / test splits (10 samples each)
python -m domains.harness --domain paper_review --run_id initial_paper_review_filtered_100_train_0 --subset _filtered_100_train --num_samples 10
python -m domains.harness --domain paper_review --run_id initial_paper_review_filtered_100_val_0   --subset _filtered_100_val   --num_samples 10
python -m domains.harness --domain paper_review --run_id initial_paper_review_filtered_100_test_0  --subset _filtered_100_test  --num_samples 10

# Generate score reports for each split
python -m domains.report --domain paper_review --dname ./outputs/initial_paper_review_filtered_100_train_0
python -m domains.report --domain paper_review --dname ./outputs/initial_paper_review_filtered_100_val_0
python -m domains.report --domain paper_review --dname ./outputs/initial_paper_review_filtered_100_test_0

To bootstrap other domains (BALROG, Genesis, IMO, Polyglot), uncomment the relevant sections in setup_initial.sh.All outputs are written to the outputs/ directory.

Run the evolution loop

Once the initial agents are bootstrapped, start the main generate loop:

python generate_loop.py --domains paper_review --max_generation 10

This will:

Create a timestamped run directory under outputs/generate_<run_id>/.
Spin up a Docker container per generation.
Run the meta-agent to produce a diff of proposed improvements.
Apply the diff and evaluate the new task-agent.
Update the evolutionary archive and select the best parent for the next generation.

Common options

# Run on multiple domains simultaneously
python generate_loop.py --domains paper_review imo_grading --max_generation 20

# Control evaluation sample count per domain
python generate_loop.py --domains paper_review --max_generation 10 --eval_samples 50

# Change the parent selection strategy
python generate_loop.py --domains paper_review --max_generation 10 \
  --parent_selection score_child_prop

# Resume an interrupted run
python generate_loop.py --domains paper_review \
  --resume_from ./outputs/generate_20260101_120000_000000

# Always run test-set evaluation (disabled by default to save compute)
python generate_loop.py --domains paper_review --max_generation 10 --eval_test

All `generate_loop.py` arguments

Argument	Default	Description
`--domains`	(required)	One or more domains: `paper_review`, `balrog_babyai`, `balrog_babaisai`, `balrog_minihack`, `balrog_nle`, `genesis_go2walking`, `genesis_go2walkback`, `genesis_go2hop`, `imo_grading`, `imo_proof`, `polyglot`
`--max_generation`	`10`	Number of evolution generations to run
`--eval_samples`	`-1` (all)	Samples per domain during evaluation
`--eval_workers`	`10`	Parallel evaluation workers
`--parent_selection`	`score_child_prop`	Strategy: `random`, `latest`, `best`, `score_prop`, `score_child_prop`
`--resume_from`	`None`	Path to an existing output folder to resume
`--output_dir_parent`	`outputs/`	Root folder for new run output
`--optimize_option`	`only_agent`	What to optimize: `only_agent`, `only_ensemble`, `both_agent_ensemble`
`--run_baseline`	`None`	Ablation mode: `no_selfimprove`, `no_archive`, `dgm`, `dgm_custom`, etc.
`--eval_test`	`False`	Include test-split evaluation every generation
`--skip_staged_eval`	`False`	Skip small-sample staged evaluation gate
`--edit_select_parent`	`False`	Allow the agent to modify its own parent-selection logic

Outputs are saved to outputs/ by default. Each run creates a generate_<run_id>/ subdirectory containing one gen_<n>/ folder per generation, an archive.jsonl file, and per-generation metadata, logs, diffs, and evaluation results.

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

Common options

All `generate_loop.py` arguments

Build docs developers (and LLMs) love

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

​Common options

​All generate_loop.py arguments

Build docs developers (and LLMs) love

Common options

All `generate_loop.py` arguments