domains/ with a utils.py that defines its dataset format and a shared harness that drives evaluation.
Supported Domains
paper_review
Judges AI-generated academic paper reviews. Predicts accept/reject outcomes against human reviewer decisions.
search_arena
Preference judgment for web-search responses. Picks the better of two model-generated search answers.
balrog
Game-playing ability across four NetHack-family environments:
babyai, babaisai, minihack, and nle.genesis
Robotic locomotion control for the Unitree Go2 quadruped. Three tasks: walking, walking backward, and hopping.
imo_grading
Grades student answers to International Mathematical Olympiad problems against official rubrics.
imo_proof
Generates full mathematical proofs for IMO problems, then scores them with a proof-grading agent.
polyglot
SWE-bench-style coding tasks across Python, Rust, Go, JavaScript, C++, and Java — each in its own Docker container.
Domain Summary Table
| Domain | Score Key | Splits | Eval Subset | Ensemble? |
|---|---|---|---|---|
paper_review | overall_accuracy | train / val / test | _filtered_100_train | Yes |
search_arena | overall_accuracy | train / val / test | _filtered_100_train | Yes |
balrog_babyai | average_progress | train | — | No |
balrog_babaisai | average_progress | train | — | No |
balrog_minihack | average_progress | train | — | No |
balrog_nle | average_progress | train | — | No |
genesis_go2walking | average_fitness | train | — | No |
genesis_go2walkback | average_fitness | train | — | No |
genesis_go2hop | average_fitness | train | — | No |
imo_grading | overall_accuracy | train / val / test | _filtered_100_train | Yes |
imo_proof | points_percentage | train | — | No |
polyglot | accuracy_score | train | — | No |
domain_utils.py Reference
All cross-domain logic lives inutils/domain_utils.py. The four primary functions are:
get_domain_score_key(domain)
Returns the key to look up in report.json for the domain’s primary metric.
get_domain_splits(domain, eval_test=False)
Returns the list of dataset splits to evaluate on. Human-preference domains (search_arena, paper_review, imo_grading) support train, val, and optionally test. All other domains return only ["train"].
get_domain_eval_subset(domain)
Returns the file suffix for the default evaluation subset. Human-preference domains use _filtered_100_train (100 balanced samples). Game, robotic, and proof domains use an empty string (full dataset).
can_domain_ensembled(domain)
Returns True if the domain supports ensemble evaluation (i.e., aggregating multiple agent runs). Preference-judgment domains (search_arena, paper_review, imo_grading) support ensembling. Game, robotic, and proof domains do not.
Adding a New Domain
To add a new domain to HyperAgents:Add the domain to harness.py
Add your domain name to the
--domain choices list in domains/harness.py:Add dataset loading
If your domain uses a CSV dataset, add a branch in the
get_dataset() function in domains/harness.py. If it uses a custom harness (like BALROG or Genesis), implement a harness_<domain>() function and dispatch to it in the main block.