Skip to main content

Overview

utils/domain_utils.py centralises all domain-specific configuration so the rest of the codebase can query it without duplicating if/elif chains. Supported domain families:
  • Human preferencessearch_arena, paper_review, imo_grading
  • Balrog games — any domain containing "balrog" (e.g. "balrog_babyai", "balrog_minihack")
  • Genesis robotics — any domain containing "genesis"
  • Polyglot coding — any domain containing "polyglot"
  • IMO proofimo_proof

get_domain_eval_subset

Returns the dataset subset suffix used when locating training-split evaluation results.
from utils.domain_utils import get_domain_eval_subset

subset = get_domain_eval_subset("search_arena")  # "_filtered_100_train"
subset = get_domain_eval_subset("balrog_babyai")  # ""

Signature

def get_domain_eval_subset(domain: str) -> str

Parameters

domain
str
required
Domain name string.

Return Value

subset
str
The dataset suffix appended to the domain name when building the path to the initial evaluation output directory.
DomainReturn value
search_arena, paper_review"_filtered_100_train"
imo_grading"_filtered_100_train"
Balrog / Genesis / Polyglot / imo_proof"" (empty string)

Example

domain = "paper_review"
subset = get_domain_eval_subset(domain)
eval_path = f"./outputs/initial_{domain}{subset}_0/"
# ./outputs/initial_paper_review_filtered_100_train_0/

get_domain_splits

Returns the list of evaluation splits for a given domain.
from utils.domain_utils import get_domain_splits

get_domain_splits("search_arena")              # ["train", "val"]
get_domain_splits("search_arena", eval_test=True)  # ["train", "val", "test"]
get_domain_splits("balrog_babyai")             # ["train"]

Signature

def get_domain_splits(domain: str, eval_test: bool = False) -> list[str]

Parameters

domain
str
required
Domain name string.
eval_test
bool
default:"false"
When True, appends "test" to the split list for domains that support a test split (currently search_arena, paper_review, and imo_grading).

Return Value

splits
list[str]
Ordered list of split names.
DomainDefault splitsWith eval_test=True
search_arena, paper_review, imo_grading["train", "val"]["train", "val", "test"]
Balrog / Genesis / Polyglot / imo_proof["train"]["train"]

get_domain_score_key

Returns the JSON key used to read the primary score from a domain’s report.json.
from utils.domain_utils import get_domain_score_key

key = get_domain_score_key("search_arena")   # "overall_accuracy"
key = get_domain_score_key("balrog_babyai")  # "average_progress"
key = get_domain_score_key("genesis_walk")   # "average_fitness"

Signature

def get_domain_score_key(domain: str) -> str

Parameters

domain
str
required
Domain name string.

Return Value

key
str
The top-level key in report.json that holds the primary scalar score.
DomainKey
search_arena, paper_review, imo_grading"overall_accuracy"
Balrog domains"average_progress"
Genesis domains"average_fitness"
Polyglot domains"accuracy_score"
imo_proof"points_percentage"

can_domain_ensembled

Returns whether the domain supports ensemble evaluation.
from utils.domain_utils import can_domain_ensembled

can_domain_ensembled("search_arena")  # True
can_domain_ensembled("balrog_babyai") # False

Signature

def can_domain_ensembled(domain: str) -> bool

Parameters

domain
str
required
Domain name string.

Return Value

result
bool
True if an ensemble of agent outputs can be evaluated; False otherwise.
DomainSupports ensemble
search_arena, paper_reviewTrue
imo_gradingTrue
Balrog / Genesis / Polyglot / imo_proofFalse

get_domain_stagedeval_samples

Returns the number of samples used during staged (fast) evaluation.
from utils.domain_utils import get_domain_stagedeval_samples

samples = get_domain_stagedeval_samples("search_arena")  # 10
samples = get_domain_stagedeval_samples("balrog_babyai") # 1

Signature

def get_domain_stagedeval_samples(domain: str) -> int

Parameters

domain
str
required
Domain name string.

Return Value

samples
int
Number of evaluation samples in a staged (partial) evaluation run.
DomainStaged samples
search_arena, paper_review10
Balrog domains1
Genesis domains3
Polyglot domains10
imo_grading, imo_proof10

get_domain_stagedeval_frac

Returns the fraction of full-evaluation samples covered by a staged evaluation. Used to normalise staged scores so they are comparable to full-eval scores.
from utils.domain_utils import get_domain_stagedeval_frac

frac = get_domain_stagedeval_frac("search_arena")  # 0.1  (10/100)
frac = get_domain_stagedeval_frac("balrog_babyai") # 0.1  (1/10)

Signature

def get_domain_stagedeval_frac(domain: str) -> float

Parameters

domain
str
required
Domain name string.

Return Value

frac
float
staged_samples / full_eval_samples. Scores from a staged run are multiplied by this value in get_saved_score to produce a full-eval-comparable number.
DomainFractionCalculation
search_arena, paper_review0.1010 / 100
balrog_babyai0.101 / 10
balrog_minihack0.201 / 5
Genesis domains0.503 / 6
Polyglot domains0.16710 / 60
imo_grading0.1010 / 100
imo_proof0.16710 / 60

Example

from utils.domain_utils import get_domain_stagedeval_frac
from utils.gl_utils import get_score

domain = "search_arena"
raw_score = get_score(domain, "/runs/exp1", genid=0, split="train")
frac = get_domain_stagedeval_frac(domain)
comparable_score = raw_score * frac  # scale to full-eval range

Build docs developers (and LLMs) love