Overview
utils/domain_utils.py centralises all domain-specific configuration so the rest of the codebase can query it without duplicating if/elif chains.
Supported domain families:
- Human preferences —
search_arena,paper_review,imo_grading - Balrog games — any domain containing
"balrog"(e.g."balrog_babyai","balrog_minihack") - Genesis robotics — any domain containing
"genesis" - Polyglot coding — any domain containing
"polyglot" - IMO proof —
imo_proof
get_domain_eval_subset
Returns the dataset subset suffix used when locating training-split evaluation results.
Signature
Parameters
Domain name string.
Return Value
The dataset suffix appended to the domain name when building the path to the initial evaluation output directory.
| Domain | Return value |
|---|---|
search_arena, paper_review | "_filtered_100_train" |
imo_grading | "_filtered_100_train" |
Balrog / Genesis / Polyglot / imo_proof | "" (empty string) |
Example
get_domain_splits
Returns the list of evaluation splits for a given domain.
Signature
Parameters
Domain name string.
When
True, appends "test" to the split list for domains that support a test split (currently search_arena, paper_review, and imo_grading).Return Value
Ordered list of split names.
| Domain | Default splits | With eval_test=True |
|---|---|---|
search_arena, paper_review, imo_grading | ["train", "val"] | ["train", "val", "test"] |
Balrog / Genesis / Polyglot / imo_proof | ["train"] | ["train"] |
get_domain_score_key
Returns the JSON key used to read the primary score from a domain’s report.json.
Signature
Parameters
Domain name string.
Return Value
The top-level key in
report.json that holds the primary scalar score.| Domain | Key |
|---|---|
search_arena, paper_review, imo_grading | "overall_accuracy" |
| Balrog domains | "average_progress" |
| Genesis domains | "average_fitness" |
| Polyglot domains | "accuracy_score" |
imo_proof | "points_percentage" |
can_domain_ensembled
Returns whether the domain supports ensemble evaluation.
Signature
Parameters
Domain name string.
Return Value
True if an ensemble of agent outputs can be evaluated; False otherwise.| Domain | Supports ensemble |
|---|---|
search_arena, paper_review | True |
imo_grading | True |
Balrog / Genesis / Polyglot / imo_proof | False |
get_domain_stagedeval_samples
Returns the number of samples used during staged (fast) evaluation.
Signature
Parameters
Domain name string.
Return Value
Number of evaluation samples in a staged (partial) evaluation run.
| Domain | Staged samples |
|---|---|
search_arena, paper_review | 10 |
| Balrog domains | 1 |
| Genesis domains | 3 |
| Polyglot domains | 10 |
imo_grading, imo_proof | 10 |
get_domain_stagedeval_frac
Returns the fraction of full-evaluation samples covered by a staged evaluation. Used to normalise staged scores so they are comparable to full-eval scores.
Signature
Parameters
Domain name string.
Return Value
staged_samples / full_eval_samples. Scores from a staged run are multiplied by this value in get_saved_score to produce a full-eval-comparable number.| Domain | Fraction | Calculation |
|---|---|---|
search_arena, paper_review | 0.10 | 10 / 100 |
balrog_babyai | 0.10 | 1 / 10 |
balrog_minihack | 0.20 | 1 / 5 |
| Genesis domains | 0.50 | 3 / 6 |
| Polyglot domains | 0.167 | 10 / 60 |
imo_grading | 0.10 | 10 / 100 |
imo_proof | 0.167 | 10 / 60 |