paper_review domain tests whether an agent can replicate the judgment of human peer reviewers. Given the full text of an academic paper, the agent must predict whether the paper was ultimately accepted or rejected.
What It Evaluates
This is a binary classification task framed as a preference-judgment problem. The ground truth is the actual outcome from a real peer-review process. The domain measures:- Overall accuracy — fraction of papers where the agent’s prediction matches the actual outcome
- Per-label precision and recall — separately for
acceptandreject - Label distribution alignment — how closely the predicted accept/reject ratio matches the ground truth distribution
Dataset Format
The dataset is stored atdomains/paper_review/dataset.csv. Each row represents one paper:
| Column | Description |
|---|---|
question_id | Unique identifier for the paper |
paper_text | Full text of the paper |
outcome | Ground truth label: accept or reject |
utils.py constants confirm this layout:
format_input_dict function passes paper_text to the agent:
Dataset Subsets
The rawdataset.csv is split into balanced train/val/test subsets using curate_subsets.py. Each split contains 100 samples (50 accept, 50 reject) with no overlap between splits.
The curated subset files follow the naming convention:
_filtered_100_train suffix is the default evaluation subset (see get_domain_eval_subset).
Setup
Scoring
The primary metric isoverall_accuracy — the fraction of papers where the agent’s prediction exactly matches the ground truth outcome (case-insensitive string comparison).
The report.py script also outputs:
- Per-label precision and recall for
acceptandreject - Random-guess baseline accuracy computed from the ground truth label distribution
- Prediction and ground truth label distributions
- Lists of
question_ids_passedandquestion_ids_failed
report.json in the output directory.
Harness Behavior
- Parallelism: controlled by
--num_workers(default: 5). The domain supports parallel evaluation. - Resuming: pass
--resume_from <output_dir>to continue an interrupted run. Already-completed question IDs are skipped. - Save interval: predictions are checkpointed to
predictions.csvevery--save_intervalsamples (default: 100). - Ensemble:
can_domain_ensembled("paper_review")returnsTrue, so multiple agent runs can be aggregated.
The default model (
MODEL = "gpt-4o") is passed to TaskAgent at construction time. Your TaskAgent can ignore it or use it to configure the underlying LLM client.