search_arena domain tests pairwise preference judgment for web-search results. Given two model-generated responses (messages_a and messages_b) to a search query, the agent must predict which response a human annotator preferred.
What It Evaluates
This is a binary preference-judgment task. The domain measures:- Overall accuracy — fraction of comparisons where the agent’s preference matches the human winner label
- Per-label precision and recall — for each possible winner label
- Label distribution alignment — how closely the predicted distribution tracks the ground truth
Dataset Format
The dataset is stored atdomains/search_arena/dataset.csv. Each row represents one pairwise comparison:
| Column | Description |
|---|---|
question_id | Unique identifier for the comparison |
messages_a | First model’s conversation / response |
messages_b | Second model’s conversation / response |
winner | Ground truth label indicating which response was preferred |
utils.py constants confirm this layout:
format_input_dict function passes both responses to the agent:
Dataset Subsets
Filtered question IDs are tracked indomains/search_arena/dataset_filtered_100_ids.json. The curated subset files follow the same convention as paper_review:
_filtered_100_train suffix is the default evaluation subset.
Setup and Run
Download and curate dataset subsets
Download the dataset from HuggingFace and generate the balanced train/val/test CSV files:This downloads
lmarena-ai/search-arena-v1-7k, saves dataset.csv, filters it, and writes the three split files (dataset_filtered_100_train.csv, dataset_filtered_100_val.csv, dataset_filtered_100_test.csv).Scoring
The primary metric isoverall_accuracy — the fraction of comparisons where the agent’s predicted winner matches the human-labeled winner (case-insensitive).
The report includes:
- Per-label precision and recall for each winner value
- Random-guess baseline computed from the ground truth winner distribution
- Prediction and ground truth label distributions
- Lists of passed and failed question IDs
Harness Behavior
- Parallelism: supports up to
--num_workersconcurrent threads (default: 5). - Resuming: pass
--resume_from <output_dir>to skip already-evaluated questions. - Ensemble:
can_domain_ensembled("search_arena")returnsTrue. - Splits: the domain has
train,val, andtestsplits. Test split is only used wheneval_test=Trueis passed toget_domain_splits.
The staged evaluation sample count for
search_arena is 10 out of 100 (10%), matching paper_review. This is defined in get_domain_stagedeval_frac in utils/domain_utils.py.