The polyglot domain is a software engineering benchmark modeled after SWE-bench. Agents are given a problem statement and must produce a code patch that makes failing tests pass. Unlike other HyperAgents domains, each instance runs inside its own Docker container — a fresh, isolated environment containing the target repository at a specific commit.
What It Evaluates
Polyglot tests the ability to fix bugs and implement features in real code repositories across six languages: Python, Rust, Go, JavaScript, C++, and Java. The primary metric is accuracy_score — the fraction of instances where the agent’s patch causes all tests to pass (resolved).
Each instance is independent. The eval result for a single instance is one of:
| Result | Meaning |
|---|
resolved | Patch applied; all tests pass |
unresolved | Patch applied; tests still fail |
empty_patch | Agent produced no patch |
incomplete | Container setup failed before agent ran |
error | Unexpected exception during processing |
How It Differs from Other Domains
- Separate harness: Polyglot uses
domains/polyglot/harness.py instead of domains/harness.py. The main harness does not handle the polyglot domain.
- Docker containers: Each instance builds and starts a dedicated container with the repository pre-installed at the correct commit.
- No CSV dataset: The dataset is a JSON file (
polyglot_benchmark_metadata.json) prepared via prepare_polyglot_dataset.py.
- Per-language test commands: Test execution is language-specific (e.g.,
pytest for Python, cargo test for Rust, go test for Go, npm run test for JavaScript, cmake + make for C++, ./gradlew test for Java).
- 10-minute agent timeout: Each agent invocation is wrapped in
timeout 600 inside the container.
- No ensemble support:
can_domain_ensembled("polyglot") returns False.
Dataset Subsets
Two predefined subsets are available in domains/polyglot/subsets/:
| Subset | File | Description |
|---|
small | subsets/small.json | Small list of instance IDs for quick testing |
medium | subsets/medium.json | Larger representative set |
Setup
Clone SWE-bench
Polyglot depends on SWE-bench for Docker image building utilities:cd domains/polyglot
git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench
git checkout dc4c087c2b9e4cefebf2e3d201d27e36
pip install -e .
cd ../../../
Polyglot is pinned to a specific SWE-bench commit (dc4c087). Using a different version may break Docker image building.
Prepare the dataset
python -m domains.polyglot.prepare_polyglot_dataset
This generates domains/polyglot/polyglot_benchmark_metadata.json.Run evaluation
python -m domains.polyglot.harness \
--subset small \
--output_dir ./outputs/initial_polyglot_0 \
--model_name_or_path eval_run
Key arguments:| Argument | Default | Description |
|---|
--subset | small | Dataset subset: small, medium, or full |
--num_samples | -1 (all) | Limit number of instances |
--max_workers | 5 | Parallel Docker containers |
--model_name_or_path | timestamp | Label for this run |
--model_patch_paths | None | Comma-separated patch files to pre-apply |
Generate the report
python -m domains.polyglot.report \
--output_dir ./outputs/initial_polyglot_0 \
--model_name_or_path eval_run
The report is written to ./outputs/initial_polyglot_0/report.json.
Container Lifecycle
For each instance, the harness:
- Builds a Docker image for the repository/language combination (cached after first build)
- Starts a container from that image
- Copies
task_agent.py, agent/, utils/, and other required files into the container
- Applies any pre-existing model patches (from
--model_patch_paths)
- Installs
requirements.txt inside the container
- Runs
run_task_agent.py with a 10-minute timeout
- Reads the resulting
model_patch.diff from the container
- Resets the repository to
test_commit and applies the patch
- Runs the language-specific test command with a 2-minute timeout
- Cleans up the container
# Test commands per language (domains/polyglot/constants.py)
TEST_COMMANDS = {
"python": ["pytest -rA --tb=long"],
"rust": ["cargo test -- --include-ignored"],
"go": ["go test ./..."],
"javascript": [ # sets up node_modules symlink, then runs tests
"set -e",
"[ ! -e node_modules ] && ln -s /npm-install/node_modules .",
"[ ! -e package-lock.json ] && ln -s /npm-install/package-lock.json .",
"sed -i 's/\\bxtest(/test(/g' *.spec.js",
"npm run test",
],
"cpp": [ # builds in a build/ subdirectory via cmake + make
"set -e",
"[ ! -d \"build\" ] && mkdir build",
"cd build",
"cmake -DEXERCISM_RUN_ALL_TESTS=1 -G \"Unix Makefiles\" ..",
"make",
"cd ../",
],
"java": ["./gradlew test"],
}
The report.json produced by domains/polyglot/report.py contains:
{
"accuracy_score": 0.15,
"total_resolved_instances": 9,
"total_submitted_instances": 60,
"total_instances": 60,
"resolved_ids": [...],
"unresolved_ids": [...],
"total_emptypatch_ids": [...]
}
Domain Properties
| Property | Value |
|---|
| Score key | accuracy_score |
| Splits | train only |
| Eval subset | full dataset |
| Ensemble supported | No |
| Staged eval samples | 10 / 60 (~17%) |
| Parallelism | Multiple Docker containers via --max_workers |
The accuracy_score is computed as resolved_instances / submitted_instances. If expected_num_tasks is explicitly provided to get_all_performance(), it is used as the denominator instead, ensuring that incomplete runs are not artificially inflated.
Environment variables ANTHROPIC_API_KEY, OPENAI_API_KEY, and METAGEN_ACCESS_TOKEN are forwarded into each Docker container automatically. Ensure these are set in the host environment before running the harness.