Skip to main content
The polyglot domain is a software engineering benchmark modeled after SWE-bench. Agents are given a problem statement and must produce a code patch that makes failing tests pass. Unlike other HyperAgents domains, each instance runs inside its own Docker container — a fresh, isolated environment containing the target repository at a specific commit.

What It Evaluates

Polyglot tests the ability to fix bugs and implement features in real code repositories across six languages: Python, Rust, Go, JavaScript, C++, and Java. The primary metric is accuracy_score — the fraction of instances where the agent’s patch causes all tests to pass (resolved). Each instance is independent. The eval result for a single instance is one of:
ResultMeaning
resolvedPatch applied; all tests pass
unresolvedPatch applied; tests still fail
empty_patchAgent produced no patch
incompleteContainer setup failed before agent ran
errorUnexpected exception during processing

How It Differs from Other Domains

  • Separate harness: Polyglot uses domains/polyglot/harness.py instead of domains/harness.py. The main harness does not handle the polyglot domain.
  • Docker containers: Each instance builds and starts a dedicated container with the repository pre-installed at the correct commit.
  • No CSV dataset: The dataset is a JSON file (polyglot_benchmark_metadata.json) prepared via prepare_polyglot_dataset.py.
  • Per-language test commands: Test execution is language-specific (e.g., pytest for Python, cargo test for Rust, go test for Go, npm run test for JavaScript, cmake + make for C++, ./gradlew test for Java).
  • 10-minute agent timeout: Each agent invocation is wrapped in timeout 600 inside the container.
  • No ensemble support: can_domain_ensembled("polyglot") returns False.

Dataset Subsets

Two predefined subsets are available in domains/polyglot/subsets/:
SubsetFileDescription
smallsubsets/small.jsonSmall list of instance IDs for quick testing
mediumsubsets/medium.jsonLarger representative set

Setup

1

Clone SWE-bench

Polyglot depends on SWE-bench for Docker image building utilities:
cd domains/polyglot
git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench
git checkout dc4c087c2b9e4cefebf2e3d201d27e36
pip install -e .
cd ../../../
Polyglot is pinned to a specific SWE-bench commit (dc4c087). Using a different version may break Docker image building.
2

Prepare the dataset

python -m domains.polyglot.prepare_polyglot_dataset
This generates domains/polyglot/polyglot_benchmark_metadata.json.
3

Run evaluation

python -m domains.polyglot.harness \
  --subset small \
  --output_dir ./outputs/initial_polyglot_0 \
  --model_name_or_path eval_run
Key arguments:
ArgumentDefaultDescription
--subsetsmallDataset subset: small, medium, or full
--num_samples-1 (all)Limit number of instances
--max_workers5Parallel Docker containers
--model_name_or_pathtimestampLabel for this run
--model_patch_pathsNoneComma-separated patch files to pre-apply
4

Generate the report

python -m domains.polyglot.report \
  --output_dir ./outputs/initial_polyglot_0 \
  --model_name_or_path eval_run
The report is written to ./outputs/initial_polyglot_0/report.json.

Container Lifecycle

For each instance, the harness:
  1. Builds a Docker image for the repository/language combination (cached after first build)
  2. Starts a container from that image
  3. Copies task_agent.py, agent/, utils/, and other required files into the container
  4. Applies any pre-existing model patches (from --model_patch_paths)
  5. Installs requirements.txt inside the container
  6. Runs run_task_agent.py with a 10-minute timeout
  7. Reads the resulting model_patch.diff from the container
  8. Resets the repository to test_commit and applies the patch
  9. Runs the language-specific test command with a 2-minute timeout
  10. Cleans up the container
# Test commands per language (domains/polyglot/constants.py)
TEST_COMMANDS = {
    "python":     ["pytest -rA --tb=long"],
    "rust":       ["cargo test -- --include-ignored"],
    "go":         ["go test ./..."],
    "javascript": [  # sets up node_modules symlink, then runs tests
        "set -e",
        "[ ! -e node_modules ] && ln -s /npm-install/node_modules .",
        "[ ! -e package-lock.json ] && ln -s /npm-install/package-lock.json .",
        "sed -i 's/\\bxtest(/test(/g' *.spec.js",
        "npm run test",
    ],
    "cpp": [  # builds in a build/ subdirectory via cmake + make
        "set -e",
        "[ ! -d \"build\" ] && mkdir build",
        "cd build",
        "cmake -DEXERCISM_RUN_ALL_TESTS=1 -G \"Unix Makefiles\" ..",
        "make",
        "cd ../",
    ],
    "java":       ["./gradlew test"],
}

Report Format

The report.json produced by domains/polyglot/report.py contains:
{
  "accuracy_score": 0.15,
  "total_resolved_instances": 9,
  "total_submitted_instances": 60,
  "total_instances": 60,
  "resolved_ids": [...],
  "unresolved_ids": [...],
  "total_emptypatch_ids": [...]
}

Domain Properties

PropertyValue
Score keyaccuracy_score
Splitstrain only
Eval subsetfull dataset
Ensemble supportedNo
Staged eval samples10 / 60 (~17%)
ParallelismMultiple Docker containers via --max_workers
The accuracy_score is computed as resolved_instances / submitted_instances. If expected_num_tasks is explicitly provided to get_all_performance(), it is used as the denominator instead, ensuring that incomplete runs are not artificially inflated.
Environment variables ANTHROPIC_API_KEY, OPENAI_API_KEY, and METAGEN_ACCESS_TOKEN are forwarded into each Docker container automatically. Ensure these are set in the host environment before running the harness.

Build docs developers (and LLMs) love