BALROG

The BALROG domain group evaluates an agent’s ability to play text-based and grid-world games. It wraps four distinct environments — babyai, babaisai, minihack, and nle — each selectable as a separate domain variant.

What It Evaluates

BALROG tests sequential decision-making in partially observable environments. The agent receives a textual description of the game state each step and must produce a valid action. The primary metric is average_progress — the mean episode completion fraction across all tasks in the environment, expressed as a percentage. Each environment runs multiple episodes per task. Results are aggregated into a report.json with per-task and per-environment breakdowns.

The Four Environments

babyai
babaisai
minihack
nle

BabyAI is a grid-world environment with language-conditioned navigation and manipulation tasks. The default task set includes 5 tasks from BabyAI-MixedTrainLocal-v0:

goto — navigate to a target object
pickup — pick up a specified object
open — open a door
putnext — place an object next to another
pick_up_seq_go_to — pick up an object then navigate

Default episodes per task: 10. Staged eval fraction: 1/10.

MiniHack provides a suite of NetHack-based custom environments. The default task set includes 8 tasks:

MiniHack-Boxoban-Hard-v0
MiniHack-Boxoban-Medium-v0
MiniHack-MazeWalk-9x9-v0
MiniHack-MazeWalk-15x15-v0
MiniHack-Corridor-R3-v0
MiniHack-CorridorBattle-Dark-v0
MiniHack-Quest-Easy-v0
MiniHack-Quest-Medium-v0

Default episodes per task: 5. Staged eval fraction: 1/5.

NLE (NetHack Learning Environment) runs the full NetHackChallenge-v0 game. This is the most complex environment, with a large action space, procedurally generated dungeons, and rich item/monster interactions.Default episodes: 5. Max episode steps: 100,000. No-progress timeout: 150 steps.

Hydra Configuration

BALROG uses Hydra for configuration. The config file is at domains/balrog/config/config.yaml. Key configuration sections:

eval:
  output_dir: "results"
  num_workers: 16
  num_episodes:
    nle: 5
    minihack: 5
    babyai: 10
    babaisai: 3
  save_trajectories: True
  feedback_on_invalid_action: True

envs:
  names: babyai  # environment to evaluate

tasks:
  babyai_tasks:
    - "BabyAI-MixedTrainLocal-v0/goto"
    # ...

The harness composes this config and overrides it via command-line arguments:

cfg = compose(
    config_name="config",
    overrides=[
        f"eval.output_dir={args.output_dir}",
        f"eval.num_workers={args.num_workers}",
        f"envs.names={env_name}",
    ]
)

Setup

Run the post-install script

BALROG requires additional game data (Boxoban levels and TextWorld games) that must be downloaded after installing the Python packages:

python -m domains.balrog.scripts.post_install

This downloads:

Boxoban levels from the DeepMind repository into the MiniHack data directory
TextWorld game files (tw-games.zip) into ./domains/balrog/

This script must be run before evaluating any BALROG environment. Missing game data will cause environment initialization to fail silently or crash.

Run evaluation

Select the sub-environment with --domain:

# BabyAI
python -m domains.harness \
  --domain balrog_babyai \
  --run_id initial_balrog_babyai_0 \
  --num_samples 1

# BabaIsAI
python -m domains.harness \
  --domain balrog_babaisai \
  --run_id initial_balrog_babaisai_0

# MiniHack
python -m domains.harness \
  --domain balrog_minihack \
  --run_id initial_balrog_minihack_0

# NLE
python -m domains.harness \
  --domain balrog_nle \
  --run_id initial_balrog_nle_0

Generate the report

python -m domains.report --domain balrog_babyai \
  --dname ./outputs/initial_balrog_babyai_0

The report prints a summary table of average_progress per environment and writes report.json to the output directory.

Output Structure

Each episode produces a JSON file and optionally a text trajectory. Results are organized under <output_dir>/<env_name>/<task_name>/. The report.json summary contains:

{
  "average_progress": 12.5,
  "standard_error": 2.1,
  "environments": {
    "babyai": {
      "progression_percentage": 12.5,
      "standard_error": 2.1,
      "episodes_played": 50
    }
  }
}

Domain Properties

Property	Value
Score key	`average_progress`
Splits	`train` only
Eval subset	full dataset
Ensemble supported	No
Staged eval samples	1 (all variants)

BALROG does not support --resume_from. Each run starts fresh. Use --num_samples to limit the number of episodes during development.

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

What It Evaluates

The Four Environments

Hydra Configuration

Setup

Output Structure

Domain Properties

Build docs developers (and LLMs) love

Get Started

Core Concepts

Domains

Configuration & Running

Analysis & Outputs

​What It Evaluates

​The Four Environments

​Hydra Configuration

​Setup

​Output Structure

​Domain Properties

Build docs developers (and LLMs) love

What It Evaluates

The Four Environments

Hydra Configuration

Setup

Output Structure

Domain Properties