HarborEnv
A specialized environment for running Harbor-format benchmark tasks with automatic task loading, sandbox management, and test execution.Overview
HarborEnv extends CliAgentEnv to provide first-class support for Harbor-format evaluation tasks. It automatically:
- Loads task specifications from
task.tomlandinstruction.md - Manages Docker-based sandboxes per task
- Uploads task assets and test suites
- Executes verification tests and computes rewards
Inheritance
Constructor
Command to execute the agent inside the sandbox (e.g.,
"python agent.py").Path to directory containing Harbor task folders. Each task folder must contain
task.toml and instruction.md.Specific task names to load. If None, loads all tasks found in
dataset_path.Working directory for the agent inside the sandbox. Set via
AGENT_WORKDIR environment variable.Default Docker image for sandboxes. Can be overridden per-task via
task.toml.**kwargs
Additional arguments passed to
CliAgentEnv (timeout, resources, etc.). See CliAgentEnv for details.Key Methods
load_harbor_dataset
example_id: Sequential task IDtask: Task name (directory name)prompt: Formatted instruction as messagesinfo: Task metadata includingtask_dir,docker_image, andconfig
get_docker_image
task.toml or falls back to the default.
Rollout state containing task info.
build_env_vars
HARBOR_TASK_NAME: Current task nameHARBOR_TASK_DIR: Path to task assets (/task)HARBOR_INSTRUCTION_PATH: Path to instruction fileAGENT_WORKDIR: Agent working directory
compute_reward
tests/test.sh) and extracts reward from:
/logs/verifier/reward.txt(preferred)/logs/verifier/reward.json(fallback)
Harbor Task Structure
Each task directory must follow this structure:task.toml Format
Test Script Requirements
Thetests/test.sh script must:
- Execute verification logic
- Write reward to
/logs/verifier/reward.txt(single float) or/logs/verifier/reward.json({"reward": 0.85}) - Exit with status 0 (errors are logged but don’t fail scoring)
Example Usage
Asset Upload Strategy
HarborEnv implements a two-phase upload strategy to prevent test contamination:
- Pre-agent: Uploads only
instruction.mdandtask.toml - Post-agent: Uploads
solution/andtests/directories before running verification
Environment Variables Available to Agent
Custom Agent Setup
Error Handling
Reward computation fails gracefully:- Test execution errors are logged but return 0.0 reward
- Missing reward files return 0.0
- Invalid JSON/float formats return 0.0
- Infrastructure errors set
state["error"]and skip scoring
State Keys
HarborEnv adds the following state keys:Parsed
task.toml configurationLocal path to task directory
Computed reward from test execution
See Also
- CliAgentEnv - Parent class for custom agent environments
- SandboxEnv - Base sandbox management
- Harbor benchmark repository for task format details