Skip to main content

HarborEnv

A specialized environment for running Harbor-format benchmark tasks with automatic task loading, sandbox management, and test execution.
HarborEnv is experimental and subject to breaking changes. The API may change in future releases.

Overview

HarborEnv extends CliAgentEnv to provide first-class support for Harbor-format evaluation tasks. It automatically:
  • Loads task specifications from task.toml and instruction.md
  • Manages Docker-based sandboxes per task
  • Uploads task assets and test suites
  • Executes verification tests and computes rewards

Inheritance

Environment
└── MultiTurnEnv
    └── CliAgentEnv
        └── HarborEnv

Constructor

HarborEnv(
    run_command: str,
    dataset_path: str | Path,
    tasks: list[str] | None = None,
    agent_workdir: str = "/app",
    docker_image: str = "python:3.11-slim",
    **kwargs
)
run_command
str
required
Command to execute the agent inside the sandbox (e.g., "python agent.py").
dataset_path
str | Path
required
Path to directory containing Harbor task folders. Each task folder must contain task.toml and instruction.md.
tasks
list[str] | None
default:"None"
Specific task names to load. If None, loads all tasks found in dataset_path.
agent_workdir
str
default:"/app"
Working directory for the agent inside the sandbox. Set via AGENT_WORKDIR environment variable.
docker_image
str
default:"python:3.11-slim"
Default Docker image for sandboxes. Can be overridden per-task via task.toml.
**kwargs
Additional arguments passed to CliAgentEnv (timeout, resources, etc.). See CliAgentEnv for details.

Key Methods

load_harbor_dataset

def load_harbor_dataset(self) -> Dataset
Loads Harbor tasks from the dataset directory into a HuggingFace Dataset. Returns: Dataset with columns:
  • example_id: Sequential task ID
  • task: Task name (directory name)
  • prompt: Formatted instruction as messages
  • info: Task metadata including task_dir, docker_image, and config

get_docker_image

async def get_docker_image(self, state: vf.State) -> str
Resolves the Docker image for a task from task.toml or falls back to the default.
state
vf.State
Rollout state containing task info.
Returns: Docker image string

build_env_vars

async def build_env_vars(self, state: vf.State) -> dict[str, str]
Builds environment variables with Harbor-specific additions:
  • HARBOR_TASK_NAME: Current task name
  • HARBOR_TASK_DIR: Path to task assets (/task)
  • HARBOR_INSTRUCTION_PATH: Path to instruction file
  • AGENT_WORKDIR: Agent working directory

compute_reward

async def compute_reward(self, state: vf.State) -> float
Executes Harbor test suite (tests/test.sh) and extracts reward from:
  1. /logs/verifier/reward.txt (preferred)
  2. /logs/verifier/reward.json (fallback)
Returns: Reward value between 0.0 and 1.0

Harbor Task Structure

Each task directory must follow this structure:
dataset_path/
├── task_name_1/
│   ├── task.toml          # Task configuration
│   ├── instruction.md     # Task description for agent
│   ├── solution/          # Reference implementation (uploaded after agent runs)
│   └── tests/
│       └── test.sh        # Verification script
├── task_name_2/
│   ├── ...

task.toml Format

[environment]
docker_image = "python:3.11-slim"  # Optional: override default image

# Additional task metadata...

Test Script Requirements

The tests/test.sh script must:
  1. Execute verification logic
  2. Write reward to /logs/verifier/reward.txt (single float) or /logs/verifier/reward.json ({"reward": 0.85})
  3. Exit with status 0 (errors are logged but don’t fail scoring)

Example Usage

import verifiers as vf
from pathlib import Path

def load_environment():
    return vf.HarborEnv(
        run_command="python /app/agent.py",
        dataset_path=Path("./harbor_tasks"),
        tasks=["task_1", "task_2"],  # Optional: filter specific tasks
        agent_workdir="/app",
        docker_image="python:3.11",
        max_turns=20,
        timeout_seconds=300,
    )

# Run evaluation
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(api_key="..."),
    model="gpt-4",
    num_examples=10
)

print(f"Average reward: {results['metadata']['avg_reward']}")

Asset Upload Strategy

HarborEnv implements a two-phase upload strategy to prevent test contamination:
  1. Pre-agent: Uploads only instruction.md and task.toml
  2. Post-agent: Uploads solution/ and tests/ directories before running verification
This ensures agents cannot access oracle solutions or test implementations during task execution.

Environment Variables Available to Agent

OPENAI_BASE_URL=<interception_url>  # For API interception
HARBOR_TASK_NAME=<task_name>
HARBOR_TASK_DIR=/task
HARBOR_INSTRUCTION_PATH=/task/instruction.md
AGENT_WORKDIR=/app
OPENAI_MODEL=<model_name>  # If model is set in state

Custom Agent Setup

import verifiers as vf
from pathlib import Path

class CustomHarborEnv(vf.HarborEnv):
    async def post_sandbox_setup(self, state: vf.State) -> None:
        """Upload custom agent code after sandbox creation."""
        await super().post_sandbox_setup(state)  # Upload Harbor assets
        
        sandbox_id = state["sandbox_id"]
        
        # Upload agent code
        await self.sandbox_client.upload_file(
            sandbox_id,
            "/app/agent.py",
            "./my_agent.py"
        )
        
        # Install dependencies
        await self.sandbox_client.execute_command(
            sandbox_id,
            "pip install -r /app/requirements.txt",
            working_dir="/app"
        )

def load_environment():
    return CustomHarborEnv(
        run_command="python /app/agent.py",
        dataset_path=Path("./harbor_tasks"),
    )

Error Handling

Reward computation fails gracefully:
  • Test execution errors are logged but return 0.0 reward
  • Missing reward files return 0.0
  • Invalid JSON/float formats return 0.0
  • Infrastructure errors set state["error"] and skip scoring

State Keys

HarborEnv adds the following state keys:
harbor_config
dict
Parsed task.toml configuration
harbor_task_dir
str
Local path to task directory
reward
float
Computed reward from test execution

See Also

  • CliAgentEnv - Parent class for custom agent environments
  • SandboxEnv - Base sandbox management
  • Harbor benchmark repository for task format details

Build docs developers (and LLMs) love