HarborEnv

A specialized environment for running Harbor-format benchmark tasks with automatic task loading, sandbox management, and test execution.

HarborEnv is experimental and subject to breaking changes. The API may change in future releases.

Overview

HarborEnv extends CliAgentEnv to provide first-class support for Harbor-format evaluation tasks. It automatically:

Loads task specifications from task.toml and instruction.md
Manages Docker-based sandboxes per task
Uploads task assets and test suites
Executes verification tests and computes rewards

Inheritance

Environment
└── MultiTurnEnv
    └── CliAgentEnv
        └── HarborEnv

Constructor

HarborEnv(
    run_command: str,
    dataset_path: str | Path,
    tasks: list[str] | None = None,
    agent_workdir: str = "/app",
    docker_image: str = "python:3.11-slim",
    **kwargs
)

run_command

str

required

Command to execute the agent inside the sandbox (e.g., "python agent.py").

dataset_path

str | Path

required

Path to directory containing Harbor task folders. Each task folder must contain task.toml and instruction.md.

tasks

list[str] | None

default:"None"

Specific task names to load. If None, loads all tasks found in dataset_path.

agent_workdir

str

default:"/app"

Working directory for the agent inside the sandbox. Set via AGENT_WORKDIR environment variable.

docker_image

str

default:"python:3.11-slim"

Default Docker image for sandboxes. Can be overridden per-task via task.toml.

**kwargs

Additional arguments passed to CliAgentEnv (timeout, resources, etc.). See CliAgentEnv for details.

Key Methods

load_harbor_dataset

def load_harbor_dataset(self) -> Dataset

Loads Harbor tasks from the dataset directory into a HuggingFace Dataset. Returns: Dataset with columns:

example_id: Sequential task ID
task: Task name (directory name)
prompt: Formatted instruction as messages
info: Task metadata including task_dir, docker_image, and config

get_docker_image

async def get_docker_image(self, state: vf.State) -> str

Resolves the Docker image for a task from task.toml or falls back to the default.

state

vf.State

Rollout state containing task info.

Returns: Docker image string

build_env_vars

async def build_env_vars(self, state: vf.State) -> dict[str, str]

Builds environment variables with Harbor-specific additions:

HARBOR_TASK_NAME: Current task name
HARBOR_TASK_DIR: Path to task assets (/task)
HARBOR_INSTRUCTION_PATH: Path to instruction file
AGENT_WORKDIR: Agent working directory

compute_reward

async def compute_reward(self, state: vf.State) -> float

Executes Harbor test suite (tests/test.sh) and extracts reward from:

/logs/verifier/reward.txt (preferred)
/logs/verifier/reward.json (fallback)

Returns: Reward value between 0.0 and 1.0

Harbor Task Structure

Each task directory must follow this structure:

dataset_path/
├── task_name_1/
│   ├── task.toml          # Task configuration
│   ├── instruction.md     # Task description for agent
│   ├── solution/          # Reference implementation (uploaded after agent runs)
│   └── tests/
│       └── test.sh        # Verification script
├── task_name_2/
│   ├── ...

task.toml Format

[environment]
docker_image = "python:3.11-slim"  # Optional: override default image

# Additional task metadata...

Test Script Requirements

The tests/test.sh script must:

Execute verification logic
Write reward to /logs/verifier/reward.txt (single float) or /logs/verifier/reward.json ({"reward": 0.85})
Exit with status 0 (errors are logged but don’t fail scoring)

Example Usage

import verifiers as vf
from pathlib import Path

def load_environment():
    return vf.HarborEnv(
        run_command="python /app/agent.py",
        dataset_path=Path("./harbor_tasks"),
        tasks=["task_1", "task_2"],  # Optional: filter specific tasks
        agent_workdir="/app",
        docker_image="python:3.11",
        max_turns=20,
        timeout_seconds=300,
    )

# Run evaluation
env = load_environment()
results = await env.evaluate(
    client=vf.ClientConfig(api_key="..."),
    model="gpt-4",
    num_examples=10
)

print(f"Average reward: {results['metadata']['avg_reward']}")

Asset Upload Strategy

HarborEnv implements a two-phase upload strategy to prevent test contamination:

Pre-agent: Uploads only instruction.md and task.toml
Post-agent: Uploads solution/ and tests/ directories before running verification

This ensures agents cannot access oracle solutions or test implementations during task execution.

Environment Variables Available to Agent

OPENAI_BASE_URL=<interception_url>  # For API interception
HARBOR_TASK_NAME=<task_name>
HARBOR_TASK_DIR=/task
HARBOR_INSTRUCTION_PATH=/task/instruction.md
AGENT_WORKDIR=/app
OPENAI_MODEL=<model_name>  # If model is set in state

Custom Agent Setup

import verifiers as vf
from pathlib import Path

class CustomHarborEnv(vf.HarborEnv):
    async def post_sandbox_setup(self, state: vf.State) -> None:
        """Upload custom agent code after sandbox creation."""
        await super().post_sandbox_setup(state)  # Upload Harbor assets
        
        sandbox_id = state["sandbox_id"]
        
        # Upload agent code
        await self.sandbox_client.upload_file(
            sandbox_id,
            "/app/agent.py",
            "./my_agent.py"
        )
        
        # Install dependencies
        await self.sandbox_client.execute_command(
            sandbox_id,
            "pip install -r /app/requirements.txt",
            working_dir="/app"
        )

def load_environment():
    return CustomHarborEnv(
        run_command="python /app/agent.py",
        dataset_path=Path("./harbor_tasks"),
    )

Error Handling

Reward computation fails gracefully:

Test execution errors are logged but return 0.0 reward
Missing reward files return 0.0
Invalid JSON/float formats return 0.0
Infrastructure errors set state["error"] and skip scoring

State Keys

HarborEnv adds the following state keys:

harbor_config

dict

Parsed task.toml configuration

harbor_task_dir

str

Local path to task directory

reward

float

Computed reward from test execution

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

HarborEnv

HarborEnv

Overview

Inheritance

Constructor

Key Methods

load_harbor_dataset

get_docker_image

build_env_vars

compute_reward

Harbor Task Structure

task.toml Format

Test Script Requirements

Example Usage

Asset Upload Strategy

Environment Variables Available to Agent

Custom Agent Setup

Error Handling

State Keys

See Also

Build docs developers (and LLMs) love

Environment Classes

Rubrics & Parsers

Clients

Integration Classes

Experimental

Data Types

Utilities

​HarborEnv

​Overview

​Inheritance

​Constructor

​Key Methods

​load_harbor_dataset

​get_docker_image

​build_env_vars

​compute_reward

​Harbor Task Structure

​task.toml Format

​Test Script Requirements

​Example Usage

​Asset Upload Strategy

​Environment Variables Available to Agent

​Custom Agent Setup

​Error Handling

​State Keys

​See Also

Build docs developers (and LLMs) love

HarborEnv

Overview

Inheritance

Constructor

Key Methods

load_harbor_dataset

get_docker_image

build_env_vars

compute_reward

Harbor Task Structure

task.toml Format

Test Script Requirements

Example Usage

Asset Upload Strategy

Environment Variables Available to Agent

Custom Agent Setup

Error Handling

State Keys

See Also