Skip to main content
This example demonstrates browser automation environments using two different modes: DOM mode (natural language actions via Stagehand) and CUA mode (vision-based control with low-level primitives). Both integrate with Browserbase for cloud browser management.

Overview

Verifiers provides two browser automation approaches:
ModeControl MethodToolsBest For
DOMNatural language → Stagehand SDKact, observe, extract, navigateSemantic interactions, form filling
CUAVision + coordinatesclick, type_text, scroll, screenshotPrecise control, visual tasks
Both modes support:
  • Cloud browsers via Browserbase
  • Local browser automation
  • Sandbox deployment (CUA mode)
  • Screenshot capture and vision model integration

DOM Mode Example

DOM mode uses Stagehand’s AI-driven browser control for natural language interactions.

Complete Implementation

import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset

DOM_SYSTEM_PROMPT = """You are a browser automation agent using Stagehand's AI-driven tools.

Available tools:
- navigate(url): Navigate to a URL
- observe(instruction): Find possible actions matching the instruction
- act(instruction): Execute an action described in natural language
- extract(instruction, schema_json): Extract structured data from the page

Use natural language to describe what you want to do. Stagehand will intelligently
find elements and execute actions without needing CSS selectors or coordinates.

Complete the given task efficiently."""

def create_example_dataset() -> Dataset:
    return Dataset.from_dict({
        "question": [
            "What does the headline say on the primeintellect.ai homepage?"
        ],
        "answer": ["The Open Superintelligence Stack"],
        "start_url": ["https://primeintellect.ai"],
        "task_id": ["dom-example-0"],
    })

def load_environment(
    project_id: str,
    max_turns: int = 10,
    judge_model: str = "gpt-4o-mini",
    system_prompt: str = DOM_SYSTEM_PROMPT,
    browserbase_api_key_var: str = "BROWSERBASE_API_KEY",
    stagehand_model: str = "openai/gpt-4o-mini",
    model_api_key_var: str = "MODEL_API_KEY",
    proxy_model_to_stagehand: bool = False,
    **kwargs,
) -> vf.Environment:
    import os

    # Check required env vars
    missing = []
    if not os.getenv(browserbase_api_key_var):
        missing.append(browserbase_api_key_var)
    if not os.getenv(model_api_key_var):
        missing.append(model_api_key_var)

    if missing:
        raise ValueError(
            f"Missing required environment variables: {', '.join(missing)}"
        )

    dataset = create_example_dataset()

    # Create judge rubric
    rubric = vf.JudgeRubric(
        judge_model=judge_model,
        judge_prompt=JUDGE_PROMPT,
    )
    rubric.add_reward_func(judge_answer, weight=1.0)

    return BrowserEnv(
        mode="dom",
        dataset=dataset,
        rubric=rubric,
        max_turns=max_turns,
        system_prompt=system_prompt,
        project_id=project_id,
        browserbase_api_key_var=browserbase_api_key_var,
        stagehand_model=stagehand_model,
        model_api_key_var=model_api_key_var,
        proxy_model_to_stagehand=proxy_model_to_stagehand,
        **kwargs,
    )

DOM Mode Tools

navigate(url) - Navigate to a URL
navigate("https://example.com")
act(instruction) - Perform action via natural language
act("click the login button")
act("type '[email protected]' into the email field")
act("select 'United States' from the country dropdown")
observe(instruction) - Find elements matching description
observe("find all product cards on the page")
# Returns: List of matching elements with metadata
extract(instruction, schema_json) - Extract structured data
schema = '{"title": "string", "price": "number"}'
extract("extract product information", schema)
# Returns: {"title": "Product Name", "price": 29.99}

DOM Mode Interaction Example

Task: What does the headline say on the primeintellect.ai homepage?
Assistant: I’ll navigate to the homepage and extract the headline.Tool Call: navigate("https://primeintellect.ai")Tool Result: Navigation successful
Assistant: Now I’ll extract the main headline.Tool Call: act("read the main headline text")Tool Result: Found headline: “The Open Superintelligence Stack”
Assistant: The headline says “The Open Superintelligence Stack”.Judge: “yes”Result: ✓ Correct (reward = 1.0)

CUA Mode Example

CUA (Computer Use Agent) mode provides low-level vision-based browser control.

Complete Implementation

from typing import Literal
import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset

CUA_SYSTEM_PROMPT = """You are a browser automation agent. You can control a web browser using the provided tools.

Available tools:
- click(x, y, button): Click at coordinates
- double_click(x, y): Double-click at coordinates
- type_text(text): Type text into focused element
- keypress(keys): Press keyboard keys
- scroll(x, y, scroll_x, scroll_y): Scroll at position
- goto(url): Navigate to URL
- back(): Go back in history
- forward(): Go forward in history
- wait(time_ms): Wait for specified milliseconds
- screenshot(): Capture current page state

After each action, you will receive a screenshot showing the current page state.
Analyze the screenshot to determine your next action.

Complete the given task efficiently using the minimum number of actions necessary."""

def load_environment(
    max_turns: int = 15,
    judge_model: str = "gpt-4o-mini",
    system_prompt: str = CUA_SYSTEM_PROMPT,
    # CUA mode configuration
    use_sandbox: bool = True,
    server_url: str = "http://localhost:3000",
    # Browserbase configuration
    browserbase_api_key: str | None = None,
    browserbase_project_id: str | None = None,
    env: Literal["LOCAL", "BROWSERBASE"] = "BROWSERBASE",
    # Pre-built image (fastest startup)
    use_prebuilt_image: bool = True,
    prebuilt_image: str = "deepdream19/cua-server:latest",
    **kwargs,
) -> vf.Environment:
    dataset = create_example_dataset()

    rubric = vf.JudgeRubric(
        judge_model=judge_model,
        judge_prompt=JUDGE_PROMPT,
    )
    rubric.add_reward_func(judge_answer, weight=1.0)

    return BrowserEnv(
        mode="cua",
        dataset=dataset,
        rubric=rubric,
        max_turns=max_turns,
        system_prompt=system_prompt,
        use_sandbox=use_sandbox,
        server_url=server_url,
        env=env,
        browserbase_api_key=browserbase_api_key,
        browserbase_project_id=browserbase_project_id,
        use_prebuilt_image=use_prebuilt_image,
        prebuilt_image=prebuilt_image,
        **kwargs,
    )

CUA Mode Tools

click(x, y, button="left") - Click at coordinates
click(150, 300)  # Click at (150, 300)
click(200, 400, button="right")  # Right-click
type_text(text) - Type text into focused element
type_text("[email protected]")
keypress(keys) - Press keyboard keys
keypress("Enter")
keypress("Ctrl+C")
keypress("Tab")
scroll(x, y, scroll_x, scroll_y) - Scroll at position
scroll(500, 500, 0, -100)  # Scroll up 100 pixels
scroll(500, 500, 0, 100)   # Scroll down
goto(url) - Navigate to URL
goto("https://example.com")
screenshot() - Capture current state
screenshot()  # Returns base64 encoded image

CUA Mode Interaction Example

Task: Click the search button
Assistant: Let me take a screenshot to see the page.Tool Call: screenshot()Tool Result: [Image showing page with search button at (300, 150)]
Assistant: I can see the search button at coordinates (300, 150). I’ll click it.Tool Call: click(300, 150)Tool Result: Click successful
Assistant: Now I’ll type the search query.Tool Call: type_text("machine learning")Tool Result: Text entered
Assistant: Pressing Enter to search.Tool Call: keypress("Enter")

Mode Comparison

FeatureDOM ModeCUA Mode
ControlNatural languageCoordinates + vision
SetupRequires Browserbase projectRequires sandbox or local server
StartupFast (~2-5s)Slower (~5-30s depending on mode)
Best forSemantic tasks, form fillingPrecise control, visual inspection
Tools4 high-level10 low-level
VisionNot requiredRequired for coordinates
SelectorsNot needed (AI-driven)Not needed (coordinate-based)
ReliabilityGood for standard elementsExcellent for any clickable item

Running Browser Environments

Installation

# Install browser environments
prime env install browser-dom-example
prime env install browser-cua-example

DOM Mode Execution

# Requires: BROWSERBASE_API_KEY, MODEL_API_KEY, Browserbase project ID
prime eval run browser-dom-example \
  -m openai/gpt-4o-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -a '{"project_id": "YOUR_PROJECT_ID"}' \
  -n 10 \
  -r 3

CUA Mode Execution

# Default: Pre-built image (fastest)
prime eval run browser-cua-example \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 10 \
  -r 3

# Binary upload mode (custom server)
prime eval run browser-cua-example \
  -m openai/gpt-4.1-mini \
  -a '{"use_prebuilt_image": false}' \
  -n 10

# Manual mode (local development)
# Terminal 1: cd cua-server && ./start.sh
# Terminal 2:
prime eval run browser-cua-example \
  -m openai/gpt-4.1-mini \
  -a '{"use_sandbox": false, "server_url": "http://localhost:3000"}' \
  -n 10

Configuration Options

DOM Mode

ParameterDefaultDescription
project_idRequiredBrowserbase project ID
max_turns10Maximum interactions
judge_model"gpt-4o-mini"Judge model
browserbase_api_key_var"BROWSERBASE_API_KEY"API key env var
stagehand_model"openai/gpt-4o-mini"Model for Stagehand
proxy_model_to_stagehandFalseRoute Stagehand through eval model

CUA Mode

ParameterDefaultDescription
max_turns15Maximum interactions
use_sandboxTrueAuto-deploy to sandbox
use_prebuilt_imageTrueUse pre-built Docker image (fastest)
prebuilt_image"deepdream19/cua-server:latest"Docker image
server_url"http://localhost:3000"Server URL (manual mode)
env"BROWSERBASE"Browser env (LOCAL/BROWSERBASE)
viewport_width1024Browser width
viewport_height768Browser height
save_screenshotsFalseSave screenshots to disk
keep_recent_screenshots2Screenshots in context

Key Features

Browserbase Integration

Both modes support cloud browsers via Browserbase:
  • No local browser installation needed
  • Scalable cloud infrastructure
  • Session recording and debugging
  • Proxy support for geo-targeting
Setup:
  1. Sign up at browserbase.com
  2. Get API key and project ID
  3. Set environment variables

Vision Model Integration

CUA mode integrates screenshots with vision models:
# Screenshot automatically included in context
screenshot()  # Returns base64 image

# Vision model analyzes image to determine coordinates
# Example: "I see the login button at approximately (250, 180)"

Sandbox Deployment

CUA mode supports three deployment modes: 1. Pre-built image (default, fastest)
BrowserEnv(
    mode="cua",
    use_prebuilt_image=True,  # ~5-10s startup
    prebuilt_image="deepdream19/cua-server:latest",
)
2. Binary upload (custom server)
BrowserEnv(
    mode="cua",
    use_prebuilt_image=False,  # ~30-60s startup
    use_binary=True,
)
3. Manual (local development)
# Terminal 1
cd cua-server && ./start.sh

# Terminal 2
BrowserEnv(
    mode="cua",
    use_sandbox=False,
    server_url="http://localhost:3000",
)

Metrics Tracked

DOM Mode

  • judge_reward: Answer correctness (0.0 or 1.0)
  • num_turns: Interaction count
  • total_tool_calls: Tools used
  • Per-tool counts: navigate_calls, act_calls, etc.

CUA Mode

  • judge_reward: Answer correctness (0.0 or 1.0)
  • num_turns: Interaction count
  • total_tool_calls: Tools used
  • Per-tool counts: click_calls, screenshot_calls, etc.
  • sandbox_ready_wait_time: Sandbox startup time

Advanced Usage

Custom Datasets

Create task-specific datasets:
def create_custom_dataset() -> Dataset:
    return Dataset.from_dict({
        "question": [
            "Find the price of the first product",
            "What is the company's contact email?",
        ],
        "answer": [
            "$29.99",
            "[email protected]",
        ],
        "start_url": [
            "https://shop.example.com",
            "https://example.com/contact",
        ],
    })

Proxy Configuration

BrowserEnv(
    mode="dom",
    proxies=True,  # Enable Browserbase proxies
    # Proxies allow geo-targeting and IP rotation
)

Screenshot Management

BrowserEnv(
    mode="cua",
    save_screenshots=True,  # Save to disk
    keep_recent_screenshots=3,  # Keep last 3 in context
    # Balances context size with visual information
)

Next Steps

Build docs developers (and LLMs) love