Browser Automation Examples

This example demonstrates browser automation environments using two different modes: DOM mode (natural language actions via Stagehand) and CUA mode (vision-based control with low-level primitives). Both integrate with Browserbase for cloud browser management.

Overview

Verifiers provides two browser automation approaches:

Mode	Control Method	Tools	Best For
DOM	Natural language → Stagehand SDK	`act`, `observe`, `extract`, `navigate`	Semantic interactions, form filling
CUA	Vision + coordinates	`click`, `type_text`, `scroll`, `screenshot`	Precise control, visual tasks

Both modes support:

Cloud browsers via Browserbase
Local browser automation
Sandbox deployment (CUA mode)
Screenshot capture and vision model integration

DOM Mode Example

DOM mode uses Stagehand’s AI-driven browser control for natural language interactions.

Complete Implementation

Main Code
Judge Evaluation

import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset

DOM_SYSTEM_PROMPT = """You are a browser automation agent using Stagehand's AI-driven tools.

Available tools:
- navigate(url): Navigate to a URL
- observe(instruction): Find possible actions matching the instruction
- act(instruction): Execute an action described in natural language
- extract(instruction, schema_json): Extract structured data from the page

Use natural language to describe what you want to do. Stagehand will intelligently
find elements and execute actions without needing CSS selectors or coordinates.

Complete the given task efficiently."""

def create_example_dataset() -> Dataset:
    return Dataset.from_dict({
        "question": [
            "What does the headline say on the primeintellect.ai homepage?"
        ],
        "answer": ["The Open Superintelligence Stack"],
        "start_url": ["https://primeintellect.ai"],
        "task_id": ["dom-example-0"],
    })

def load_environment(
    project_id: str,
    max_turns: int = 10,
    judge_model: str = "gpt-4o-mini",
    system_prompt: str = DOM_SYSTEM_PROMPT,
    browserbase_api_key_var: str = "BROWSERBASE_API_KEY",
    stagehand_model: str = "openai/gpt-4o-mini",
    model_api_key_var: str = "MODEL_API_KEY",
    proxy_model_to_stagehand: bool = False,
    **kwargs,
) -> vf.Environment:
    import os

    # Check required env vars
    missing = []
    if not os.getenv(browserbase_api_key_var):
        missing.append(browserbase_api_key_var)
    if not os.getenv(model_api_key_var):
        missing.append(model_api_key_var)

    if missing:
        raise ValueError(
            f"Missing required environment variables: {', '.join(missing)}"
        )

    dataset = create_example_dataset()

    # Create judge rubric
    rubric = vf.JudgeRubric(
        judge_model=judge_model,
        judge_prompt=JUDGE_PROMPT,
    )
    rubric.add_reward_func(judge_answer, weight=1.0)

    return BrowserEnv(
        mode="dom",
        dataset=dataset,
        rubric=rubric,
        max_turns=max_turns,
        system_prompt=system_prompt,
        project_id=project_id,
        browserbase_api_key_var=browserbase_api_key_var,
        stagehand_model=stagehand_model,
        model_api_key_var=model_api_key_var,
        proxy_model_to_stagehand=proxy_model_to_stagehand,
        **kwargs,
    )

JUDGE_PROMPT = """You are evaluating a browser automation agent's answer to a question.

Question:


Expected Answer:


Agent's Response:

Does the agent's response contain the correct answer? The answer may be embedded 
in a longer response or phrased differently, but should convey the same information 
as the expected answer.

Respond "yes" if the agent's response contains the correct answer, "no" if it does not."""

async def judge_answer(
    judge,
    prompt: str | list,
    completion: str | list,
    answer: str,
    state: vf.State,
) -> float:
    judge_response = await judge(prompt, completion, answer, state)
    is_correct = "yes" in judge_response.lower()
    return 1.0 if is_correct else 0.0

DOM Mode Tools

navigate(url) - Navigate to a URL

navigate("https://example.com")

act(instruction) - Perform action via natural language

act("click the login button")
act("type '[email protected]' into the email field")
act("select 'United States' from the country dropdown")

observe(instruction) - Find elements matching description

observe("find all product cards on the page")
# Returns: List of matching elements with metadata

extract(instruction, schema_json) - Extract structured data

schema = '{"title": "string", "price": "number"}'
extract("extract product information", schema)
# Returns: {"title": "Product Name", "price": 29.99}

DOM Mode Interaction Example

Successful Task
Natural Language Actions

Task: What does the headline say on the primeintellect.ai homepage?

Assistant: I’ll navigate to the homepage and extract the headline.Tool Call: navigate("https://primeintellect.ai")Tool Result: Navigation successful

Assistant: Now I’ll extract the main headline.Tool Call: act("read the main headline text")Tool Result: Found headline: “The Open Superintelligence Stack”

Assistant: The headline says “The Open Superintelligence Stack”.Judge: “yes”Result: ✓ Correct (reward = 1.0)

DOM mode excels at semantic tasks:

# Login form
act("click the 'Sign In' button")
act("type '[email protected]' in the email field")
act("type 'password123' in the password field")
act("click the submit button")

# Search
act("type 'machine learning' in the search box")
act("press Enter")

# Navigation
act("click on the first search result")
act("scroll down to the comments section")

Stagehand intelligently finds elements without selectors.

CUA Mode Example

CUA (Computer Use Agent) mode provides low-level vision-based browser control.

Complete Implementation

Main Code
Dataset

from typing import Literal
import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset

CUA_SYSTEM_PROMPT = """You are a browser automation agent. You can control a web browser using the provided tools.

Available tools:
- click(x, y, button): Click at coordinates
- double_click(x, y): Double-click at coordinates
- type_text(text): Type text into focused element
- keypress(keys): Press keyboard keys
- scroll(x, y, scroll_x, scroll_y): Scroll at position
- goto(url): Navigate to URL
- back(): Go back in history
- forward(): Go forward in history
- wait(time_ms): Wait for specified milliseconds
- screenshot(): Capture current page state

After each action, you will receive a screenshot showing the current page state.
Analyze the screenshot to determine your next action.

Complete the given task efficiently using the minimum number of actions necessary."""

def load_environment(
    max_turns: int = 15,
    judge_model: str = "gpt-4o-mini",
    system_prompt: str = CUA_SYSTEM_PROMPT,
    # CUA mode configuration
    use_sandbox: bool = True,
    server_url: str = "http://localhost:3000",
    # Browserbase configuration
    browserbase_api_key: str | None = None,
    browserbase_project_id: str | None = None,
    env: Literal["LOCAL", "BROWSERBASE"] = "BROWSERBASE",
    # Pre-built image (fastest startup)
    use_prebuilt_image: bool = True,
    prebuilt_image: str = "deepdream19/cua-server:latest",
    **kwargs,
) -> vf.Environment:
    dataset = create_example_dataset()

    rubric = vf.JudgeRubric(
        judge_model=judge_model,
        judge_prompt=JUDGE_PROMPT,
    )
    rubric.add_reward_func(judge_answer, weight=1.0)

    return BrowserEnv(
        mode="cua",
        dataset=dataset,
        rubric=rubric,
        max_turns=max_turns,
        system_prompt=system_prompt,
        use_sandbox=use_sandbox,
        server_url=server_url,
        env=env,
        browserbase_api_key=browserbase_api_key,
        browserbase_project_id=browserbase_project_id,
        use_prebuilt_image=use_prebuilt_image,
        prebuilt_image=prebuilt_image,
        **kwargs,
    )

def create_example_dataset() -> Dataset:
    """Create example dataset for CUA mode."""
    return Dataset.from_dict({
        "question": [
            "What does the headline say on the primeintellect.ai homepage?"
        ],
        "answer": ["The Open Superintelligence Stack"],
        "start_url": ["https://primeintellect.ai"],
        "task_id": ["cua-example-0"],
    })

CUA Mode Tools

click(x, y, button="left") - Click at coordinates

click(150, 300)  # Click at (150, 300)
click(200, 400, button="right")  # Right-click

type_text(text) - Type text into focused element

type_text("[email protected]")

keypress(keys) - Press keyboard keys

keypress("Enter")
keypress("Ctrl+C")
keypress("Tab")

scroll(x, y, scroll_x, scroll_y) - Scroll at position

scroll(500, 500, 0, -100)  # Scroll up 100 pixels
scroll(500, 500, 0, 100)   # Scroll down

goto(url) - Navigate to URL

goto("https://example.com")

screenshot() - Capture current state

screenshot()  # Returns base64 encoded image

CUA Mode Interaction Example

Vision-Based Control
Low-Level Actions

Task: Click the search button

Assistant: Let me take a screenshot to see the page.Tool Call: screenshot()Tool Result: [Image showing page with search button at (300, 150)]

Assistant: I can see the search button at coordinates (300, 150). I’ll click it.Tool Call: click(300, 150)Tool Result: Click successful

Assistant: Now I’ll type the search query.Tool Call: type_text("machine learning")Tool Result: Text entered

Assistant: Pressing Enter to search.Tool Call: keypress("Enter")

CUA mode provides precise control:

# Login sequence
click(200, 100)  # Click email field
type_text("[email protected]")
keypress("Tab")  # Move to password field
type_text("password123")
click(300, 200)  # Click login button

# Form interaction
click(150, 300)  # Focus dropdown
keypress("Down")
keypress("Down")
keypress("Enter")

# Scrolling and navigation
scroll(500, 500, 0, -200)  # Scroll up
wait(1000)  # Wait 1 second
screenshot()  # Capture state

Mode Comparison

Feature	DOM Mode	CUA Mode
Control	Natural language	Coordinates + vision
Setup	Requires Browserbase project	Requires sandbox or local server
Startup	Fast (~2-5s)	Slower (~5-30s depending on mode)
Best for	Semantic tasks, form filling	Precise control, visual inspection
Tools	4 high-level	10 low-level
Vision	Not required	Required for coordinates
Selectors	Not needed (AI-driven)	Not needed (coordinate-based)
Reliability	Good for standard elements	Excellent for any clickable item

Running Browser Environments

Installation

# Install browser environments
prime env install browser-dom-example
prime env install browser-cua-example

DOM Mode Execution

# Requires: BROWSERBASE_API_KEY, MODEL_API_KEY, Browserbase project ID
prime eval run browser-dom-example \
  -m openai/gpt-4o-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -a '{"project_id": "YOUR_PROJECT_ID"}' \
  -n 10 \
  -r 3

CUA Mode Execution

# Default: Pre-built image (fastest)
prime eval run browser-cua-example \
  -m openai/gpt-4.1-mini \
  -b https://api.openai.com/v1 \
  -k OPENAI_API_KEY \
  -n 10 \
  -r 3

# Binary upload mode (custom server)
prime eval run browser-cua-example \
  -m openai/gpt-4.1-mini \
  -a '{"use_prebuilt_image": false}' \
  -n 10

# Manual mode (local development)
# Terminal 1: cd cua-server && ./start.sh
# Terminal 2:
prime eval run browser-cua-example \
  -m openai/gpt-4.1-mini \
  -a '{"use_sandbox": false, "server_url": "http://localhost:3000"}' \
  -n 10

Configuration Options

DOM Mode

Parameter	Default	Description
`project_id`	Required	Browserbase project ID
`max_turns`	`10`	Maximum interactions
`judge_model`	`"gpt-4o-mini"`	Judge model
`browserbase_api_key_var`	`"BROWSERBASE_API_KEY"`	API key env var
`stagehand_model`	`"openai/gpt-4o-mini"`	Model for Stagehand
`proxy_model_to_stagehand`	`False`	Route Stagehand through eval model

CUA Mode

Parameter	Default	Description
`max_turns`	`15`	Maximum interactions
`use_sandbox`	`True`	Auto-deploy to sandbox
`use_prebuilt_image`	`True`	Use pre-built Docker image (fastest)
`prebuilt_image`	`"deepdream19/cua-server:latest"`	Docker image
`server_url`	`"http://localhost:3000"`	Server URL (manual mode)
`env`	`"BROWSERBASE"`	Browser env (LOCAL/BROWSERBASE)
`viewport_width`	`1024`	Browser width
`viewport_height`	`768`	Browser height
`save_screenshots`	`False`	Save screenshots to disk
`keep_recent_screenshots`	`2`	Screenshots in context

Key Features

Browserbase Integration

Both modes support cloud browsers via Browserbase:

No local browser installation needed
Scalable cloud infrastructure
Session recording and debugging
Proxy support for geo-targeting

Setup:

Sign up at browserbase.com
Get API key and project ID
Set environment variables

Vision Model Integration

CUA mode integrates screenshots with vision models:

# Screenshot automatically included in context
screenshot()  # Returns base64 image

# Vision model analyzes image to determine coordinates
# Example: "I see the login button at approximately (250, 180)"

Sandbox Deployment

CUA mode supports three deployment modes: 1. Pre-built image (default, fastest)

BrowserEnv(
    mode="cua",
    use_prebuilt_image=True,  # ~5-10s startup
    prebuilt_image="deepdream19/cua-server:latest",
)

2. Binary upload (custom server)

BrowserEnv(
    mode="cua",
    use_prebuilt_image=False,  # ~30-60s startup
    use_binary=True,
)

3. Manual (local development)

# Terminal 1
cd cua-server && ./start.sh

# Terminal 2
BrowserEnv(
    mode="cua",
    use_sandbox=False,
    server_url="http://localhost:3000",
)

Metrics Tracked

DOM Mode

judge_reward: Answer correctness (0.0 or 1.0)
num_turns: Interaction count
total_tool_calls: Tools used
Per-tool counts: navigate_calls, act_calls, etc.

CUA Mode

judge_reward: Answer correctness (0.0 or 1.0)
num_turns: Interaction count
total_tool_calls: Tools used
Per-tool counts: click_calls, screenshot_calls, etc.
sandbox_ready_wait_time: Sandbox startup time

Advanced Usage

Custom Datasets

Create task-specific datasets:

def create_custom_dataset() -> Dataset:
    return Dataset.from_dict({
        "question": [
            "Find the price of the first product",
            "What is the company's contact email?",
        ],
        "answer": [
            "$29.99",
            "[email protected]",
        ],
        "start_url": [
            "https://shop.example.com",
            "https://example.com/contact",
        ],
    })

Proxy Configuration

BrowserEnv(
    mode="dom",
    proxies=True,  # Enable Browserbase proxies
    # Proxies allow geo-targeting and IP rotation
)

Screenshot Management

BrowserEnv(
    mode="cua",
    save_screenshots=True,  # Save to disk
    keep_recent_screenshots=3,  # Keep last 3 in context
    # Balances context size with visual information
)

Wiki Search - Custom tool implementation
Math Python - Sandbox environments
Wordle - Multi-turn interaction

Next Steps

Learn about BrowserEnv architecture
See Stateful Environments for sandbox management
Explore Browserbase for cloud browsers
Check out Stagehand for DOM mode

Example Environments

Browser Automation Examples

Overview

DOM Mode Example

Complete Implementation

DOM Mode Tools

DOM Mode Interaction Example

CUA Mode Example

Complete Implementation

CUA Mode Tools

CUA Mode Interaction Example

Mode Comparison

Running Browser Environments

Installation

DOM Mode Execution

CUA Mode Execution

Configuration Options

DOM Mode

CUA Mode

Key Features

Browserbase Integration

Vision Model Integration

Sandbox Deployment

Metrics Tracked

DOM Mode

CUA Mode

Advanced Usage

Custom Datasets

Proxy Configuration

Screenshot Management

Next Steps

Build docs developers (and LLMs) love

Example Environments

​Overview

​DOM Mode Example

​Complete Implementation

​DOM Mode Tools

​DOM Mode Interaction Example

​CUA Mode Example

​Complete Implementation

​CUA Mode Tools

​CUA Mode Interaction Example

​Mode Comparison

​Running Browser Environments

​Installation

​DOM Mode Execution

​CUA Mode Execution

​Configuration Options

​DOM Mode

​CUA Mode

​Key Features

​Browserbase Integration

​Vision Model Integration

​Sandbox Deployment

​Metrics Tracked

​DOM Mode

​CUA Mode

​Advanced Usage

​Custom Datasets

​Proxy Configuration

​Screenshot Management

​Related Examples

​Next Steps

Build docs developers (and LLMs) love

Overview

DOM Mode Example

Complete Implementation

DOM Mode Tools

DOM Mode Interaction Example

CUA Mode Example

Complete Implementation

CUA Mode Tools

CUA Mode Interaction Example

Mode Comparison

Running Browser Environments

Installation

DOM Mode Execution

CUA Mode Execution

Configuration Options

DOM Mode

CUA Mode

Key Features

Browserbase Integration

Vision Model Integration

Sandbox Deployment

Metrics Tracked

DOM Mode

CUA Mode

Advanced Usage

Custom Datasets

Proxy Configuration

Screenshot Management

Related Examples

Next Steps