Overview
Verifiers provides two browser automation approaches:| Mode | Control Method | Tools | Best For |
|---|---|---|---|
| DOM | Natural language → Stagehand SDK | act, observe, extract, navigate | Semantic interactions, form filling |
| CUA | Vision + coordinates | click, type_text, scroll, screenshot | Precise control, visual tasks |
- Cloud browsers via Browserbase
- Local browser automation
- Sandbox deployment (CUA mode)
- Screenshot capture and vision model integration
DOM Mode Example
DOM mode uses Stagehand’s AI-driven browser control for natural language interactions.Complete Implementation
- Main Code
- Judge Evaluation
DOM Mode Tools
navigate(url) - Navigate to a URL
act(instruction) - Perform action via natural language
observe(instruction) - Find elements matching description
extract(instruction, schema_json) - Extract structured data
DOM Mode Interaction Example
- Successful Task
- Natural Language Actions
Task: What does the headline say on the primeintellect.ai homepage?
Assistant: I’ll navigate to the homepage and extract the headline.Tool Call:
Assistant: Now I’ll extract the main headline.Tool Call:
Assistant: The headline says “The Open Superintelligence Stack”.Judge: “yes”Result: ✓ Correct (reward = 1.0)
Assistant: I’ll navigate to the homepage and extract the headline.Tool Call:
navigate("https://primeintellect.ai")Tool Result: Navigation successfulAssistant: Now I’ll extract the main headline.Tool Call:
act("read the main headline text")Tool Result: Found headline: “The Open Superintelligence Stack”Assistant: The headline says “The Open Superintelligence Stack”.Judge: “yes”Result: ✓ Correct (reward = 1.0)
CUA Mode Example
CUA (Computer Use Agent) mode provides low-level vision-based browser control.Complete Implementation
- Main Code
- Dataset
CUA Mode Tools
click(x, y, button="left") - Click at coordinates
type_text(text) - Type text into focused element
keypress(keys) - Press keyboard keys
scroll(x, y, scroll_x, scroll_y) - Scroll at position
goto(url) - Navigate to URL
screenshot() - Capture current state
CUA Mode Interaction Example
- Vision-Based Control
- Low-Level Actions
Task: Click the search button
Assistant: Let me take a screenshot to see the page.Tool Call:
Assistant: I can see the search button at coordinates (300, 150). I’ll click it.Tool Call:
Assistant: Now I’ll type the search query.Tool Call:
Assistant: Pressing Enter to search.Tool Call:
Assistant: Let me take a screenshot to see the page.Tool Call:
screenshot()Tool Result: [Image showing page with search button at (300, 150)]Assistant: I can see the search button at coordinates (300, 150). I’ll click it.Tool Call:
click(300, 150)Tool Result: Click successfulAssistant: Now I’ll type the search query.Tool Call:
type_text("machine learning")Tool Result: Text enteredAssistant: Pressing Enter to search.Tool Call:
keypress("Enter")Mode Comparison
| Feature | DOM Mode | CUA Mode |
|---|---|---|
| Control | Natural language | Coordinates + vision |
| Setup | Requires Browserbase project | Requires sandbox or local server |
| Startup | Fast (~2-5s) | Slower (~5-30s depending on mode) |
| Best for | Semantic tasks, form filling | Precise control, visual inspection |
| Tools | 4 high-level | 10 low-level |
| Vision | Not required | Required for coordinates |
| Selectors | Not needed (AI-driven) | Not needed (coordinate-based) |
| Reliability | Good for standard elements | Excellent for any clickable item |
Running Browser Environments
Installation
DOM Mode Execution
CUA Mode Execution
Configuration Options
DOM Mode
| Parameter | Default | Description |
|---|---|---|
project_id | Required | Browserbase project ID |
max_turns | 10 | Maximum interactions |
judge_model | "gpt-4o-mini" | Judge model |
browserbase_api_key_var | "BROWSERBASE_API_KEY" | API key env var |
stagehand_model | "openai/gpt-4o-mini" | Model for Stagehand |
proxy_model_to_stagehand | False | Route Stagehand through eval model |
CUA Mode
| Parameter | Default | Description |
|---|---|---|
max_turns | 15 | Maximum interactions |
use_sandbox | True | Auto-deploy to sandbox |
use_prebuilt_image | True | Use pre-built Docker image (fastest) |
prebuilt_image | "deepdream19/cua-server:latest" | Docker image |
server_url | "http://localhost:3000" | Server URL (manual mode) |
env | "BROWSERBASE" | Browser env (LOCAL/BROWSERBASE) |
viewport_width | 1024 | Browser width |
viewport_height | 768 | Browser height |
save_screenshots | False | Save screenshots to disk |
keep_recent_screenshots | 2 | Screenshots in context |
Key Features
Browserbase Integration
Both modes support cloud browsers via Browserbase:- No local browser installation needed
- Scalable cloud infrastructure
- Session recording and debugging
- Proxy support for geo-targeting
- Sign up at browserbase.com
- Get API key and project ID
- Set environment variables
Vision Model Integration
CUA mode integrates screenshots with vision models:Sandbox Deployment
CUA mode supports three deployment modes: 1. Pre-built image (default, fastest)Metrics Tracked
DOM Mode
judge_reward: Answer correctness (0.0 or 1.0)num_turns: Interaction counttotal_tool_calls: Tools used- Per-tool counts:
navigate_calls,act_calls, etc.
CUA Mode
judge_reward: Answer correctness (0.0 or 1.0)num_turns: Interaction counttotal_tool_calls: Tools used- Per-tool counts:
click_calls,screenshot_calls, etc. sandbox_ready_wait_time: Sandbox startup time
Advanced Usage
Custom Datasets
Create task-specific datasets:Proxy Configuration
Screenshot Management
Related Examples
- Wiki Search - Custom tool implementation
- Math Python - Sandbox environments
- Wordle - Multi-turn interaction
Next Steps
- Learn about BrowserEnv architecture
- See Stateful Environments for sandbox management
- Explore Browserbase for cloud browsers
- Check out Stagehand for DOM mode