Computer Use API

Stagehand provides native support for Computer Use APIs from major AI providers. These APIs enable AI agents to interact with web browsers using visual understanding and coordinate-based actions.

Overview

Computer Use APIs allow AI models to:

See screenshots of web pages
Click at specific coordinates
Type text into fields
Scroll, drag, and perform other mouse/keyboard actions
Navigate between pages

Stagehand supports three CUA implementations:

Anthropic - Claude’s Computer Use API
Google - Gemini’s Computer Use API
OpenAI - GPT’s Computer Use API (preview)

Creating a CUA Agent

import { Stagehand } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({
  env: "LOCAL",
  verbose: 2,
});
await stagehand.init();

const page = stagehand.context.pages()[0];

// Create a Computer Use Agent
const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "google/gemini-3-flash-preview",
    apiKey: process.env.GEMINI_API_KEY,
  },
  systemPrompt: `You are a helpful assistant that can use a web browser.
    You are currently on: ${page.url()}.
    Today's date is ${new Date().toLocaleDateString()}.`,
});

// Execute a task
await page.goto("https://www.example.com");
const result = await agent.execute({
  instruction: "Fill out the contact form with test data",
  maxSteps: 20,
});

Provider-Specific Implementations

Anthropic CUA Client

Location: packages/core/lib/v3/agent/AnthropicCUAClient.ts Key Features:

Uses Anthropic’s Messages API with computer_20251124 tool
Supports Claude 4.5+ models with extended thinking budgets
Handles image compression in conversation history
Converts between Anthropic’s coordinate system and Playwright actions

Configuration:

const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "anthropic/claude-sonnet-4-5-20250929",
    apiKey: process.env.ANTHROPIC_API_KEY,
    thinkingBudget: 5000, // Optional: extended thinking tokens
  },
});

Supported Actions:

screenshot - Capture current page state
click - Click at x,y coordinates
type - Type text
keypress - Press keyboard keys
scroll - Scroll in a direction
move - Move mouse cursor
drag - Drag between coordinates
doubleClick - Double-click at coordinates

Action Conversion: The client converts Anthropic’s tool calls to Playwright actions:

// Anthropic returns:
{
  "name": "computer",
  "input": {
    "action": "left_click",
    "coordinate": [500, 300]
  }
}

// Converted to:
{
  type: "click",
  x: 500,
  y: 300,
  button: "left"
}

Google CUA Client

Location: packages/core/lib/v3/agent/GoogleCUAClient.ts Key Features:

Uses Google’s computerUse tool with Gemini models
Normalizes coordinates from 0-1000 range to viewport dimensions
Supports both browser and desktop environments
Handles safety confirmations for sensitive actions

Configuration:

const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "google/gemini-2-5-flash-preview",
    apiKey: process.env.GEMINI_API_KEY,
    environment: "ENVIRONMENT_BROWSER", // or "ENVIRONMENT_DESKTOP"
  },
});

Supported Function Calls:

open_web_browser - Open browser
click_at - Click at coordinates
type_text_at - Click and type at location
key_combination - Press key combinations
scroll_document - Scroll page up/down
scroll_at - Scroll at specific location
navigate - Go to URL
go_back / go_forward - Browser navigation
hover_at - Hover at coordinates
drag_and_drop - Drag between points
wait_5_seconds - Wait for page updates

Coordinate Normalization:

private normalizeCoordinates(x: number, y: number) {
  // Google uses 0-1000 range, convert to actual viewport pixels
  const clampedX = Math.min(999, Math.max(0, x));
  const clampedY = Math.min(999, Math.max(0, y));
  return {
    x: Math.floor((clampedX / 1000) * this.currentViewport.width),
    y: Math.floor((clampedY / 1000) * this.currentViewport.height)
  };
}

Safety Confirmations: Google CUA may request safety confirmations for sensitive actions:

const agent = stagehand.agent({
  mode: "cua",
  model: { /* ... */ },
  safetyConfirmationHandler: async (safetyChecks) => {
    console.log("Safety checks:", safetyChecks);
    return { acknowledged: true };
  },
});

OpenAI CUA Client

Location: packages/core/lib/v3/agent/OpenAICUAClient.ts Key Features:

Uses OpenAI’s Responses API for computer use (preview)
Tracks reasoning items across conversation
Supports function calls alongside computer actions
Maintains response history with previous_response_id

Configuration:

const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "openai/gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    environment: "browser", // "browser", "mac", "windows", or "ubuntu"
  },
});

Response Types:

computer_call - Computer action request
function_call - Custom tool invocation
reasoning - Model’s internal reasoning
message - Text response to user

Computer Call Flow:

// 1. Model returns computer_call
{
  type: "computer_call",
  call_id: "call_123",
  action: {
    type: "click",
    x: 100,
    y: 200
  }
}

// 2. Execute action and capture screenshot
// 3. Return computer_call_output
{
  type: "computer_call_output",
  call_id: "call_123",
  output: {
    type: "input_image",
    image_url: "data:image/png;base64,...",
    current_url: "https://example.com"
  }
}

Browser Configuration

IMPORTANT: Computer Use requires specific browser dimensions. Configure in stagehand.config.ts:

export default {
  browserOptions: {
    headless: false,
    defaultViewport: {
      width: 1288,
      height: 711,
    },
  },
};

Or set at runtime:

const stagehand = new Stagehand({
  env: "LOCAL",
  browserOptions: {
    defaultViewport: { width: 1288, height: 711 },
  },
});

Action Handlers

CUA clients use action handlers to execute browser actions:

// Set in AgentContext (packages/core/lib/v3/agent/AgentContext.ts)
this.cuaClient.setActionHandler(async (action: AgentAction) => {
  switch (action.type) {
    case "click":
      await page.mouse.click(action.x, action.y);
      break;
    case "type":
      await page.keyboard.type(action.text);
      break;
    case "scroll":
      await page.mouse.wheel(action.scroll_x, action.scroll_y);
      break;
    // ... other actions
  }
});

Screenshot Providers

All CUA clients require a screenshot provider:

this.cuaClient.setScreenshotProvider(async () => {
  const page = await this.v3.context.awaitActivePage();
  const screenshot = await page.screenshot();
  return screenshot.toString("base64");
});

Image Compression

To reduce token usage, Stagehand compresses images in conversation history:

// Anthropic: compressConversationImages()
// Keeps first 2 images, compresses remaining to 25% quality

// Google: compressGoogleConversationImages()
// Similar compression strategy for Google's format

Custom Tools with CUA

You can combine Computer Use with custom tools:

import { tool } from "ai";
import { z } from "zod";

const getWeather = tool({
  description: "Get weather for a location",
  inputSchema: z.object({
    location: z.string(),
  }),
  execute: async ({ location }) => {
    // Your API call here
    return { temp: 70, conditions: "sunny" };
  },
});

const agent = stagehand.agent({
  mode: "cua",
  model: { /* ... */ },
  tools: { getWeather },
});

See agent-custom-tools.ts for a complete example.

Best Practices

Set appropriate maxSteps: CUA tasks typically need 10-20 steps
Use specific system prompts: Include context about the current page and date
Handle errors gracefully: CUA actions can fail; implement retry logic
Monitor token usage: Screenshots consume many tokens; use compression
Test viewport dimensions: Ensure coordinates map correctly to your viewport

Example: Complete CUA Workflow

import { Stagehand } from "@browserbasehq/stagehand";
import chalk from "chalk";

const stagehand = new Stagehand({
  env: "LOCAL",
  verbose: 2,
  browserOptions: {
    defaultViewport: { width: 1288, height: 711 },
  },
});

await stagehand.init();

const page = stagehand.context.pages()[0];

const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "anthropic/claude-sonnet-4-5",
    apiKey: process.env.ANTHROPIC_API_KEY,
  },
  systemPrompt: `You are a helpful assistant.
    Current page: ${page.url()}
    Date: ${new Date().toLocaleDateString()}`,
});

await page.goto("https://www.browserbase.com/careers");

const result = await agent.execute({
  instruction: "Apply for the first engineer position with test data. Don't submit.",
  maxSteps: 20,
});

console.log(chalk.green("✓"), "Complete:", result.message);
console.log("Actions performed:", result.actions.length);
console.log("Token usage:", result.usage);

await stagehand.close();

References

Anthropic CUA: packages/core/lib/v3/agent/AnthropicCUAClient.ts
Google CUA: packages/core/lib/v3/agent/GoogleCUAClient.ts
OpenAI CUA: packages/core/lib/v3/agent/OpenAICUAClient.ts
Example: packages/core/examples/cua-example.ts
Custom Tools Example: packages/core/examples/agent-custom-tools.ts

Getting Started

Core Concepts

Core Methods

Configuration

Integrations

Best Practices

Advanced Features

Overview

Creating a CUA Agent

Provider-Specific Implementations

Anthropic CUA Client

Google CUA Client

OpenAI CUA Client

Browser Configuration

Action Handlers

Screenshot Providers

Image Compression

Custom Tools with CUA

Best Practices

Example: Complete CUA Workflow

References

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Core Methods

Configuration

Integrations

Best Practices

Advanced Features

​Overview

​Creating a CUA Agent

​Provider-Specific Implementations

​Anthropic CUA Client

​Google CUA Client

​OpenAI CUA Client

​Browser Configuration

​Action Handlers

​Screenshot Providers

​Image Compression

​Custom Tools with CUA

​Best Practices

​Example: Complete CUA Workflow

​References

Build docs developers (and LLMs) love

Overview

Creating a CUA Agent

Provider-Specific Implementations

Anthropic CUA Client

Google CUA Client

OpenAI CUA Client

Browser Configuration

Action Handlers

Screenshot Providers

Image Compression

Custom Tools with CUA

Best Practices

Example: Complete CUA Workflow

References