Skip to main content
Stagehand provides native support for Computer Use APIs from major AI providers. These APIs enable AI agents to interact with web browsers using visual understanding and coordinate-based actions.

Overview

Computer Use APIs allow AI models to:
  • See screenshots of web pages
  • Click at specific coordinates
  • Type text into fields
  • Scroll, drag, and perform other mouse/keyboard actions
  • Navigate between pages
Stagehand supports three CUA implementations:
  • Anthropic - Claude’s Computer Use API
  • Google - Gemini’s Computer Use API
  • OpenAI - GPT’s Computer Use API (preview)

Creating a CUA Agent

import { Stagehand } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({
  env: "LOCAL",
  verbose: 2,
});
await stagehand.init();

const page = stagehand.context.pages()[0];

// Create a Computer Use Agent
const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "google/gemini-3-flash-preview",
    apiKey: process.env.GEMINI_API_KEY,
  },
  systemPrompt: `You are a helpful assistant that can use a web browser.
    You are currently on: ${page.url()}.
    Today's date is ${new Date().toLocaleDateString()}.`,
});

// Execute a task
await page.goto("https://www.example.com");
const result = await agent.execute({
  instruction: "Fill out the contact form with test data",
  maxSteps: 20,
});

Provider-Specific Implementations

Anthropic CUA Client

Location: packages/core/lib/v3/agent/AnthropicCUAClient.ts Key Features:
  • Uses Anthropic’s Messages API with computer_20251124 tool
  • Supports Claude 4.5+ models with extended thinking budgets
  • Handles image compression in conversation history
  • Converts between Anthropic’s coordinate system and Playwright actions
Configuration:
const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "anthropic/claude-sonnet-4-5-20250929",
    apiKey: process.env.ANTHROPIC_API_KEY,
    thinkingBudget: 5000, // Optional: extended thinking tokens
  },
});
Supported Actions:
  • screenshot - Capture current page state
  • click - Click at x,y coordinates
  • type - Type text
  • keypress - Press keyboard keys
  • scroll - Scroll in a direction
  • move - Move mouse cursor
  • drag - Drag between coordinates
  • doubleClick - Double-click at coordinates
Action Conversion: The client converts Anthropic’s tool calls to Playwright actions:
// Anthropic returns:
{
  "name": "computer",
  "input": {
    "action": "left_click",
    "coordinate": [500, 300]
  }
}

// Converted to:
{
  type: "click",
  x: 500,
  y: 300,
  button: "left"
}

Google CUA Client

Location: packages/core/lib/v3/agent/GoogleCUAClient.ts Key Features:
  • Uses Google’s computerUse tool with Gemini models
  • Normalizes coordinates from 0-1000 range to viewport dimensions
  • Supports both browser and desktop environments
  • Handles safety confirmations for sensitive actions
Configuration:
const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "google/gemini-2-5-flash-preview",
    apiKey: process.env.GEMINI_API_KEY,
    environment: "ENVIRONMENT_BROWSER", // or "ENVIRONMENT_DESKTOP"
  },
});
Supported Function Calls:
  • open_web_browser - Open browser
  • click_at - Click at coordinates
  • type_text_at - Click and type at location
  • key_combination - Press key combinations
  • scroll_document - Scroll page up/down
  • scroll_at - Scroll at specific location
  • navigate - Go to URL
  • go_back / go_forward - Browser navigation
  • hover_at - Hover at coordinates
  • drag_and_drop - Drag between points
  • wait_5_seconds - Wait for page updates
Coordinate Normalization:
private normalizeCoordinates(x: number, y: number) {
  // Google uses 0-1000 range, convert to actual viewport pixels
  const clampedX = Math.min(999, Math.max(0, x));
  const clampedY = Math.min(999, Math.max(0, y));
  return {
    x: Math.floor((clampedX / 1000) * this.currentViewport.width),
    y: Math.floor((clampedY / 1000) * this.currentViewport.height)
  };
}
Safety Confirmations: Google CUA may request safety confirmations for sensitive actions:
const agent = stagehand.agent({
  mode: "cua",
  model: { /* ... */ },
  safetyConfirmationHandler: async (safetyChecks) => {
    console.log("Safety checks:", safetyChecks);
    return { acknowledged: true };
  },
});

OpenAI CUA Client

Location: packages/core/lib/v3/agent/OpenAICUAClient.ts Key Features:
  • Uses OpenAI’s Responses API for computer use (preview)
  • Tracks reasoning items across conversation
  • Supports function calls alongside computer actions
  • Maintains response history with previous_response_id
Configuration:
const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "openai/gpt-4o",
    apiKey: process.env.OPENAI_API_KEY,
    environment: "browser", // "browser", "mac", "windows", or "ubuntu"
  },
});
Response Types:
  • computer_call - Computer action request
  • function_call - Custom tool invocation
  • reasoning - Model’s internal reasoning
  • message - Text response to user
Computer Call Flow:
// 1. Model returns computer_call
{
  type: "computer_call",
  call_id: "call_123",
  action: {
    type: "click",
    x: 100,
    y: 200
  }
}

// 2. Execute action and capture screenshot
// 3. Return computer_call_output
{
  type: "computer_call_output",
  call_id: "call_123",
  output: {
    type: "input_image",
    image_url: "data:image/png;base64,...",
    current_url: "https://example.com"
  }
}

Browser Configuration

IMPORTANT: Computer Use requires specific browser dimensions. Configure in stagehand.config.ts:
export default {
  browserOptions: {
    headless: false,
    defaultViewport: {
      width: 1288,
      height: 711,
    },
  },
};
Or set at runtime:
const stagehand = new Stagehand({
  env: "LOCAL",
  browserOptions: {
    defaultViewport: { width: 1288, height: 711 },
  },
});

Action Handlers

CUA clients use action handlers to execute browser actions:
// Set in AgentContext (packages/core/lib/v3/agent/AgentContext.ts)
this.cuaClient.setActionHandler(async (action: AgentAction) => {
  switch (action.type) {
    case "click":
      await page.mouse.click(action.x, action.y);
      break;
    case "type":
      await page.keyboard.type(action.text);
      break;
    case "scroll":
      await page.mouse.wheel(action.scroll_x, action.scroll_y);
      break;
    // ... other actions
  }
});

Screenshot Providers

All CUA clients require a screenshot provider:
this.cuaClient.setScreenshotProvider(async () => {
  const page = await this.v3.context.awaitActivePage();
  const screenshot = await page.screenshot();
  return screenshot.toString("base64");
});

Image Compression

To reduce token usage, Stagehand compresses images in conversation history:
// Anthropic: compressConversationImages()
// Keeps first 2 images, compresses remaining to 25% quality

// Google: compressGoogleConversationImages()
// Similar compression strategy for Google's format

Custom Tools with CUA

You can combine Computer Use with custom tools:
import { tool } from "ai";
import { z } from "zod";

const getWeather = tool({
  description: "Get weather for a location",
  inputSchema: z.object({
    location: z.string(),
  }),
  execute: async ({ location }) => {
    // Your API call here
    return { temp: 70, conditions: "sunny" };
  },
});

const agent = stagehand.agent({
  mode: "cua",
  model: { /* ... */ },
  tools: { getWeather },
});
See agent-custom-tools.ts for a complete example.

Best Practices

  1. Set appropriate maxSteps: CUA tasks typically need 10-20 steps
  2. Use specific system prompts: Include context about the current page and date
  3. Handle errors gracefully: CUA actions can fail; implement retry logic
  4. Monitor token usage: Screenshots consume many tokens; use compression
  5. Test viewport dimensions: Ensure coordinates map correctly to your viewport

Example: Complete CUA Workflow

import { Stagehand } from "@browserbasehq/stagehand";
import chalk from "chalk";

const stagehand = new Stagehand({
  env: "LOCAL",
  verbose: 2,
  browserOptions: {
    defaultViewport: { width: 1288, height: 711 },
  },
});

await stagehand.init();

const page = stagehand.context.pages()[0];

const agent = stagehand.agent({
  mode: "cua",
  model: {
    modelName: "anthropic/claude-sonnet-4-5",
    apiKey: process.env.ANTHROPIC_API_KEY,
  },
  systemPrompt: `You are a helpful assistant.
    Current page: ${page.url()}
    Date: ${new Date().toLocaleDateString()}`,
});

await page.goto("https://www.browserbase.com/careers");

const result = await agent.execute({
  instruction: "Apply for the first engineer position with test data. Don't submit.",
  maxSteps: 20,
});

console.log(chalk.green("✓"), "Complete:", result.message);
console.log("Actions performed:", result.actions.length);
console.log("Token usage:", result.usage);

await stagehand.close();

References

  • Anthropic CUA: packages/core/lib/v3/agent/AnthropicCUAClient.ts
  • Google CUA: packages/core/lib/v3/agent/GoogleCUAClient.ts
  • OpenAI CUA: packages/core/lib/v3/agent/OpenAICUAClient.ts
  • Example: packages/core/examples/cua-example.ts
  • Custom Tools Example: packages/core/examples/agent-custom-tools.ts

Build docs developers (and LLMs) love