Overview
Computer Use APIs allow AI models to:- See screenshots of web pages
- Click at specific coordinates
- Type text into fields
- Scroll, drag, and perform other mouse/keyboard actions
- Navigate between pages
- Anthropic - Claude’s Computer Use API
- Google - Gemini’s Computer Use API
- OpenAI - GPT’s Computer Use API (preview)
Creating a CUA Agent
Provider-Specific Implementations
Anthropic CUA Client
Location:packages/core/lib/v3/agent/AnthropicCUAClient.ts
Key Features:
- Uses Anthropic’s Messages API with
computer_20251124tool - Supports Claude 4.5+ models with extended thinking budgets
- Handles image compression in conversation history
- Converts between Anthropic’s coordinate system and Playwright actions
screenshot- Capture current page stateclick- Click at x,y coordinatestype- Type textkeypress- Press keyboard keysscroll- Scroll in a directionmove- Move mouse cursordrag- Drag between coordinatesdoubleClick- Double-click at coordinates
Google CUA Client
Location:packages/core/lib/v3/agent/GoogleCUAClient.ts
Key Features:
- Uses Google’s
computerUsetool with Gemini models - Normalizes coordinates from 0-1000 range to viewport dimensions
- Supports both browser and desktop environments
- Handles safety confirmations for sensitive actions
open_web_browser- Open browserclick_at- Click at coordinatestype_text_at- Click and type at locationkey_combination- Press key combinationsscroll_document- Scroll page up/downscroll_at- Scroll at specific locationnavigate- Go to URLgo_back/go_forward- Browser navigationhover_at- Hover at coordinatesdrag_and_drop- Drag between pointswait_5_seconds- Wait for page updates
OpenAI CUA Client
Location:packages/core/lib/v3/agent/OpenAICUAClient.ts
Key Features:
- Uses OpenAI’s Responses API for computer use (preview)
- Tracks reasoning items across conversation
- Supports function calls alongside computer actions
- Maintains response history with
previous_response_id
computer_call- Computer action requestfunction_call- Custom tool invocationreasoning- Model’s internal reasoningmessage- Text response to user
Browser Configuration
IMPORTANT: Computer Use requires specific browser dimensions. Configure instagehand.config.ts:
Action Handlers
CUA clients use action handlers to execute browser actions:Screenshot Providers
All CUA clients require a screenshot provider:Image Compression
To reduce token usage, Stagehand compresses images in conversation history:Custom Tools with CUA
You can combine Computer Use with custom tools:Best Practices
- Set appropriate maxSteps: CUA tasks typically need 10-20 steps
- Use specific system prompts: Include context about the current page and date
- Handle errors gracefully: CUA actions can fail; implement retry logic
- Monitor token usage: Screenshots consume many tokens; use compression
- Test viewport dimensions: Ensure coordinates map correctly to your viewport
Example: Complete CUA Workflow
References
- Anthropic CUA:
packages/core/lib/v3/agent/AnthropicCUAClient.ts - Google CUA:
packages/core/lib/v3/agent/GoogleCUAClient.ts - OpenAI CUA:
packages/core/lib/v3/agent/OpenAICUAClient.ts - Example:
packages/core/examples/cua-example.ts - Custom Tools Example:
packages/core/examples/agent-custom-tools.ts