Browser Automation

Overview

Kortix agents can control a real browser using natural language commands. This enables agents to interact with any website just like a human would - clicking buttons, filling forms, scrolling pages, and extracting structured data. The browser automation capability is powered by Stagehand, running in a sandboxed environment with full visual feedback through screenshots.

Core Functions

The browser tool provides four essential functions that handle all web automation tasks:

Navigate to URLs

browser_navigate_to(url="https://example.com")

Navigates to any URL and loads the page.

Perform Actions

browser_act(action="click the login button")
browser_act(action="fill in email with [email protected]")
browser_act(action="scroll down")
browser_act(action="select 'Premium' from the dropdown")

Performs any browser action using natural language descriptions:

Click any element (buttons, links, images)
Fill forms (text, emails, passwords)
Select dropdown options
Scroll pages
Keyboard input (Enter, Tab, Escape)
Upload files (with filePath parameter)

Extract Content

browser_extract_content(instruction="get all product names and prices")
browser_extract_content(instruction="extract the main article text")

Extracts structured data from web pages using natural language instructions.

Take Screenshots

browser_screenshot(name="homepage")

Captures the current page state. Screenshots are automatically included with every action for visual validation.

Real-World Examples

# Navigate to login page
browser_navigate_to(url="https://app.example.com/login")

# Fill in credentials
browser_act(action="click the email field")
browser_act(action="type [email protected]")
browser_act(action="click the password field")
browser_act(
    action="type %password%",
    variables={"password": "secure_pass"}
)

# Submit form
browser_act(action="click the Sign In button")

Example 2: Data Extraction

# Navigate to product page
browser_navigate_to(url="https://shop.example.com/products")

# Scroll to load all products
browser_act(action="scroll to bottom")

# Extract product data
result = browser_extract_content(
    instruction="extract all products with name, price, and rating"
)

Example 3: Multi-Step Research

# Research a company website
browser_navigate_to(url="https://example.io")

# Browse key pages
browser_act(action="click the Features link")
features = browser_extract_content(instruction="extract feature descriptions")

browser_act(action="click Pricing")
pricing = browser_extract_content(instruction="get pricing tiers and costs")

browser_act(action="click About Us")
company_info = browser_extract_content(instruction="extract company mission and team size")

Implementation Details

From the source code (browser_tool.py:108-542):

@tool_metadata(
    display_name="Browser",
    description="Interact with web pages using mouse and keyboard, take screenshots, and extract content",
    icon="Globe",
    color="bg-cyan-100 dark:bg-cyan-800/50"
)
class BrowserTool(SandboxToolsBase):
    """
    Browser Tool for browser automation using local Stagehand API.
    
    Only 4 core functions that can handle everything:
    - browser_navigate_to: Navigate to URLs
    - browser_act: Perform any action (click, type, scroll, dropdowns etc.)
    - browser_extract_content: Extract content from pages
    - browser_screenshot: Take screenshots
    """

Architecture

Stagehand API Server: Runs on port 8004 inside the sandbox
Health Checks: Automatic retry with exponential backoff
Screenshot Validation: Every action returns a screenshot for verification
Secure Variables: Sensitive data (passwords) not logged to LLM providers

Validation and Error Handling

From browser_tool.py:127-194:

def _validate_base64_image(self, base64_string: str, max_size_mb: int = 10) -> tuple[bool, str]:
    """
    Comprehensive validation of base64 image data.
    
    - Checks string length and format
    - Validates base64 characters
    - Decodes and verifies image data
    - Checks file size limits
    - Validates image format using PIL
    """

Security Features

Variables Parameter

For sensitive data like passwords, use the variables parameter:

browser_act(
    action="fill in password with %pass%",
    variables={"pass": "actual_password"}
)

Variables are NOT shared with LLM providers for security.

Sandboxed Execution

All browser actions run in an isolated sandbox environment:

No access to host system
Temporary, disposable instances
Safe for any website

File Upload Support

For actions involving file uploads:

browser_act(
    action="click the upload button",
    filePath="/workspace/documents/resume.pdf"
)

Always include the filePath parameter when dealing with upload-related elements to prevent accidental file dialog triggers.

Best Practices

1. Direct URL Research

When researching a specific website, browse it directly:

# ✅ GOOD: Direct navigation
browser_navigate_to(url="https://example.io")
browser_extract_content(instruction="get product features")

# ❌ BAD: Generic web search
web_search(query="example.io features")

2. Screenshot Validation

Every action returns a screenshot. Review it to verify expected results:

# Action returns screenshot automatically
result = browser_act(action="click the Submit button")
# Check the screenshot to confirm button was clicked

3. Information Reuse

Once content is extracted, use it as the primary source:

# Extract content once
product_data = browser_extract_content(
    instruction="get product information"
)

# ✅ Use extracted data for deliverables
# ❌ Don't override with web search results

When to Use Browser vs Other Tools

Use Browser For

Interacting with dynamic websites
Filling forms or multi-step flows
Sites requiring clicks/JavaScript
Visual inspection needed
Login-protected content

Use Alternative Tools

Static content → scrape_webpage
API data → API tools
GitHub URLs → gh CLI
Simple page reads → web_search

Limitations

Requires GEMINI_API_KEY configuration
Browser startup takes a few seconds
Not suitable for high-frequency automation
Screenshots consume additional storage

Configuration

Browser automation requires:

GEMINI_API_KEY=your_api_key_here

The Stagehand API server starts automatically in the sandbox and listens on port 8004.

Get Started

Core Concepts

Building Agents

Agent Capabilities

Tools & Extensions

Platform Features

Self-Hosting

Browser Automation

Overview

Core Functions

Navigate to URLs

Perform Actions

Extract Content

Take Screenshots

Real-World Examples

Example 2: Data Extraction

Example 3: Multi-Step Research

Implementation Details

Architecture

Validation and Error Handling

Security Features

Variables Parameter

Sandboxed Execution

File Upload Support

Best Practices

1. Direct URL Research

2. Screenshot Validation

3. Information Reuse

When to Use Browser vs Other Tools

Use Browser For

Use Alternative Tools

Limitations

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Agent Capabilities

Tools & Extensions

Platform Features

Self-Hosting

​Overview

​Core Functions

​Navigate to URLs

​Perform Actions

​Extract Content

​Take Screenshots

​Real-World Examples

​Example 1: Login Flow

​Example 2: Data Extraction

​Example 3: Multi-Step Research

​Implementation Details

​Architecture

​Validation and Error Handling

​Security Features

​Variables Parameter

​Sandboxed Execution

​File Upload Support

​Best Practices

​1. Direct URL Research

​2. Screenshot Validation

​3. Information Reuse

​When to Use Browser vs Other Tools

Use Browser For

Use Alternative Tools

​Limitations

​Configuration

Build docs developers (and LLMs) love

Overview

Core Functions

Navigate to URLs

Perform Actions

Extract Content

Take Screenshots

Real-World Examples

Example 1: Login Flow

Example 2: Data Extraction

Example 3: Multi-Step Research

Implementation Details

Architecture

Validation and Error Handling

Security Features

Variables Parameter

Sandboxed Execution

File Upload Support

Best Practices

1. Direct URL Research

2. Screenshot Validation

3. Information Reuse

When to Use Browser vs Other Tools

Limitations

Configuration