Skip to main content

Overview

The Browser Automation toolkit enables AI agents to interact with websites autonomously, performing complex web-based tasks like research, form filling, e-commerce transactions, and data extraction without manual intervention.

Features

  • Autonomous Navigation: AI-driven web browsing with intelligent decision-making
  • Form Interaction: Automatic form filling, button clicking, and data entry
  • E-commerce Automation: Order placement, cart management, and checkout
  • Web Research: Information gathering and content summarization
  • Screenshot Capture: Visual verification of browser actions
  • Multi-platform Support: Works with any website that runs in Chrome

Installation

The browser automation toolkit uses the browser-use library:
pip install browser-use langchain-anthropic langchain-openai
Requires Google Chrome to be installed on your system.

Architecture

The toolkit consists of two main components:

1. BrowserTool

The core tool that wraps browser-use functionality for LangChain agents.

2. BrowserToolkit

A toolkit wrapper that provides the tool in a LangChain-compatible format.

Quick Start

Basic Usage

from browser_agent import BrowserTool
import asyncio

# Initialize the tool
browser_tool = BrowserTool()

# Run a simple task
async def browse():
    result = await browser_tool._arun(
        task="Go to Wikipedia and summarize the article about Python programming"
    )
    print(result)

asyncio.run(browse())

Synchronous Usage

# Use the synchronous wrapper
result = browser_tool._run(
    task="Search Google for 'machine learning tutorials' and return the top 3 results"
)
print(result)

BrowserTool Implementation

browser_agent/browser_tool.py
from langchain_core.tools import BaseTool
from browser_use import Agent, Browser, BrowserConfig
from langchain_anthropic import ChatAnthropic

class BrowserTool(BaseTool):
    """Tool for autonomous web browsing and research."""
    
    name: Literal["browser_agent"] = "browser_agent"
    description: str = """Use this tool for web-based tasks requiring browser interaction.
    Input should be a clear description of what you want to accomplish online.
    
    Examples:
    - "Order a large pepperoni pizza from Domino's"
    - "Browse Amazon and add a Nintendo Switch to cart"
    - "Research World War II and summarize key points"
    - "Compare flight prices from NYC to London"
    """
    
    llm: ChatAnthropic = Field(
        default_factory=lambda: ChatAnthropic(model="claude-3-5-sonnet-latest")
    )
    
    browser: Browser = Field(
        default_factory=lambda: Browser(
            config=BrowserConfig(
                chrome_instance_path='/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
            )
        )
    )

Configuration

Custom Chrome Path

from browser_use import Browser, BrowserConfig

# Windows
browser = Browser(
    config=BrowserConfig(
        chrome_instance_path='C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe'
    )
)

# Linux
browser = Browser(
    config=BrowserConfig(
        chrome_instance_path='/usr/bin/google-chrome'
    )
)

# macOS (default)
browser = Browser(
    config=BrowserConfig(
        chrome_instance_path='/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
    )
)

Custom LLM

from langchain_anthropic import ChatAnthropic
from browser_agent import BrowserTool

# Use a different Claude model
custom_llm = ChatAnthropic(
    model="claude-3-opus-20240229",
    temperature=0.7,
    max_tokens=4096
)

browser_tool = BrowserTool(llm=custom_llm)

Browser Configuration Options

from browser_use import BrowserConfig

config = BrowserConfig(
    chrome_instance_path='/path/to/chrome',
    headless=False,  # Set to True for headless mode
    disable_security=False,
    window_width=1920,
    window_height=1080,
    save_screenshots=True,
    screenshot_dir='./screenshots'
)

browser = Browser(config=config)

Common Use Cases

1. E-commerce Automation

# Order food delivery
result = browser_tool._run(
    task="Order a large pepperoni pizza from Domino's for delivery to my address"
)

# Shopping cart management
result = browser_tool._run(
    task="Browse Amazon and add a Nintendo Switch, two controllers, and Zelda game to cart"
)

# Price comparison
result = browser_tool._run(
    task="Compare prices for iPhone 15 Pro on Amazon, Best Buy, and Apple.com"
)
The tool assumes billing and shipping information is already saved on websites. It will not ask for payment details.

2. Research and Information Gathering

# Academic research
result = browser_tool._run(
    task="Research the key events of World War II and create a timeline with dates"
)

# Technical documentation
result = browser_tool._run(
    task="Find the official Python documentation for asyncio and summarize the main concepts"
)

# Market research
result = browser_tool._run(
    task="Research the top 5 CRM software solutions and compare their pricing and features"
)

3. Form Filling and Account Management

# Newsletter signup
result = browser_tool._run(
    task="Sign up for the TechCrunch newsletter using my email"
)

# Gym membership
result = browser_tool._run(
    task="Navigate to Planet Fitness website and start the membership signup process"
)

# Service scheduling
result = browser_tool._run(
    task="Schedule a grocery delivery from Whole Foods for tomorrow between 3-5 PM"
)

4. Travel and Booking

# Flight search
result = browser_tool._run(
    task="Search for round-trip flights from NYC to London departing next month, return cheapest options"
)

# Hotel booking research
result = browser_tool._run(
    task="Find hotels in San Francisco near Moscone Center under $200/night with good reviews"
)

BrowserToolkit for LangChain Agents

browser_agent/browser_toolkit.py
from langchain_core.tools import BaseTool
from typing import List
from .browser_tool import BrowserTool

class BrowserToolkit:
    """Toolkit for browser automation capabilities."""
    
    def __init__(self, llm=None):
        self.llm = llm
    
    def get_tools(self) -> List[BaseTool]:
        """Get the list of tools in the toolkit."""
        return [BrowserTool(llm=self.llm)]
    
    @classmethod
    def from_llm(cls, llm=None) -> "BrowserToolkit":
        """Create a BrowserToolkit from an LLM."""
        return cls(llm=llm)

Using with LangChain Agents

from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from browser_agent import BrowserToolkit

# Initialize LLM
llm = ChatAnthropic(model="claude-3-5-sonnet-latest")

# Create toolkit and get tools
toolkit = BrowserToolkit.from_llm(llm)
tools = toolkit.get_tools()

# Create agent prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant with web browsing capabilities."),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

# Create agent
agent = create_openai_functions_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Run tasks
result = agent_executor.invoke({
    "input": "Research the latest AI news and summarize the top 3 stories"
})

print(result['output'])

Advanced Usage

Multi-step Workflows

async def complex_workflow():
    browser_tool = BrowserTool()
    
    # Step 1: Research
    research = await browser_tool._arun(
        task="Find the top-rated Italian restaurants in San Francisco on Yelp"
    )
    
    # Step 2: Price comparison
    prices = await browser_tool._arun(
        task="Compare menu prices for the top 3 restaurants from the previous search"
    )
    
    # Step 3: Reservation (example - won't actually book)
    info = await browser_tool._arun(
        task="Get reservation availability for tonight at the highest-rated restaurant"
    )
    
    return {
        "research": research,
        "prices": prices,
        "availability": info
    }

result = asyncio.run(complex_workflow())

Error Handling

import asyncio
from browser_agent import BrowserTool

async def safe_browse(task: str, max_retries: int = 3):
    browser_tool = BrowserTool()
    
    for attempt in range(max_retries):
        try:
            result = await browser_tool._arun(task)
            return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)  # Exponential backoff
    
    return None

# Use with retry logic
result = asyncio.run(safe_browse(
    "Navigate to GitHub and find the top trending Python repositories"
))

Screenshot Capture

from browser_use import Browser, BrowserConfig
from browser_agent import BrowserTool

# Configure browser to save screenshots
config = BrowserConfig(
    chrome_instance_path='/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
    save_screenshots=True,
    screenshot_dir='./browser_screenshots'
)

browser = Browser(config=config)
browser_tool = BrowserTool(browser=browser)

# Screenshots will be automatically saved during execution
result = browser_tool._run(
    task="Navigate to the OpenAI website and capture the homepage"
)

Task Input Format

Be clear and specific in your task descriptions:
# Specific and actionable
"Order a large pepperoni pizza from Domino's for delivery"
"Find the documentation for React hooks and summarize useState"
"Search Amazon for wireless headphones under $100 and sort by rating"
"Navigate to GitHub, search for 'langchain', and get the star count"

Integration with Other Tools

from browser_agent import BrowserTool
from writing_agent import WritingTool
import asyncio

async def research_and_write(topic: str):
    # Step 1: Use browser to gather information
    browser = BrowserTool()
    research = await browser._arun(
        f"Research {topic} and gather key facts and statistics"
    )
    
    # Step 2: Use writing agent to create article
    writer = WritingTool()
    article = await writer._arun(
        query=f"Write a comprehensive article about {topic}. Use this research: {research}",
        target_length=1500
    )
    
    return article

# Create research-backed content
result = asyncio.run(research_and_write("Quantum Computing Applications"))
print(result)

Browser Agent Workflow

1

Task Parsing

Claude analyzes the task description and formulates a browsing plan.
2

Navigation

The agent navigates to relevant websites using intelligent path planning.
3

Interaction

Performs actions like clicking, typing, scrolling, and form filling.
4

Verification

Takes screenshots and verifies actions were successful.
5

Result Extraction

Extracts relevant information and formats the response.
6

Cleanup

Closes the browser and returns results to the caller.

Performance Considerations

Headless Mode

Use headless mode for faster execution when visual verification isn’t needed.

Browser Reuse

Keep browser instances alive between tasks when performing multiple operations.

Timeout Settings

Configure appropriate timeouts for complex tasks that may take longer.

Resource Management

Always close browser instances to free up system resources.

Limitations and Best Practices

Important Limitations:
  • Cannot bypass CAPTCHAs or advanced bot detection
  • Requires saved payment information for e-commerce
  • May struggle with heavily JavaScript-dependent sites
  • Performance depends on internet speed and website complexity

Best Practices

  1. Be Specific: Provide clear, detailed task descriptions
  2. Error Handling: Implement retry logic for unstable operations
  3. Resource Cleanup: Always close browser instances after use
  4. Rate Limiting: Add delays between requests to avoid triggering anti-bot measures
  5. Verification: Check results to ensure tasks completed successfully

Security Considerations

# Never hardcode sensitive information
# BAD
result = browser_tool._run(
    task="Login to example.com with username: admin, password: secret123"
)

# GOOD - Use environment variables or secure credential storage
import os
username = os.getenv('APP_USERNAME')
password = os.getenv('APP_PASSWORD')

result = browser_tool._run(
    task=f"Login to example.com with saved credentials"
)
For production use, integrate with secure credential management systems like AWS Secrets Manager or HashiCorp Vault.

Debugging

Enable verbose mode to see browser actions:
from browser_use import Browser, BrowserConfig

config = BrowserConfig(
    chrome_instance_path='/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
    headless=False,  # See the browser in action
    save_screenshots=True,
    screenshot_dir='./debug_screenshots'
)

browser = Browser(config=config)
browser_tool = BrowserTool(browser=browser)

# Run task with visible browser
result = browser_tool._run("Navigate to example.com and click the login button")

Source Code Reference

Key files in the browser_agent module:
  • browser_tool.py:8-55 - Main BrowserTool implementation
  • browser_toolkit.py:6-20 - LangChain toolkit wrapper
  • __init__.py:1-6 - Module exports

Build docs developers (and LLMs) love