Vimium integration

What is Vimium?

Vimium is a Chrome extension that provides keyboard shortcuts for navigation, inspired by Vim. Its most powerful feature for automation is hint mode, which overlays every clickable element with a yellow box containing a unique 1-2 character label.

Why Vimium for AI automation?

Traditional web automation relies on DOM selectors (CSS, XPath) or accessibility trees. vimGPT takes a different approach:

Vision-first design

GPT-4V analyzes screenshots like a human would, without parsing HTML. Vimium overlays make element selection visual and explicit.

DOM-agnostic

No need to understand page structure, handle shadow DOM, or deal with dynamic class names. The model only needs to read visible labels.

Keyboard-only interaction

Everything is accomplished through keyboard commands, avoiding complex coordinate calculations or element locators.

Human-like interaction

Vimium is designed for humans. By using it, the AI mimics how power users navigate the web.

Loading Vimium in Playwright

The extension must be loaded manually since Playwright doesn’t support extensions by default. Here’s how vimGPT accomplishes this:

vimbot.py

vimium_path = "./vimium-master"

class Vimbot:
    def __init__(self, headless=False):
        self.context = (
            sync_playwright()
            .start()
            .chromium.launch_persistent_context(
                "",
                headless=headless,
                args=[
                    f"--disable-extensions-except={vimium_path}",
                    f"--load-extension={vimium_path}",
                ],
                ignore_https_errors=True,
            )
        )

Download Vimium source

Run ./setup.sh to download the Vimium extension locally

Use persistent context

Regular browser contexts don’t support extensions; use launch_persistent_context instead

Load extension via CLI args

Pass --disable-extensions-except and --load-extension flags to Chromium

Headless limitations: Chromium extensions don’t work in headless mode. Set headless=False to see the browser window.

Triggering Vimium overlays

Before capturing a screenshot, vimGPT activates Vimium’s hint mode:

vimbot.py

def capture(self):
    # Capture a screenshot with vim bindings on the screen
    self.page.keyboard.press("Escape")
    self.page.keyboard.type("f")

    screenshot = Image.open(BytesIO(self.page.screenshot())).convert("RGB")
    return screenshot

The keyboard sequence

Press Escape: Exit any existing input mode or focus
Type f: Activate Vimium’s “follow link” hint mode
Capture screenshot: Take the screenshot with overlays visible

Vimium displays hints for all clickable elements: links, buttons, form fields, and other interactive elements.

How GPT-4V uses the overlays

Once the screenshot includes Vimium overlays, GPT-4V can:

Identify clickable elements by the yellow boxes
Read the hint characters (e.g., “AA”, “AB”, “F”, “JK”)
Select the appropriate element based on the user’s objective
Return the hint characters in a JSON response

Example prompt excerpt from vision.py:

vision.py

"text": f"...if you want to click on an object, return the string with the yellow character sequence you want to click on..."

Example response:

{"click": "AB"}

Executing Vimium clicks

Clicking in Vimium is simply typing the hint characters:

vimbot.py

def click(self, text):
    self.page.keyboard.type(text)

When you type a valid hint sequence with Vimium active, it automatically:

Matches the input to an element
Triggers a click event on that element
Exits hint mode

No coordinates needed: Unlike traditional automation that calculates x/y positions, vimGPT just types “AB” and Vimium handles the rest.

Handling multiple actions

Some tasks require both clicking and typing (e.g., filling a search box):

vimbot.py

def perform_action(self, action):
    if "done" in action:
        return True
    if "click" in action and "type" in action:
        self.click(action["click"])  # Focus the input field
        self.type(action["type"])     # Enter text and press Enter
    elif "navigate" in action:
        self.navigate(action["navigate"])
    elif "type" in action:
        self.type(action["type"])
    elif "click" in action:
        self.click(action["click"])

Example JSON from GPT-4V:

{
  "click": "F",
  "type": "machine learning papers"
}

This clicks the search box (hint “F”) and types the query.

Advantages and limitations

Advantages

Simple and reliable: No brittle selectors or DOM parsing

Works on any site: No need for site-specific logic

Visual feedback: Easy to debug by watching what the AI “sees”

Human-compatible: Uses a tool designed for humans

Limitations

Element occlusion: Yellow boxes can cover text or important UI elements

Limited resolution: Small hint characters may be hard for the model to read at low resolutions

Single-character ambiguity: The model sometimes struggles to distinguish similar characters

No non-clickable elements: Can’t interact with elements that aren’t links/buttons

Future enhancements

Potential improvements to the Vimium integration:

Custom Vimium fork: Overlay only relevant elements based on the user’s objective (context-aware pruning)
Larger/colored hints: Make labels more distinguishable for the vision model
Dual screenshots: Provide frames with and without overlays to prevent occlusion issues
Custom labeling: Use JavaScript to create custom colored boxes instead of Vimium

See the GitHub README for the full list of ideas.

Get Started

Core Concepts

Usage

API Reference

Advanced

Vimium integration

What is Vimium?

Why Vimium for AI automation?

Loading Vimium in Playwright

Triggering Vimium overlays

The keyboard sequence

How GPT-4V uses the overlays

Executing Vimium clicks

Handling multiple actions

Advantages and limitations

Advantages

Limitations

Future enhancements

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

API Reference

Advanced

​What is Vimium?

​Why Vimium for AI automation?

​Loading Vimium in Playwright

​Triggering Vimium overlays

​The keyboard sequence

​How GPT-4V uses the overlays

​Executing Vimium clicks

​Handling multiple actions

​Advantages and limitations

​Advantages

​Limitations

​Future enhancements

Build docs developers (and LLMs) love

What is Vimium?

Why Vimium for AI automation?

Loading Vimium in Playwright

Triggering Vimium overlays

The keyboard sequence

How GPT-4V uses the overlays

Executing Vimium clicks

Handling multiple actions

Advantages and limitations

Advantages

Limitations

Future enhancements