What is Vimium?
Vimium is a Chrome extension that provides keyboard shortcuts for navigation, inspired by Vim. Its most powerful feature for automation is hint mode, which overlays every clickable element with a yellow box containing a unique 1-2 character label.Why Vimium for AI automation?
Traditional web automation relies on DOM selectors (CSS, XPath) or accessibility trees. vimGPT takes a different approach:Vision-first design
Vision-first design
GPT-4V analyzes screenshots like a human would, without parsing HTML. Vimium overlays make element selection visual and explicit.
DOM-agnostic
DOM-agnostic
No need to understand page structure, handle shadow DOM, or deal with dynamic class names. The model only needs to read visible labels.
Keyboard-only interaction
Keyboard-only interaction
Everything is accomplished through keyboard commands, avoiding complex coordinate calculations or element locators.
Human-like interaction
Human-like interaction
Vimium is designed for humans. By using it, the AI mimics how power users navigate the web.
Loading Vimium in Playwright
The extension must be loaded manually since Playwright doesn’t support extensions by default. Here’s how vimGPT accomplishes this:vimbot.py
Use persistent context
Regular browser contexts don’t support extensions; use
launch_persistent_context insteadTriggering Vimium overlays
Before capturing a screenshot, vimGPT activates Vimium’s hint mode:vimbot.py
The keyboard sequence
- Press
Escape: Exit any existing input mode or focus - Type
f: Activate Vimium’s “follow link” hint mode - Capture screenshot: Take the screenshot with overlays visible
Vimium displays hints for all clickable elements: links, buttons, form fields, and other interactive elements.
How GPT-4V uses the overlays
Once the screenshot includes Vimium overlays, GPT-4V can:- Identify clickable elements by the yellow boxes
- Read the hint characters (e.g., “AA”, “AB”, “F”, “JK”)
- Select the appropriate element based on the user’s objective
- Return the hint characters in a JSON response
vision.py:
vision.py
Executing Vimium clicks
Clicking in Vimium is simply typing the hint characters:vimbot.py
- Matches the input to an element
- Triggers a click event on that element
- Exits hint mode
No coordinates needed: Unlike traditional automation that calculates x/y positions, vimGPT just types “AB” and Vimium handles the rest.
Handling multiple actions
Some tasks require both clicking and typing (e.g., filling a search box):vimbot.py
Advantages and limitations
Advantages
Simple and reliable: No brittle selectors or DOM parsing
Works on any site: No need for site-specific logic
Visual feedback: Easy to debug by watching what the AI “sees”
Human-compatible: Uses a tool designed for humans
Limitations
Future enhancements
Potential improvements to the Vimium integration:- Custom Vimium fork: Overlay only relevant elements based on the user’s objective (context-aware pruning)
- Larger/colored hints: Make labels more distinguishable for the vision model
- Dual screenshots: Provide frames with and without overlays to prevent occlusion issues
- Custom labeling: Use JavaScript to create custom colored boxes instead of Vimium