Overview
vimGPT is an autonomous web browsing agent that uses GPT-4V’s vision capabilities combined with Vimium keyboard navigation to interact with web pages. The system follows a continuous loop of capturing, analyzing, and acting on web content without relying on DOM parsing.System flow
The core architecture follows this execution loop:1. Browser initialization
TheVimbot class (vimbot.py:10) initializes a Playwright browser context with the Vimium extension pre-loaded:
vimbot.py:27) to balance between visual clarity and token usage.
2. Screenshot capture with Vimium overlays
Thecapture() method (vimbot.py:53) activates Vimium’s hint mode to display yellow character sequences over clickable elements:
3. Vision processing and action extraction
Thevision.py module handles GPT-4V interaction:
Image encoding
Screenshots are resized and encoded (vision.py:16):
vision.py:12), maintaining aspect ratio to preserve visual context.
GPT-4V prompt engineering
Theget_actions() function (vision.py:25) sends the screenshot with a structured prompt:
- Available actions:
navigate,type,click,done - Expected format: JSON object with action keys
- Click instructions: Return only the 1-2 letter yellow sequence from Vimium
- Type instructions: Return both
clickandtypefor text input scenarios - Completion signal: Return
{"done": true}when objective is achieved
gpt-4o (vision.py:28) with a 100 token limit (vision.py:46).
4. JSON response parsing and error handling
The response parsing includes a two-tier error recovery system (vision.py:49):
- First attempt: Parse the raw GPT-4V response
- Fallback: If JSON is malformed, make a second API call to GPT-4o to clean the response
- Final fallback: Return empty dict
{}if both attempts fail
5. Action execution
Theperform_action() method (vimbot.py:29) dispatches actions:
- navigate (
vimbot.py:42): Loads URL with automatic https:// prefix - type (
vimbot.py:45): Types text and presses Enter after 1 second delay - click (
vimbot.py:50): Simulates typing the Vimium hint sequence - done: Returns True to break the execution loop
6. Main execution loop
The orchestration happens inmain.py:29:
Tech stack
Core dependencies
- Playwright (
1.39.0): Browser automation and screenshot capture - OpenAI Python SDK (
1.1.2): GPT-4V API integration - Pillow (
10.1.0): Image processing and encoding - python-dotenv (
1.0.0): Environment variable management for API keys - whisper-mic: Voice input processing for voice mode
Browser extension
- Vimium: Keyboard-driven browser navigation that overlays hint markers on clickable elements
Development tools
The project uses pre-commit hooks (.pre-commit-config.yaml:1) for code quality:
- trailing-whitespace: Removes trailing whitespace
- end-of-file-fixer: Ensures files end with newline
- ssort (
v0.11.6): Statement sorting for Python - isort (
5.12.0): Import sorting with Black profile - black (
23.3.0): Code formatting with 120 character line length
Voice mode
Voice mode (main.py:17) uses the whisper-mic library to convert speech to text:
Design philosophy
Vision-only approach
vimGPT deliberately avoids using DOM parsing or accessibility trees, relying solely on visual understanding. This approach:- Tests the limits of multimodal model vision capabilities
- Simplifies the architecture by eliminating HTML parsing
- More closely mimics human web browsing behavior
- Remains functional even on dynamically rendered content
Vimium as the interface layer
Vimium solves the “what to click” problem by providing:- Visual markers: Yellow boxes with character sequences
- Unambiguous targets: Each element gets a unique identifier
- Keyboard-driven interaction: No need for pixel coordinate calculations
- Extensibility: Works across all web pages without modification
Extension points
The architecture supports several enhancement opportunities:Model switching
The vision model can be swapped by changingvision.py:28. Potential alternatives:
- Other OpenAI vision models
- Self-hosted models like LLaVa or CogVLM
- Models with native pixel coordinate output
Resolution tuning
AdjustIMG_RES in vision.py:12 to balance:
- Higher resolution: Better element detection, more tokens consumed
- Lower resolution: Faster processing, potential detection failures
Browser configuration
Modify Playwright launch parameters invimbot.py:15 for:
- Headless mode for server deployment
- Custom user agent strings
- Cookie/session persistence
- Different viewport sizes
Performance considerations
Token usage
Each iteration consumes:- Screenshot encoding (varies by resolution)
- Prompt text (~200 tokens)
- Response generation (capped at 100 tokens)
Latency
Typical iteration timing:- Screenshot capture: less than 500ms
- API call to GPT-4V: 2-5 seconds
- Action execution: less than 1 second
- Total: approximately 3-7 seconds per action
Cost
GPT-4V pricing is based on image tokens plus text tokens. At 1080px resolution, expect:- ~$0.01-0.03 per iteration
- Costs scale with task complexity (number of steps)