Overview
vimGPT uses GPT-4V’s vision capabilities to autonomously browse the web by analyzing screenshots and executing keyboard-based actions. The system operates in a continuous loop: capture → analyze → act → repeat.Architecture
The system consists of three core components:Vimbot
Browser automation layer using Playwright with Vimium extension
Vision API
GPT-4V integration for screenshot analysis and decision making
Main Loop
Orchestration layer that coordinates the capture-analyze-act cycle
Core workflow
The main execution loop inmain.py coordinates the entire workflow:
main.py
Step-by-step process
Capture screenshot
Trigger Vimium overlays by pressing
Escape then f, then capture the screen with visible element hintsAnalyze with GPT-4V
Send the screenshot and user objective to GPT-4V, which returns a JSON action object
Browser automation
TheVimbot class wraps Playwright to provide a simple interface for web automation:
vimbot.py
- Persistent context: Maintains session state across page loads
- Vimium integration: Extension loaded via command-line arguments
- Fixed viewport: 1080×720 resolution for consistent screenshots
- HTTPS flexibility: Ignores certificate errors for broader compatibility
Action execution
The system supports four action types, all executed through keyboard commands:Navigate
Navigate
Type
Type
Enter text and press Enter (for search boxes, forms, etc.)
Click
Click
Simulate clicking by typing Vimium hint characters
Done
Done
Signal completion when the objective is achieved
Input modes
vimGPT supports two input methods for specifying objectives:Text mode (default)
main.py
Voice mode
Enable with the--voice flag to use Whisper for speech-to-text:
main.py
Design decisions
Why vision-only? vimGPT deliberately avoids parsing the DOM or accessibility tree, relying solely on what GPT-4V can “see” in screenshots. This tests the limits of visual reasoning for web automation.
Why Vimium? Without access to the DOM, the model needs a way to specify which element to click. Vimium’s overlay provides visible, alphanumeric labels that GPT-4V can easily identify and reference.
Performance characteristics
- Latency: Each action requires a GPT-4V API call (~2-5 seconds)
- Token usage: Screenshots consume significant tokens; resolution is capped at 1080px width
- Reliability: JSON parsing errors trigger a fallback LLM call to repair malformed responses
- Viewport: Fixed 1080×720 resolution balances visual clarity with token efficiency
Limitations and future improvements
See the GitHub README for a comprehensive list of known limitations and potential enhancements, including:- Higher resolution support for better element detection
- Cycle detection to avoid repeated actions
- Fine-tuning open-source vision models for faster/cheaper inference
- Integration with Chrome accessibility tree for additional context