Welcome to vimGPT
vimGPT is an innovative project that gives multimodal models an interface to autonomously browse the web. It combines OpenAI’s GPT-4 with Vision (GPT-4V) and the Vimium Chrome extension to create an AI agent that can navigate websites, click elements, type text, and complete complex browsing tasks using only visual input.How it works
The challenge with using vision models for web browsing is determining what to click without providing the browser DOM as text. vimGPT solves this elegantly by leveraging Vimium, a Chrome extension that overlays keyboard shortcuts on clickable elements. Here’s the flow:Screenshot capture
The agent captures a screenshot with Vimium’s yellow letter overlays visible on all interactive elements
Vision analysis
GPT-4V analyzes the screenshot and decides which action to take based on your objective
Key features
Vision-first browsing
Uses only GPT-4V’s vision capabilities to understand and interact with web pages, no DOM parsing required
Vimium integration
Leverages Vimium’s keyboard shortcuts to provide clear, labeled targets for the AI to interact with
Voice mode
Speak your browsing objectives naturally and watch vimGPT execute them in real-time
Autonomous navigation
Handles complex multi-step tasks like searching, clicking through results, and filling forms
Use cases
- Research automation: Search for information across multiple websites and aggregate results
- Form filling: Automate repetitive data entry tasks on web forms
- Content discovery: Navigate through websites to find specific information
- Accessibility: Voice-controlled browsing for users who prefer or require hands-free interaction
- Web testing: Simulate realistic user browsing patterns for QA purposes
vimGPT is an experimental project that demonstrates the potential of vision-based web automation. It requires an OpenAI API key and uses GPT-4 with Vision, which incurs API costs.
Get started
Quickstart
Run your first autonomous browsing task in minutes
Installation
Detailed setup instructions for Python, Vimium, and API configuration
Architecture overview
vimGPT consists of three main components:- main.py: Entry point that initializes the Vimbot, accepts user objectives (text or voice), and orchestrates the browsing loop
- vimbot.py: Playwright-based browser controller that manages page navigation, screenshots, and action execution
- vision.py: OpenAI GPT-4V integration that analyzes screenshots and returns action commands in JSON format