Introduction

Welcome to vimGPT

vimGPT is an innovative project that gives multimodal models an interface to autonomously browse the web. It combines OpenAI’s GPT-4 with Vision (GPT-4V) and the Vimium Chrome extension to create an AI agent that can navigate websites, click elements, type text, and complete complex browsing tasks using only visual input.

How it works

The challenge with using vision models for web browsing is determining what to click without providing the browser DOM as text. vimGPT solves this elegantly by leveraging Vimium, a Chrome extension that overlays keyboard shortcuts on clickable elements. Here’s the flow:

Screenshot capture

The agent captures a screenshot with Vimium’s yellow letter overlays visible on all interactive elements

Vision analysis

GPT-4V analyzes the screenshot and decides which action to take based on your objective

Action execution

The agent performs the action (navigate, type, or click) using Playwright

Repeat

The cycle continues until the objective is completed

Key features

Vision-first browsing

Uses only GPT-4V’s vision capabilities to understand and interact with web pages, no DOM parsing required

Vimium integration

Leverages Vimium’s keyboard shortcuts to provide clear, labeled targets for the AI to interact with

Voice mode

Speak your browsing objectives naturally and watch vimGPT execute them in real-time

Autonomous navigation

Handles complex multi-step tasks like searching, clicking through results, and filling forms

Use cases

Research automation: Search for information across multiple websites and aggregate results
Form filling: Automate repetitive data entry tasks on web forms
Content discovery: Navigate through websites to find specific information
Accessibility: Voice-controlled browsing for users who prefer or require hands-free interaction
Web testing: Simulate realistic user browsing patterns for QA purposes

vimGPT is an experimental project that demonstrates the potential of vision-based web automation. It requires an OpenAI API key and uses GPT-4 with Vision, which incurs API costs.

Get started

Quickstart

Run your first autonomous browsing task in minutes

Installation

Detailed setup instructions for Python, Vimium, and API configuration

Architecture overview

vimGPT consists of three main components:

main.py: Entry point that initializes the Vimbot, accepts user objectives (text or voice), and orchestrates the browsing loop
vimbot.py: Playwright-based browser controller that manages page navigation, screenshots, and action execution
vision.py: OpenAI GPT-4V integration that analyzes screenshots and returns action commands in JSON format

Recognition

vimGPT has been featured in:

Next steps

Ready to start? Check out the quickstart guide to run your first autonomous browsing task, or dive into the installation guide for detailed setup instructions.

Get Started

Core Concepts

Usage

API Reference

Advanced

Welcome to vimGPT

How it works

Key features

Vision-first browsing

Vimium integration

Voice mode

Autonomous navigation

Use cases

Get started

Quickstart

Installation

Architecture overview

Recognition

Next steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

API Reference

Advanced

​Welcome to vimGPT

​How it works

​Key features

Vision-first browsing

Vimium integration

Voice mode

Autonomous navigation

​Use cases

​Get started

Quickstart

Installation

​Architecture overview

​Recognition

​Next steps

Build docs developers (and LLMs) love

Welcome to vimGPT

How it works

Key features

Use cases

Get started

Architecture overview

Recognition

Next steps