Basic usage

vimGPT allows you to control a web browser using natural language objectives. The AI agent uses GPT-4V vision capabilities combined with Vimium keyboard navigation to interact with web pages.

Running vimGPT

To start vimGPT with a text-based objective:

python main.py

When prompted, enter your objective in plain English:

Please enter your objective: Search for Python tutorials on YouTube

How it works

The vimGPT workflow operates in a continuous loop:

Navigation: The browser starts at Google (main.py:15)
Screen capture: Takes a screenshot with Vimium overlays visible (vimbot.py:53-59)
AI analysis: Sends the screenshot to GPT-4V with your objective (vision.py:25-46)
Action execution: Performs the AI-suggested action (navigate, type, or click)
Completion: Repeats until the AI determines the task is complete

The agent presses Escape and then f to activate Vimium’s link hint mode, which overlays yellow character sequences on clickable elements (vimbot.py:55-56).

Available actions

vimGPT supports three primary actions:

Navigate

Navigates to a specified URL:

{"navigate": "https://youtube.com"}

URLs without a protocol are automatically prefixed with https:// (vimbot.py:43).

Type

Types text and presses Enter:

{"type": "Python tutorials for beginners"}

The agent waits 1 second before typing to ensure the page is ready (vimbot.py:46).

Click

Clicks on an element using its Vimium hint sequence:

{"click": "AB"}

The AI identifies clickable elements by their yellow character sequences and selects the most appropriate one.

Combined actions

For input fields, the agent can click and type in sequence:

{
  "click": "F",
  "type": "machine learning tutorials"
}

Task completion

When the objective is satisfied, the AI returns:

{"done": true}

Example objectives

Here are some example objectives you can try:

“Search for Python tutorials on YouTube”
“Find the latest news about artificial intelligence”
“Go to GitHub and search for machine learning projects”
“Find recipes for chocolate chip cookies”
“Look up the weather forecast for San Francisco”

Understanding the output

As vimGPT runs, you’ll see console output showing its progress:

Initializing the Vimbot driver...
Navigating to Google...
Please enter your objective: Search for Python tutorials
Capturing the screen...
Getting actions for the given objective...
JSON Response: {'click': 'F', 'type': 'Python tutorials'}
Capturing the screen...
Getting actions for the given objective...
JSON Response: {'done': True}

Each iteration shows:

The screenshot capture phase (main.py:31-32)
The AI’s action decision (main.py:35-36)
The JSON response indicating what action to perform

Stopping execution

To stop vimGPT at any time, press Ctrl+C:

^C
Exiting...

The script catches the KeyboardInterrupt and exits gracefully (main.py:54-56).

Start with simple, specific objectives. The AI performs better when the goal is clear and well-defined.

Viewport configuration

The browser viewport is set to 1080x720 pixels by default (vimbot.py:27). This resolution balances token usage with visual clarity for the GPT-4V model.

Image processing

Screenshots are resized to 1080 pixels wide while maintaining aspect ratio before being sent to the API (vision.py:12, vision.py:18). This helps manage API token costs while preserving enough detail for accurate element detection.

Each action requires a GPT-4V API call, which can be expensive. Monitor your OpenAI usage to avoid unexpected costs.

Get Started

Core Concepts

Usage

API Reference

Advanced

Running vimGPT

How it works

Available actions

Navigate

Type

Click

Combined actions

Task completion

Example objectives

Understanding the output

Stopping execution

Viewport configuration

Image processing

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

API Reference

Advanced

​Running vimGPT

​How it works

​Available actions

​Navigate

​Type

​Click

​Combined actions

​Task completion

​Example objectives

​Understanding the output

​Stopping execution

​Viewport configuration

​Image processing

Build docs developers (and LLMs) love

Running vimGPT

How it works

Available actions

Navigate

Type

Click

Combined actions

Task completion

Example objectives

Understanding the output

Stopping execution

Viewport configuration

Image processing