Quickstart

Prerequisites

Before you begin, ensure you have:

Python 3.8 or higher installed
An OpenAI API key with access to GPT-4 with Vision
Chrome/Chromium browser (installed automatically by Playwright)

This quickstart assumes you’ve completed the installation steps. If you haven’t set up vimGPT yet, please complete the installation first.

Your first browsing task

Let’s run a simple task: searching Google for information.

Navigate to the project directory

cd ~/workspace/source

Run vimGPT

Execute the main script:

python main.py

You’ll see output indicating the browser is initializing:

Initializing the Vimbot driver...
Navigating to Google...

Enter your objective

When prompted, type your browsing objective:

Please enter your objective: Search for Python tutorials

The agent will begin executing your task autonomously.

Watch the automation

vimGPT will:

Capture a screenshot with Vimium overlays
Analyze the page with GPT-4V
Execute actions (click, type, navigate)
Repeat until the objective is complete

Example console output:

Capturing the screen...
Getting actions for the given objective...
JSON Response: {'click': 'A', 'type': 'Python tutorials'}
Capturing the screen...
Getting actions for the given objective...
JSON Response: {'done': True}

Understanding the code flow

Here’s what happens when you run vimGPT:

1. Initialization (main.py:11-15)

print("Initializing the Vimbot driver...")
driver = Vimbot()

print("Navigating to Google...")
driver.navigate("https://www.google.com")

The Vimbot class initializes a Playwright browser with the Vimium extension loaded.

2. Objective input (main.py:26-27)

objective = input("Please enter your objective: ")

You provide a natural language description of what you want to accomplish.

3. Autonomous browsing loop (main.py:29-38)

while True:
    time.sleep(1)
    print("Capturing the screen...")
    screenshot = driver.capture()

    print("Getting actions for the given objective...")
    action = vision.get_actions(screenshot, objective)
    print(f"JSON Response: {action}")
    if driver.perform_action(action):  # returns True if done
        break

The agent continuously:

Captures a screenshot with Vimium overlays (vimbot.py:53-59)
Sends it to GPT-4V for analysis (vision.py:25-76)
Executes the returned action (vimbot.py:29-40)
Stops when the model returns {"done": true}

Action types

vimGPT supports four action types that GPT-4V can choose from:

{
  "navigate": "https://example.com"
}

The click action uses Vimium’s letter sequences. When you see yellow boxes with letters like “A”, “AB”, or “ZX” on the page, those are the values GPT-4V can use to click elements.

Try voice mode

For a hands-free experience, enable voice input:

python main.py --voice

When prompted, speak your objective naturally:

Voice mode enabled. Listening for your command...
[Speak: "Search for machine learning courses on Coursera"]
Objective received: Search for machine learning courses on Coursera
Capturing the screen...

Voice mode requires the whisper-mic package and a working microphone. Make sure these are properly configured.

Example objectives to try

Start with these simple tasks to understand how vimGPT works:

Search task

“Search Google for today’s weather”

Navigation task

“Go to news.ycombinator.com and click on the top story”

Information lookup

“Find the documentation for Python’s requests library”

Multi-step task

“Search for vegan restaurants in San Francisco and open the first result”

Stopping execution

To interrupt vimGPT at any time, press Ctrl+C:

^C
Exiting...

The browser will close and the program will terminate gracefully.

Troubleshooting

GPT-4V returns invalid JSON

vimGPT includes automatic retry logic (vision.py:49-74). If the initial response isn’t valid JSON, it sends the response to GPT-4 again for cleanup.

try:
    json_response = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    print("Error: Invalid JSON response")
    cleaned_response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[...]
    )

Vimium overlays not appearing

The capture method (vimbot.py:53-59) presses Escape then ‘f’ to trigger Vimium overlays. If they don’t appear:

Ensure Vimium is properly installed
Check that the extension loaded correctly on browser startup
Verify the page has finished loading before capture

Actions not executing correctly

The perform_action method (vimbot.py:29-40) handles different action combinations:

Check console output for the JSON response
Verify the action format matches expected structure
Ensure letter sequences from Vimium are visible in screenshots

For better results, start with simple, clearly-defined objectives. As you understand how vimGPT interprets tasks, you can try more complex multi-step operations.

Next steps

Review the installation guide for advanced configuration options
Experiment with different objective phrasings to optimize results
Check the console output to understand GPT-4V’s decision-making process
Monitor your OpenAI API usage as vision models consume more tokens

Get Started

Core Concepts

Usage

API Reference

Advanced

Prerequisites

Your first browsing task

Understanding the code flow

1. Initialization (main.py:11-15)

2. Objective input (main.py:26-27)

3. Autonomous browsing loop (main.py:29-38)

Action types

Try voice mode

Example objectives to try

Search task

Navigation task

Information lookup

Multi-step task

Stopping execution

Troubleshooting

Next steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

API Reference

Advanced

​Prerequisites

​Your first browsing task

​Understanding the code flow

​1. Initialization (main.py:11-15)

​2. Objective input (main.py:26-27)

​3. Autonomous browsing loop (main.py:29-38)

​Action types

​Try voice mode

​Example objectives to try

Search task

Navigation task

Information lookup

Multi-step task

​Stopping execution

​Troubleshooting

​Next steps

Build docs developers (and LLMs) love

Prerequisites

Your first browsing task

Understanding the code flow

1. Initialization (main.py:11-15)

2. Objective input (main.py:26-27)

3. Autonomous browsing loop (main.py:29-38)

Action types

Try voice mode

Example objectives to try

Stopping execution

Troubleshooting

Next steps