Skip to main content

Prerequisites

Before you begin, ensure you have:
  • Python 3.8 or higher installed
  • An OpenAI API key with access to GPT-4 with Vision
  • Chrome/Chromium browser (installed automatically by Playwright)
This quickstart assumes you’ve completed the installation steps. If you haven’t set up vimGPT yet, please complete the installation first.

Your first browsing task

Let’s run a simple task: searching Google for information.
1

Navigate to the project directory

cd ~/workspace/source
2

Run vimGPT

Execute the main script:
python main.py
You’ll see output indicating the browser is initializing:
Initializing the Vimbot driver...
Navigating to Google...
3

Enter your objective

When prompted, type your browsing objective:
Please enter your objective: Search for Python tutorials
The agent will begin executing your task autonomously.
4

Watch the automation

vimGPT will:
  • Capture a screenshot with Vimium overlays
  • Analyze the page with GPT-4V
  • Execute actions (click, type, navigate)
  • Repeat until the objective is complete
Example console output:
Capturing the screen...
Getting actions for the given objective...
JSON Response: {'click': 'A', 'type': 'Python tutorials'}
Capturing the screen...
Getting actions for the given objective...
JSON Response: {'done': True}

Understanding the code flow

Here’s what happens when you run vimGPT:

1. Initialization (main.py:11-15)

print("Initializing the Vimbot driver...")
driver = Vimbot()

print("Navigating to Google...")
driver.navigate("https://www.google.com")
The Vimbot class initializes a Playwright browser with the Vimium extension loaded.

2. Objective input (main.py:26-27)

objective = input("Please enter your objective: ")
You provide a natural language description of what you want to accomplish.

3. Autonomous browsing loop (main.py:29-38)

while True:
    time.sleep(1)
    print("Capturing the screen...")
    screenshot = driver.capture()

    print("Getting actions for the given objective...")
    action = vision.get_actions(screenshot, objective)
    print(f"JSON Response: {action}")
    if driver.perform_action(action):  # returns True if done
        break
The agent continuously:
  1. Captures a screenshot with Vimium overlays (vimbot.py:53-59)
  2. Sends it to GPT-4V for analysis (vision.py:25-76)
  3. Executes the returned action (vimbot.py:29-40)
  4. Stops when the model returns {"done": true}

Action types

vimGPT supports four action types that GPT-4V can choose from:
{
  "navigate": "https://example.com"
}
The click action uses Vimium’s letter sequences. When you see yellow boxes with letters like “A”, “AB”, or “ZX” on the page, those are the values GPT-4V can use to click elements.

Try voice mode

For a hands-free experience, enable voice input:
python main.py --voice
When prompted, speak your objective naturally:
Voice mode enabled. Listening for your command...
[Speak: "Search for machine learning courses on Coursera"]
Objective received: Search for machine learning courses on Coursera
Capturing the screen...
Voice mode requires the whisper-mic package and a working microphone. Make sure these are properly configured.

Example objectives to try

Start with these simple tasks to understand how vimGPT works:

Search task

“Search Google for today’s weather”

Navigation task

“Go to news.ycombinator.com and click on the top story”

Information lookup

“Find the documentation for Python’s requests library”

Multi-step task

“Search for vegan restaurants in San Francisco and open the first result”

Stopping execution

To interrupt vimGPT at any time, press Ctrl+C:
^C
Exiting...
The browser will close and the program will terminate gracefully.

Troubleshooting

vimGPT includes automatic retry logic (vision.py:49-74). If the initial response isn’t valid JSON, it sends the response to GPT-4 again for cleanup.
try:
    json_response = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    print("Error: Invalid JSON response")
    cleaned_response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[...]
    )
The capture method (vimbot.py:53-59) presses Escape then ‘f’ to trigger Vimium overlays. If they don’t appear:
  • Ensure Vimium is properly installed
  • Check that the extension loaded correctly on browser startup
  • Verify the page has finished loading before capture
The perform_action method (vimbot.py:29-40) handles different action combinations:
  • Check console output for the JSON response
  • Verify the action format matches expected structure
  • Ensure letter sequences from Vimium are visible in screenshots
For better results, start with simple, clearly-defined objectives. As you understand how vimGPT interprets tasks, you can try more complex multi-step operations.

Next steps

  • Review the installation guide for advanced configuration options
  • Experiment with different objective phrasings to optimize results
  • Check the console output to understand GPT-4V’s decision-making process
  • Monitor your OpenAI API usage as vision models consume more tokens

Build docs developers (and LLMs) love