Prerequisites
Before you begin, ensure you have:- Python 3.8 or higher installed
- An OpenAI API key with access to GPT-4 with Vision
- Chrome/Chromium browser (installed automatically by Playwright)
This quickstart assumes you’ve completed the installation steps. If you haven’t set up vimGPT yet, please complete the installation first.
Your first browsing task
Let’s run a simple task: searching Google for information.Enter your objective
When prompted, type your browsing objective:The agent will begin executing your task autonomously.
Understanding the code flow
Here’s what happens when you run vimGPT:1. Initialization (main.py:11-15)
2. Objective input (main.py:26-27)
3. Autonomous browsing loop (main.py:29-38)
- Captures a screenshot with Vimium overlays (vimbot.py:53-59)
- Sends it to GPT-4V for analysis (vision.py:25-76)
- Executes the returned action (vimbot.py:29-40)
- Stops when the model returns
{"done": true}
Action types
vimGPT supports four action types that GPT-4V can choose from:The click action uses Vimium’s letter sequences. When you see yellow boxes with letters like “A”, “AB”, or “ZX” on the page, those are the values GPT-4V can use to click elements.
Try voice mode
For a hands-free experience, enable voice input:Example objectives to try
Start with these simple tasks to understand how vimGPT works:Search task
“Search Google for today’s weather”
Navigation task
“Go to news.ycombinator.com and click on the top story”
Information lookup
“Find the documentation for Python’s requests library”
Multi-step task
“Search for vegan restaurants in San Francisco and open the first result”
Stopping execution
To interrupt vimGPT at any time, pressCtrl+C:
Troubleshooting
GPT-4V returns invalid JSON
GPT-4V returns invalid JSON
vimGPT includes automatic retry logic (vision.py:49-74). If the initial response isn’t valid JSON, it sends the response to GPT-4 again for cleanup.
Vimium overlays not appearing
Vimium overlays not appearing
The capture method (vimbot.py:53-59) presses Escape then ‘f’ to trigger Vimium overlays. If they don’t appear:
- Ensure Vimium is properly installed
- Check that the extension loaded correctly on browser startup
- Verify the page has finished loading before capture
Actions not executing correctly
Actions not executing correctly
The perform_action method (vimbot.py:29-40) handles different action combinations:
- Check console output for the JSON response
- Verify the action format matches expected structure
- Ensure letter sequences from Vimium are visible in screenshots
Next steps
- Review the installation guide for advanced configuration options
- Experiment with different objective phrasings to optimize results
- Check the console output to understand GPT-4V’s decision-making process
- Monitor your OpenAI API usage as vision models consume more tokens