Skip to main content
vimGPT can be configured through environment variables and code-level settings to customize its behavior for your specific needs.

Environment variables

OPENAI_API_KEY

Required: Yes Your OpenAI API key for accessing GPT-4V:
export OPENAI_API_KEY="sk-..."
vimGPT loads environment variables using python-dotenv (vision.py:10), so you can also create a .env file:
.env
OPENAI_API_KEY=sk-...
Never commit your .env file to version control. Add it to .gitignore to prevent accidental exposure of your API key.
The API key is set in the vision module:
from dotenv import load_dotenv
import os
import openai

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
Reference: vision.py:10-11

Browser configuration

Viewport size

The browser viewport is configured in the Vimbot class initialization:
self.page.set_viewport_size({"width": 1080, "height": 720})
Reference: vimbot.py:27 Default: 1080x720 pixels To customize, modify the viewport dimensions in vimbot.py:
self.page.set_viewport_size({"width": 1920, "height": 1080})
Larger viewports may improve element detection but increase token usage and API costs.

Headless mode

vimGPT runs in headed mode by default (browser window visible):
class Vimbot:
    def __init__(self, headless=False):
        self.context = (
            sync_playwright()
            .start()
            .chromium.launch_persistent_context(
                "",
                headless=headless,
                args=[
                    f"--disable-extensions-except={vimium_path}",
                    f"--load-extension={vimium_path}",
                ],
                ignore_https_errors=True,
            )
        )
Reference: vimbot.py:11-24 Default: headless=False (browser visible) To run in headless mode, modify the initialization in main.py:
driver = Vimbot(headless=True)

Browser extensions

vimGPT loads the Vimium extension from the local directory:
vimium_path = "./vimium-master"
Reference: vimbot.py:7 The extension is downloaded by the setup script:
./setup.sh
This script:
  1. Downloads Vimium from GitHub (setup.sh:1)
  2. Extracts the archive (setup.sh:2)
  3. Cleans up the zip file (setup.sh:3)
Page navigation has a 60-second timeout:
def navigate(self, url):
    self.page.goto(url=url if "://" in url else "https://" + url, timeout=60000)
Reference: vimbot.py:42-43 Default: 60000 milliseconds (60 seconds) To adjust, modify the timeout parameter in vimbot.py.

HTTPS error handling

HTTPS errors are ignored by default:
ignore_https_errors=True
Reference: vimbot.py:22 This allows vimGPT to navigate to sites with self-signed or expired certificates.

Image processing

Image resolution

Screenshots are resized to optimize token usage:
IMG_RES = 1080

def encode_and_resize(image):
    W, H = image.size
    image = image.resize((IMG_RES, int(IMG_RES * H / W)))
    # ...
Reference: vision.py:12, vision.py:16-18 Default: 1080 pixels wide (maintains aspect ratio) To use higher resolution images, modify IMG_RES in vision.py:
IMG_RES = 1920  # Higher resolution for better detail
Higher resolution improves element detection but significantly increases API token usage and costs.

Image format

Screenshots are captured and encoded as PNG:
screenshot = Image.open(BytesIO(self.page.screenshot())).convert("RGB")
Reference: vimbot.py:58 The image is then base64-encoded for the API:
buffer = BytesIO()
image.save(buffer, format="PNG")
encoded_image = base64.b64encode(buffer.getvalue()).decode("utf-8")
Reference: vision.py:19-21

GPT-4V configuration

Model selection

The current model is GPT-4V (gpt-4o):
response = openai.chat.completions.create(
    model="gpt-4o",
    # ...
)
Reference: vision.py:28 To use a different model, change the model parameter in vision.py.

Token limits

The API call limits the response to 100 tokens:
max_tokens=100
Reference: vision.py:46 This is sufficient for JSON action responses. Increase if you need more detailed outputs.

Action timing

vimGPT pauses between actions to allow pages to load:
while True:
    time.sleep(1)
    print("Capturing the screen...")
    screenshot = driver.capture()
Reference: main.py:30 Default: 1 second between actions Additionally, typing actions include a 1-second wait:
def type(self, text):
    time.sleep(1)
    self.page.keyboard.type(text)
    self.page.keyboard.press("Enter")
Reference: vimbot.py:45-48

Vimium configuration

Screenshots are captured with Vimium link hints visible:
def capture(self):
    # capture a screenshot with vim bindings on the screen
    self.page.keyboard.press("Escape")
    self.page.keyboard.type("f")
    
    screenshot = Image.open(BytesIO(self.page.screenshot())).convert("RGB")
    return screenshot
Reference: vimbot.py:53-59 The f key activates Vimium’s link hint mode, overlaying yellow character sequences on all clickable elements.

Error handling

JSON parsing errors

If GPT-4V returns invalid JSON, vimGPT attempts automatic correction:
try:
    json_response = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    print("Error: Invalid JSON response")
    cleaned_response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant to fix an invalid JSON response...",
            },
            # ...
        ],
    )
Reference: vision.py:49-66 This makes an additional API call to fix malformed JSON responses.

Voice input errors

Voice mode exits gracefully if capture fails:
try:
    objective = mic.listen()
except Exception as e:
    print(f"Error in capturing voice input: {e}")
    return  # Exit if voice input fails
Reference: main.py:21-24

Dependencies

vimGPT requires these key packages:
requirements.txt
openai==1.1.2
playwright==1.39.0
Pillow==10.1.0
python-dotenv==1.0.0
whisper-mic
Reference: requirements.txt Install all dependencies:
pip install -r requirements.txt
Playwright requires additional browser binaries. Install them with:
playwright install chromium

Summary

Key configuration options:
SettingDefaultLocation
API KeyRequiredEnvironment variable
Viewport size1080x720vimbot.py:27
Image resolution1080pxvision.py:12
Headless modeFalsevimbot.py:11
Action delay1 secondmain.py:30
Navigation timeout60 secondsvimbot.py:43
Max tokens100vision.py:46
Modelgpt-4ovision.py:28
All code-level customizations require modifying the source files and restarting vimGPT.

Build docs developers (and LLMs) love