Architecture

Overview

vimGPT is an autonomous web browsing agent that uses GPT-4V’s vision capabilities combined with Vimium keyboard navigation to interact with web pages. The system follows a continuous loop of capturing, analyzing, and acting on web content without relying on DOM parsing.

System flow

The core architecture follows this execution loop:

Playwright → Vimium → Screenshot → GPT-4V → Action → Repeat

1. Browser initialization

The Vimbot class (vimbot.py:10) initializes a Playwright browser context with the Vimium extension pre-loaded:

self.context = (
    sync_playwright()
    .start()
    .chromium.launch_persistent_context(
        "",
        headless=headless,
        args=[
            f"--disable-extensions-except={vimium_path}",
            f"--load-extension={vimium_path}",
        ],
        ignore_https_errors=True,
    )
)

The viewport is set to 1080x720 pixels (vimbot.py:27) to balance between visual clarity and token usage.

2. Screenshot capture with Vimium overlays

The capture() method (vimbot.py:53) activates Vimium’s hint mode to display yellow character sequences over clickable elements:

def capture(self):
    self.page.keyboard.press("Escape")
    self.page.keyboard.type("f")
    screenshot = Image.open(BytesIO(self.page.screenshot())).convert("RGB")
    return screenshot

Pressing “f” triggers Vimium’s link hint mode, overlaying 1-2 character sequences on interactive elements that can be clicked.

3. Vision processing and action extraction

The vision.py module handles GPT-4V interaction:

Image encoding

Screenshots are resized and encoded (vision.py:16):

def encode_and_resize(image):
    W, H = image.size
    image = image.resize((IMG_RES, int(IMG_RES * H / W)))
    buffer = BytesIO()
    image.save(buffer, format="PNG")
    encoded_image = base64.b64encode(buffer.getvalue()).decode("utf-8")
    return encoded_image

The default resolution is 1080 pixels (vision.py:12), maintaining aspect ratio to preserve visual context.

GPT-4V prompt engineering

The get_actions() function (vision.py:25) sends the screenshot with a structured prompt:

Available actions: navigate, type, click, done
Expected format: JSON object with action keys
Click instructions: Return only the 1-2 letter yellow sequence from Vimium
Type instructions: Return both click and type for text input scenarios
Completion signal: Return {"done": true} when objective is achieved

The model is currently set to gpt-4o (vision.py:28) with a 100 token limit (vision.py:46).

4. JSON response parsing and error handling

The response parsing includes a two-tier error recovery system (vision.py:49):

try:
    json_response = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    print("Error: Invalid JSON response")
    cleaned_response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "You are a helpful assistant to fix an invalid JSON response..."
        }]
    )

Error handling strategy:

First attempt: Parse the raw GPT-4V response
Fallback: If JSON is malformed, make a second API call to GPT-4o to clean the response
Final fallback: Return empty dict {} if both attempts fail

This approach handles cases where GPT-4V returns valid instructions but wraps them in markdown code blocks or adds explanatory text.

5. Action execution

The perform_action() method (vimbot.py:29) dispatches actions:

def perform_action(self, action):
    if "done" in action:
        return True
    if "click" in action and "type" in action:
        self.click(action["click"])
        self.type(action["type"])
    if "navigate" in action:
        self.navigate(action["navigate"])
    elif "type" in action:
        self.type(action["type"])
    elif "click" in action:
        self.click(action["click"])

Action implementations:

navigate (vimbot.py:42): Loads URL with automatic https:// prefix
type (vimbot.py:45): Types text and presses Enter after 1 second delay
click (vimbot.py:50): Simulates typing the Vimium hint sequence
done: Returns True to break the execution loop

6. Main execution loop

The orchestration happens in main.py:29:

while True:
    time.sleep(1)
    print("Capturing the screen...")
    screenshot = driver.capture()
    
    print("Getting actions for the given objective...")
    action = vision.get_actions(screenshot, objective)
    print(f"JSON Response: {action}")
    if driver.perform_action(action):
        break

This creates a continuous feedback loop where each action updates the page state, triggering a new screenshot and analysis cycle.

Tech stack

Core dependencies

Playwright (1.39.0): Browser automation and screenshot capture
OpenAI Python SDK (1.1.2): GPT-4V API integration
Pillow (10.1.0): Image processing and encoding
python-dotenv (1.0.0): Environment variable management for API keys
whisper-mic: Voice input processing for voice mode

Browser extension

Vimium: Keyboard-driven browser navigation that overlays hint markers on clickable elements

Development tools

The project uses pre-commit hooks (.pre-commit-config.yaml:1) for code quality:

trailing-whitespace: Removes trailing whitespace
end-of-file-fixer: Ensures files end with newline
ssort (v0.11.6): Statement sorting for Python
isort (5.12.0): Import sorting with Black profile
black (23.3.0): Code formatting with 120 character line length

Voice mode

Voice mode (main.py:17) uses the whisper-mic library to convert speech to text:

if voice_mode:
    print("Voice mode enabled. Listening for your command...")
    mic = WhisperMic()
    try:
        objective = mic.listen()
    except Exception as e:
        print(f"Error in capturing voice input: {e}")
        return

This enables hands-free interaction by speaking the objective instead of typing it.

Design philosophy

Vision-only approach

vimGPT deliberately avoids using DOM parsing or accessibility trees, relying solely on visual understanding. This approach:

Tests the limits of multimodal model vision capabilities
Simplifies the architecture by eliminating HTML parsing
More closely mimics human web browsing behavior
Remains functional even on dynamically rendered content

Vimium as the interface layer

Vimium solves the “what to click” problem by providing:

Visual markers: Yellow boxes with character sequences
Unambiguous targets: Each element gets a unique identifier
Keyboard-driven interaction: No need for pixel coordinate calculations
Extensibility: Works across all web pages without modification

Extension points

The architecture supports several enhancement opportunities:

Model switching

The vision model can be swapped by changing vision.py:28. Potential alternatives:

Other OpenAI vision models
Self-hosted models like LLaVa or CogVLM
Models with native pixel coordinate output

Resolution tuning

Adjust IMG_RES in vision.py:12 to balance:

Higher resolution: Better element detection, more tokens consumed
Lower resolution: Faster processing, potential detection failures

Browser configuration

Modify Playwright launch parameters in vimbot.py:15 for:

Headless mode for server deployment
Custom user agent strings
Cookie/session persistence
Different viewport sizes

Performance considerations

Token usage

Each iteration consumes:

Screenshot encoding (varies by resolution)
Prompt text (~200 tokens)
Response generation (capped at 100 tokens)

Latency

Typical iteration timing:

Screenshot capture: less than 500ms
API call to GPT-4V: 2-5 seconds
Action execution: less than 1 second
Total: approximately 3-7 seconds per action

Cost

GPT-4V pricing is based on image tokens plus text tokens. At 1080px resolution, expect:

~$0.01-0.03 per iteration
Costs scale with task complexity (number of steps)

Get Started

Core Concepts

Usage

API Reference

Advanced

Overview

System flow

1. Browser initialization

2. Screenshot capture with Vimium overlays

3. Vision processing and action extraction

Image encoding

GPT-4V prompt engineering

4. JSON response parsing and error handling

5. Action execution

6. Main execution loop

Tech stack

Core dependencies

Browser extension

Development tools

Voice mode

Design philosophy

Vision-only approach

Vimium as the interface layer

Extension points

Model switching

Resolution tuning

Browser configuration

Performance considerations

Token usage

Latency

Cost

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

API Reference

Advanced

​Overview

​System flow

​1. Browser initialization

​2. Screenshot capture with Vimium overlays

​3. Vision processing and action extraction

​Image encoding

​GPT-4V prompt engineering

​4. JSON response parsing and error handling

​5. Action execution

​6. Main execution loop

​Tech stack

​Core dependencies

​Browser extension

​Development tools

​Voice mode

​Design philosophy

​Vision-only approach

​Vimium as the interface layer

​Extension points

​Model switching

​Resolution tuning

​Browser configuration

​Performance considerations

​Token usage

​Latency

​Cost

Build docs developers (and LLMs) love

Overview

System flow

1. Browser initialization

2. Screenshot capture with Vimium overlays

3. Vision processing and action extraction

Image encoding

GPT-4V prompt engineering

4. JSON response parsing and error handling

5. Action execution

6. Main execution loop

Tech stack

Core dependencies

Browser extension

Development tools

Voice mode

Design philosophy

Vision-only approach

Vimium as the interface layer

Extension points

Model switching

Resolution tuning

Browser configuration

Performance considerations

Token usage

Latency

Cost