Skip to main content

Overview

The vision module uses OpenAI’s GPT-4V (GPT-4 with vision) to analyze screenshots and determine the next action to take based on a given objective. It handles image encoding, API communication, and response parsing.

Configuration

Environment variables

The vision module requires an OpenAI API key to be set:
OPENAI_API_KEY=your_api_key_here

Image resolution

IMG_RES = 1080
Screenshots are resized to 1080 pixels width while maintaining aspect ratio before being sent to the GPT-4V API.

Functions

get_actions()

get_actions(screenshot: PIL.Image.Image, objective: str) -> dict
Analyzes a screenshot and determines the next action to take based on the objective.
screenshot
PIL.Image.Image
required
A PIL Image object containing the screenshot with Vimium hints visible. Typically obtained from Vimbot.capture().
objective
str
required
A natural language description of the task to accomplish (e.g., “search for Python tutorials”, “upvote the pinterest post”).
action
dict
A dictionary containing the action to perform. See Actions for the complete format specification. Returns an empty dictionary {} if JSON parsing fails twice.
Behavior:
  • Encodes and resizes the screenshot using encode_and_resize()
  • Sends the screenshot and objective to GPT-4V (model: gpt-4o)
  • Parses the JSON response to extract the action
  • If JSON parsing fails, attempts to fix the response using a second GPT-4 call
  • Returns an empty dict if both parsing attempts fail
Example:
import vision
from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

screenshot = driver.capture()
action = vision.get_actions(screenshot, "search for autonomous agents")
print(action)  # {"click": "a", "type": "autonomous agents"}

driver.perform_action(action)
GPT-4V prompt behavior: The function instructs GPT-4V to:
  • Choose between navigate, type, click, and done actions
  • For clicks: Return only the 1-2 letter sequence from yellow Vimium hint boxes
  • For typing: Return both a click (to focus the input) and type (with the message)
  • Return done when the objective is satisfied
  • Respond only in JSON format without code blocks
Error handling:
# If initial JSON parsing fails
try:
    json_response = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    # Attempts to fix invalid JSON using a second GPT-4 call
    # Returns {} if fixing also fails

encode_and_resize()

encode_and_resize(image: PIL.Image.Image) -> str
Resizes an image and encodes it to a base64 string for API transmission.
image
PIL.Image.Image
required
A PIL Image object to encode and resize.
encoded_image
str
A base64-encoded PNG image string, resized to 1080 pixels width while maintaining aspect ratio.
Behavior:
  • Calculates aspect ratio from original image dimensions
  • Resizes to 1080 pixels width (maintains aspect ratio)
  • Converts to PNG format
  • Encodes to base64 string
Example:
from PIL import Image
import vision

image = Image.open("screenshot.png")
encoded = vision.encode_and_resize(image)
print(encoded[:50])  # iVBORw0KGgoAAAANSUhEUgAABDgAAAH0CAYAAAD...
Implementation details:
W, H = image.size
image = image.resize((IMG_RES, int(IMG_RES * H / W)))
buffer = BytesIO()
image.save(buffer, format="PNG")
encoded_image = base64.b64encode(buffer.getvalue()).decode("utf-8")

API configuration

Model

The vision module uses OpenAI’s gpt-4o model for vision analysis.

Token limits

max_tokens=100
Responses are limited to 100 tokens, which is sufficient for JSON action responses.

Usage pattern

Typical workflow for vision-guided automation:
import time
import vision
from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

objective = "search for GPT-4 vision capabilities and open the first result"

while True:
    time.sleep(1)
    screenshot = driver.capture()
    action = vision.get_actions(screenshot, objective)
    print(f"Action: {action}")
    
    if driver.perform_action(action):
        print("Objective completed!")
        break

Build docs developers (and LLMs) love