Skip to main content

Overview

Actions are JSON objects that define what the Vimbot should do next. They are typically generated by the vision module’s get_actions() function based on GPT-4V analysis of screenshots, but can also be created manually for programmatic control.

Action format

Actions are represented as Python dictionaries with specific keys. Each action type has its own key-value structure.
action = {
    "action_type": "value"
}

Action types

Navigates to a specified URL.
{
  "navigate": "https://www.example.com"
}
navigate
str
required
The URL to navigate to. The https:// protocol is automatically added if not present.
Example:
from vimbot import Vimbot

driver = Vimbot()
action = {"navigate": "github.com"}
driver.perform_action(action)

Click

Clicks on an element using Vimium keyboard shortcuts.
{
  "click": "ab"
}
click
str
required
The 1-2 letter Vimium hint sequence from the yellow boxes displayed on the page. Obtained by pressing ‘f’ in Vimium or calling driver.capture().
Example:
from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

# Capture shows Vimium hints
screenshot = driver.capture()

# Click on element with hint "a"
action = {"click": "a"}
driver.perform_action(action)

Type

Types text into the currently focused input field and presses Enter.
{
  "type": "autonomous web browsing"
}
type
str
required
The text to type. An Enter key press is automatically added at the end.
Example:
from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")
driver.click("a")  # Focus search box

action = {"type": "GPT-4 vision"}
driver.perform_action(action)  # Types and presses Enter

Click and type

Combines clicking on an element and then typing text. This is the most common action for interacting with input fields.
{
  "click": "a",
  "type": "search query"
}
click
str
required
The Vimium hint to click on (typically an input field).
type
str
required
The text to type after clicking.
Example:
from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

# Click search box and type in one action
action = {
    "click": "a",
    "type": "autonomous agents"
}
driver.perform_action(action)

Done

Signals that the objective has been completed.
{
  "done": true
}
done
any
Any value (typically true or an empty string). The presence of the key is what matters.
Return value: When perform_action() receives a “done” action, it returns True, allowing the automation loop to exit. Example:
import vision
from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

while True:
    screenshot = driver.capture()
    action = vision.get_actions(screenshot, "search for Python")
    
    if driver.perform_action(action):  # Returns True when action contains "done"
        print("Task completed!")
        break

Action execution order

When perform_action() processes an action dictionary, it follows this priority order:
  1. Check for done: If “done” key exists, return True immediately
  2. Check for click and type: If both keys exist, execute click then type
  3. Check for navigate: If “navigate” key exists, navigate to URL
  4. Check for type only: If only “type” key exists, type text
  5. Check for click only: If only “click” key exists, click element
Implementation:
def perform_action(self, action):
    if "done" in action:
        return True
    if "click" in action and "type" in action:
        self.click(action["click"])
        self.type(action["type"])
    if "navigate" in action:
        self.navigate(action["navigate"])
    elif "type" in action:
        self.type(action["type"])
    elif "click" in action:
        self.click(action["click"])

GPT-4V generated actions

When using vision.get_actions(), the GPT-4V model is instructed to:
  • Return only valid JSON with keys from: navigate, type, click, done
  • For clicks: Return only the 1-2 letter yellow hint sequence
  • For typing in input fields: Return both click and type keys
  • Choose the most appropriate action based on the objective
  • Return done when the page satisfies the objective
Real examples from usage:
# Searching on Google
{"click": "a", "type": "autonomous web browsing"}

# Clicking a search result
{"click": "ba"}

# Navigating to a new site
{"navigate": "https://github.com"}

# Task completion
{"done": true}

Manual action creation

You can create actions programmatically for deterministic automation:
from vimbot import Vimbot

driver = Vimbot()

# Sequence of manual actions
actions = [
    {"navigate": "https://www.google.com"},
    {"click": "a", "type": "GitHub vimGPT"},
    {"click": "b"},  # Click first result
    {"done": true}
]

for action in actions:
    if driver.perform_action(action):
        break

Error handling

If vision.get_actions() fails to parse JSON, it returns an empty dictionary:
action = vision.get_actions(screenshot, objective)
# If parsing fails twice: action = {}

if not action:
    print("No valid action returned")
    # Handle error (retry, skip, etc.)

Best practices

  1. Always capture before acting: Call driver.capture() to get updated Vimium hints before determining actions
  2. Use click + type for inputs: When filling forms, combine click and type in a single action
  3. Add delays between actions: Use time.sleep(1) between action loops to allow pages to load
  4. Check for done: Always check if perform_action() returns True to handle completion
  5. Handle empty actions: Check if the vision module returns an empty dict and implement retry logic

Build docs developers (and LLMs) love