Actions

Overview

Actions are JSON objects that define what the Vimbot should do next. They are typically generated by the vision module’s get_actions() function based on GPT-4V analysis of screenshots, but can also be created manually for programmatic control.

Action format

Actions are represented as Python dictionaries with specific keys. Each action type has its own key-value structure.

action = {
    "action_type": "value"
}

Action types

Navigate

Navigates to a specified URL.

{
  "navigate": "https://www.example.com"
}

navigate

str

required

The URL to navigate to. The https:// protocol is automatically added if not present.

Example:

from vimbot import Vimbot

driver = Vimbot()
action = {"navigate": "github.com"}
driver.perform_action(action)

Click

Clicks on an element using Vimium keyboard shortcuts.

{
  "click": "ab"
}

click

str

required

The 1-2 letter Vimium hint sequence from the yellow boxes displayed on the page. Obtained by pressing ‘f’ in Vimium or calling driver.capture().

Example:

from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

# Capture shows Vimium hints
screenshot = driver.capture()

# Click on element with hint "a"
action = {"click": "a"}
driver.perform_action(action)

Type

Types text into the currently focused input field and presses Enter.

{
  "type": "autonomous web browsing"
}

type

str

required

The text to type. An Enter key press is automatically added at the end.

Example:

from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")
driver.click("a")  # Focus search box

action = {"type": "GPT-4 vision"}
driver.perform_action(action)  # Types and presses Enter

Click and type

Combines clicking on an element and then typing text. This is the most common action for interacting with input fields.

{
  "click": "a",
  "type": "search query"
}

click

str

required

The Vimium hint to click on (typically an input field).

type

str

required

The text to type after clicking.

Example:

from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

# Click search box and type in one action
action = {
    "click": "a",
    "type": "autonomous agents"
}
driver.perform_action(action)

Done

Signals that the objective has been completed.

{
  "done": true
}

done

any

Any value (typically true or an empty string). The presence of the key is what matters.

Return value: When perform_action() receives a “done” action, it returns True, allowing the automation loop to exit. Example:

import vision
from vimbot import Vimbot

driver = Vimbot()
driver.navigate("https://www.google.com")

while True:
    screenshot = driver.capture()
    action = vision.get_actions(screenshot, "search for Python")
    
    if driver.perform_action(action):  # Returns True when action contains "done"
        print("Task completed!")
        break

Action execution order

When perform_action() processes an action dictionary, it follows this priority order:

Check for done: If “done” key exists, return True immediately
Check for click and type: If both keys exist, execute click then type
Check for navigate: If “navigate” key exists, navigate to URL
Check for type only: If only “type” key exists, type text
Check for click only: If only “click” key exists, click element

Implementation:

def perform_action(self, action):
    if "done" in action:
        return True
    if "click" in action and "type" in action:
        self.click(action["click"])
        self.type(action["type"])
    if "navigate" in action:
        self.navigate(action["navigate"])
    elif "type" in action:
        self.type(action["type"])
    elif "click" in action:
        self.click(action["click"])

GPT-4V generated actions

When using vision.get_actions(), the GPT-4V model is instructed to:

Return only valid JSON with keys from: navigate, type, click, done
For clicks: Return only the 1-2 letter yellow hint sequence
For typing in input fields: Return both click and type keys
Choose the most appropriate action based on the objective
Return done when the page satisfies the objective

Real examples from usage:

# Searching on Google
{"click": "a", "type": "autonomous web browsing"}

# Clicking a search result
{"click": "ba"}

# Navigating to a new site
{"navigate": "https://github.com"}

# Task completion
{"done": true}

Manual action creation

You can create actions programmatically for deterministic automation:

from vimbot import Vimbot

driver = Vimbot()

# Sequence of manual actions
actions = [
    {"navigate": "https://www.google.com"},
    {"click": "a", "type": "GitHub vimGPT"},
    {"click": "b"},  # Click first result
    {"done": true}
]

for action in actions:
    if driver.perform_action(action):
        break

Error handling

If vision.get_actions() fails to parse JSON, it returns an empty dictionary:

action = vision.get_actions(screenshot, objective)
# If parsing fails twice: action = {}

if not action:
    print("No valid action returned")
    # Handle error (retry, skip, etc.)

Best practices

Always capture before acting: Call driver.capture() to get updated Vimium hints before determining actions
Use click + type for inputs: When filling forms, combine click and type in a single action
Add delays between actions: Use time.sleep(1) between action loops to allow pages to load
Check for done: Always check if perform_action() returns True to handle completion
Handle empty actions: Check if the vision module returns an empty dict and implement retry logic

Get Started

Core Concepts

Usage

API Reference

Advanced

Overview

Action format

Action types

Navigate

Click

Type

Click and type

Done

Action execution order

GPT-4V generated actions

Manual action creation

Error handling

Best practices

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

API Reference

Advanced

​Overview

​Action format

​Action types

​Navigate

​Click

​Type

​Click and type

​Done

​Action execution order

​GPT-4V generated actions

​Manual action creation

​Error handling

​Best practices

Build docs developers (and LLMs) love

Overview

Action format

Action types

Navigate

Click

Type

Click and type

Done

Action execution order

GPT-4V generated actions

Manual action creation

Error handling

Best practices