Overview
Actions are JSON objects that define what the Vimbot should do next. They are typically generated by the vision module’sget_actions() function based on GPT-4V analysis of screenshots, but can also be created manually for programmatic control.
Action format
Actions are represented as Python dictionaries with specific keys. Each action type has its own key-value structure.Action types
Navigate
Navigates to a specified URL.The URL to navigate to. The
https:// protocol is automatically added if not present.Click
Clicks on an element using Vimium keyboard shortcuts.The 1-2 letter Vimium hint sequence from the yellow boxes displayed on the page. Obtained by pressing ‘f’ in Vimium or calling
driver.capture().Type
Types text into the currently focused input field and presses Enter.The text to type. An Enter key press is automatically added at the end.
Click and type
Combines clicking on an element and then typing text. This is the most common action for interacting with input fields.The Vimium hint to click on (typically an input field).
The text to type after clicking.
Done
Signals that the objective has been completed.Any value (typically
true or an empty string). The presence of the key is what matters.perform_action() receives a “done” action, it returns True, allowing the automation loop to exit.
Example:
Action execution order
Whenperform_action() processes an action dictionary, it follows this priority order:
- Check for done: If “done” key exists, return
Trueimmediately - Check for click and type: If both keys exist, execute click then type
- Check for navigate: If “navigate” key exists, navigate to URL
- Check for type only: If only “type” key exists, type text
- Check for click only: If only “click” key exists, click element
GPT-4V generated actions
When usingvision.get_actions(), the GPT-4V model is instructed to:
- Return only valid JSON with keys from:
navigate,type,click,done - For clicks: Return only the 1-2 letter yellow hint sequence
- For typing in input fields: Return both
clickandtypekeys - Choose the most appropriate action based on the objective
- Return
donewhen the page satisfies the objective
Manual action creation
You can create actions programmatically for deterministic automation:Error handling
Ifvision.get_actions() fails to parse JSON, it returns an empty dictionary:
Best practices
- Always capture before acting: Call
driver.capture()to get updated Vimium hints before determining actions - Use click + type for inputs: When filling forms, combine click and type in a single action
- Add delays between actions: Use
time.sleep(1)between action loops to allow pages to load - Check for done: Always check if
perform_action()returnsTrueto handle completion - Handle empty actions: Check if the vision module returns an empty dict and implement retry logic