Contributing - vimGPT

Getting started

vimGPT is an open-source project that welcomes contributions from the community. Whether you’re fixing bugs, adding features, or improving documentation, your help is appreciated.

Repository

The project is hosted on GitHub: github.com/ishan0102/vimGPT

Prerequisites

Before contributing, ensure you have:

Python 3.8 or higher
Git installed
OpenAI API key for testing
Familiarity with Playwright and OpenAI APIs (helpful but not required)

Development setup

1. Fork and clone

git clone https://github.com/YOUR_USERNAME/vimGPT.git
cd vimGPT

2. Install dependencies

pip install -r requirements.txt
playwright install chromium

3. Download Vimium extension

./setup.sh

This downloads the Vimium Chrome extension required for keyboard navigation overlays.

4. Configure environment

Create a .env file with your API key:

OPENAI_API_KEY=sk-...

5. Install pre-commit hooks

The project uses pre-commit hooks to maintain code quality:

pip install pre-commit
pre-commit install

This automatically runs the following checks before each commit:

trailing-whitespace: Removes trailing whitespace
end-of-file-fixer: Ensures files end with a newline
ssort: Sorts Python statements
isort: Sorts imports with Black profile
black: Formats code with 120 character line length

Configuration is defined in .pre-commit-config.yaml.

6. Test the installation

python main.py

Enter a simple objective like “Search for Python tutorials” to verify everything works.

Code style

vimGPT follows strict formatting guidelines enforced by pre-commit hooks:

Formatting standards

Line length: 120 characters (Black configuration)
Import sorting: isort with Black profile
Statement sorting: ssort for consistent Python statement order
Whitespace: No trailing whitespace, files end with newline

Running formatters manually

# Format all files
black . --line-length 120
isort . --profile black
ssort .

# Run all pre-commit hooks
pre-commit run --all-files

Code organization

The codebase follows a simple structure:

main.py: Entry point and orchestration loop
vimbot.py: Browser automation with Playwright
vision.py: GPT-4V integration and image processing
setup.sh: Vimium extension download script

Making changes

1. Create a feature branch

git checkout -b feature/your-feature-name

Use descriptive branch names:

feature/add-json-mode for new features
fix/screenshot-resolution for bug fixes
docs/update-readme for documentation

2. Make your changes

Edit the relevant files. Common areas for contribution:

Vision model improvements (`vision.py`)

Enhance prompt engineering for better action extraction
Add support for new action types
Implement better error handling
Optimize image resolution and encoding

Browser automation (`vimbot.py`)

Add new action types (scroll, hover, etc.)
Improve element clicking reliability
Add screenshot annotation features

Orchestration (`main.py`)

Add cycle detection to prevent infinite loops
Implement task completion validation
Add logging and telemetry

3. Test your changes

Run the script with various objectives:

# Test basic search
python main.py  # Objective: "Find Python documentation"

# Test voice mode (if applicable)
python main.py --voice

Create test cases for edge scenarios:

Pages with slow loading times
Sites with complex JavaScript interactions
Pages with overlapping Vimium hints

4. Commit your changes

Pre-commit hooks will automatically format your code:

git add .
git commit -m "Add feature: detailed description"

If pre-commit hooks fail:

Review the errors
Fix the issues (often auto-fixed)
Stage the fixes: git add .
Commit again

5. Push and create a pull request

git push origin feature/your-feature-name

Open a PR on GitHub with:

Clear description of changes
Motivation and context
Testing steps performed
Screenshots (if UI-related)

Contribution ideas

The project maintainer has outlined several enhancement opportunities in the README. Here are the current areas for improvement:

High priority

JSON mode support

Once OpenAI supports JSON mode for Vision API, update vision.py to use structured outputs instead of prompt-based JSON extraction.

Cycle detection

Build a graph-based retry mechanism to prevent infinite loops when the bot repeatedly clicks the same element.

Higher resolution images

Experiment with higher resolution screenshots to improve element detection. Balance token usage vs accuracy.

Assistant API integration

Use the Assistant API for automatic context retrieval and conversation history once it supports Vision.

Medium priority

Vimium fork for selective overlays

Create a specialized Vimium version that overlays elements based on the user query context, effectively pruning irrelevant elements.Implementation notes:

Fork the Vimium repository
Add context-aware filtering logic
Test different sized boxes and colors
Integrate with vimGPT’s objective system

Fine-tune open-source vision models

Train models like LLaVa, CogVLM, or Fuyu-8B specifically for web navigation tasks.Benefits:

Faster inference (local deployment)
Lower costs (no API fees)
CogVLM can specify pixel coordinates directly

Requirements:

Dataset of web navigation tasks
GPU resources for training
Evaluation metrics for accuracy

Dual-frame input (with/without Vimium)

Provide screenshots both with and without Vimium overlays to prevent the yellow boxes from obscuring page content.Implementation:

Capture two screenshots per iteration
Send both to GPT-4V in a single request
Update prompt to explain the dual-view approach

Trade-off: Doubles token usage per request

Accessibility tree integration

Pass Chrome’s accessibility tree as additional input alongside the screenshot.Benefits:

Provides structured layout information
Maps interactive elements to Vimium bindings
Improves reliability for complex UIs

Implementation reference: See Playwright’s accessibility testing features

Advanced features

Visual question answering

Enable the bot to read and extract information from pages, not just navigate them.Use cases:

Summarize news articles
Reply to emails based on context
Answer questions about page content
Extract structured data from web pages

Implementation:

Add new action type: extract or answer
Return information to the user instead of performing actions
Chain multiple GPT-4V calls for complex tasks

Browser session persistence

Make vimGPT work with your actual browser instead of a headless instance.Benefits:

Use saved cookies and sessions
Access authenticated pages
Interact with payment forms (“order food with my credit card”)

Challenges:

Security concerns with automation on real accounts
Browser extension limitations in Playwright
Need for user confirmation on sensitive actions

Accessibility features for blind users

Enhance voice mode to create an “agent” interface for page navigation.Features:

Full voice control (input and output)
Natural language conversations about page content
Assistant API integration for multi-turn dialogues
Screen reader integration

Impact: Makes web browsing more accessible for visually impaired users

JavaScript-based DOM labeling

Replace Vimium with custom JavaScript that labels DOM elements with colored boxes.Inspiration: Similar approach by DivGargAdvantages:

More control over visual markers
Context-aware element highlighting
Better integration with page structure

Implementation: Inject custom JavaScript via Playwright

Testing guidelines

Manual testing

Test your changes with diverse scenarios:

Simple tasks: Google search, Wikipedia lookup
Multi-step workflows: Search → Click result → Find specific section
Complex UIs: Sites with dropdowns, modals, dynamic content
Edge cases: Slow networks, timeout scenarios, malformed pages

Adding automated tests

While the project currently lacks a test suite, contributors can add:

# tests/test_vision.py
import pytest
from vision import encode_and_resize
from PIL import Image

def test_image_encoding():
    img = Image.new('RGB', (1920, 1080), color='red')
    encoded = encode_and_resize(img)
    assert isinstance(encoded, str)
    assert len(encoded) > 0

Run tests with:

pytest tests/

Documentation

Improvements to documentation are highly valued:

Update README.md with new features
Add code comments for complex logic
Create examples for common use cases
Improve error messages
Add type hints to function signatures

Pull request guidelines

Before submitting

Code passes all pre-commit hooks
Changes are tested manually
Documentation is updated (if applicable)
Commit messages are descriptive
No unnecessary files are committed (screenshots, .env, etc.)

PR description template

## Description
Brief summary of changes

## Motivation
Why this change is needed

## Changes
- List of specific modifications
- Include file names and line numbers if helpful

## Testing
- Steps to test the changes
- Example objectives that demonstrate the feature

## Screenshots (if applicable)
Before/after comparisons or feature demonstrations

## Checklist
- [ ] Pre-commit hooks pass
- [ ] Tested manually
- [ ] Documentation updated

Code review process

Maintainers will review your PR
Feedback may be provided for improvements
Make requested changes and push updates
Once approved, your PR will be merged

Community

Discussions and support

GitHub Issues: Bug reports and feature requests
HackerNews thread: Discussion on vimGPT
Pull Requests: Code contributions and reviews

Recognition

The project has been featured in:

WIRED: AI Assistant Testing Article
VisualWebArena Research Paper: Evaluating Multimodal Agents (page 9)
HackerNews: Front page discussion with community feedback

globe-engineer/globot: Similar browser automation
nat/natbot: Natural language browser control

License

By contributing to vimGPT, you agree that your contributions will be licensed under the same license as the project.

Questions?

If you’re unsure about anything:

Check existing issues and PRs for similar discussions
Open a GitHub issue with the “question” label
Review the source code and comments for implementation details

Thank you for contributing to vimGPT!

Get Started

Core Concepts

Usage

API Reference

Advanced

​Getting started

​Repository

​Prerequisites

​Development setup

​1. Fork and clone

​2. Install dependencies

​3. Download Vimium extension

​4. Configure environment

​5. Install pre-commit hooks

​6. Test the installation

​Code style

​Formatting standards

​Running formatters manually

​Code organization

​Making changes

​1. Create a feature branch

​2. Make your changes

​Vision model improvements (vision.py)

​Browser automation (vimbot.py)

​Orchestration (main.py)

​3. Test your changes

​4. Commit your changes

​5. Push and create a pull request

​Contribution ideas

​High priority