Getting started
vimGPT is an open-source project that welcomes contributions from the community. Whether you’re fixing bugs, adding features, or improving documentation, your help is appreciated.Repository
The project is hosted on GitHub: github.com/ishan0102/vimGPTPrerequisites
Before contributing, ensure you have:- Python 3.8 or higher
- Git installed
- OpenAI API key for testing
- Familiarity with Playwright and OpenAI APIs (helpful but not required)
Development setup
1. Fork and clone
2. Install dependencies
3. Download Vimium extension
4. Configure environment
Create a.env file with your API key:
5. Install pre-commit hooks
The project uses pre-commit hooks to maintain code quality:- trailing-whitespace: Removes trailing whitespace
- end-of-file-fixer: Ensures files end with a newline
- ssort: Sorts Python statements
- isort: Sorts imports with Black profile
- black: Formats code with 120 character line length
.pre-commit-config.yaml.
6. Test the installation
Code style
vimGPT follows strict formatting guidelines enforced by pre-commit hooks:Formatting standards
- Line length: 120 characters (Black configuration)
- Import sorting: isort with Black profile
- Statement sorting: ssort for consistent Python statement order
- Whitespace: No trailing whitespace, files end with newline
Running formatters manually
Code organization
The codebase follows a simple structure:main.py: Entry point and orchestration loopvimbot.py: Browser automation with Playwrightvision.py: GPT-4V integration and image processingsetup.sh: Vimium extension download script
Making changes
1. Create a feature branch
feature/add-json-modefor new featuresfix/screenshot-resolutionfor bug fixesdocs/update-readmefor documentation
2. Make your changes
Edit the relevant files. Common areas for contribution:Vision model improvements (vision.py)
- Enhance prompt engineering for better action extraction
- Add support for new action types
- Implement better error handling
- Optimize image resolution and encoding
Browser automation (vimbot.py)
- Add new action types (scroll, hover, etc.)
- Improve element clicking reliability
- Add screenshot annotation features
Orchestration (main.py)
- Add cycle detection to prevent infinite loops
- Implement task completion validation
- Add logging and telemetry
3. Test your changes
Run the script with various objectives:- Pages with slow loading times
- Sites with complex JavaScript interactions
- Pages with overlapping Vimium hints
4. Commit your changes
Pre-commit hooks will automatically format your code:- Review the errors
- Fix the issues (often auto-fixed)
- Stage the fixes:
git add . - Commit again
5. Push and create a pull request
- Clear description of changes
- Motivation and context
- Testing steps performed
- Screenshots (if UI-related)
Contribution ideas
The project maintainer has outlined several enhancement opportunities in the README. Here are the current areas for improvement:High priority
JSON mode support
Once OpenAI supports JSON mode for Vision API, update
vision.py to use structured outputs instead of prompt-based JSON extraction.Cycle detection
Build a graph-based retry mechanism to prevent infinite loops when the bot repeatedly clicks the same element.
Higher resolution images
Experiment with higher resolution screenshots to improve element detection. Balance token usage vs accuracy.
Assistant API integration
Use the Assistant API for automatic context retrieval and conversation history once it supports Vision.
Medium priority
Vimium fork for selective overlays
Vimium fork for selective overlays
Create a specialized Vimium version that overlays elements based on the user query context, effectively pruning irrelevant elements.Implementation notes:
- Fork the Vimium repository
- Add context-aware filtering logic
- Test different sized boxes and colors
- Integrate with vimGPT’s objective system
Fine-tune open-source vision models
Fine-tune open-source vision models
Train models like LLaVa, CogVLM, or Fuyu-8B specifically for web navigation tasks.Benefits:
- Faster inference (local deployment)
- Lower costs (no API fees)
- CogVLM can specify pixel coordinates directly
- Dataset of web navigation tasks
- GPU resources for training
- Evaluation metrics for accuracy
Dual-frame input (with/without Vimium)
Dual-frame input (with/without Vimium)
Provide screenshots both with and without Vimium overlays to prevent the yellow boxes from obscuring page content.Implementation:
- Capture two screenshots per iteration
- Send both to GPT-4V in a single request
- Update prompt to explain the dual-view approach
Accessibility tree integration
Accessibility tree integration
Pass Chrome’s accessibility tree as additional input alongside the screenshot.Benefits:
- Provides structured layout information
- Maps interactive elements to Vimium bindings
- Improves reliability for complex UIs
Advanced features
Visual question answering
Visual question answering
Enable the bot to read and extract information from pages, not just navigate them.Use cases:
- Summarize news articles
- Reply to emails based on context
- Answer questions about page content
- Extract structured data from web pages
- Add new action type:
extractoranswer - Return information to the user instead of performing actions
- Chain multiple GPT-4V calls for complex tasks
Browser session persistence
Browser session persistence
Make vimGPT work with your actual browser instead of a headless instance.Benefits:
- Use saved cookies and sessions
- Access authenticated pages
- Interact with payment forms (“order food with my credit card”)
- Security concerns with automation on real accounts
- Browser extension limitations in Playwright
- Need for user confirmation on sensitive actions
Accessibility features for blind users
Accessibility features for blind users
Enhance voice mode to create an “agent” interface for page navigation.Features:
- Full voice control (input and output)
- Natural language conversations about page content
- Assistant API integration for multi-turn dialogues
- Screen reader integration
JavaScript-based DOM labeling
JavaScript-based DOM labeling
Replace Vimium with custom JavaScript that labels DOM elements with colored boxes.Inspiration: Similar approach by DivGargAdvantages:
- More control over visual markers
- Context-aware element highlighting
- Better integration with page structure
Testing guidelines
Manual testing
Test your changes with diverse scenarios:- Simple tasks: Google search, Wikipedia lookup
- Multi-step workflows: Search → Click result → Find specific section
- Complex UIs: Sites with dropdowns, modals, dynamic content
- Edge cases: Slow networks, timeout scenarios, malformed pages
Adding automated tests
While the project currently lacks a test suite, contributors can add:Documentation
Improvements to documentation are highly valued:- Update README.md with new features
- Add code comments for complex logic
- Create examples for common use cases
- Improve error messages
- Add type hints to function signatures
Pull request guidelines
Before submitting
- Code passes all pre-commit hooks
- Changes are tested manually
- Documentation is updated (if applicable)
- Commit messages are descriptive
- No unnecessary files are committed (screenshots,
.env, etc.)
PR description template
Code review process
- Maintainers will review your PR
- Feedback may be provided for improvements
- Make requested changes and push updates
- Once approved, your PR will be merged
Community
Discussions and support
- GitHub Issues: Bug reports and feature requests
- HackerNews thread: Discussion on vimGPT
- Pull Requests: Code contributions and reviews
Recognition
The project has been featured in:- WIRED: AI Assistant Testing Article
- VisualWebArena Research Paper: Evaluating Multimodal Agents (page 9)
- HackerNews: Front page discussion with community feedback
Related projects
- globe-engineer/globot: Similar browser automation
- nat/natbot: Natural language browser control
License
By contributing to vimGPT, you agree that your contributions will be licensed under the same license as the project.Questions?
If you’re unsure about anything:- Check existing issues and PRs for similar discussions
- Open a GitHub issue with the “question” label
- Review the source code and comments for implementation details