Vision capabilities are currently experimental. Features and performance may vary depending on the model used.
How Vision Works
Page Assist supports two distinct vision modes:- Native Vision (Recommended)
- OCR Mode (Fallback)
For models with built-in vision capabilitiesHow it works:
- Image or screenshot is captured
- Image sent directly to vision-capable model
- AI analyzes visual content natively
- Returns detailed visual understanding
- GPT-4 Vision (OpenAI)
- GPT-4o (OpenAI)
- Claude 3 (Anthropic)
- Gemini Pro Vision (Google)
- LLaVA (Ollama)
- BakLLaVA (Ollama)
- Other vision-enabled models
- True visual understanding
- Accurate spatial reasoning
- Can describe layouts, colors, relationships
- Better with complex images
- Requires specific models
- May be slower
- Higher API costs (for cloud models)
Getting Started
Using Native Vision
Configure Vision Model
For Ollama (Local):Then select in Settings → Model SettingsFor OpenAI:
- Ensure API key is configured
- Select GPT-4 Vision or GPT-4o
- Configure in respective settings sections
- Verify model supports vision
Enable Vision in Sidebar
- Open sidebar (
Ctrl+Shift+Y) - Navigate to any webpage
- Click the eye icon in input area
- Select a vision-capable model
- Start asking about what you see
Using OCR Mode
Enable OCR Extraction
- Expand the Submit button dropdown
- Enable “Extract Text From Image (OCR)”
- Now images will be processed with OCR
Sidebar Vision Features
Webpage Screenshots
Analyze the current webpage visually: Use cases:- UI/UX review: “Analyze the design of this page”
- Accessibility audit: “Are there any accessibility issues?”
- Layout analysis: “Describe the page structure”
- Visual comparison: “How does this compare to Material Design?”
- Navigate to target webpage
- Open sidebar
- Enable vision mode (eye icon)
- Ask visual questions
- Page Assist captures and analyzes screenshot automatically
Image Upload
Analyze uploaded images:Upload Image
- Click the attachment/image icon in input area
- Select image file from your device
- Image preview appears in chat
Vision Use Cases
UI/UX Analysis
- Design critique and feedback
- Layout and spacing review
- Color scheme analysis
- Accessibility evaluation
- Responsive design check
Code Screenshots
- Analyze code in images
- Debug from screenshots
- Extract code snippets
- Review error messages
Diagrams & Charts
- Interpret flowcharts
- Analyze data visualizations
- Explain architectural diagrams
- Read infographics
Content Extraction
- Extract text from images
- Read handwritten notes (limited)
- Transcribe screenshots
- Get data from tables
Real-World Examples
Design Feedback
Design Feedback
Scenario: Get feedback on a UI mockupSteps:
- Upload design mockup
- Enable vision with Claude or GPT-4V
- Ask: “Provide detailed UI/UX feedback on this design. Consider usability, accessibility, and visual hierarchy.”
Error Debugging
Error Debugging
Scenario: Debug an error from a screenshotSteps:
- Take screenshot of error
- Upload to chat
- Ask: “What’s causing this error and how do I fix it?”
Page Comparison
Page Comparison
Scenario: Compare two website designsSteps:
- Visit first website, enable vision
- Ask: “Describe this design”
- Visit second website (keep chat open)
- Ask: “How does this design differ from the previous page?”
Diagram Explanation
Diagram Explanation
Scenario: Understand a complex architecture diagramSteps:
- Upload diagram image
- Use GPT-4V or Claude 3
- Ask: “Explain this architecture diagram in detail. What are the components and how do they interact?”
Model Recommendations
Best Vision Models
- Cloud (Highest Quality)
- Local (Privacy & Speed)
GPT-4 Vision / GPT-4o (OpenAI)
- Excellent visual understanding
- Accurate spatial reasoning
- Great with text in images
- Best for: Complex scenes, detailed analysis
- Strong visual capabilities
- Good context understanding
- Detailed descriptions
- Best for: UI analysis, diagrams
- Fast processing
- Good general vision
- Multimodal understanding
- Best for: Quick analysis, multi-image tasks
Local models are completely private but may have lower accuracy than cloud models. Choose based on your privacy vs. performance needs.
Advanced Techniques
Multi-Image Analysis
Analyze multiple images in one conversation:- Upload first image
- Ask question about it
- Upload second image
- Ask comparative question
- AI considers both images in context
Combining Vision with Other Features
Vision + Internet Search
Vision + Internet Search
Identify objects then search for information:
- Enable vision + internet search
- Upload image: “What building is this?”
- AI identifies building visually
- Searches web for current information
- Returns: Name, history, visiting hours, etc.
Vision + Knowledge Base
Vision + Knowledge Base
Compare images with documentation:
- Select knowledge base with design docs
- Enable vision
- Upload design screenshot
- Ask: “Does this match our design system guidelines?”
- AI compares image with your docs
Vision + Webpage Chat
Vision + Webpage Chat
Enhanced page analysis:
- Enable both vision and webpage chat
- AI gets both visual and textual page content
- Ask: “Are there discrepancies between visible UI and underlying HTML?”
- More comprehensive analysis
Prompt Engineering for Vision
Get better results with specific prompts: General Analysis:Configuration
OCR Settings
When using OCR mode:- Go to sidebar settings
- Find vision options
- Enable “Extract Text From Image (OCR)”
- OCR uses Tesseract.js for text extraction
OCR accuracy depends on image quality, text size, and font clarity. Works best with clear, high-contrast text.
Vision Model Selection
Choose the right model for your task: For UI/Design Work: GPT-4V, Claude 3 For Speed: Gemini Pro Vision, LLaVA For Privacy: LLaVA, BakLLaVA (local) For Accuracy: GPT-4V, Claude 3 For Cost: Local models, GeminiLimitations
What Vision Can Do
- Identify objects, scenes, and people
- Describe layouts and compositions
- Read text in images (with varying accuracy)
- Analyze colors and visual style
- Detect UI/UX issues
- Interpret charts and diagrams
- Compare visual elements
What Vision Cannot Do
- Process video (only static images)
- Real-time analysis of dynamic content
- Perfect OCR on all text (especially handwriting)
- Guarantee 100% accuracy on complex scenes
- Process very large images (check model limits)
- Analyze images requiring domain expertise it lacks
Troubleshooting
Vision not working
Vision not working
Possible causes:
- Non-vision model selected
- Image format not supported
- Network issues (cloud models)
- Verify you’re using a vision-capable model
- Check image format (use JPG or PNG)
- Try different model
- Enable OCR mode as fallback
- Check API key if using cloud models
Poor image analysis
Poor image analysis
Possible causes:
- Low image quality
- Model limitations
- Vague prompt
- Use higher quality images
- Try advanced models (GPT-4V, Claude 3)
- Be more specific in your prompts
- Provide context about what you want analyzed
OCR extraction fails
OCR extraction fails
Possible causes:
- Poor image quality
- Unusual fonts
- Low contrast
- Handwriting
- Use clearer images
- Increase image resolution
- Improve text contrast
- Try vision models instead of OCR
Image upload fails
Image upload fails
Possible causes:
- File too large
- Unsupported format
- Browser issues
- Compress image (keep under 10MB)
- Convert to JPG or PNG
- Try different browser
- Check browser console for errors
Privacy and Security
Image Processing: When using vision features, images are sent to your configured AI provider.
Privacy Comparison
| Model Type | Privacy Level | Notes |
|---|---|---|
| Local (LLaVA, BakLLaVA) | Highest | Complete privacy, no data sent externally |
| OpenAI GPT-4V | Medium | Data sent to OpenAI, subject to their policies |
| Claude 3 | Medium | Data sent to Anthropic |
| Gemini | Medium | Data sent to Google |
Best Practices
Next Steps
Chat with Webpage
Combine vision with text-based webpage analysis
Internet Search
Identify objects visually then search for info
Knowledge Base
Compare images with your documentation
Prompts
Create vision-specific prompt templates