Skip to main content
The Vision feature enables AI models to “see” and analyze images, screenshots, and webpage content visually. This goes beyond text extraction to provide true visual understanding, making it perfect for UI analysis, design feedback, diagram interpretation, and more.
Vision capabilities are currently experimental. Features and performance may vary depending on the model used.

How Vision Works

Page Assist supports two distinct vision modes:

Getting Started

Using Native Vision

1

Configure Vision Model

For Ollama (Local):
ollama pull llava
# or
ollama pull bakllava
Then select in Settings → Model SettingsFor OpenAI:
  • Ensure API key is configured
  • Select GPT-4 Vision or GPT-4o
For Other Providers:
  • Configure in respective settings sections
  • Verify model supports vision
2

Enable Vision in Sidebar

  1. Open sidebar (Ctrl+Shift+Y)
  2. Navigate to any webpage
  3. Click the eye icon in input area
  4. Select a vision-capable model
  5. Start asking about what you see
3

Start Analyzing

With vision enabled:
  • “What’s on this page?”
  • “Describe the layout”
  • “What colors are used?”
  • “Find any errors in the UI”

Using OCR Mode

1

Enable Vision

  1. Open sidebar
  2. Click the eye icon (vision mode)
  3. Select any model (vision not required)
2

Enable OCR Extraction

  1. Expand the Submit button dropdown
  2. Enable “Extract Text From Image (OCR)”
  3. Now images will be processed with OCR
3

Analyze Text Content

Ask questions about text visible in the image:
  • “What text is shown?”
  • “Extract all URLs”
  • “Summarize the visible content”

Webpage Screenshots

Analyze the current webpage visually: Use cases:
  • UI/UX review: “Analyze the design of this page”
  • Accessibility audit: “Are there any accessibility issues?”
  • Layout analysis: “Describe the page structure”
  • Visual comparison: “How does this compare to Material Design?”
How to use:
  1. Navigate to target webpage
  2. Open sidebar
  3. Enable vision mode (eye icon)
  4. Ask visual questions
  5. Page Assist captures and analyzes screenshot automatically

Image Upload

Analyze uploaded images:
1

Upload Image

  1. Click the attachment/image icon in input area
  2. Select image file from your device
  3. Image preview appears in chat
2

Ask Questions

Example queries:
  • “What’s in this image?”
  • “Describe the scene in detail”
  • “What text can you see?”
  • “What are the main colors?”
  • “Is there anything unusual?”
Supported formats: JPG, PNG, GIF, WebP, SVG

Vision Use Cases

UI/UX Analysis

  • Design critique and feedback
  • Layout and spacing review
  • Color scheme analysis
  • Accessibility evaluation
  • Responsive design check

Code Screenshots

  • Analyze code in images
  • Debug from screenshots
  • Extract code snippets
  • Review error messages

Diagrams & Charts

  • Interpret flowcharts
  • Analyze data visualizations
  • Explain architectural diagrams
  • Read infographics

Content Extraction

  • Extract text from images
  • Read handwritten notes (limited)
  • Transcribe screenshots
  • Get data from tables

Real-World Examples

Scenario: Get feedback on a UI mockupSteps:
  1. Upload design mockup
  2. Enable vision with Claude or GPT-4V
  3. Ask: “Provide detailed UI/UX feedback on this design. Consider usability, accessibility, and visual hierarchy.”
Result: Detailed critique with specific suggestions
Scenario: Debug an error from a screenshotSteps:
  1. Take screenshot of error
  2. Upload to chat
  3. Ask: “What’s causing this error and how do I fix it?”
Result: Error explanation and solution
Scenario: Compare two website designsSteps:
  1. Visit first website, enable vision
  2. Ask: “Describe this design”
  3. Visit second website (keep chat open)
  4. Ask: “How does this design differ from the previous page?”
Result: Comparative analysis
Scenario: Understand a complex architecture diagramSteps:
  1. Upload diagram image
  2. Use GPT-4V or Claude 3
  3. Ask: “Explain this architecture diagram in detail. What are the components and how do they interact?”
Result: Step-by-step explanation

Model Recommendations

Best Vision Models

GPT-4 Vision / GPT-4o (OpenAI)
  • Excellent visual understanding
  • Accurate spatial reasoning
  • Great with text in images
  • Best for: Complex scenes, detailed analysis
Claude 3 (Anthropic)
  • Strong visual capabilities
  • Good context understanding
  • Detailed descriptions
  • Best for: UI analysis, diagrams
Gemini Pro Vision (Google)
  • Fast processing
  • Good general vision
  • Multimodal understanding
  • Best for: Quick analysis, multi-image tasks
Local models are completely private but may have lower accuracy than cloud models. Choose based on your privacy vs. performance needs.

Advanced Techniques

Multi-Image Analysis

Analyze multiple images in one conversation:
  1. Upload first image
  2. Ask question about it
  3. Upload second image
  4. Ask comparative question
  5. AI considers both images in context
Example:
User: [uploads before.png] "Describe this UI"
AI: [describes first image]
User: [uploads after.png] "How does this differ from the previous design?"
AI: [compares both images]

Combining Vision with Other Features

Compare images with documentation:
  1. Select knowledge base with design docs
  2. Enable vision
  3. Upload design screenshot
  4. Ask: “Does this match our design system guidelines?”
  5. AI compares image with your docs
Enhanced page analysis:
  1. Enable both vision and webpage chat
  2. AI gets both visual and textual page content
  3. Ask: “Are there discrepancies between visible UI and underlying HTML?”
  4. More comprehensive analysis

Prompt Engineering for Vision

Get better results with specific prompts: General Analysis:
"Describe this image in detail, including:
- Main subjects and objects
- Colors and styling
- Layout and composition
- Any text visible
- Overall mood or purpose"
UI/UX Review:
"Analyze this UI design for:
- Visual hierarchy
- Color contrast and accessibility
- Spacing and alignment
- Call-to-action clarity
- Mobile responsiveness indicators
- Potential usability issues"
Error Diagnosis:
"Examine this error screenshot and:
1. Identify the error type
2. Explain the likely cause
3. Suggest specific fixes
4. Recommend prevention strategies"
Code Review:
"Review this code screenshot for:
- Syntax errors
- Logic issues
- Best practice violations
- Performance concerns
- Security vulnerabilities"

Configuration

OCR Settings

When using OCR mode:
  1. Go to sidebar settings
  2. Find vision options
  3. Enable “Extract Text From Image (OCR)”
  4. OCR uses Tesseract.js for text extraction
OCR accuracy depends on image quality, text size, and font clarity. Works best with clear, high-contrast text.

Vision Model Selection

Choose the right model for your task: For UI/Design Work: GPT-4V, Claude 3 For Speed: Gemini Pro Vision, LLaVA For Privacy: LLaVA, BakLLaVA (local) For Accuracy: GPT-4V, Claude 3 For Cost: Local models, Gemini

Limitations

Current Limitations:
  • Vision is experimental and may not work with all models
  • Image size limits apply (typically 4-20MB)
  • OCR mode provides only text, no visual understanding
  • Some models may have rate limits for vision requests
  • Local models have lower accuracy than cloud models

What Vision Can Do

  • Identify objects, scenes, and people
  • Describe layouts and compositions
  • Read text in images (with varying accuracy)
  • Analyze colors and visual style
  • Detect UI/UX issues
  • Interpret charts and diagrams
  • Compare visual elements

What Vision Cannot Do

  • Process video (only static images)
  • Real-time analysis of dynamic content
  • Perfect OCR on all text (especially handwriting)
  • Guarantee 100% accuracy on complex scenes
  • Process very large images (check model limits)
  • Analyze images requiring domain expertise it lacks

Troubleshooting

Possible causes:
  • Non-vision model selected
  • Image format not supported
  • Network issues (cloud models)
Solutions:
  • Verify you’re using a vision-capable model
  • Check image format (use JPG or PNG)
  • Try different model
  • Enable OCR mode as fallback
  • Check API key if using cloud models
Possible causes:
  • Low image quality
  • Model limitations
  • Vague prompt
Solutions:
  • Use higher quality images
  • Try advanced models (GPT-4V, Claude 3)
  • Be more specific in your prompts
  • Provide context about what you want analyzed
Possible causes:
  • Poor image quality
  • Unusual fonts
  • Low contrast
  • Handwriting
Solutions:
  • Use clearer images
  • Increase image resolution
  • Improve text contrast
  • Try vision models instead of OCR
Possible causes:
  • File too large
  • Unsupported format
  • Browser issues
Solutions:
  • Compress image (keep under 10MB)
  • Convert to JPG or PNG
  • Try different browser
  • Check browser console for errors

Privacy and Security

Image Processing: When using vision features, images are sent to your configured AI provider.

Privacy Comparison

Model TypePrivacy LevelNotes
Local (LLaVA, BakLLaVA)HighestComplete privacy, no data sent externally
OpenAI GPT-4VMediumData sent to OpenAI, subject to their policies
Claude 3MediumData sent to Anthropic
GeminiMediumData sent to Google
Sensitive Images: Be cautious when analyzing screenshots containing sensitive information (passwords, personal data, etc.). Use local models for sensitive content.

Best Practices

Image Quality: Use high-resolution, clear images for best results. Blurry or low-quality images reduce accuracy.
Specific Prompts: Instead of “What’s this?”, ask “Describe the UI layout and identify any accessibility issues.”
Model Selection: Use local models (LLaVA) for privacy-sensitive tasks, cloud models (GPT-4V) for highest accuracy.
Context: Provide context in your prompts: “This is a mobile app login screen” helps the AI give more relevant analysis.

Next Steps

Chat with Webpage

Combine vision with text-based webpage analysis

Internet Search

Identify objects visually then search for info

Knowledge Base

Compare images with your documentation

Prompts

Create vision-specific prompt templates

Build docs developers (and LLMs) love