Vision - Page Assist

The Vision feature enables AI models to “see” and analyze images, screenshots, and webpage content visually. This goes beyond text extraction to provide true visual understanding, making it perfect for UI analysis, design feedback, diagram interpretation, and more.

Vision capabilities are currently experimental. Features and performance may vary depending on the model used.

How Vision Works

Page Assist supports two distinct vision modes:

Native Vision (Recommended)
OCR Mode (Fallback)

For models with built-in vision capabilitiesHow it works:

Image or screenshot is captured
Image sent directly to vision-capable model
AI analyzes visual content natively
Returns detailed visual understanding

Supported models:

GPT-4 Vision (OpenAI)
GPT-4o (OpenAI)
Claude 3 (Anthropic)
Gemini Pro Vision (Google)
LLaVA (Ollama)
BakLLaVA (Ollama)
Other vision-enabled models

Pros:

True visual understanding
Accurate spatial reasoning
Can describe layouts, colors, relationships
Better with complex images

Cons:

Requires specific models
May be slower
Higher API costs (for cloud models)

Getting Started

Using Native Vision

Configure Vision Model

For Ollama (Local):

ollama pull llava
# or
ollama pull bakllava

Then select in Settings → Model SettingsFor OpenAI:

Ensure API key is configured
Select GPT-4 Vision or GPT-4o

For Other Providers:

Configure in respective settings sections
Verify model supports vision

Enable Vision in Sidebar

Open sidebar (Ctrl+Shift+Y)
Navigate to any webpage
Click the eye icon in input area
Select a vision-capable model
Start asking about what you see

Start Analyzing

With vision enabled:

“What’s on this page?”
“Describe the layout”
“What colors are used?”
“Find any errors in the UI”

Using OCR Mode

Enable Vision

Open sidebar
Click the eye icon (vision mode)
Select any model (vision not required)

Enable OCR Extraction

Expand the Submit button dropdown
Enable “Extract Text From Image (OCR)”
Now images will be processed with OCR

Analyze Text Content

Ask questions about text visible in the image:

“What text is shown?”
“Extract all URLs”
“Summarize the visible content”

Webpage Screenshots

Analyze the current webpage visually: Use cases:

UI/UX review: “Analyze the design of this page”
Accessibility audit: “Are there any accessibility issues?”
Layout analysis: “Describe the page structure”
Visual comparison: “How does this compare to Material Design?”

How to use:

Navigate to target webpage
Open sidebar
Enable vision mode (eye icon)
Ask visual questions
Page Assist captures and analyzes screenshot automatically

Image Upload

Analyze uploaded images:

Upload Image

Click the attachment/image icon in input area
Select image file from your device
Image preview appears in chat

Ask Questions

Example queries:

“What’s in this image?”
“Describe the scene in detail”
“What text can you see?”
“What are the main colors?”
“Is there anything unusual?”

Supported formats: JPG, PNG, GIF, WebP, SVG

Vision Use Cases

UI/UX Analysis

Design critique and feedback
Layout and spacing review
Color scheme analysis
Accessibility evaluation
Responsive design check

Code Screenshots

Analyze code in images
Debug from screenshots
Extract code snippets
Review error messages

Diagrams & Charts

Interpret flowcharts
Analyze data visualizations
Explain architectural diagrams
Read infographics

Content Extraction

Extract text from images
Read handwritten notes (limited)
Transcribe screenshots
Get data from tables

Real-World Examples

Design Feedback

Scenario: Get feedback on a UI mockupSteps:

Upload design mockup
Enable vision with Claude or GPT-4V
Ask: “Provide detailed UI/UX feedback on this design. Consider usability, accessibility, and visual hierarchy.”

Result: Detailed critique with specific suggestions

Error Debugging

Scenario: Debug an error from a screenshotSteps:

Take screenshot of error
Upload to chat
Ask: “What’s causing this error and how do I fix it?”

Result: Error explanation and solution

Page Comparison

Scenario: Compare two website designsSteps:

Visit first website, enable vision
Ask: “Describe this design”
Visit second website (keep chat open)
Ask: “How does this design differ from the previous page?”

Result: Comparative analysis

Diagram Explanation

Scenario: Understand a complex architecture diagramSteps:

Upload diagram image
Use GPT-4V or Claude 3
Ask: “Explain this architecture diagram in detail. What are the components and how do they interact?”

Result: Step-by-step explanation

Model Recommendations

Best Vision Models

Cloud (Highest Quality)
Local (Privacy & Speed)

GPT-4 Vision / GPT-4o (OpenAI)

Excellent visual understanding
Accurate spatial reasoning
Great with text in images
Best for: Complex scenes, detailed analysis

Claude 3 (Anthropic)

Strong visual capabilities
Good context understanding
Detailed descriptions
Best for: UI analysis, diagrams

Gemini Pro Vision (Google)

Fast processing
Good general vision
Multimodal understanding
Best for: Quick analysis, multi-image tasks

LLaVA (Ollama)

Good open-source vision model
Runs completely local
Decent accuracy
Best for: Privacy-sensitive tasks

ollama pull llava

BakLLaVA (Ollama)

Improved version of LLaVA
Better visual understanding
Still fully local
Best for: Local processing needs

ollama pull bakllava

Local models are completely private but may have lower accuracy than cloud models. Choose based on your privacy vs. performance needs.

Advanced Techniques

Multi-Image Analysis

Analyze multiple images in one conversation:

Upload first image
Ask question about it
Upload second image
Ask comparative question
AI considers both images in context

Example:

User: [uploads before.png] "Describe this UI"
AI: [describes first image]
User: [uploads after.png] "How does this differ from the previous design?"
AI: [compares both images]

Combining Vision with Other Features

Vision + Internet Search

Identify objects then search for information:

Enable vision + internet search
Upload image: “What building is this?”
AI identifies building visually
Searches web for current information
Returns: Name, history, visiting hours, etc.

Vision + Knowledge Base

Compare images with documentation:

Select knowledge base with design docs
Enable vision
Upload design screenshot
Ask: “Does this match our design system guidelines?”
AI compares image with your docs

Vision + Webpage Chat

Enhanced page analysis:

Enable both vision and webpage chat
AI gets both visual and textual page content
Ask: “Are there discrepancies between visible UI and underlying HTML?”
More comprehensive analysis

Prompt Engineering for Vision

Get better results with specific prompts: General Analysis:

"Describe this image in detail, including:
- Main subjects and objects
- Colors and styling
- Layout and composition
- Any text visible
- Overall mood or purpose"

UI/UX Review:

"Analyze this UI design for:
- Visual hierarchy
- Color contrast and accessibility
- Spacing and alignment
- Call-to-action clarity
- Mobile responsiveness indicators
- Potential usability issues"

Error Diagnosis:

"Examine this error screenshot and:
Identify the error type
Explain the likely cause
Suggest specific fixes
Recommend prevention strategies"

Code Review:

"Review this code screenshot for:
- Syntax errors
- Logic issues
- Best practice violations
- Performance concerns
- Security vulnerabilities"

Configuration

OCR Settings

When using OCR mode:

Go to sidebar settings
Find vision options
Enable “Extract Text From Image (OCR)”
OCR uses Tesseract.js for text extraction

OCR accuracy depends on image quality, text size, and font clarity. Works best with clear, high-contrast text.

Vision Model Selection

Choose the right model for your task: For UI/Design Work: GPT-4V, Claude 3 For Speed: Gemini Pro Vision, LLaVA For Privacy: LLaVA, BakLLaVA (local) For Accuracy: GPT-4V, Claude 3 For Cost: Local models, Gemini

Limitations

Current Limitations:

Vision is experimental and may not work with all models
Image size limits apply (typically 4-20MB)
OCR mode provides only text, no visual understanding
Some models may have rate limits for vision requests
Local models have lower accuracy than cloud models

What Vision Can Do

Identify objects, scenes, and people
Describe layouts and compositions
Read text in images (with varying accuracy)
Analyze colors and visual style
Detect UI/UX issues
Interpret charts and diagrams
Compare visual elements

What Vision Cannot Do

Process video (only static images)
Real-time analysis of dynamic content
Perfect OCR on all text (especially handwriting)
Guarantee 100% accuracy on complex scenes
Process very large images (check model limits)
Analyze images requiring domain expertise it lacks

Troubleshooting

Vision not working

Possible causes:

Non-vision model selected
Image format not supported
Network issues (cloud models)

Solutions:

Verify you’re using a vision-capable model
Check image format (use JPG or PNG)
Try different model
Enable OCR mode as fallback
Check API key if using cloud models

Poor image analysis

Possible causes:

Low image quality
Model limitations
Vague prompt

Solutions:

Use higher quality images
Try advanced models (GPT-4V, Claude 3)
Be more specific in your prompts
Provide context about what you want analyzed

OCR extraction fails

Possible causes:

Poor image quality
Unusual fonts
Low contrast
Handwriting

Solutions:

Use clearer images
Increase image resolution
Improve text contrast
Try vision models instead of OCR

Image upload fails

Possible causes:

File too large
Unsupported format
Browser issues

Solutions:

Compress image (keep under 10MB)
Convert to JPG or PNG
Try different browser
Check browser console for errors

Privacy and Security

Image Processing: When using vision features, images are sent to your configured AI provider.

Privacy Comparison

Model Type	Privacy Level	Notes
Local (LLaVA, BakLLaVA)	Highest	Complete privacy, no data sent externally
OpenAI GPT-4V	Medium	Data sent to OpenAI, subject to their policies
Claude 3	Medium	Data sent to Anthropic
Gemini	Medium	Data sent to Google

Sensitive Images: Be cautious when analyzing screenshots containing sensitive information (passwords, personal data, etc.). Use local models for sensitive content.

Best Practices

Image Quality: Use high-resolution, clear images for best results. Blurry or low-quality images reduce accuracy.

Specific Prompts: Instead of “What’s this?”, ask “Describe the UI layout and identify any accessibility issues.”

Model Selection: Use local models (LLaVA) for privacy-sensitive tasks, cloud models (GPT-4V) for highest accuracy.

Context: Provide context in your prompts: “This is a mobile app login screen” helps the AI give more relevant analysis.

Next Steps

Chat with Webpage

Combine vision with text-based webpage analysis

Internet Search

Identify objects visually then search for info

Knowledge Base

Compare images with your documentation

Prompts

Create vision-specific prompt templates

Get Started

Core Features

AI Providers

Configuration

Troubleshooting

Resources

​How Vision Works

​Getting Started

​Using Native Vision

​Using OCR Mode

​Sidebar Vision Features

​Webpage Screenshots

​Image Upload

​Vision Use Cases

UI/UX Analysis

Code Screenshots

Diagrams & Charts

Content Extraction

​Real-World Examples

​Model Recommendations

​Best Vision Models

​Advanced Techniques

​Multi-Image Analysis

​Combining Vision with Other Features

​Prompt Engineering for Vision

​Configuration

​OCR Settings

​Vision Model Selection

​Limitations

​What Vision Can Do

​What Vision Cannot Do

​Troubleshooting

​Privacy and Security

​Privacy Comparison

​Best Practices

​Next Steps

Chat with Webpage

Internet Search

Knowledge Base

Prompts

Build docs developers (and LLMs) love

How Vision Works

Getting Started

Using Native Vision

Using OCR Mode

Sidebar Vision Features

Webpage Screenshots

Image Upload

Vision Use Cases

Real-World Examples

Model Recommendations

Best Vision Models

Advanced Techniques

Multi-Image Analysis

Combining Vision with Other Features

Prompt Engineering for Vision

Configuration

OCR Settings

Vision Model Selection

Limitations

What Vision Can Do

What Vision Cannot Do

Troubleshooting

Privacy and Security

Privacy Comparison

Best Practices

Next Steps