Architecture
The vision pipeline automatically selects the best processing method:Native Vision Providers
These providers support vision directly without preprocessing:Claude (Anthropic)
- Models: All Claude 3+ models (Haiku, Sonnet, Opus)
- Format: Base64-encoded images in message content
- Max size: 5 MB per image
- Supported formats: JPEG, PNG, GIF, WebP
Google Gemini
- Models: Gemini 1.5+ (Pro, Flash)
- Format: Data URLs with base64 content
- Features: Multi-image support, video frames
- Implementation:
backend/app/providers/google.py:50-58
OpenAI
- Models: GPT-4 Vision, GPT-4o, GPT-5.2+
- Format: Data URLs in message array
- Features: High/low detail modes, multiple images
- Implementation:
backend/app/providers/openai.py:51-55
Native vision providers receive images as part of the chat request and analyze them directly with their multimodal models.
Vision Preprocessor
For providers without native vision (Ollama, Groq, etc.), Asta uses a preprocessing pipeline:How It Works
Vision model analyzes
OpenRouter’s free vision models analyze the image:
- Scene description
- OCR (text extraction)
- Object identification
- Layout analysis
Configuration
Configure preprocessing in Settings → Vision or via environment:Fallback Chain
The preprocessor tries providers in order:- OpenRouter - Free vision models (default)
- Ollama - Local vision-capable models (if configured)
Supported Models
OpenRouter free vision models:nvidia/nemotron-nano-12b-v2-vl:free(default)openrouter/free(auto-routes to best available)
llava(7B/13B/34B)bakllavamoondream
backend/app/handler.py:892-984
Image Attachment Methods
Web Chat API
Inline Markdown Images
Embed images directly in message text:Telegram
Send photos directly in Telegram:- Attach photo to message
- Add caption with your question
- Asta receives image bytes and MIME type automatically
Telegram compresses photos. For high-quality analysis, send as “File” instead of “Photo”.
PDF Vision Fallback
When PDFs have poor text extraction quality, Asta renders pages as images:Quality Assessment
PDFs are evaluated on:- Characters per page (< 100 = poor)
- Alphabetic ratio (< 35% = poor)
- Average word length (2-25 chars)
- Sentence density (< 10% = blueprint/diagram)
backend/app/handler.py:133-167
Vision Fallback Process
Choose analyzer
- Native vision provider (if Claude/Google/OpenAI active) → Direct analysis
- Otherwise → Vision preprocessor
- Native provider:
backend/app/handler.py:256-323 - Preprocessor fallback:
backend/app/handler.py:206-253
Example Use Cases
- Scanned documents - OCR with vision instead of native text extraction
- Architectural blueprints - Extract labels and measurements
- Infographics - Describe visual elements and extract data
- Foreign language documents - OCR + translation
Advanced Configuration
API Settings
backend/app/routers/settings.py:257-272
Vision Context Prompt
The preprocessor uses this system prompt:Performance Considerations
Image Size Limits
- Web upload: 10 MB (configurable)
- Telegram: 20 MB (photo), 20 MB (file)
- Base64 inline: Practical limit ~5 MB (context length)
Optimization
Images are automatically:- Resized - Max 1600px dimension (PDF pages)
- Compressed - JPEG quality 75% for PDF renders
- Format converted - PNGs converted to RGB for JPEG encoding
backend/app/handler.py:170-203
Timeouts
- Vision preprocessor: 50s per attempt
- Native vision: 45s per request
- PDF page render: No timeout (fast, local)
Troubleshooting
Vision Not Working
- Check provider - Verify native support or preprocessor enabled
- Test preprocessor - Try OpenRouter with free model
- Validate image - Ensure JPEG/PNG, not corrupted
- Check logs - Look for “Vision preprocess complete” or error messages
Poor Quality Results
- Use native vision - Claude/Google/OpenAI have better accuracy
- Increase resolution - Send high-quality images
- Add context - Include specific questions in text prompt
- Try different model - Some preprocessor models excel at OCR vs. scene understanding
PDF Vision Failing
- Check page count - Only first 4 pages rendered
- Verify provider - Native vision providers produce better results
- Manual extraction - For critical documents, extract text manually and paste
Vision processing increases response time by 5-15 seconds depending on provider and image complexity.