Supported Models
Phi-3 Vision
128k context length vision model for image understanding
Phi-3.5 Vision
Enhanced vision capabilities with improved accuracy
Phi-4 Multi-Modal
Latest model supporting both vision and audio inputs
Model Architecture
Phi vision models are multi-modal models consisting of several internal components:- Vision Encoder: Processes images and extracts visual features
- Image Embedding: Converts visual features into embeddings compatible with the language model
- Language Model: Core transformer model for text generation
- Fusion Layers: Combine visual and text embeddings
Building Phi Vision Models
- Phi-3 Vision
- Phi-3.5 Vision
- Phi-4 Multi-Modal
Phi-3 Vision (128k Context)
Add Configuration Files
Download the required JSON configuration files:
Using Phi Vision Models
Basic Image Understanding
Multi-Image Processing
Chat Template Integration
Image Input Handling
Supported Image Formats
Phi vision models support common image formats:- JPEG/JPG
- PNG
- BMP
- TIFF
Image Preprocessing
The processor automatically handles:- Resizing: Images are resized to the model’s expected dimensions
- Normalization: Pixel values are normalized
- Patch Extraction: Images are divided into patches
- Embedding: Visual patches are converted to embeddings
Image Resolution
Advanced Usage
Batch Processing
Custom Generation Parameters
Performance Optimization
Precision Selection
Precision Selection
Choose the right precision for your hardware:
- FP32: Best accuracy, slower, works on all devices
- FP16: Good balance, requires GPU with FP16 support
- INT4: Fastest, smallest memory footprint, slight accuracy loss
Execution Provider Selection
Execution Provider Selection
- CUDA (NVIDIA)
- DirectML (AMD/Intel)
- CPU
Memory Management
Memory Management
For large images or long sequences:
Fine-Tuning Support
You can use your own fine-tuned Phi vision models:Troubleshooting
Image Not Loading
Image Not Loading
Out of Memory
Out of Memory
If you encounter OOM errors:
- Reduce image resolution before processing
- Use INT4 quantization instead of FP16
- Reduce
max_lengthparameter - Process images one at a time instead of batching
Flash Attention Errors
Flash Attention Errors
If you see flash attention errors:
Example Application
Here’s a complete example script for document analysis:Next Steps
Qwen Vision
Explore Qwen’s advanced vision models
Gemma Vision
Learn about Google’s Gemma vision models
Whisper Audio
Add audio processing capabilities
Model Quantization
Optimize models with quantization