Overview
Multimodal capabilities enable:- Image understanding and analysis
- Visual question answering
- Multimodal RAG (text + images)
- CLIP embeddings for image search
- Combined text and image processing
Image Chat Example
Analyze images using vision models:multimodal-chat.ts
Multimodal RAG Example
Build a RAG system that retrieves both text and images:multimodal-rag.ts
Step-by-Step Explanation
1. Image Processing
imageToDataUrl utility converts images to base64 data URLs that vision models can process.
2. Vision Model Configuration
3. Multimodal Messages
Combine text and images in messages:4. Multimodal Retrieval
Retrieve different content types:CLIP Embeddings
Use CLIP for image and text embeddings:Image Search Example
Build an image search engine:Running the Examples
- Install dependencies:
- Set your API key:
- Run an example:
Supported Vision Models
OpenAI
- gpt-4o - Latest multimodal model
- gpt-4o-mini - Faster, more cost-effective
- gpt-4-turbo - Previous generation with vision
- gpt-4-vision-preview - Legacy vision model
Anthropic
- claude-3-5-sonnet - Best vision + reasoning
- claude-3-opus - Highest capability
- claude-3-sonnet - Balanced performance
- claude-3-haiku - Fast and cost-effective
Google Gemini
Use Cases
Visual Question Answering
Document Analysis
Extract information from documents:Product Cataloging
Automate product descriptions:Best Practices
Image Quality
- Use high-resolution images for better results
- Ensure images are well-lit and clear
- Crop to relevant areas when possible
Token Usage
- Images consume many tokens (varies by resolution)
- Use
maxTokensto control response length - Consider
gpt-4o-minifor cost optimization
Error Handling
Next Steps
CLIP Embeddings
Learn more about CLIP and multimodal embeddings
Vision Models
Explore different vision-language models
RAG with Images
Build advanced multimodal RAG systems
Custom Readers
Create custom image readers and processors
Related Examples
- Multimodal Chat - Simple image chat
- Multimodal RAG - Text + image retrieval
- CLIP Embeddings - Image search with CLIP
- Multimodal Context - Context-aware multimodal chat