Multimodal
Genkit supports multimodal AI capabilities, allowing you to generate and process images, videos, and audio alongside text. Build applications that understand and create visual content.Image Generation
Generate images from text descriptions:Image Understanding
Analyze and describe images:Video Understanding
Process and analyze video content:PDF Processing
Extract and analyze content from PDFs:Multimodal Embeddings
Create embeddings for images and videos for similarity search:Video Segment Configuration
Control how videos are processed:Mixed Media Inputs
Combine multiple types of media in a single request:Provider Support
Multimodal capabilities vary by provider:| Provider | Image Input | Image Generation | Video Input | PDF Input |
|---|---|---|---|---|
| Google AI (Gemini) | ✅ | ✅ | ✅ | ✅ |
| Vertex AI | ✅ | ✅ (Imagen) | ✅ | ✅ |
| Anthropic (Claude) | ✅ | ❌ | ❌ | ✅ |
| OpenAI | ✅ | ✅ (DALL-E) | ❌ | ❌ |
Supported File Types
Images
- JPEG (
.jpg,.jpeg) - PNG (
.png) - WebP (
.webp) - GIF (
.gif)
Video
- MP4 (
.mp4) - MOV (
.mov) - AVI (
.avi) - WebM (
.webm)
Documents
- PDF (
.pdf)
Media Source Options
HTTP/HTTPS URLs
TypeScript
Google Cloud Storage URLs
TypeScript
Base64 Encoded Data
TypeScript
Best Practices
Optimize Image Sizes
Resize images before sending to reduce latency and costs:- Recommended: 1024x1024 or smaller
- Maximum: Check provider limits
Use Cloud Storage for Large Files
For videos and large PDFs, use Google Cloud Storage URLs instead of base64:TypeScript
Process Videos in Segments
For long videos, process in smaller time windows:TypeScript
Add Specific Instructions
Provide clear context about what to analyze:TypeScript
Handle Multimodal Errors
Some models may reject certain content:TypeScript
Complete Example: Image Analysis Flow
Next Steps
- Learn about RAG for multimodal document retrieval
- Explore Streaming for progressive image generation
- Check out Evaluation for testing multimodal outputs