Multimodal embeddings allow you to create unified vector representations from text and images together. This enables powerful cross-modal search, where you can find images using text queries or vice versa.
Overview
Voyage AI’s voyage-multimodal-3 model supports creating embeddings from:
Text only
Images only
Text and images combined
Multiple texts and/or images in one embedding
Multimodal embeddings create a shared semantic space where text and images with similar meanings have similar vector representations.
Basic usage
Create an embedding from text and image combined:
import { createVoyage } from 'voyage-ai-provider' ;
import { embed } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embedding } = await embed < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
value: {
text: [ 'A beautiful sunset over the beach' ],
image: [ 'https://i.ibb.co/r5w8hG8/beach2.jpg' ],
},
});
console . log ( `Generated ${ embedding . length } dimensional embedding` );
The multimodal model accepts several input formats for maximum flexibility:
Tab Title
Tab Title
Tab Title
Text and image together Combine textual descriptions with visual content: import { createVoyage } from 'voyage-ai-provider' ;
import { embedMany } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embeddings } = await embedMany < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
values: [
{
text: [ 'A beautiful sunset over the beach' ],
image: [ 'https://i.ibb.co/r5w8hG8/beach2.jpg' ],
},
],
});
Text-only embeddings You can use the multimodal model for text alone: import { createVoyage } from 'voyage-ai-provider' ;
import { embedMany } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embeddings } = await embedMany < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
values: [
'Customer service inquiry about product return' ,
'Technical support request for software installation' ,
'Sales question about pricing and availability' ,
],
});
Image-only embeddings Generate embeddings from images without text: import { createVoyage } from 'voyage-ai-provider' ;
import { embedMany } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embeddings } = await embedMany < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
values: [
'https://i.ibb.co/nQNGqL0/beach1.jpg' ,
'https://i.ibb.co/r5w8hG8/beach2.jpg' ,
],
});
Multiple items per embedding
Combine multiple text segments and images into a single embedding:
Single text with multiple images
Pair descriptive text with several related images:
import { createVoyage } from 'voyage-ai-provider' ;
import { embedMany } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embeddings } = await embedMany < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
values: [
{
text: [ 'A beautiful sunset over the beach' ],
image: [
'https://i.ibb.co/nQNGqL0/beach1.jpg' ,
'https://i.ibb.co/r5w8hG8/beach2.jpg' ,
],
},
],
});
console . log ( 'Generated embedding from 1 text + 2 images' );
Rich content with comprehensive data
Create detailed embeddings with both modalities:
import { createVoyage } from 'voyage-ai-provider' ;
import { embedMany } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embeddings } = await embedMany < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
values: [
{
text: [ 'Golden sunset over ocean waves on sandy beach.' ],
image: [ 'https://i.ibb.co/nQNGqL0/beach1.jpg' ],
},
{
text: [ 'Vibrant sunset over tropical beach and ocean.' ],
image: [ 'https://i.ibb.co/r5w8hG8/beach2.jpg' ],
},
],
});
for ( const [ index , embedding ] of embeddings . entries ()) {
console . log ( `Embedding ${ index + 1 } : ${ embedding . length } dimensions` );
}
Grouped text embeddings
The multimodal model also supports grouping multiple text segments:
import { createVoyage } from 'voyage-ai-provider' ;
import { embedMany } from 'ai' ;
import type { TextEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embeddings } = await embedMany < TextEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
values: [
// E-commerce product: title + description + features
[
'Premium Wireless Bluetooth Headphones' ,
'Experience superior sound quality with active noise cancellation' ,
'Battery life: 30 hours, Quick charge: 15 min = 3 hours playback' ,
'Compatible with iOS, Android, and all Bluetooth devices' ,
],
// Blog post: title + summary + tags
[
'The Future of Artificial Intelligence in Healthcare' ,
'Exploring how AI is revolutionizing medical diagnosis and treatment' ,
'Tags: AI, healthcare, machine learning, medical technology, innovation' ,
],
],
});
Grouping related content creates richer semantic representations than embedding items separately.
Configuration options
Customize multimodal embedding behavior:
import { createVoyage } from 'voyage-ai-provider' ;
import { embed } from 'ai' ;
import type { MultimodalEmbeddingInput , VoyageMultimodalEmbeddingOptions } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embedding } = await embed < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
value: {
text: [ 'Product description' ],
image: [ 'https://example.com/product.jpg' ],
},
providerOptions: {
voyage: {
inputType: 'document' ,
truncation: true ,
} satisfies VoyageMultimodalEmbeddingOptions ,
},
});
Available options
The data type for output embeddings. Defaults to null.
null (default) - Embeddings as a list of floating-point numbers
base64 - Base64-encoded NumPy array of single-precision floats
See output data types FAQ for details.
Whether to truncate inputs to fit within the context length. Defaults to true. When true, long inputs are automatically truncated. When false, an error is raised if inputs exceed limits.
Use cases
Cross-modal search Search images using text queries or find text with images
E-commerce Match products with descriptions and images
Content management Organize documents containing text and visuals
Visual Q&A Answer questions about image content
Cross-modal retrieval
One of the most powerful features is searching across modalities:
import { createVoyage } from 'voyage-ai-provider' ;
import { embed , embedMany } from 'ai' ;
import type { MultimodalEmbeddingInput , VoyageMultimodalEmbeddingOptions } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const model = voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' );
// Index documents with images
const { embeddings : documents } = await embedMany < MultimodalEmbeddingInput >({
model ,
values: [
{
text: [ 'Golden sunset over ocean waves' ],
image: [ 'https://i.ibb.co/nQNGqL0/beach1.jpg' ],
},
{
text: [ 'Vibrant tropical beach sunset' ],
image: [ 'https://i.ibb.co/r5w8hG8/beach2.jpg' ],
},
],
providerOptions: {
voyage: {
inputType: 'document' ,
} satisfies VoyageMultimodalEmbeddingOptions ,
},
});
// Search using text query
const { embedding : query } = await embed < MultimodalEmbeddingInput >({
model ,
value: 'beach at sunset' ,
providerOptions: {
voyage: {
inputType: 'query' ,
} satisfies VoyageMultimodalEmbeddingOptions ,
},
});
// Calculate similarities
function cosineSimilarity ( a : number [], b : number []) : number {
const dotProduct = a . reduce (( sum , val , i ) => sum + val * b [ i ], 0 );
const magnitudeA = Math . sqrt ( a . reduce (( sum , val ) => sum + val * val , 0 ));
const magnitudeB = Math . sqrt ( b . reduce (( sum , val ) => sum + val * val , 0 ));
return dotProduct / ( magnitudeA * magnitudeB );
}
const similarities = documents . map ( doc => cosineSimilarity ( query , doc ));
console . log ( 'Document similarities:' , similarities );
Use inputType: 'query' for search queries and inputType: 'document' for indexed content to optimize retrieval performance.
Working with base64 images
Convert and embed images from various sources:
import { createVoyage } from 'voyage-ai-provider' ;
import { embed } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
// Helper to convert image URL to base64
const getBase64Image = async ( url : string ) => {
const response = await fetch ( url );
const arrayBuffer = await response . arrayBuffer ();
const base64 = Buffer . from ( arrayBuffer ). toString ( 'base64' );
return `data:image/jpeg;base64, ${ base64 } ` ;
};
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const { embedding } = await embed < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
value: {
text: [ 'Beach scene at sunset' ],
image: [ await getBase64Image ( 'https://i.ibb.co/r5w8hG8/beach2.jpg' )],
},
});
console . log ( 'Embedded base64 image with text' );
Usage tracking
Multimodal embeddings track both text and image token usage:
import { createVoyage } from 'voyage-ai-provider' ;
import { embedMany } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
const result = await embedMany < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
values: [
{
text: [ 'Description of the image' ],
image: [ 'https://i.ibb.co/nQNGqL0/beach1.jpg' ],
},
],
});
console . log ( `Generated ${ result . embeddings . length } embeddings` );
console . log ( `Total tokens: ${ result . usage ?. tokens } ` );
Total tokens include both text tokens and image pixels converted to token equivalents.
Error handling
Handle errors gracefully with multimodal inputs:
import { createVoyage } from 'voyage-ai-provider' ;
import { embed } from 'ai' ;
import type { MultimodalEmbeddingInput } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
try {
const { embedding } = await embed < MultimodalEmbeddingInput >({
model: voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' ),
value: {
text: [ 'Product information' ],
image: [ 'https://example.com/product.jpg' ],
},
});
console . log ( 'Multimodal embedding generated' );
} catch ( error ) {
console . error ( 'Failed to generate embedding:' , error );
}
The maximum batch size is 128 embeddings per call. Split larger batches into multiple requests.
Best practices
Group text and images that belong together (e.g., product titles with product images) for more meaningful embeddings.
Set inputType: 'query' for search queries and inputType: 'document' for indexed content to optimize retrieval.
Process multiple multimodal inputs together using embedMany for better efficiency.
Ensure text descriptions complement images rather than duplicate information, creating richer semantic representations.
Model selection
Choose the right model method based on your input:
import { createVoyage } from 'voyage-ai-provider' ;
const voyage = createVoyage ({
apiKey: process . env . VOYAGE_API_KEY ,
});
// For text only - use textEmbeddingModel for better performance
const textModel = voyage . textEmbeddingModel ( 'voyage-3-lite' );
// For images only - use imageEmbeddingModel
const imageModel = voyage . imageEmbeddingModel ( 'voyage-multimodal-3' );
// For text + images or flexible inputs - use multimodalEmbeddingModel
const multimodalModel = voyage . multimodalEmbeddingModel ( 'voyage-multimodal-3' );
All three methods can use the same underlying model, but they’re optimized for different input patterns.
Next steps
Text embeddings Learn about text-only embedding models
Image embeddings Generate embeddings from images
Reranking Improve search results with reranking
Configuration Customize provider settings