Skip to main content

Multimodal

Genkit supports multimodal AI capabilities, allowing you to generate and process images, videos, and audio alongside text. Build applications that understand and create visual content.

Image Generation

Generate images from text descriptions:
import (
    "github.com/firebase/genkit/go/ai"
    "github.com/firebase/genkit/go/genkit"
    "github.com/firebase/genkit/go/plugins/googlegenai"
    "google.golang.org/genai"
)

func main() {
    ctx := context.Background()
    g := genkit.Init(ctx, genkit.WithPlugins(&googlegenai.VertexAI{}))

    genkit.DefineFlow(g, "image-generation", 
        func(ctx context.Context, input string) ([]string, error) {
            r, err := genkit.Generate(ctx, g,
                ai.WithModelName("vertexai/imagen-3.0-generate-001"),
                ai.WithPrompt("Generate an image of %s", input),
                ai.WithConfig(&genai.GenerateImagesConfig{
                    NumberOfImages:    2,
                    NegativePrompt:    "night",
                    AspectRatio:       "9:16",
                    SafetyFilterLevel: genai.SafetyFilterLevelBlockLowAndAbove,
                    PersonGeneration:  genai.PersonGenerationAllowAll,
                    Language:          genai.ImagePromptLanguageEn,
                    AddWatermark:      true,
                    OutputMIMEType:    "image/jpeg",
                }),
            )
            if err != nil {
                return nil, err
            }

            var images []string
            for _, m := range r.Message.Content {
                images = append(images, m.Text)
            }
            return images, nil
        })
}

Image Understanding

Analyze and describe images:
import { Document } from 'genkit';

const { text } = await ai.generate({
  model: googleAI.model('gemini-2.5-flash'),
  prompt: 'Describe what you see in this image',
  media: [
    {
      url: 'https://example.com/image.jpg',
      contentType: 'image/jpeg',
    },
  ],
});

console.log(text);

Video Understanding

Process and analyze video content:
import { Document } from 'genkit';

// Index video for RAG
const videoDocs = [
  Document.fromMedia(
    'gs://cloud-samples-data/generative-ai/video/pixel8.mp4',
    'video/mp4',
    {
      videoSegmentConfig: {
        startOffsetSec: 0,
        endOffsetSec: 120,
        intervalSec: 15,
      },
    }
  ),
];

await ai.index({
  indexer: videoIndexer,
  documents: videoDocs,
});

// Query video content
const videoQA = ai.defineFlow(
  { name: 'videoQuestions', inputSchema: z.string() },
  async (query) => {
    const docs = await ai.retrieve({
      retriever: videoRetriever,
      query,
      options: { k: 1 },
    });

    const { text } = await ai.generate({
      model: googleAI.model('gemini-2.5-flash'),
      prompt: query,
      media: docs
        .filter(d => d.media[0]?.url)
        .map(d => ({
          url: d.media[0].url,
          contentType: d.media[0].contentType,
          startOffsetSec: d.metadata?.embedMetadata?.startOffsetSec,
          endOffsetSec: d.metadata?.embedMetadata?.endOffsetSec,
        }))[0],
    });
    
    return text;
  }
);

PDF Processing

Extract and analyze content from PDFs:
import { Document } from 'genkit';
import fs from 'fs';

// Read PDF file
const pdfBuffer = fs.readFileSync('document.pdf');
const pdfBase64 = pdfBuffer.toString('base64');

const { text } = await ai.generate({
  model: googleAI.model('gemini-2.5-flash'),
  prompt: 'Summarize the key points from this document',
  media: [
    {
      url: `data:application/pdf;base64,${pdfBase64}`,
      contentType: 'application/pdf',
    },
  ],
});

console.log(text);

Multimodal Embeddings

Create embeddings for images and videos for similarity search:
import { vertexAI } from '@genkit-ai/vertexai';

const ai = genkit({ plugins: [vertexAI()] });

// Embed an image
const embedding = await ai.embed({
  embedder: vertexAI.embedder('multimodalembedding@001'),
  content: Document.fromMedia(
    'gs://my-bucket/image.jpg',
    'image/jpeg'
  ),
});

console.log(embedding.values);

Video Segment Configuration

Control how videos are processed:
const videoDoc = Document.fromMedia(
  'gs://cloud-samples-data/video/google_sustainability.mp4',
  'video/mp4',
  {
    videoSegmentConfig: {
      startOffsetSec: 0,      // Start at beginning
      endOffsetSec: 120,       // Process first 2 minutes
      intervalSec: 15,         // Create embeddings every 15 seconds
    },
  }
);

// For longer videos, process in multiple segments
const segment1 = Document.fromMedia(
  videoUrl,
  'video/mp4',
  {
    videoSegmentConfig: {
      startOffsetSec: 0,
      endOffsetSec: 120,
      intervalSec: 15,
    },
  }
);

const segment2 = Document.fromMedia(
  videoUrl,
  'video/mp4',
  {
    videoSegmentConfig: {
      startOffsetSec: 120,
      endOffsetSec: 240,
      intervalSec: 15,
    },
  }
);

Mixed Media Inputs

Combine multiple types of media in a single request:
const { text } = await ai.generate({
  model: googleAI.model('gemini-2.5-flash'),
  prompt: 'Compare these images and describe the differences',
  media: [
    { url: 'https://example.com/image1.jpg', contentType: 'image/jpeg' },
    { url: 'https://example.com/image2.jpg', contentType: 'image/jpeg' },
  ],
});

Provider Support

Multimodal capabilities vary by provider:
ProviderImage InputImage GenerationVideo InputPDF Input
Google AI (Gemini)
Vertex AI✅ (Imagen)
Anthropic (Claude)
OpenAI✅ (DALL-E)

Supported File Types

Images

  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • WebP (.webp)
  • GIF (.gif)

Video

  • MP4 (.mp4)
  • MOV (.mov)
  • AVI (.avi)
  • WebM (.webm)

Documents

  • PDF (.pdf)

Media Source Options

HTTP/HTTPS URLs

TypeScript
media: [{
  url: 'https://example.com/image.jpg',
  contentType: 'image/jpeg',
}]

Google Cloud Storage URLs

TypeScript
media: [{
  url: 'gs://my-bucket/video.mp4',
  contentType: 'video/mp4',
}]

Base64 Encoded Data

TypeScript
const base64Data = fs.readFileSync('image.jpg').toString('base64');

media: [{
  url: `data:image/jpeg;base64,${base64Data}`,
  contentType: 'image/jpeg',
}]

Best Practices

Optimize Image Sizes

Resize images before sending to reduce latency and costs:
  • Recommended: 1024x1024 or smaller
  • Maximum: Check provider limits

Use Cloud Storage for Large Files

For videos and large PDFs, use Google Cloud Storage URLs instead of base64:
TypeScript
// Recommended for large files
url: 'gs://my-bucket/large-video.mp4'

// Avoid for large files (increases request size)
url: `data:video/mp4;base64,${hugeBase64String}`

Process Videos in Segments

For long videos, process in smaller time windows:
TypeScript
const processLongVideo = async (videoUrl: string) => {
  const segmentDuration = 120; // 2 minutes per segment
  const segments = [];
  
  for (let start = 0; start < totalDuration; start += segmentDuration) {
    segments.push(
      Document.fromMedia(videoUrl, 'video/mp4', {
        videoSegmentConfig: {
          startOffsetSec: start,
          endOffsetSec: start + segmentDuration,
          intervalSec: 15,
        },
      })
    );
  }
  
  return segments;
};

Add Specific Instructions

Provide clear context about what to analyze:
TypeScript
const { text } = await ai.generate({
  model: googleAI.model('gemini-2.5-flash'),
  prompt: 'Identify all the text visible in this image, maintaining the original layout and formatting',
  media: [{ url: imageUrl, contentType: 'image/jpeg' }],
});

Handle Multimodal Errors

Some models may reject certain content:
TypeScript
try {
  const { text } = await ai.generate({
    model: googleAI.model('gemini-2.5-flash'),
    prompt: 'Describe this image',
    media: [{ url: imageUrl, contentType: 'image/jpeg' }],
  });
} catch (error) {
  if (error.message.includes('SAFETY')) {
    console.log('Content blocked by safety filters');
  } else if (error.message.includes('UNSUPPORTED')) {
    console.log('File type not supported');
  }
}

Complete Example: Image Analysis Flow

import { genkit, z } from 'genkit';
import { googleAI } from '@genkit-ai/google-genai';

const ai = genkit({ plugins: [googleAI()] });

const ImageAnalysisSchema = z.object({
  description: z.string(),
  objects: z.array(z.string()),
  colors: z.array(z.string()),
  text: z.string().optional(),
  sentiment: z.enum(['positive', 'neutral', 'negative']),
});

const analyzeImage = ai.defineFlow(
  {
    name: 'analyzeImage',
    inputSchema: z.object({ imageUrl: z.string() }),
    outputSchema: ImageAnalysisSchema,
  },
  async ({ imageUrl }) => {
    const { output } = await ai.generate({
      model: googleAI.model('gemini-2.5-flash'),
      prompt: 'Analyze this image and provide a detailed description, list of objects, dominant colors, any visible text, and the overall sentiment',
      media: [{ url: imageUrl, contentType: 'image/jpeg' }],
      output: { schema: ImageAnalysisSchema },
    });
    
    return output;
  }
);

// Use the flow
const analysis = await analyzeImage({ 
  imageUrl: 'https://example.com/photo.jpg' 
});

console.log(analysis.description);
console.log('Objects found:', analysis.objects);

Next Steps

  • Learn about RAG for multimodal document retrieval
  • Explore Streaming for progressive image generation
  • Check out Evaluation for testing multimodal outputs

Build docs developers (and LLMs) love