Multimodal Understanding

Gemini models natively support multimodal inputs, allowing you to process text, images, video, audio, and documents in a single API call.

Supported Modalities

Images

JPEG, PNG, WEBP, GIF

Video

MP4, MOV, MPEG, AVI, WebM

Audio

MP3, WAV, FLAC, AAC

Documents

PDF documents

Web Pages

HTML content via HTTP/HTTPS

YouTube

Direct YouTube video URLs

Image Understanding

From Cloud Storage

Process images stored in Google Cloud Storage:

from google import genai
from google.genai.types import Part

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/image/a-man-and-a-dog.png",
            mime_type="image/png"
        ),
        "Describe this image in detail."
    ]
)

print(response.text)

From Local Files

Load and process images from your local file system:

with open("path/to/image.jpg", "rb") as f:
    image_data = f.read()

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_bytes(data=image_data, mime_type="image/jpeg"),
        "What objects are in this image?"
    ]
)

print(response.text)

From Public URLs

Process images from web URLs:

import requests

image_url = "https://example.com/image.jpg"
image_data = requests.get(image_url).content

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_bytes(data=image_data, mime_type="image/jpeg"),
        "Analyze this image."
    ]
)

Media Resolution Control

Gemini 3 allows fine-grained control over image processing quality:

from google.genai.types import (
    Part,
    FileData,
    PartMediaResolution,
    PartMediaResolutionLevel,
    GenerateContentConfig
)

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part(
            file_data=FileData(
                file_uri="gs://path/to/high-res-image.jpg",
                mime_type="image/jpeg"
            ),
            media_resolution=PartMediaResolution(
                level=PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH
            )
        ),
        "Read the small text in this image."
    ]
)

Higher resolution settings increase token usage and latency but improve accuracy for fine details and small text.

Video Understanding

From Cloud Storage

Process video files with temporal understanding:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/video/pixel_phone.mp4",
            mime_type="video/mp4"
        ),
        "Summarize the main events in this video."
    ]
)

print(response.text)

YouTube Videos

Analyze YouTube videos directly:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="https://www.youtube.com/watch?v=3KtWfp0UopM",
            mime_type="video/mp4"
        ),
        "At what timestamp does the product launch occur?"
    ]
)

Multiple Videos

Compare multiple videos in a single request:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://path/to/video1.mp4",
            mime_type="video/mp4",
        ),
        Part.from_uri(
            file_uri="gs://path/to/video2.mp4",
            mime_type="video/mp4",
        ),
        "What are the differences between these two videos?"
    ]
)

Video Resolution Control

Optimize processing for different video lengths:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part(
            file_data=FileData(
                file_uri="gs://path/to/long-video.mp4",
                mime_type="video/mp4"
            ),
            media_resolution=PartMediaResolution(
                level=PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW
            )
        ),
        "Provide a high-level summary."
    ],
    config=GenerateContentConfig(
        media_resolution=MediaResolution.MEDIA_RESOLUTION_LOW
    )
)

Audio Processing

Audio Transcription and Analysis

Process audio files with automatic transcription:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/audio/podcast.mp3",
            mime_type="audio/mpeg"
        ),
        "Transcribe this audio and provide a summary."
    ]
)

print(response.text)

Audio with Timestamps

Get timestamped transcriptions:

from google.genai.types import GenerateContentConfig

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="https://example.com/podcast.mp3",
            mime_type="audio/mpeg"
        ),
        "Transcribe this podcast with timestamps."
    ],
    config=GenerateContentConfig(
        audio_timestamp=True
    )
)

Multiple Audio Files

Compare or combine multiple audio sources:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://path/to/audio1.wav",
            mime_type="audio/wav"
        ),
        Part.from_uri(
            file_uri="gs://path/to/audio2.wav",
            mime_type="audio/wav"
        ),
        "Compare the speaking styles in these recordings."
    ]
)

PDF Document Processing

Research Papers

Analyze academic papers and documents:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf",
            mime_type="application/pdf"
        ),
        "Summarize the key findings of this research paper."
    ]
)

Multiple Documents

Compare or synthesize information across documents:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://path/to/paper1.pdf",
            mime_type="application/pdf"
        ),
        Part.from_uri(
            file_uri="gs://path/to/paper2.pdf",
            mime_type="application/pdf"
        ),
        "What methodology differences exist between these papers?"
    ]
)

Web Page Processing

Analyze web pages directly via HTTP:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="https://cloud.google.com/vertex-ai/docs",
            mime_type="text/html"
        ),
        "Summarize the key features described on this page."
    ]
)

Mixed Multimodal Inputs

Images and Text

Combine multiple images with text instructions:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        "Compare these two images and describe the differences:",
        Part.from_uri(
            file_uri="gs://path/to/before.jpg",
            mime_type="image/jpeg"
        ),
        "VS",
        Part.from_uri(
            file_uri="gs://path/to/after.jpg",
            mime_type="image/jpeg"
        )
    ]
)

Video, Image, and Text

Mix different modalities:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        "This is the product image:",
        Part.from_uri(
            file_uri="gs://path/to/product.jpg",
            mime_type="image/jpeg"
        ),
        "And this is the demo video:",
        Part.from_uri(
            file_uri="gs://path/to/demo.mp4",
            mime_type="video/mp4"
        ),
        "Write a marketing description combining both."
    ]
)

Global Media Resolution

Set default resolution for all media in a request:

from google.genai.types import GenerateContentConfig, MediaResolution

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part.from_uri(
            file_uri="gs://path/to/image1.jpg",
            mime_type="image/jpeg"
        ),
        Part.from_uri(
            file_uri="gs://path/to/image2.jpg",
            mime_type="image/jpeg"
        ),
        "Analyze both images."
    ],
    config=GenerateContentConfig(
        media_resolution=MediaResolution.MEDIA_RESOLUTION_MEDIUM
    )
)

Image + Video Context

Find when an image appears in a video:

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        Part(
            file_data=FileData(
                file_uri="gs://path/to/reference-image.png",
                mime_type="image/png"
            ),
            media_resolution=PartMediaResolution(
                level=PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH
            )
        ),
        Part(
            file_data=FileData(
                file_uri="gs://path/to/video.mp4",
                mime_type="video/mp4"
            ),
            media_resolution=PartMediaResolution(
                level=PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW
            )
        ),
        "At what timestamp does this image appear in the video? What's the context?"
    ]
)

Chat with Multimodal Input

Maintain context across multimodal conversations:

chat = client.chats.create(
    model="gemini-3.1-pro-preview"
)

# First turn with image
response = chat.send_message([
    Part.from_uri(
        file_uri="gs://path/to/chart.png",
        mime_type="image/png"
    ),
    "What does this chart show?"
])
print(response.text)

# Follow-up without image (maintains context)
response = chat.send_message("What were the peak values?")
print(response.text)

Best Practices

File Size Limits

Keep individual files under 20MB for optimal performance

Resolution Trade-offs

Use LOW resolution for long videos to fit context limits

Cloud Storage

Use GCS URIs for large files instead of base64 encoding

Batch Processing

Process multiple similar files in parallel requests

Supported MIME Types

Images

image/png
image/jpeg
image/webp
image/gif

Video

video/mp4
video/mpeg
video/mov
video/avi
video/x-flv
video/mpg
video/webm
video/wmv
video/3gpp

Audio

audio/wav
audio/mp3
audio/mpeg
audio/aiff
audio/aac
audio/ogg
audio/flac

Documents

application/pdf
text/html
text/plain

Next Steps

Function Calling

Use multimodal function calling for structured outputs

Grounding

Ground multimodal queries in real-time data

Context Caching

Cache large media files for cost optimization

Batch Prediction

Process large media collections asynchronously

Getting Started

Gemini Models

Agents

RAG & Search

Embeddings & Vector Search

Vision

Audio

​Multimodal Understanding

​Supported Modalities

Images

Video

Audio

Documents

Web Pages

YouTube

​Image Understanding

​From Cloud Storage

​From Local Files

​From Public URLs

​Media Resolution Control

​Video Understanding

​From Cloud Storage

​YouTube Videos

​Multiple Videos

​Video Resolution Control

​Audio Processing

​Audio Transcription and Analysis

​Audio with Timestamps

​Multiple Audio Files

​PDF Document Processing

​Research Papers

​Multiple Documents

​Web Page Processing

​Mixed Multimodal Inputs

​Images and Text

​Video, Image, and Text

​Global Media Resolution

​Image + Video Context

​Chat with Multimodal Input

​Best Practices

File Size Limits

Resolution Trade-offs

Cloud Storage

Batch Processing

​Supported MIME Types

​Images

​Video

​Audio

​Documents

​Next Steps

Function Calling

Grounding

Context Caching

Batch Prediction

Build docs developers (and LLMs) love

Multimodal Understanding

Supported Modalities

Image Understanding

From Cloud Storage

From Local Files

From Public URLs

Media Resolution Control

Video Understanding

From Cloud Storage

YouTube Videos

Multiple Videos

Video Resolution Control

Audio Processing

Audio Transcription and Analysis

Audio with Timestamps

Multiple Audio Files

PDF Document Processing

Research Papers

Multiple Documents

Web Page Processing

Mixed Multimodal Inputs

Images and Text

Video, Image, and Text

Global Media Resolution

Image + Video Context

Chat with Multimodal Input

Best Practices

Supported MIME Types

Images

Video

Audio

Documents

Next Steps