Skip to main content
Gemini models can process multiple types of media in addition to text. This guide covers how to provide images, audio, video, and PDF files as input.

Images

There are three main ways to provide image input:
Use Part.from_uri for images stored in Google Cloud Storage:
from google import genai
from google.genai import types

client = genai.Client(api_key='your-api-key')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'What is this image about?',
        types.Part.from_uri(
            file_uri='gs://generativeai-downloads/images/scones.jpg',
            mime_type='image/jpeg',
        ),
    ],
)
print(response.text)
Supported image formats: JPEG, PNG, WebP, GIF

Audio

Process audio files for transcription, analysis, or understanding:
from google.genai import types

with open('audio_sample.mp3', 'rb') as f:
    audio_bytes = f.read()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        types.Part.from_bytes(
            data=audio_bytes,
            mime_type='audio/mp3',
        ),
        'Transcribe this audio.'
    ]
)
print(response.text)
Supported audio formats: MP3, WAV, FLAC, AAC

Video

Analyze video content for descriptions, summaries, or specific questions:
from google.genai import types

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'What happens in this video?',
        types.Part.from_uri(
            file_uri='gs://your-bucket/video.mp4',
            mime_type='video/mp4',
        ),
    ],
)
print(response.text)
Supported video formats: MP4, MOV, AVI, WebM, FLV, MPG

PDFs

Extract information from PDF documents:
# Upload PDF
pdf_file = client.files.upload(file='document.pdf')

# Wait for processing
import time
while pdf_file.state == 'PROCESSING':
    time.sleep(2)
    pdf_file = client.files.get(name=pdf_file.name)

# Ask questions about the PDF
response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['Summarize this document', pdf_file]
)
print(response.text)

Combining Multiple Modalities

You can mix different media types in a single request:
from google.genai import types

# Upload files
image_file = client.files.upload(file='chart.png')
audio_file = client.files.upload(file='presentation.mp3')

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Based on this chart and audio presentation, ',
        image_file,
        audio_file,
        'what are the main conclusions?'
    ]
)
print(response.text)

MIME Types Reference

Common MIME types for different media:
Media TypeMIME Type Examples
Imagesimage/jpeg, image/png, image/webp, image/gif
Audioaudio/mp3, audio/wav, audio/flac, audio/aac
Videovideo/mp4, video/mov, video/avi, video/webm
PDFapplication/pdf

File API Management

Manage uploaded files:
# Upload
file = client.files.upload(file='document.pdf')
print(f"Uploaded: {file.name}")

# Get file info
file_info = client.files.get(name=file.name)
print(f"State: {file_info.state}")
print(f"Size: {file_info.size_bytes} bytes")

# List all files
for f in client.files.list():
    print(f"{f.name}: {f.state}")

# Delete when done
client.files.delete(name=file.name)

Streaming with Multimodal Input

You can stream responses for multimodal inputs:
from google.genai import types

with open('image.jpg', 'rb') as f:
    image_bytes = f.read()

for chunk in client.models.generate_content_stream(
    model='gemini-2.5-flash',
    contents=[
        'Describe this image in detail',
        types.Part.from_bytes(data=image_bytes, mime_type='image/jpeg'),
    ],
):
    print(chunk.text, end='')

Use Cases

Document Analysis

Extract insights from PDFs, images of documents, and scanned files

Video Understanding

Analyze video content, generate descriptions, and answer questions

Audio Transcription

Transcribe and analyze audio content, podcasts, and meetings

Visual Q&A

Answer questions about images, charts, and diagrams

Best Practices

  • Use Part.from_uri for large files or files already in cloud storage
  • Use Part.from_bytes for small files (< 20MB) from local filesystem
  • Use the File API for files that need preprocessing (video, long audio, PDFs)
  • Always specify the correct MIME type for your media
  • Check file state (PROCESSING, ACTIVE) before using uploaded files
  • Delete files after use to manage storage costs
  • Combine multiple modalities when relevant to your use case
  • For Gemini Developer API, use the File API for all large files
  • For Vertex AI, you can use GCS URIs directly with Part.from_uri

Build docs developers (and LLMs) love