Vision (Multimodal)

Overview

Vision models can analyze images alongside text. LiteLLM provides a unified interface for vision capabilities across multiple providers including OpenAI, Anthropic, Google, and more.

Quick Start

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

print(response.choices[0].message.content)

Image Input Methods

URL
Base64
Local File

Reference images via URL.

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/photo.jpg"
                }
            }
        ]
    }]
)

Embed images as base64 data.

import base64
from litellm import completion

# Read and encode image
with open("image.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            }
        ]
    }]
)

LiteLLM can automatically convert local files.

from litellm import completion
import litellm

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this image"},
            {
                "type": "image_url",
                "image_url": {"url": "file://path/to/image.jpg"}
            }
        ]
    }]
)

Multiple Images

Process multiple images in a single request.

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these two images"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}}
        ]
    }]
)

print(response.choices[0].message.content)

Provider Support

OpenAI
Anthropic
Google Gemini
Ollama

GPT-4o and GPT-4 Turbo with vision.

from litellm import completion

# GPT-4o - Best vision model
response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

# GPT-4o-mini - Faster and cheaper
response = completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": [...]}]
)

Claude models with vision support.

response = completion(
    model="anthropic/claude-3.5-sonnet",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

Gemini models with native multimodal support.

response = completion(
    model="gemini/gemini-1.5-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

Local vision models like LLaVA.

response = completion(
    model="ollama/llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }],
    api_base="http://localhost:11434"
)

Image Detail Level

Control how the model processes images (OpenAI models).

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this image in detail"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.jpg",
                    "detail": "high"  # "low", "high", or "auto"
                }
            }
        ]
    }]
)

low: Faster, lower cost, less detail (512x512)
high: Slower, higher cost, more detail (2048x2048)
auto: Model decides based on image size

Common Use Cases

Image Description
OCR / Text Extraction
Object Detection
Image Comparison
Chart Analysis

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Provide a detailed description of this image"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all text from this image"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "List all objects you can identify in this image"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What are the differences between these images?"},
            {"type": "image_url", "image_url": {"url": "https://.../before.jpg"}},
            {"type": "image_url", "image_url": {"url": "https://.../after.jpg"}}
        ]
    }]
)

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this chart and provide key insights"},
            {"type": "image_url", "image_url": {"url": "https://.../chart.png"}}
        ]
    }]
)

Streaming with Vision

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Multi-turn Vision Conversations

from litellm import completion

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://.../room.jpg"}}
        ]
    }
]

# First response
response = completion(model="gpt-4o", messages=messages)
messages.append(response.choices[0].message)

# Follow-up question
messages.append({
    "role": "user",
    "content": "What color is the furniture?"
})

response = completion(model="gpt-4o", messages=messages)
print(response.choices[0].message.content)

JSON Mode with Vision

Get structured output from image analysis.

from litellm import completion
import json

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Extract product information from this image as JSON with fields: name, price, description"
            },
            {"type": "image_url", "image_url": {"url": "https://.../product.jpg"}}
        ]
    }],
    response_format={"type": "json_object"}
)

data = json.loads(response.choices[0].message.content)
print(data)

Error Handling

from litellm import completion
from litellm.exceptions import BadRequestError, APIError

try:
    response = completion(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this"},
                {"type": "image_url", "image_url": {"url": "invalid-url"}}
            ]
        }]
    )
except BadRequestError as e:
    print(f"Invalid image URL or format: {e}")
except APIError as e:
    print(f"API error: {e}")

Supported Image Formats

JPEG - .jpg, .jpeg
PNG - .png
WebP - .webp
GIF - .gif (non-animated)

Format support varies by provider. OpenAI and Anthropic support all common formats.

Image Size Limits

Provider	Max Size	Max Resolution
OpenAI	20MB	Varies by detail level
Anthropic	5MB per image	8000x8000
Google Gemini	Varies	Varies by model
Ollama	Depends on model	Depends on model

Best Practices

Image Quality

Use high-quality images for better results
Ensure images are clear and well-lit
Crop images to focus on relevant content
Use appropriate resolution (not too low or unnecessarily high)

Cost Optimization

Use detail="low" for simple image tasks
Resize large images before sending
Use URLs instead of base64 when possible
Consider using cheaper models for simple vision tasks

Prompt Design

Be specific about what you want to extract
Ask direct questions about the image
Use examples when requesting specific formats
Break complex tasks into multiple queries

Performance

Cache images when used multiple times
Use streaming for long descriptions
Process multiple images in parallel when independent
Consider batch processing for many images

Limitations

Vision models may hallucinate details not present in images
Text recognition accuracy varies
Some models have restrictions on certain image types
Privacy: Be cautious with sensitive images

Providers

Provider Features

Vision (Multimodal)

Overview

Quick Start

Image Input Methods

Multiple Images

Provider Support

Image Detail Level

Common Use Cases

Streaming with Vision

Multi-turn Vision Conversations

JSON Mode with Vision

Error Handling

Supported Image Formats

Image Size Limits

Best Practices

Limitations

Build docs developers (and LLMs) love

Providers

Provider Features

​Overview

​Quick Start

​Image Input Methods

​Multiple Images

​Provider Support

​Image Detail Level

​Common Use Cases

​Streaming with Vision

​Multi-turn Vision Conversations

​JSON Mode with Vision

​Error Handling

​Supported Image Formats

​Image Size Limits

​Best Practices

​Limitations

Build docs developers (and LLMs) love

Overview

Quick Start

Image Input Methods

Multiple Images

Provider Support

Image Detail Level

Common Use Cases

Streaming with Vision

Multi-turn Vision Conversations

JSON Mode with Vision

Error Handling

Supported Image Formats

Image Size Limits

Best Practices

Limitations