Skip to main content

Overview

Vision models can analyze images alongside text. LiteLLM provides a unified interface for vision capabilities across multiple providers including OpenAI, Anthropic, Google, and more.

Quick Start

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

print(response.choices[0].message.content)

Image Input Methods

Reference images via URL.
from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/photo.jpg"
                }
            }
        ]
    }]
)

Multiple Images

Process multiple images in a single request.
from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these two images"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}}
        ]
    }]
)

print(response.choices[0].message.content)

Provider Support

GPT-4o and GPT-4 Turbo with vision.
from litellm import completion

# GPT-4o - Best vision model
response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

# GPT-4o-mini - Faster and cheaper
response = completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": [...]}]
)

Image Detail Level

Control how the model processes images (OpenAI models).
from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this image in detail"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://example.com/image.jpg",
                    "detail": "high"  # "low", "high", or "auto"
                }
            }
        ]
    }]
)
  • low: Faster, lower cost, less detail (512x512)
  • high: Slower, higher cost, more detail (2048x2048)
  • auto: Model decides based on image size

Common Use Cases

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Provide a detailed description of this image"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

Streaming with Vision

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Multi-turn Vision Conversations

from litellm import completion

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://.../room.jpg"}}
        ]
    }
]

# First response
response = completion(model="gpt-4o", messages=messages)
messages.append(response.choices[0].message)

# Follow-up question
messages.append({
    "role": "user",
    "content": "What color is the furniture?"
})

response = completion(model="gpt-4o", messages=messages)
print(response.choices[0].message.content)

JSON Mode with Vision

Get structured output from image analysis.
from litellm import completion
import json

response = completion(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Extract product information from this image as JSON with fields: name, price, description"
            },
            {"type": "image_url", "image_url": {"url": "https://.../product.jpg"}}
        ]
    }],
    response_format={"type": "json_object"}
)

data = json.loads(response.choices[0].message.content)
print(data)

Error Handling

from litellm import completion
from litellm.exceptions import BadRequestError, APIError

try:
    response = completion(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this"},
                {"type": "image_url", "image_url": {"url": "invalid-url"}}
            ]
        }]
    )
except BadRequestError as e:
    print(f"Invalid image URL or format: {e}")
except APIError as e:
    print(f"API error: {e}")

Supported Image Formats

  • JPEG - .jpg, .jpeg
  • PNG - .png
  • WebP - .webp
  • GIF - .gif (non-animated)
Format support varies by provider. OpenAI and Anthropic support all common formats.

Image Size Limits

ProviderMax SizeMax Resolution
OpenAI20MBVaries by detail level
Anthropic5MB per image8000x8000
Google GeminiVariesVaries by model
OllamaDepends on modelDepends on model

Best Practices

  • Use high-quality images for better results
  • Ensure images are clear and well-lit
  • Crop images to focus on relevant content
  • Use appropriate resolution (not too low or unnecessarily high)
  • Use detail="low" for simple image tasks
  • Resize large images before sending
  • Use URLs instead of base64 when possible
  • Consider using cheaper models for simple vision tasks
  • Be specific about what you want to extract
  • Ask direct questions about the image
  • Use examples when requesting specific formats
  • Break complex tasks into multiple queries
  • Cache images when used multiple times
  • Use streaming for long descriptions
  • Process multiple images in parallel when independent
  • Consider batch processing for many images

Limitations

  • Vision models may hallucinate details not present in images
  • Text recognition accuracy varies
  • Some models have restrictions on certain image types
  • Privacy: Be cautious with sensitive images

Build docs developers (and LLMs) love