Overview
Vision models can analyze images alongside text. LiteLLM provides a unified interface for vision capabilities across multiple providers including OpenAI, Anthropic, Google, and more.
Quick Start
from litellm import completion
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{ "type" : "image_url" , "image_url" : { "url" : "https://example.com/image.jpg" }}
]
}]
)
print (response.choices[ 0 ].message.content)
Reference images via URL. from litellm import completion
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "https://example.com/photo.jpg"
}
}
]
}]
)
Embed images as base64 data. import base64
from litellm import completion
# Read and encode image
with open ( "image.jpg" , "rb" ) as image_file:
base64_image = base64.b64encode(image_file.read()).decode( 'utf-8' )
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{
"type" : "image_url" ,
"image_url" : {
"url" : f "data:image/jpeg;base64, { base64_image } "
}
}
]
}]
)
LiteLLM can automatically convert local files. from litellm import completion
import litellm
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Analyze this image" },
{
"type" : "image_url" ,
"image_url" : { "url" : "file://path/to/image.jpg" }
}
]
}]
)
Multiple Images
Process multiple images in a single request.
from litellm import completion
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Compare these two images" },
{ "type" : "image_url" , "image_url" : { "url" : "https://example.com/image1.jpg" }},
{ "type" : "image_url" , "image_url" : { "url" : "https://example.com/image2.jpg" }}
]
}]
)
print (response.choices[ 0 ].message.content)
Provider Support
OpenAI
Anthropic
Google Gemini
Ollama
GPT-4o and GPT-4 Turbo with vision. from litellm import completion
# GPT-4o - Best vision model
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}]
)
# GPT-4o-mini - Faster and cheaper
response = completion(
model = "gpt-4o-mini" ,
messages = [{ "role" : "user" , "content" : [ ... ]}]
)
Claude models with vision support. response = completion(
model = "anthropic/claude-3.5-sonnet" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}]
)
Gemini models with native multimodal support. response = completion(
model = "gemini/gemini-1.5-pro" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}]
)
Local vision models like LLaVA. response = completion(
model = "ollama/llava" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}],
api_base = "http://localhost:11434"
)
Image Detail Level
Control how the model processes images (OpenAI models).
from litellm import completion
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Analyze this image in detail" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "https://example.com/image.jpg" ,
"detail" : "high" # "low", "high", or "auto"
}
}
]
}]
)
low : Faster, lower cost, less detail (512x512)
high : Slower, higher cost, more detail (2048x2048)
auto : Model decides based on image size
Common Use Cases
Image Description
Object Detection
Image Comparison
Chart Analysis
from litellm import completion
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Provide a detailed description of this image" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}]
)
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "List all objects you can identify in this image" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}]
)
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What are the differences between these images?" },
{ "type" : "image_url" , "image_url" : { "url" : "https://.../before.jpg" }},
{ "type" : "image_url" , "image_url" : { "url" : "https://.../after.jpg" }}
]
}]
)
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Analyze this chart and provide key insights" },
{ "type" : "image_url" , "image_url" : { "url" : "https://.../chart.png" }}
]
}]
)
Streaming with Vision
from litellm import completion
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image in detail" },
{ "type" : "image_url" , "image_url" : { "url" : "https://..." }}
]
}],
stream = True
)
for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
Multi-turn Vision Conversations
from litellm import completion
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{ "type" : "image_url" , "image_url" : { "url" : "https://.../room.jpg" }}
]
}
]
# First response
response = completion( model = "gpt-4o" , messages = messages)
messages.append(response.choices[ 0 ].message)
# Follow-up question
messages.append({
"role" : "user" ,
"content" : "What color is the furniture?"
})
response = completion( model = "gpt-4o" , messages = messages)
print (response.choices[ 0 ].message.content)
JSON Mode with Vision
Get structured output from image analysis.
from litellm import completion
import json
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{
"type" : "text" ,
"text" : "Extract product information from this image as JSON with fields: name, price, description"
},
{ "type" : "image_url" , "image_url" : { "url" : "https://.../product.jpg" }}
]
}],
response_format = { "type" : "json_object" }
)
data = json.loads(response.choices[ 0 ].message.content)
print (data)
Error Handling
from litellm import completion
from litellm.exceptions import BadRequestError, APIError
try :
response = completion(
model = "gpt-4o" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this" },
{ "type" : "image_url" , "image_url" : { "url" : "invalid-url" }}
]
}]
)
except BadRequestError as e:
print ( f "Invalid image URL or format: { e } " )
except APIError as e:
print ( f "API error: { e } " )
JPEG - .jpg, .jpeg
PNG - .png
WebP - .webp
GIF - .gif (non-animated)
Format support varies by provider. OpenAI and Anthropic support all common formats.
Image Size Limits
Provider Max Size Max Resolution OpenAI 20MB Varies by detail level Anthropic 5MB per image 8000x8000 Google Gemini Varies Varies by model Ollama Depends on model Depends on model
Best Practices
Use high-quality images for better results
Ensure images are clear and well-lit
Crop images to focus on relevant content
Use appropriate resolution (not too low or unnecessarily high)
Use detail="low" for simple image tasks
Resize large images before sending
Use URLs instead of base64 when possible
Consider using cheaper models for simple vision tasks
Be specific about what you want to extract
Ask direct questions about the image
Use examples when requesting specific formats
Break complex tasks into multiple queries
Limitations
Vision models may hallucinate details not present in images
Text recognition accuracy varies
Some models have restrictions on certain image types
Privacy: Be cautious with sensitive images