Overview
The Portkey AI Gateway provides unified access to multi-modal capabilities across providers, allowing you to work with:
Vision : Image understanding and analysis
Audio : Text-to-speech and speech-to-text
Image Generation : Creating images from text prompts
Document Analysis : PDF, CSV, and document processing
All using the familiar OpenAI-compatible API signature.
Vision (Image Understanding)
Supported Providers
OpenAI (GPT-4o, GPT-4o-mini, GPT-4 Turbo with Vision)
Anthropic (Claude 3.5 Sonnet, Claude 3 Opus/Sonnet/Haiku)
Google (Gemini 1.5 Pro/Flash, Gemini 2.0 Flash)
Azure OpenAI
Vertex AI
Bedrock (Claude models)
Usage
from portkey_ai import Portkey
client = Portkey(
api_key = "PORTKEY_API_KEY" ,
provider = "openai" ,
Authorization = "sk-***"
)
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "https://example.com/image.jpg"
}
}
]
}
]
)
Base64 Images
You can also pass images as base64-encoded data:
import base64
with open ( "image.jpg" , "rb" ) as image_file:
base64_image = base64.b64encode(image_file.read()).decode( 'utf-8' )
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{
"type" : "image_url" ,
"image_url" : {
"url" : f "data:image/jpeg;base64, { base64_image } "
}
}
]
}
]
)
Multi-Provider Vision with Fallback
{
"strategy" : { "mode" : "fallback" },
"targets" : [
{
"provider" : "openai" ,
"api_key" : "sk-***" ,
"override_params" : { "model" : "gpt-4o" }
},
{
"provider" : "anthropic" ,
"api_key" : "sk-ant-***" ,
"override_params" : { "model" : "claude-3-5-sonnet-20240620" }
},
{
"provider" : "google" ,
"api_key" : "***" ,
"override_params" : { "model" : "gemini-1.5-pro" }
}
]
}
The Gateway automatically transforms image content formats between providers, ensuring compatibility across OpenAI, Anthropic, Google, and AWS Bedrock.
Audio
Text-to-Speech (TTS)
Supported Providers
OpenAI (tts-1, tts-1-hd)
Azure OpenAI
ElevenLabs (via custom integration)
from portkey_ai import Portkey
client = Portkey(
api_key = "PORTKEY_API_KEY" ,
provider = "openai" ,
Authorization = "sk-***"
)
response = client.audio.speech.create(
model = "tts-1" ,
voice = "alloy" ,
input = "Hello! Welcome to Portkey AI Gateway."
)
# Save to file
with open ( "speech.mp3" , "wb" ) as f:
f.write(response.content)
Speech-to-Text (Transcription)
Supported Providers
OpenAI (whisper-1)
Azure OpenAI
Groq (whisper-large-v3)
Deepgram (via custom integration)
from portkey_ai import Portkey
client = Portkey(
api_key = "PORTKEY_API_KEY" ,
provider = "openai" ,
Authorization = "sk-***"
)
with open ( "audio.mp3" , "rb" ) as audio_file:
transcript = client.audio.transcriptions.create(
model = "whisper-1" ,
file = audio_file
)
print (transcript.text)
Translation
Translate audio to English:
with open ( "spanish_audio.mp3" , "rb" ) as audio_file:
translation = client.audio.translations.create(
model = "whisper-1" ,
file = audio_file
)
print (translation.text) # English translation
Image Generation
Supported Providers
OpenAI (DALL-E 2, DALL-E 3)
Azure OpenAI
Stability AI (Stable Diffusion)
Together AI
Segmind
Replicate
from portkey_ai import Portkey
client = Portkey(
api_key = "PORTKEY_API_KEY" ,
provider = "openai" ,
Authorization = "sk-***"
)
response = client.images.generate(
model = "dall-e-3" ,
prompt = "A serene landscape with mountains at sunset" ,
size = "1024x1024" ,
quality = "standard" ,
n = 1
)
image_url = response.data[ 0 ].url
print ( f "Generated image: { image_url } " )
Stability AI Example
client = Portkey(
api_key = "PORTKEY_API_KEY" ,
provider = "stability-ai" ,
Authorization = "sk-***"
)
response = client.images.generate(
model = "stable-diffusion-xl-1024-v1-0" ,
prompt = "A futuristic city with flying cars" ,
size = "1024x1024"
)
Fallback Between Image Providers
{
"strategy" : { "mode" : "fallback" },
"targets" : [
{
"provider" : "openai" ,
"api_key" : "sk-***" ,
"override_params" : { "model" : "dall-e-3" }
},
{
"provider" : "stability-ai" ,
"api_key" : "sk-***" ,
"override_params" : { "model" : "stable-diffusion-xl-1024-v1-0" }
}
]
}
When using fallbacks between different image generation providers, be aware that:
Prompt interpretation may vary
Style and output quality differ
Some parameters may not be supported across providers
Document Processing
Many vision models support document understanding:
PDF Analysis
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Summarize this document" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "data:application/pdf;base64,..."
}
}
]
}
]
)
Supported Document Types
PDF documents
CSV files
Images (JPEG, PNG, GIF, WebP)
Spreadsheets (provider-dependent)
Multi-Modal Routing Strategies
Cost Optimization
Route to cost-effective models for simple tasks:
{
"strategy" : { "mode" : "conditional" },
"conditions" : [
{
"query" : { "complexity" : "simple" },
"then" : "cost-effective"
},
{
"query" : { "complexity" : "complex" },
"then" : "high-quality"
}
],
"targets" : [
{
"name" : "cost-effective" ,
"provider" : "openai" ,
"override_params" : { "model" : "gpt-4o-mini" }
},
{
"name" : "high-quality" ,
"provider" : "anthropic" ,
"override_params" : { "model" : "claude-3-5-sonnet-20240620" }
}
]
}
Load Balancing for High Volume
{
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{
"provider" : "openai" ,
"api_key" : "sk-***-1" ,
"weight" : 0.6
},
{
"provider" : "google" ,
"api_key" : "***" ,
"weight" : 0.4
}
]
}
Provider-Specific Features
OpenAI Vision Detail Levels
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Analyze in detail" },
{
"type" : "image_url" ,
"image_url" : {
"url" : "https://example.com/image.jpg" ,
"detail" : "high" # low, high, auto
}
}
]
}
]
)
Anthropic PDF Support
Claude models support direct PDF processing:
with open ( "document.pdf" , "rb" ) as pdf_file:
pdf_data = base64.b64encode(pdf_file.read()).decode( 'utf-8' )
response = client.chat.completions.create(
model = "claude-3-5-sonnet-20240620" ,
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Summarize this PDF" },
{
"type" : "image_url" ,
"image_url" : {
"url" : f "data:application/pdf;base64, { pdf_data } "
}
}
]
}
]
)
Best Practices
Resize images before sending to reduce latency and costs. Most models work well with images under 2MB.
Use Appropriate Detail Levels
For OpenAI models, use detail: "low" for simple tasks and detail: "high" for complex analysis to balance cost and accuracy.
Cache Multi-Modal Requests
Enable caching for repeated multi-modal requests to reduce costs significantly.
Handle Timeouts Appropriately
Multi-modal requests take longer. Set higher timeouts (30-60s) for vision and audio processing.
Different providers excel at different multi-modal tasks. Test to find the best fit for your use case.
Multi-modal requests are more expensive. Monitor usage and set up cost alerts in the Gateway dashboard.
Common Use Cases
Image Analysis
Product catalog analysis
Medical image interpretation
Document OCR and extraction
Visual quality control
Audio Processing
Meeting transcriptions
Podcast summaries
Voice command processing
Multi-language translation
Image Generation
Marketing content creation
Product visualization
UI/UX mockups
Creative artwork
Streaming Stream multi-modal responses
Fallbacks Fallback between vision providers
Caching Cache expensive multi-modal requests
Providers Explore all supported providers