Multi-Modal Support

Overview

The Portkey AI Gateway provides unified access to multi-modal capabilities across providers, allowing you to work with:

Vision: Image understanding and analysis
Audio: Text-to-speech and speech-to-text
Image Generation: Creating images from text prompts
Document Analysis: PDF, CSV, and document processing

All using the familiar OpenAI-compatible API signature.

Vision (Image Understanding)

Supported Providers

OpenAI (GPT-4o, GPT-4o-mini, GPT-4 Turbo with Vision)
Anthropic (Claude 3.5 Sonnet, Claude 3 Opus/Sonnet/Haiku)
Google (Gemini 1.5 Pro/Flash, Gemini 2.0 Flash)
Azure OpenAI
Vertex AI
Bedrock (Claude models)

Usage

from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

Base64 Images

You can also pass images as base64-encoded data:

import base64

with open("image.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ]
)

Multi-Provider Vision with Fallback

{
  "strategy": { "mode": "fallback" },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-***",
      "override_params": { "model": "gpt-4o" }
    },
    {
      "provider": "anthropic",
      "api_key": "sk-ant-***",
      "override_params": { "model": "claude-3-5-sonnet-20240620" }
    },
    {
      "provider": "google",
      "api_key": "***",
      "override_params": { "model": "gemini-1.5-pro" }
    }
  ]
}

The Gateway automatically transforms image content formats between providers, ensuring compatibility across OpenAI, Anthropic, Google, and AWS Bedrock.

Audio

Text-to-Speech (TTS)

Supported Providers

OpenAI (tts-1, tts-1-hd)
Azure OpenAI
ElevenLabs (via custom integration)

from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello! Welcome to Portkey AI Gateway."
)

# Save to file
with open("speech.mp3", "wb") as f:
    f.write(response.content)

Speech-to-Text (Transcription)

Supported Providers

OpenAI (whisper-1)
Azure OpenAI
Groq (whisper-large-v3)
Deepgram (via custom integration)

from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print(transcript.text)

Translation

Translate audio to English:

with open("spanish_audio.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=audio_file
    )

print(translation.text)  # English translation

Image Generation

Supported Providers

OpenAI (DALL-E 2, DALL-E 3)
Azure OpenAI
Stability AI (Stable Diffusion)
Together AI
Segmind
Replicate

from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

response = client.images.generate(
    model="dall-e-3",
    prompt="A serene landscape with mountains at sunset",
    size="1024x1024",
    quality="standard",
    n=1
)

image_url = response.data[0].url
print(f"Generated image: {image_url}")

Stability AI Example

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="stability-ai",
    Authorization="sk-***"
)

response = client.images.generate(
    model="stable-diffusion-xl-1024-v1-0",
    prompt="A futuristic city with flying cars",
    size="1024x1024"
)

Fallback Between Image Providers

{
  "strategy": { "mode": "fallback" },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-***",
      "override_params": { "model": "dall-e-3" }
    },
    {
      "provider": "stability-ai",
      "api_key": "sk-***",
      "override_params": { "model": "stable-diffusion-xl-1024-v1-0" }
    }
  ]
}

When using fallbacks between different image generation providers, be aware that:

Prompt interpretation may vary
Style and output quality differ
Some parameters may not be supported across providers

Document Processing

Many vision models support document understanding:

PDF Analysis

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarize this document"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:application/pdf;base64,..."
                    }
                }
            ]
        }
    ]
)

Supported Document Types

PDF documents
CSV files
Images (JPEG, PNG, GIF, WebP)
Spreadsheets (provider-dependent)

Cost Optimization

Route to cost-effective models for simple tasks:

{
  "strategy": { "mode": "conditional" },
  "conditions": [
    {
      "query": { "complexity": "simple" },
      "then": "cost-effective"
    },
    {
      "query": { "complexity": "complex" },
      "then": "high-quality"
    }
  ],
  "targets": [
    {
      "name": "cost-effective",
      "provider": "openai",
      "override_params": { "model": "gpt-4o-mini" }
    },
    {
      "name": "high-quality",
      "provider": "anthropic",
      "override_params": { "model": "claude-3-5-sonnet-20240620" }
    }
  ]
}

Load Balancing for High Volume

{
  "strategy": { "mode": "loadbalance" },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-***-1",
      "weight": 0.6
    },
    {
      "provider": "google",
      "api_key": "***",
      "weight": 0.4
    }
  ]
}

Provider-Specific Features

OpenAI Vision Detail Levels

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "high"  # low, high, auto
                    }
                }
            ]
        }
    ]
)

Anthropic PDF Support

Claude models support direct PDF processing:

with open("document.pdf", "rb") as pdf_file:
    pdf_data = base64.b64encode(pdf_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="claude-3-5-sonnet-20240620",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarize this PDF"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:application/pdf;base64,{pdf_data}"
                    }
                }
            ]
        }
    ]
)

Best Practices

Optimize Image Sizes

Resize images before sending to reduce latency and costs. Most models work well with images under 2MB.

Use Appropriate Detail Levels

For OpenAI models, use detail: "low" for simple tasks and detail: "high" for complex analysis to balance cost and accuracy.

Cache Multi-Modal Requests

Handle Timeouts Appropriately

Multi-modal requests take longer. Set higher timeouts (30-60s) for vision and audio processing.

Test Across Providers

Different providers excel at different multi-modal tasks. Test to find the best fit for your use case.

Monitor Costs

Multi-modal requests are more expensive. Monitor usage and set up cost alerts in the Gateway dashboard.

Common Use Cases

Image Analysis

Product catalog analysis
Medical image interpretation
Document OCR and extraction
Visual quality control

Audio Processing

Meeting transcriptions
Podcast summaries
Voice command processing
Multi-language translation

Image Generation

Marketing content creation
Product visualization
UI/UX mockups
Creative artwork

Streaming

Stream multi-modal responses

Fallbacks

Fallback between vision providers

Caching

Cache expensive multi-modal requests

Providers

Explore all supported providers

Getting Started

Core Concepts

Features

MCP Gateway

Deployment

Overview

Vision (Image Understanding)

Supported Providers

Usage

Base64 Images

Multi-Provider Vision with Fallback

Audio

Text-to-Speech (TTS)

Supported Providers

Speech-to-Text (Transcription)

Supported Providers

Translation

Image Generation

Supported Providers

Stability AI Example

Fallback Between Image Providers

Document Processing

PDF Analysis

Supported Document Types

Cost Optimization

Load Balancing for High Volume

Provider-Specific Features

OpenAI Vision Detail Levels

Anthropic PDF Support

Best Practices

Common Use Cases

Image Analysis

Audio Processing

Image Generation

Streaming

Fallbacks

Caching

Providers

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Features

MCP Gateway

Deployment

​Overview

​Vision (Image Understanding)

​Supported Providers

​Usage

​Base64 Images

​Multi-Provider Vision with Fallback

​Audio

​Text-to-Speech (TTS)

​Supported Providers

​Speech-to-Text (Transcription)

​Supported Providers

​Translation

​Image Generation

​Supported Providers

​Stability AI Example

​Fallback Between Image Providers

​Document Processing

​PDF Analysis

​Supported Document Types

​Multi-Modal Routing Strategies

​Cost Optimization

​Load Balancing for High Volume

​Provider-Specific Features

​OpenAI Vision Detail Levels

​Anthropic PDF Support

​Best Practices

​Common Use Cases

​Image Analysis

​Audio Processing

​Image Generation

​Related Features

Streaming

Fallbacks

Caching

Providers

Build docs developers (and LLMs) love

Overview

Vision (Image Understanding)

Supported Providers

Usage

Base64 Images

Multi-Provider Vision with Fallback

Audio

Text-to-Speech (TTS)

Supported Providers

Speech-to-Text (Transcription)

Supported Providers

Translation

Image Generation

Supported Providers

Stability AI Example

Fallback Between Image Providers

Document Processing

PDF Analysis

Supported Document Types

Multi-Modal Routing Strategies

Cost Optimization

Load Balancing for High Volume

Provider-Specific Features

OpenAI Vision Detail Levels

Anthropic PDF Support

Best Practices

Common Use Cases

Image Analysis

Audio Processing

Image Generation

Related Features