Skip to main content

Overview

The Portkey AI Gateway provides unified access to multi-modal capabilities across providers, allowing you to work with:
  • Vision: Image understanding and analysis
  • Audio: Text-to-speech and speech-to-text
  • Image Generation: Creating images from text prompts
  • Document Analysis: PDF, CSV, and document processing
All using the familiar OpenAI-compatible API signature.

Vision (Image Understanding)

Supported Providers

  • OpenAI (GPT-4o, GPT-4o-mini, GPT-4 Turbo with Vision)
  • Anthropic (Claude 3.5 Sonnet, Claude 3 Opus/Sonnet/Haiku)
  • Google (Gemini 1.5 Pro/Flash, Gemini 2.0 Flash)
  • Azure OpenAI
  • Vertex AI
  • Bedrock (Claude models)

Usage

from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

Base64 Images

You can also pass images as base64-encoded data:
import base64

with open("image.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                }
            ]
        }
    ]
)

Multi-Provider Vision with Fallback

{
  "strategy": { "mode": "fallback" },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-***",
      "override_params": { "model": "gpt-4o" }
    },
    {
      "provider": "anthropic",
      "api_key": "sk-ant-***",
      "override_params": { "model": "claude-3-5-sonnet-20240620" }
    },
    {
      "provider": "google",
      "api_key": "***",
      "override_params": { "model": "gemini-1.5-pro" }
    }
  ]
}
The Gateway automatically transforms image content formats between providers, ensuring compatibility across OpenAI, Anthropic, Google, and AWS Bedrock.

Audio

Text-to-Speech (TTS)

Supported Providers

  • OpenAI (tts-1, tts-1-hd)
  • Azure OpenAI
  • ElevenLabs (via custom integration)
from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello! Welcome to Portkey AI Gateway."
)

# Save to file
with open("speech.mp3", "wb") as f:
    f.write(response.content)

Speech-to-Text (Transcription)

Supported Providers

  • OpenAI (whisper-1)
  • Azure OpenAI
  • Groq (whisper-large-v3)
  • Deepgram (via custom integration)
from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print(transcript.text)

Translation

Translate audio to English:
with open("spanish_audio.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=audio_file
    )

print(translation.text)  # English translation

Image Generation

Supported Providers

  • OpenAI (DALL-E 2, DALL-E 3)
  • Azure OpenAI
  • Stability AI (Stable Diffusion)
  • Together AI
  • Segmind
  • Replicate
from portkey_ai import Portkey

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="openai",
    Authorization="sk-***"
)

response = client.images.generate(
    model="dall-e-3",
    prompt="A serene landscape with mountains at sunset",
    size="1024x1024",
    quality="standard",
    n=1
)

image_url = response.data[0].url
print(f"Generated image: {image_url}")

Stability AI Example

client = Portkey(
    api_key="PORTKEY_API_KEY",
    provider="stability-ai",
    Authorization="sk-***"
)

response = client.images.generate(
    model="stable-diffusion-xl-1024-v1-0",
    prompt="A futuristic city with flying cars",
    size="1024x1024"
)

Fallback Between Image Providers

{
  "strategy": { "mode": "fallback" },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-***",
      "override_params": { "model": "dall-e-3" }
    },
    {
      "provider": "stability-ai",
      "api_key": "sk-***",
      "override_params": { "model": "stable-diffusion-xl-1024-v1-0" }
    }
  ]
}
When using fallbacks between different image generation providers, be aware that:
  • Prompt interpretation may vary
  • Style and output quality differ
  • Some parameters may not be supported across providers

Document Processing

Many vision models support document understanding:

PDF Analysis

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarize this document"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:application/pdf;base64,..."
                    }
                }
            ]
        }
    ]
)

Supported Document Types

  • PDF documents
  • CSV files
  • Images (JPEG, PNG, GIF, WebP)
  • Spreadsheets (provider-dependent)

Multi-Modal Routing Strategies

Cost Optimization

Route to cost-effective models for simple tasks:
{
  "strategy": { "mode": "conditional" },
  "conditions": [
    {
      "query": { "complexity": "simple" },
      "then": "cost-effective"
    },
    {
      "query": { "complexity": "complex" },
      "then": "high-quality"
    }
  ],
  "targets": [
    {
      "name": "cost-effective",
      "provider": "openai",
      "override_params": { "model": "gpt-4o-mini" }
    },
    {
      "name": "high-quality",
      "provider": "anthropic",
      "override_params": { "model": "claude-3-5-sonnet-20240620" }
    }
  ]
}

Load Balancing for High Volume

{
  "strategy": { "mode": "loadbalance" },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-***-1",
      "weight": 0.6
    },
    {
      "provider": "google",
      "api_key": "***",
      "weight": 0.4
    }
  ]
}

Provider-Specific Features

OpenAI Vision Detail Levels

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze in detail"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "high"  # low, high, auto
                    }
                }
            ]
        }
    ]
)

Anthropic PDF Support

Claude models support direct PDF processing:
with open("document.pdf", "rb") as pdf_file:
    pdf_data = base64.b64encode(pdf_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="claude-3-5-sonnet-20240620",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarize this PDF"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:application/pdf;base64,{pdf_data}"
                    }
                }
            ]
        }
    ]
)

Best Practices

Resize images before sending to reduce latency and costs. Most models work well with images under 2MB.
For OpenAI models, use detail: "low" for simple tasks and detail: "high" for complex analysis to balance cost and accuracy.
Enable caching for repeated multi-modal requests to reduce costs significantly.
Multi-modal requests take longer. Set higher timeouts (30-60s) for vision and audio processing.
Different providers excel at different multi-modal tasks. Test to find the best fit for your use case.
Multi-modal requests are more expensive. Monitor usage and set up cost alerts in the Gateway dashboard.

Common Use Cases

Image Analysis

  • Product catalog analysis
  • Medical image interpretation
  • Document OCR and extraction
  • Visual quality control

Audio Processing

  • Meeting transcriptions
  • Podcast summaries
  • Voice command processing
  • Multi-language translation

Image Generation

  • Marketing content creation
  • Product visualization
  • UI/UX mockups
  • Creative artwork

Streaming

Stream multi-modal responses

Fallbacks

Fallback between vision providers

Caching

Cache expensive multi-modal requests

Providers

Explore all supported providers

Build docs developers (and LLMs) love