Skip to main content

Multimodal Inputs

BAML supports multimodal inputs including images, audio, video, and PDFs. This guide shows you how to use these media types in your prompts.

Supported Media Types

  • Images - PNG, JPEG, GIF, WebP
  • Audio - MP3, WAV, OGG, and other formats
  • Video - MP4, WebM, and other formats (model-dependent)
  • PDFs - Document analysis (requires compatible models)

Image Inputs

Define functions that accept image inputs:
function DescribeImage(img: image) -> string {
  client "openai/gpt-5"  // GPT-5 has excellent multimodal support
  prompt #"
    {{_.role("user")}}
    Describe this image: {{ img }}
  "#
}

test TestImage {
  functions [DescribeImage]
  args {
    img {
      url "https://upload.wikimedia.org/wikipedia/en/4/4d/Shrek_%28character%29.png"
    }
  }
}
Most LLM providers require images or audio to be sent as “user” messages. Use {{_.role("user")}} before media content.

Calling with Images

from baml_py import Image
from baml_client import b

async def test_image():
    # From URL
    res = await b.DescribeImage(
        img=Image.from_url(
            "https://upload.wikimedia.org/wikipedia/en/4/4d/Shrek_%28character%29.png"
        )
    )

    # Base64 image
    image_b64 = "iVBORw0K...."
    res = await b.DescribeImage(
        img=Image.from_base64("image/png", image_b64)
    )

Audio Inputs

Process audio files for transcription or analysis:
function TranscribeAudio(audio: audio) -> string {
  client "openai/whisper-1"
  prompt #"
    {{_.role("user")}}
    Transcribe this audio: {{ audio }}
  "#
}

Calling with Audio

from baml_py import Audio
from baml_client import b

async def test_audio():
    # From URL
    res = await b.TranscribeAudio(
        audio=Audio.from_url(
            "https://actions.google.com/sounds/v1/emergency/beeper_emergency_call.ogg"
        )
    )

    # Base64
    b64 = "iVBORw0K...."
    res = await b.TranscribeAudio(
        audio=Audio.from_base64("audio/ogg", b64)
    )

PDF Inputs

Analyze PDF documents with compatible models:
function AnalyzePDF(doc: pdf) -> DocumentSummary {
  client "google-ai/gemini-2.0-flash-exp"  // Gemini supports PDFs
  prompt #"
    {{_.role("user")}}
    Analyze this document and extract key information: {{ doc }}
    {{ ctx.output_format }}
  "#
}

class DocumentSummary {
  title string
  main_points string[]
  summary string
}
PDF Limitations:
  • PDFs must be provided as Base64 data (URL-based PDFs not supported)
  • Only models that explicitly support PDF modality work (e.g., Gemini 2.x Flash/Pro, VertexAI Gemini)
  • Ensure your selected model advertises PDF support

Calling with PDFs

from baml_py import Pdf
from baml_client import b

async def test_pdf():
    # Base64 data only
    b64 = "JVBERi0K...."
    res = await b.AnalyzePDF(
        doc=Pdf.from_base64("application/pdf", b64)
    )

Video Inputs

Analyze video content with video-capable models:
function DescribeVideo(video: video) -> VideoAnalysis {
  client "google-ai/gemini-2.0-flash-exp"  // Gemini supports video
  prompt #"
    {{_.role("user")}}
    Analyze this video and describe what happens: {{ video }}
    {{ ctx.output_format }}
  "#
}

class VideoAnalysis {
  scene_description string
  key_events string[]
  detected_objects string[]
}
Video Requirements:
  • Requires models that support video understanding (e.g., Gemini 2.x Flash/Pro)
  • When using URLs, the URL is forwarded to the model unchanged
  • If the model can’t fetch remote content, use Video.from_base64 instead

Calling with Videos

from baml_py import Video
from baml_client import b

async def test_video():
    # From URL
    res = await b.DescribeVideo(
        video=Video.from_url("https://example.com/sample.mp4")
    )

    # Base64
    b64 = "AAAAGGZ0eXBpc29t...."
    res = await b.DescribeVideo(
        video=Video.from_base64("video/mp4", b64)
    )

Controlling URL Resolution

Customize how BAML handles media URLs using media_url_handler:

Example: Optimizing for Performance

client<llm> FastClaude {
  provider anthropic
  options {
    model "claude-3-5-sonnet-20241022"
    api_key env.ANTHROPIC_API_KEY
    media_url_handler {
      image "send_url"       // Anthropic can fetch URLs directly
      pdf "send_base64"      // Required by Anthropic API
    }
  }
}

Example: Working with Google Cloud Storage

client<llm> GeminiWithGCS {
  provider google-ai
  options {
    model "gemini-1.5-pro"
    api_key env.GOOGLE_API_KEY
    media_url_handler {
      image "send_base64_unless_google_url"  // Preserve gs:// URLs, convert others
    }
  }
}

Example: Maximum Compatibility

client<llm> CompatibleClient {
  provider openai
  options {
    model "gpt-4o"
    api_key env.OPENAI_API_KEY
    media_url_handler {
      image "send_base64"    // Ensure images are embedded
      audio "send_base64"    // OpenAI requires base64 for audio
      pdf "send_base64"      // Embed PDFs for reliability
    }
  }
}

Available URL Handling Strategies

StrategyDescriptionUse Case
send_urlForward URL to providerReduce payload size, let provider fetch
send_base64Download and embed contentAvoid external dependencies
send_url_add_mime_typeAdd MIME type to URLRequired by some providers
send_base64_unless_google_urlPreserve GCS URLsOptimize for Google providers

Multi-Modal Combinations

Combine multiple media types in a single function:
function AnalyzePresentation(
  slides: image[],
  audio: audio,
  notes: string
) -> PresentationAnalysis {
  client "google-ai/gemini-2.0-flash-exp"
  prompt #"
    {{_.role("user")}}
    Analyze this presentation:
    
    Slides: {{ slides }}
    Audio commentary: {{ audio }}
    Speaker notes: {{ notes }}
    
    {{ ctx.output_format }}
  "#
}

class PresentationAnalysis {
  main_topics string[]
  visual_quality string
  audio_clarity string
  overall_score int
}

Best Practices

1
Verify Model Compatibility
2
Check that your chosen model supports the media types you’re using. Not all models support all modalities.
3
Optimize Media Size
4
For images:
5
  • Crop unnecessary parts
  • Resize to model’s input size
  • Verify clarity at model resolution
  • Blurrier images = higher hallucination rates
  • 6
    Use Appropriate MIME Types
    7
    Always specify correct MIME types when using base64:
    8
  • Images: image/png, image/jpeg, image/gif
  • Audio: audio/mp3, audio/wav, audio/ogg
  • Video: video/mp4, video/webm
  • PDFs: application/pdf
  • 9
    Test in the Playground
    10
    Use the BAML Playground to test multimodal functions before deploying to production.

    Next Steps

    Build docs developers (and LLMs) love