Skip to main content
llama.cpp supports multimodal models that can process images and audio alongside text through the libmtmd library. This enables vision-language models (VLMs) and speech-language models for diverse AI applications.

Overview

Multimodal support allows models to:
  • Vision: Analyze images, answer questions about visual content, generate image descriptions
  • Audio: Process speech, transcribe audio, understand audio content
  • Mixed: Handle multiple modalities simultaneously (e.g., Qwen2.5-Omni)
Currently supported modalities:
  • Images: Vision models like Gemma 3, SmolVLM, Qwen2-VL, Pixtral
  • Audio: Speech models like Ultravox, Voxtral (experimental, may have reduced quality)

Quick Start

1

Download a multimodal model

Use the -hf flag to automatically download a model with its projector:
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF
2

Send a multimodal request

Use the chat completions endpoint with image content:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4-vision",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
      ]
    }]
  }'

Loading Multimodal Models

Using -hf automatically downloads both the model and multimodal projector:
# Vision model
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# Audio model
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

# Mixed modality model
./llama-server -hf ggml-org/Qwen2.5-Omni-7B-GGUF

Manual Loading

Specify the model and projector separately:
./llama-server -m model.gguf --mmproj projector.gguf

Disable Multimodal

# Load model without multimodal projector
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj

# Or disable automatic loading
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-auto

GPU Offloading

By default, the multimodal projector is offloaded to GPU. To disable:
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload

Vision Models

Vision models can analyze images and answer questions about visual content.

Available Vision Models

# 4B parameter model
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# 12B parameter model
./llama-server -hf ggml-org/gemma-3-12b-it-GGUF

# 27B parameter model
./llama-server -hf ggml-org/gemma-3-27b-it-GGUF
Some models may require a larger context window. Use -c 8192 or higher if you encounter issues.

Using Vision Models

With CLI

# Start interactive session
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF -cnv

# Send image with prompt
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF \
  --image photo.jpg \
  -p "Describe this image in detail"

# Multiple images
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF \
  --image "image1.jpg,image2.png" \
  -p "Compare these two images"

With Server (OpenAI API)

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4-vision",
    "messages": [{
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What objects are in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        }
      ]
    }],
    "max_tokens": 300
  }'

Python Example

import openai
import base64

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

# Read and encode image
with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

# Make request
response = client.chat.completions.create(
    model="gpt-4-vision",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}"
                }
            }
        ]
    }],
    max_tokens=300
)

print(response.choices[0].message.content)

Image Input Formats

Supported image formats:
  • Base64 encoded: data:image/jpeg;base64,/9j/4AAQ...
  • Local files (CLI): --image path/to/image.jpg
  • URLs: https://example.com/image.jpg

Dynamic Resolution

Some vision models support dynamic resolution for better image understanding:
# Configure token limits for dynamic resolution
./llama-server -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF \
  --image-min-tokens 64 \
  --image-max-tokens 4096
--image-min-tokens
integer
default:"model default"
Minimum tokens each image can use (for dynamic resolution models)
--image-max-tokens
integer
default:"model default"
Maximum tokens each image can use (for dynamic resolution models)

Audio Models

Audio models process speech and audio content.
Audio support is highly experimental and may have reduced quality compared to vision models.

Available Audio Models

# Ultravox 0.5 (1B parameters)
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF

# Ultravox 0.5 (8B parameters)
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

# Mistral Voxtral Mini
./llama-server -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

Using Audio Models

With CLI

# Start with audio model
./llama-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF -cnv

# Process audio file
./llama-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF \
  --audio speech.wav \
  -p "Transcribe and summarize this audio"

With Server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "audio-model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is said in this audio?"},
        {
          "type": "input_audio",
          "input_audio": {
            "data": "base64_encoded_audio_data",
            "format": "wav"
          }
        }
      ]
    }]
  }'

Mixed Modality Models

Some models support multiple input modalities simultaneously.

Qwen2.5-Omni

Capabilities: Audio input, vision input, text output
# 3B parameter model
./llama-server -hf ggml-org/Qwen2.5-Omni-3B-GGUF

# 7B parameter model
./llama-server -hf ggml-org/Qwen2.5-Omni-7B-GGUF

Using Mixed Modality

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "omni",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see and hear"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
        {"type": "input_audio", "input_audio": {"data": "base64_audio", "format": "wav"}}
      ]
    }]
  }'

Finding More Models

Discover GGUF multimodal models on Hugging Face:

Common Use Cases

Image Analysis

# Object detection
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision",
    "messages": [{"role": "user", "content": [
      {"type": "text", "text": "List all objects in this image"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
    ]}]
  }'

OCR (Text Extraction)

# Extract text from image
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision",
    "messages": [{"role": "user", "content": [
      {"type": "text", "text": "Extract all text from this document image"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
    ]}],
    "temperature": 0.1
  }'

Image Comparison

import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Load two images
with open("before.jpg", "rb") as f:
    img1 = base64.b64encode(f.read()).decode()
with open("after.jpg", "rb") as f:
    img2 = base64.b64encode(f.read()).decode()

# Compare images
response = client.chat.completions.create(
    model="vision",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What are the differences between these two images?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img1}"}},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img2}"}}
        ]
    }]
)

print(response.choices[0].message.content)

Speech Transcription

import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Load audio file
with open("speech.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

# Transcribe
response = client.chat.completions.create(
    model="audio",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe this audio"},
            {
                "type": "input_audio",
                "input_audio": {"data": audio_data, "format": "wav"}
            }
        ]
    }]
)

print(response.choices[0].message.content)

Implementation Details

How Multimodal Works

Multimodal models work by:
  1. Encoding: Images/audio are encoded into embeddings using a separate encoder model (the multimodal projector)
  2. Integration: These embeddings are combined with text token embeddings
  3. Processing: The main language model processes the combined embeddings
  4. Generation: The model generates text responses that incorporate understanding of all modalities

Media Markers

In the prompt, multimodal data is represented by marker strings (e.g., <__media__>) that act as placeholders. The actual media data is passed separately and substituted in order.
Clients must check the /models or /v1/models endpoint for the multimodal capability before sending multimodal requests.

Performance Optimization

GPU Acceleration

# Offload model and projector to GPU
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF -ngl 99

# Disable projector offload if needed
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF -ngl 99 --no-mmproj-offload

Context Window

# Increase context for large images or multiple images
./llama-server -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF -c 8192

Batch Processing

For processing multiple images:
# Process multiple images in parallel
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = []

for img_path in images:
    with open(img_path, "rb") as f:
        img_data = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="vision",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_data}"}}
            ]
        }]
    )
    results.append(response.choices[0].message.content)

Troubleshooting

Model fails to load multimodal projector

Issue: Projector not found or not loading Solution:
  • Ensure you’re using -hf for automatic download
  • Or manually specify with --mmproj projector.gguf
  • Check that the projector file exists and is compatible

Images not being processed

Issue: Model ignores image input Solution:
  • Verify the model supports vision (check model card)
  • Ensure projector is loaded (--mmproj)
  • Check image format (base64, supported file types)
  • Verify the image marker is in the prompt

Out of memory errors

Issue: Crashes or OOM errors with large images Solution:
  • Reduce --image-max-tokens
  • Increase context size with -c
  • Use smaller images or resize before encoding
  • Enable GPU offloading with -ngl

Audio quality issues

Issue: Poor audio transcription or understanding Solution:
  • Use higher quality audio files (16kHz+ sample rate)
  • Try different audio models
  • Ensure audio format is supported (WAV recommended)
  • Note that audio support is experimental

See Also

Server

Full server API reference

CLI Tool

Command-line multimodal usage

Embeddings

Multimodal embeddings

Model Hub

Pre-quantized multimodal models