Multimodal Models - llama.cpp

llama.cpp supports multimodal models that can process images and audio alongside text through the libmtmd library. This enables vision-language models (VLMs) and speech-language models for diverse AI applications.

Overview

Multimodal support allows models to:

Vision: Analyze images, answer questions about visual content, generate image descriptions
Audio: Process speech, transcribe audio, understand audio content
Mixed: Handle multiple modalities simultaneously (e.g., Qwen2.5-Omni)

Currently supported modalities:

Images: Vision models like Gemma 3, SmolVLM, Qwen2-VL, Pixtral
Audio: Speech models like Ultravox, Voxtral (experimental, may have reduced quality)

Quick Start

Download a multimodal model

Use the -hf flag to automatically download a model with its projector:

./llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Send a multimodal request

Use the chat completions endpoint with image content:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4-vision",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
      ]
    }]
  }'

Loading Multimodal Models

Automatic Loading (Recommended)

Using -hf automatically downloads both the model and multimodal projector:

# Vision model
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# Audio model
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

# Mixed modality model
./llama-server -hf ggml-org/Qwen2.5-Omni-7B-GGUF

Manual Loading

Specify the model and projector separately:

./llama-server -m model.gguf --mmproj projector.gguf

Disable Multimodal

# Load model without multimodal projector
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj

# Or disable automatic loading
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-auto

GPU Offloading

By default, the multimodal projector is offloaded to GPU. To disable:

./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload

Vision Models

Vision models can analyze images and answer questions about visual content.

Available Vision Models

# 4B parameter model
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# 12B parameter model
./llama-server -hf ggml-org/gemma-3-12b-it-GGUF

# 27B parameter model
./llama-server -hf ggml-org/gemma-3-27b-it-GGUF

Some models may require a larger context window. Use -c 8192 or higher if you encounter issues.

Using Vision Models

With CLI

# Start interactive session
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF -cnv

# Send image with prompt
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF \
  --image photo.jpg \
  -p "Describe this image in detail"

# Multiple images
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF \
  --image "image1.jpg,image2.png" \
  -p "Compare these two images"

With Server (OpenAI API)

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4-vision",
    "messages": [{
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What objects are in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        }
      ]
    }],
    "max_tokens": 300
  }'

Python Example

import openai
import base64

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

# Read and encode image
with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

# Make request
response = client.chat.completions.create(
    model="gpt-4-vision",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}"
                }
            }
        ]
    }],
    max_tokens=300
)

print(response.choices[0].message.content)

Image Input Formats

Supported image formats:

Base64 encoded: data:image/jpeg;base64,/9j/4AAQ...
Local files (CLI): --image path/to/image.jpg
URLs: https://example.com/image.jpg

Dynamic Resolution

Some vision models support dynamic resolution for better image understanding:

# Configure token limits for dynamic resolution
./llama-server -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF \
  --image-min-tokens 64 \
  --image-max-tokens 4096

--image-min-tokens

integer

default:"model default"

Minimum tokens each image can use (for dynamic resolution models)

--image-max-tokens

integer

default:"model default"

Maximum tokens each image can use (for dynamic resolution models)

Audio Models

Audio models process speech and audio content.

Audio support is highly experimental and may have reduced quality compared to vision models.

Available Audio Models

# Ultravox 0.5 (1B parameters)
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF

# Ultravox 0.5 (8B parameters)
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

# Mistral Voxtral Mini
./llama-server -hf ggml-org/Voxtral-Mini-3B-2507-GGUF

Using Audio Models

With CLI

# Start with audio model
./llama-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF -cnv

# Process audio file
./llama-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF \
  --audio speech.wav \
  -p "Transcribe and summarize this audio"

With Server

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "audio-model",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is said in this audio?"},
        {
          "type": "input_audio",
          "input_audio": {
            "data": "base64_encoded_audio_data",
            "format": "wav"
          }
        }
      ]
    }]
  }'

Mixed Modality Models

Some models support multiple input modalities simultaneously.

Qwen2.5-Omni

Capabilities: Audio input, vision input, text output

# 3B parameter model
./llama-server -hf ggml-org/Qwen2.5-Omni-3B-GGUF

# 7B parameter model
./llama-server -hf ggml-org/Qwen2.5-Omni-7B-GGUF

Using Mixed Modality

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "omni",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see and hear"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
        {"type": "input_audio", "input_audio": {"data": "base64_audio", "format": "wav"}}
      ]
    }]
  }'

Finding More Models

Discover GGUF multimodal models on Hugging Face:

Vision models: https://huggingface.co/models?pipeline_tag=image-text-to-text&sort=trending&search=gguf
ggml-org collection: https://huggingface.co/collections/ggml-org/multimodal-ggufs-68244e01ff1f39e5bebeeedc

Common Use Cases

Image Analysis

# Object detection
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision",
    "messages": [{"role": "user", "content": [
      {"type": "text", "text": "List all objects in this image"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
    ]}]
  }'

OCR (Text Extraction)

# Extract text from image
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vision",
    "messages": [{"role": "user", "content": [
      {"type": "text", "text": "Extract all text from this document image"},
      {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
    ]}],
    "temperature": 0.1
  }'

Image Comparison

import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Load two images
with open("before.jpg", "rb") as f:
    img1 = base64.b64encode(f.read()).decode()
with open("after.jpg", "rb") as f:
    img2 = base64.b64encode(f.read()).decode()

# Compare images
response = client.chat.completions.create(
    model="vision",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What are the differences between these two images?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img1}"}},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img2}"}}
        ]
    }]
)

print(response.choices[0].message.content)

Speech Transcription

import openai
import base64

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Load audio file
with open("speech.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

# Transcribe
response = client.chat.completions.create(
    model="audio",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe this audio"},
            {
                "type": "input_audio",
                "input_audio": {"data": audio_data, "format": "wav"}
            }
        ]
    }]
)

print(response.choices[0].message.content)

Implementation Details

How Multimodal Works

Multimodal models work by:

Encoding: Images/audio are encoded into embeddings using a separate encoder model (the multimodal projector)
Integration: These embeddings are combined with text token embeddings
Processing: The main language model processes the combined embeddings
Generation: The model generates text responses that incorporate understanding of all modalities

Media Markers

In the prompt, multimodal data is represented by marker strings (e.g., <__media__>) that act as placeholders. The actual media data is passed separately and substituted in order.

Clients must check the /models or /v1/models endpoint for the multimodal capability before sending multimodal requests.

Performance Optimization

GPU Acceleration

# Offload model and projector to GPU
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF -ngl 99

# Disable projector offload if needed
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF -ngl 99 --no-mmproj-offload

Context Window

# Increase context for large images or multiple images
./llama-server -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF -c 8192

Batch Processing

For processing multiple images:

# Process multiple images in parallel
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = []

for img_path in images:
    with open(img_path, "rb") as f:
        img_data = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="vision",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_data}"}}
            ]
        }]
    )
    results.append(response.choices[0].message.content)

Troubleshooting

Model fails to load multimodal projector

Issue: Projector not found or not loading Solution:

Ensure you’re using -hf for automatic download
Or manually specify with --mmproj projector.gguf
Check that the projector file exists and is compatible

Images not being processed

Issue: Model ignores image input Solution:

Verify the model supports vision (check model card)
Ensure projector is loaded (--mmproj)
Check image format (base64, supported file types)
Verify the image marker is in the prompt

Out of memory errors

Issue: Crashes or OOM errors with large images Solution:

Reduce --image-max-tokens
Increase context size with -c
Use smaller images or resize before encoding
Enable GPU offloading with -ngl

Audio quality issues

Issue: Poor audio transcription or understanding Solution:

Use higher quality audio files (16kHz+ sample rate)
Try different audio models
Ensure audio format is supported (WAV recommended)
Note that audio support is experimental

Server

Full server API reference

CLI Tool

Command-line multimodal usage

Embeddings

Multimodal embeddings

Model Hub

Pre-quantized multimodal models

Get Started

Core Concepts

Inference

Models

Advanced

​Overview

​Quick Start

​Loading Multimodal Models

​Automatic Loading (Recommended)

​Manual Loading

​Disable Multimodal

​GPU Offloading

​Vision Models

​Available Vision Models

​Using Vision Models

​With CLI

​With Server (OpenAI API)

​Python Example

​Image Input Formats

​Dynamic Resolution

​Audio Models

​Available Audio Models

​Using Audio Models

​With CLI

​With Server

​Mixed Modality Models

​Qwen2.5-Omni

​Using Mixed Modality

​Finding More Models

​Common Use Cases

​Image Analysis

​OCR (Text Extraction)

​Image Comparison

​Speech Transcription

​Implementation Details

​How Multimodal Works

​Media Markers

​Performance Optimization

​GPU Acceleration

​Context Window

​Batch Processing

​Troubleshooting

​Model fails to load multimodal projector

​Images not being processed

​Out of memory errors

​Audio quality issues

​See Also

Server

CLI Tool

Embeddings

Model Hub

Overview

Quick Start

Loading Multimodal Models

Automatic Loading (Recommended)

Manual Loading

Disable Multimodal

GPU Offloading

Vision Models

Available Vision Models

Using Vision Models

With CLI

With Server (OpenAI API)

Python Example

Image Input Formats

Dynamic Resolution

Audio Models

Available Audio Models

Using Audio Models

With CLI

With Server

Mixed Modality Models

Qwen2.5-Omni

Using Mixed Modality

Finding More Models

Common Use Cases

Image Analysis

OCR (Text Extraction)

Image Comparison

Speech Transcription

Implementation Details

How Multimodal Works

Media Markers

Performance Optimization

GPU Acceleration

Context Window

Batch Processing

Troubleshooting

Model fails to load multimodal projector

Images not being processed

Out of memory errors

Audio quality issues

See Also