llama.cpp supports multimodal models that can process images and audio alongside text through the libmtmd library. This enables vision-language models (VLMs) and speech-language models for diverse AI applications.
Overview
Multimodal support allows models to:
Vision : Analyze images, answer questions about visual content, generate image descriptions
Audio : Process speech, transcribe audio, understand audio content
Mixed : Handle multiple modalities simultaneously (e.g., Qwen2.5-Omni)
Currently supported modalities:
Images : Vision models like Gemma 3, SmolVLM, Qwen2-VL, Pixtral
Audio : Speech models like Ultravox, Voxtral (experimental, may have reduced quality)
Quick Start
Download a multimodal model
Use the -hf flag to automatically download a model with its projector: ./llama-server -hf ggml-org/gemma-3-4b-it-GGUF
Send a multimodal request
Use the chat completions endpoint with image content: curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4-vision",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
]
}]
}'
Loading Multimodal Models
Automatic Loading (Recommended)
Using -hf automatically downloads both the model and multimodal projector:
# Vision model
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF
# Audio model
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF
# Mixed modality model
./llama-server -hf ggml-org/Qwen2.5-Omni-7B-GGUF
Manual Loading
Specify the model and projector separately:
./llama-server -m model.gguf --mmproj projector.gguf
Disable Multimodal
# Load model without multimodal projector
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj
# Or disable automatic loading
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-auto
GPU Offloading
By default, the multimodal projector is offloaded to GPU. To disable:
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
Vision Models
Vision models can analyze images and answer questions about visual content.
Available Vision Models
Gemma 3 (Recommended)
SmolVLM
Qwen2-VL / Qwen2.5-VL
Other Models
# 4B parameter model
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF
# 12B parameter model
./llama-server -hf ggml-org/gemma-3-12b-it-GGUF
# 27B parameter model
./llama-server -hf ggml-org/gemma-3-27b-it-GGUF
Some models may require a larger context window. Use -c 8192 or higher if you encounter issues.
Using Vision Models
With CLI
# Start interactive session
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF -cnv
# Send image with prompt
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF \
--image photo.jpg \
-p "Describe this image in detail"
# Multiple images
./llama-cli -hf ggml-org/gemma-3-4b-it-GGUF \
--image "image1.jpg,image2.png" \
-p "Compare these two images"
With Server (OpenAI API)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4-vision",
"messages": [{
"role": "user",
"content": [
{
"type": "text",
"text": "What objects are in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
}
}
]
}],
"max_tokens": 300
}'
Python Example
import openai
import base64
client = openai.OpenAI(
base_url = "http://localhost:8080/v1" ,
api_key = "not-needed"
)
# Read and encode image
with open ( "image.jpg" , "rb" ) as f:
image_data = base64.b64encode(f.read()).decode( "utf-8" )
# Make request
response = client.chat.completions.create(
model = "gpt-4-vision" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What's in this image?" },
{
"type" : "image_url" ,
"image_url" : {
"url" : f "data:image/jpeg;base64, { image_data } "
}
}
]
}],
max_tokens = 300
)
print (response.choices[ 0 ].message.content)
Supported image formats:
Base64 encoded : data:image/jpeg;base64,/9j/4AAQ...
Local files (CLI): --image path/to/image.jpg
URLs : https://example.com/image.jpg
Dynamic Resolution
Some vision models support dynamic resolution for better image understanding:
# Configure token limits for dynamic resolution
./llama-server -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF \
--image-min-tokens 64 \
--image-max-tokens 4096
--image-min-tokens
integer
default: "model default"
Minimum tokens each image can use (for dynamic resolution models)
--image-max-tokens
integer
default: "model default"
Maximum tokens each image can use (for dynamic resolution models)
Audio Models
Audio models process speech and audio content.
Audio support is highly experimental and may have reduced quality compared to vision models.
Available Audio Models
# Ultravox 0.5 (1B parameters)
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF
# Ultravox 0.5 (8B parameters)
./llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF
# Mistral Voxtral Mini
./llama-server -hf ggml-org/Voxtral-Mini-3B-2507-GGUF
Using Audio Models
With CLI
# Start with audio model
./llama-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF -cnv
# Process audio file
./llama-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF \
--audio speech.wav \
-p "Transcribe and summarize this audio"
With Server
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "audio-model",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is said in this audio?"},
{
"type": "input_audio",
"input_audio": {
"data": "base64_encoded_audio_data",
"format": "wav"
}
}
]
}]
}'
Mixed Modality Models
Some models support multiple input modalities simultaneously.
Qwen2.5-Omni
Capabilities: Audio input, vision input, text output
# 3B parameter model
./llama-server -hf ggml-org/Qwen2.5-Omni-3B-GGUF
# 7B parameter model
./llama-server -hf ggml-org/Qwen2.5-Omni-7B-GGUF
Using Mixed Modality
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "omni",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see and hear"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
{"type": "input_audio", "input_audio": {"data": "base64_audio", "format": "wav"}}
]
}]
}'
Finding More Models
Discover GGUF multimodal models on Hugging Face:
Common Use Cases
Image Analysis
# Object detection
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vision",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "List all objects in this image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]}]
}'
# Extract text from image
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vision",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "Extract all text from this document image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]}],
"temperature": 0.1
}'
Image Comparison
import openai
import base64
client = openai.OpenAI( base_url = "http://localhost:8080/v1" , api_key = "not-needed" )
# Load two images
with open ( "before.jpg" , "rb" ) as f:
img1 = base64.b64encode(f.read()).decode()
with open ( "after.jpg" , "rb" ) as f:
img2 = base64.b64encode(f.read()).decode()
# Compare images
response = client.chat.completions.create(
model = "vision" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "What are the differences between these two images?" },
{ "type" : "image_url" , "image_url" : { "url" : f "data:image/jpeg;base64, { img1 } " }},
{ "type" : "image_url" , "image_url" : { "url" : f "data:image/jpeg;base64, { img2 } " }}
]
}]
)
print (response.choices[ 0 ].message.content)
Speech Transcription
import openai
import base64
client = openai.OpenAI( base_url = "http://localhost:8080/v1" , api_key = "not-needed" )
# Load audio file
with open ( "speech.wav" , "rb" ) as f:
audio_data = base64.b64encode(f.read()).decode()
# Transcribe
response = client.chat.completions.create(
model = "audio" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Transcribe this audio" },
{
"type" : "input_audio" ,
"input_audio" : { "data" : audio_data, "format" : "wav" }
}
]
}]
)
print (response.choices[ 0 ].message.content)
Implementation Details
How Multimodal Works
Multimodal models work by:
Encoding : Images/audio are encoded into embeddings using a separate encoder model (the multimodal projector)
Integration : These embeddings are combined with text token embeddings
Processing : The main language model processes the combined embeddings
Generation : The model generates text responses that incorporate understanding of all modalities
In the prompt, multimodal data is represented by marker strings (e.g., <__media__>) that act as placeholders. The actual media data is passed separately and substituted in order.
Clients must check the /models or /v1/models endpoint for the multimodal capability before sending multimodal requests.
GPU Acceleration
# Offload model and projector to GPU
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF -ngl 99
# Disable projector offload if needed
./llama-server -hf ggml-org/gemma-3-4b-it-GGUF -ngl 99 --no-mmproj-offload
Context Window
# Increase context for large images or multiple images
./llama-server -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF -c 8192
Batch Processing
For processing multiple images:
# Process multiple images in parallel
images = [ "img1.jpg" , "img2.jpg" , "img3.jpg" ]
results = []
for img_path in images:
with open (img_path, "rb" ) as f:
img_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model = "vision" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image" },
{ "type" : "image_url" , "image_url" : { "url" : f "data:image/jpeg;base64, { img_data } " }}
]
}]
)
results.append(response.choices[ 0 ].message.content)
Troubleshooting
Model fails to load multimodal projector
Issue : Projector not found or not loading
Solution :
Ensure you’re using -hf for automatic download
Or manually specify with --mmproj projector.gguf
Check that the projector file exists and is compatible
Images not being processed
Issue : Model ignores image input
Solution :
Verify the model supports vision (check model card)
Ensure projector is loaded (--mmproj)
Check image format (base64, supported file types)
Verify the image marker is in the prompt
Out of memory errors
Issue : Crashes or OOM errors with large images
Solution :
Reduce --image-max-tokens
Increase context size with -c
Use smaller images or resize before encoding
Enable GPU offloading with -ngl
Audio quality issues
Issue : Poor audio transcription or understanding
Solution :
Use higher quality audio files (16kHz+ sample rate)
Try different audio models
Ensure audio format is supported (WAV recommended)
Note that audio support is experimental
See Also
Server Full server API reference
CLI Tool Command-line multimodal usage
Embeddings Multimodal embeddings
Model Hub Pre-quantized multimodal models