Skip to main content
vLLM supports multimodal language models that can process combinations of text, images, video, and audio inputs. These models enable applications like visual question answering, image captioning, video understanding, and speech recognition.

Supported modalities

The following modalities are supported depending on the model:
  • T - Text
  • I - Image
  • V - Video
  • A - Audio
Modalities joined by + can be used together:
  • T + I means text-only, image-only, and text-with-image inputs are supported
Modalities separated by / are mutually exclusive:
  • T / I means text-only and image-only inputs are supported, but not combined
For hybrid-only models (Llama-4, Step3, Mistral-3, Qwen-3.5), a text-only mode can be enabled by setting all multimodal modalities to 0 using --language-model-only. This prevents loading multimodal modules, freeing up GPU memory for KV cache.

Passing multimodal inputs

See the examples below for how to pass images, videos, and audio to models.

Image inputs example

import PIL.Image
from vllm import LLM

llm = LLM(model="llava-hf/llava-1.5-7b-hf")

prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
image = PIL.Image.open("image.jpg")

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image},
})

print(outputs[0].outputs[0].text)

Multiple images example

from vllm import LLM
import PIL.Image

llm = LLM(
    model="microsoft/Phi-3.5-vision-instruct",
    trust_remote_code=True,
    max_model_len=4096,
    limit_mm_per_prompt={"image": 2},
)

prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is in each image?<|end|>\n<|assistant|>\n"
image1 = PIL.Image.open("image1.jpg")
image2 = PIL.Image.open("image2.jpg")

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": [image1, image2]},
})

LLaVA family

ArchitectureModelsModalitiesExample
LlavaForConditionalGenerationLLaVA-1.5T + IE+llava-hf/llava-1.5-7b-hf
LlavaNextForConditionalGenerationLLaVA-NeXTT + IE+llava-hf/llava-v1.6-mistral-7b-hf
LlavaOnevisionForConditionalGenerationLLaVA-OnevisionT + I+ + V+llava-hf/llava-onevision-qwen2-7b-ov-hf
E+ indicates the model supports multiple instances of this modality

Leading multimodal models

ModelModalitiesExample
Qwen2-VLT + IE+ + VE+Qwen/Qwen2-VL-7B-Instruct
Qwen3-VLT + IE+ + VE+Qwen/Qwen3-VL-6B
Qwen2.5-Omni-ThinkerT + IE+ + VE+ + AE+Qwen/Qwen2.5-Omni-Thinker-7B
ModelModalitiesExample
DeepSeek-VL2T + I+deepseek-ai/deepseek-vl2
DeepSeek-OCRT + I+deepseek-ai/DeepSeek-OCR
DeepSeek-OCR-2T + I+deepseek-ai/DeepSeek-OCR-2
ModelModalitiesExample
InternVL 3.5T + IE+ + VE+OpenGVLab/InternVL3_5-14B
InternVL 3.0T + IE+ + VE+OpenGVLab/InternVL3-9B
Intern-S1T + IE+ + VE+internlm/Intern-S1
ArchitectureModalitiesExample
Llama4ForConditionalGenerationT + I+meta-llama/Llama-4-Scout-17B-16E-Instruct
Llama4ForConditionalGenerationT + I+meta-llama/Llama-4-Maverick-17B-128E-Instruct
ModelModalitiesExample
Gemma 3T + IE+google/gemma-3-4b-it
Gemma 3nT + I + Agoogle/gemma-3n-E2B-it

Document understanding

Specialized models for OCR and document processing:
ArchitectureModelExample
DotsOCRForCausalLMdots.ocrrednote-hilab/dots.ocr
GlmOcrForConditionalGenerationGLM-OCRzai-org/GLM-OCR
HunYuanVLForConditionalGenerationHunyuanOCRtencent/HunyuanOCR
LightOnOCRForConditionalGenerationLightOnOCRlightonai/LightOnOCR-1B
PaddleOCRVLForConditionalGenerationPaddle-OCRPaddlePaddle/PaddleOCR-VL

Audio models

vLLM supports audio input models for speech recognition and audio understanding:
ArchitectureModelsModalitiesExample
AudioFlamingo3ForConditionalGenerationAudioFlamingo3T + Anvidia/audio-flamingo-3-hf
Qwen2AudioForConditionalGenerationQwen2-AudioT + AE+Qwen/Qwen2-Audio-7B-Instruct
GraniteSpeechForConditionalGenerationGranite SpeechT + Aibm-granite/granite-speech-3.3-8b
WhisperForConditionalGenerationWhisperAopenai/whisper-large-v3
MiDashengLMModelMiDashengLMT + A+mispeech/midashenglm-7b

Video understanding models

Models that can process video inputs:
ArchitectureModelsModalitiesExample
LlavaNextVideoForConditionalGenerationLLaVA-NeXT-VideoT + Vllava-hf/LLaVA-NeXT-Video-7B-hf
InternVLChatModelInternVideo 2.5T + IE+ + VE+OpenGVLab/InternVideo2_5_Chat_8B
Molmo2ForConditionalGenerationMolmo2T + I+ / Vallenai/Molmo2-8B
Ovis2_5Ovis2.5T + I+ + VAIDC-AI/Ovis2.5-9B

Omnimodal models

Models supporting multiple modalities including text, image, video, and audio:
ArchitectureModelsModalitiesExample
MiniCPMOMiniCPM-OT + IE+ + VE+ + AE+openbmb/MiniCPM-o-2_6
Qwen3OmniMoeThinkerForConditionalGenerationQwen3-Omni-ThinkerT + IE+ + VE+ + AE+Various Qwen3 omni models

Special capabilities

Image generation

Some models support image generation in addition to understanding:
  • Chameleon: ChameleonForConditionalGeneration - T + I (generation + understanding)

Scientific and domain-specific

  • Molmo: Open vision-language models from AI2
  • NVLM-D: NVIDIA’s vision-language model
  • Aria: Multimodal mixture-of-experts

LoRA support

vLLM supports adding LoRA adapters to the language backbone for most multimodal models. Additionally, vLLM experimentally supports adding LoRA to tower and connector modules for some models.
See the LoRA documentation for details on using adapters with multimodal models.

Online serving

Multimodal models can be served via the OpenAI-Compatible Server:
vllm serve llava-hf/llava-1.5-7b-hf
Then make requests with images:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }]
  }'

Best practices

  • Use --language-model-only for hybrid models when only text is needed
  • Set appropriate --max-model-len based on your use case
  • Use quantization (FP8, INT8) to reduce memory usage
  • Set limit_mm_per_prompt to control maximum media inputs per request
  • Verify the model’s expected prompt format from its HuggingFace page
  • Use appropriate placeholders (e.g., <image>, <|image_1|>)
  • For multiple images, check if the model supports it via limit_mm_per_prompt
  • Pre-process images to reduce size when possible
  • Enable tensor parallelism for large models: --tensor-parallel-size
  • Use continuous batching for higher throughput
  • Consider prefix caching for common visual prompts
  • Profile with different batch sizes to find optimal throughput

Next steps

Generative models

Text-only language models

Quantization

Reduce model size with quantization

OpenAI API

Serve multimodal models via API

Build docs developers (and LLMs) love