Supported modalities
The following modalities are supported depending on the model:- T - Text
- I - Image
- V - Video
- A - Audio
+ can be used together:
T + Imeans text-only, image-only, and text-with-image inputs are supported
/ are mutually exclusive:
T / Imeans text-only and image-only inputs are supported, but not combined
For hybrid-only models (Llama-4, Step3, Mistral-3, Qwen-3.5), a text-only mode can be enabled by setting all multimodal modalities to 0 using
--language-model-only. This prevents loading multimodal modules, freeing up GPU memory for KV cache.Passing multimodal inputs
See the examples below for how to pass images, videos, and audio to models.Image inputs example
Multiple images example
Popular vision-language models
LLaVA family
| Architecture | Models | Modalities | Example |
|---|---|---|---|
LlavaForConditionalGeneration | LLaVA-1.5 | T + IE+ | llava-hf/llava-1.5-7b-hf |
LlavaNextForConditionalGeneration | LLaVA-NeXT | T + IE+ | llava-hf/llava-v1.6-mistral-7b-hf |
LlavaOnevisionForConditionalGeneration | LLaVA-Onevision | T + I+ + V+ | llava-hf/llava-onevision-qwen2-7b-ov-hf |
Leading multimodal models
Qwen vision models
Qwen vision models
| Model | Modalities | Example |
|---|---|---|
| Qwen2-VL | T + IE+ + VE+ | Qwen/Qwen2-VL-7B-Instruct |
| Qwen3-VL | T + IE+ + VE+ | Qwen/Qwen3-VL-6B |
| Qwen2.5-Omni-Thinker | T + IE+ + VE+ + AE+ | Qwen/Qwen2.5-Omni-Thinker-7B |
DeepSeek vision models
DeepSeek vision models
| Model | Modalities | Example |
|---|---|---|
| DeepSeek-VL2 | T + I+ | deepseek-ai/deepseek-vl2 |
| DeepSeek-OCR | T + I+ | deepseek-ai/DeepSeek-OCR |
| DeepSeek-OCR-2 | T + I+ | deepseek-ai/DeepSeek-OCR-2 |
InternVL family
InternVL family
| Model | Modalities | Example |
|---|---|---|
| InternVL 3.5 | T + IE+ + VE+ | OpenGVLab/InternVL3_5-14B |
| InternVL 3.0 | T + IE+ + VE+ | OpenGVLab/InternVL3-9B |
| Intern-S1 | T + IE+ + VE+ | internlm/Intern-S1 |
Llama 4 vision
Llama 4 vision
| Architecture | Modalities | Example |
|---|---|---|
Llama4ForConditionalGeneration | T + I+ | meta-llama/Llama-4-Scout-17B-16E-Instruct |
Llama4ForConditionalGeneration | T + I+ | meta-llama/Llama-4-Maverick-17B-128E-Instruct |
Google Gemma multimodal
Google Gemma multimodal
| Model | Modalities | Example |
|---|---|---|
| Gemma 3 | T + IE+ | google/gemma-3-4b-it |
| Gemma 3n | T + I + A | google/gemma-3n-E2B-it |
Document understanding
Specialized models for OCR and document processing:| Architecture | Model | Example |
|---|---|---|
DotsOCRForCausalLM | dots.ocr | rednote-hilab/dots.ocr |
GlmOcrForConditionalGeneration | GLM-OCR | zai-org/GLM-OCR |
HunYuanVLForConditionalGeneration | HunyuanOCR | tencent/HunyuanOCR |
LightOnOCRForConditionalGeneration | LightOnOCR | lightonai/LightOnOCR-1B |
PaddleOCRVLForConditionalGeneration | Paddle-OCR | PaddlePaddle/PaddleOCR-VL |
Audio models
vLLM supports audio input models for speech recognition and audio understanding:| Architecture | Models | Modalities | Example |
|---|---|---|---|
AudioFlamingo3ForConditionalGeneration | AudioFlamingo3 | T + A | nvidia/audio-flamingo-3-hf |
Qwen2AudioForConditionalGeneration | Qwen2-Audio | T + AE+ | Qwen/Qwen2-Audio-7B-Instruct |
GraniteSpeechForConditionalGeneration | Granite Speech | T + A | ibm-granite/granite-speech-3.3-8b |
WhisperForConditionalGeneration | Whisper | A | openai/whisper-large-v3 |
MiDashengLMModel | MiDashengLM | T + A+ | mispeech/midashenglm-7b |
Video understanding models
Models that can process video inputs:| Architecture | Models | Modalities | Example |
|---|---|---|---|
LlavaNextVideoForConditionalGeneration | LLaVA-NeXT-Video | T + V | llava-hf/LLaVA-NeXT-Video-7B-hf |
InternVLChatModel | InternVideo 2.5 | T + IE+ + VE+ | OpenGVLab/InternVideo2_5_Chat_8B |
Molmo2ForConditionalGeneration | Molmo2 | T + I+ / V | allenai/Molmo2-8B |
Ovis2_5 | Ovis2.5 | T + I+ + V | AIDC-AI/Ovis2.5-9B |
Omnimodal models
Models supporting multiple modalities including text, image, video, and audio:| Architecture | Models | Modalities | Example |
|---|---|---|---|
MiniCPMO | MiniCPM-O | T + IE+ + VE+ + AE+ | openbmb/MiniCPM-o-2_6 |
Qwen3OmniMoeThinkerForConditionalGeneration | Qwen3-Omni-Thinker | T + IE+ + VE+ + AE+ | Various Qwen3 omni models |
Special capabilities
Image generation
Some models support image generation in addition to understanding:- Chameleon:
ChameleonForConditionalGeneration- T + I (generation + understanding)
Scientific and domain-specific
- Molmo: Open vision-language models from AI2
- NVLM-D: NVIDIA’s vision-language model
- Aria: Multimodal mixture-of-experts
LoRA support
vLLM supports adding LoRA adapters to the language backbone for most multimodal models. Additionally, vLLM experimentally supports adding LoRA to tower and connector modules for some models.
Online serving
Multimodal models can be served via the OpenAI-Compatible Server:Best practices
Memory optimization
Memory optimization
- Use
--language-model-onlyfor hybrid models when only text is needed - Set appropriate
--max-model-lenbased on your use case - Use quantization (FP8, INT8) to reduce memory usage
- Set
limit_mm_per_promptto control maximum media inputs per request
Input handling
Input handling
- Verify the model’s expected prompt format from its HuggingFace page
- Use appropriate placeholders (e.g.,
<image>,<|image_1|>) - For multiple images, check if the model supports it via
limit_mm_per_prompt - Pre-process images to reduce size when possible
Performance tuning
Performance tuning
- Enable tensor parallelism for large models:
--tensor-parallel-size - Use continuous batching for higher throughput
- Consider prefix caching for common visual prompts
- Profile with different batch sizes to find optimal throughput
Next steps
Generative models
Text-only language models
Quantization
Reduce model size with quantization
OpenAI API
Serve multimodal models via API