Overview
Multimodal models extend language models with specialized encoders for:- Vision - Image understanding and analysis
- Video - Temporal reasoning and video QA
- Audio - Speech and audio processing
- Omnimodal - Combined modalities
Quick Start
Basic Vision Model
Image Request Example
Vision-Language Models
Qwen-VL Family
Alibaba’s vision-language models with strong image and video understanding.Launch Qwen3-VL
Hardware Recommendations
- H100 with FP8: Use FP8 checkpoint for best memory efficiency
- A100/H100 with BF16: Use
--mm-max-concurrent-callsto control memory - H200 & B200: Full context + concurrent image/video processing
Qwen-VL Video Support
Qwen-VL Optimization Flags
--mm-attention-backend fa3- Use FlashAttention 3 for multimodal--mm-max-concurrent-calls <N>- Control concurrent multimodal processing--mm-per-request-timeout <seconds>- Timeout for large videos--keep-mm-feature-on-device- Keep features on GPU (lower latency, higher memory)SGLANG_USE_CUDA_IPC_TRANSPORT=1- Shared memory pool for multimodal data
DeepSeek Vision Models
DeepSeek-VL2
Vision-language variant with advanced multimodal reasoning:DeepSeek-OCR / OCR-2
Specialized for document understanding:DeepSeek-Janus-Pro
Image understanding AND generation:Llama Vision
Meta’s vision-enabled Llama models:LLaVA Family
Open vision-chat models:Other Vision Models
| Model Family | Example Model | Key Features |
|---|---|---|
| Gemma 3 MM | google/gemma-3-4b-it | 4B-27B, 256 tokens per image, 128K context |
| Kimi-VL | moonshotai/Kimi-VL-A3B-Instruct | Moonshot’s compact VLM |
| Mistral-Small-3.1 | mistralai/Mistral-Small-3.1-24B-Instruct-2503 | 24B multimodal with tool calling |
| Phi-4-multimodal | microsoft/Phi-4-multimodal-instruct | 5.6B with vision + audio |
| MiMo-VL | XiaomiMiMo/MiMo-VL-7B-RL | Native resolution ViT encoder |
| MiniCPM-V/o | openbmb/MiniCPM-V-2_6 | 8B, edge-optimized |
| GLM-4.5V | zai-org/GLM-4.5V | 106B multimodal reasoning |
| DotsVLM | rednote-hilab/dots.vlm1.inst | NaViT vision encoder + DeepSeek V3 |
| NVILA | Efficient-Large-Model/NVILA-8B | Efficient multi-modal design |
| Ernie4.5-VL | baidu/ERNIE-4.5-VL-28B-A3B-PT | Baidu’s 28B/424B VLMs |
| Step3-VL | stepfun-ai/Step3-VL-10B | Lightweight 10B VLM |
| InternVL | OpenGVLab/InternVL2-8B | Open-source VLM series |
Audio Models
Qwen3-Omni
Omni-modal model supporting audio input:Qwen2-Audio
Audio-specific model:Phi-4-multimodal (Audio)
Supports text, vision, and audio:Gemma3n-Audio
Google’s audio-enabled Gemma variant:Video Understanding
Many vision models support video input through frame sampling:Supported Video Models
| Model | Example | Video Features |
|---|---|---|
| Qwen-VL | Qwen/Qwen3-VL-30B-A3B-Instruct | Frame sampler, video metadata |
| GLM-4v | zai-org/GLM-4.5V | Decord decoder, rotary position |
| NVILA | Efficient-Large-Model/NVILA-8B | 8 frames per clip, EVS pruning |
| LLaVA-NeXT-Video | lmms-lab/LLaVA-NeXT-Video-7B | LlavaVid architecture |
| LLaVA-OneVision | lmms-lab/llava-onevision-qwen2-7b-ov | Multiple images/video frames |
| Nemotron Nano 2.0 VL | nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 | 2 FPS, max 128 frames, EVS pruning |
Video Request Example
See the Image Request Example above, but replaceimage_url with video_url:
NVILA EVS Pruning
NVILA uses Embedded Video Sparsity (EVS) to remove redundant tokens:Performance Optimization
Keep Features on Device
Trade GPU memory for lower latency:Multimodal Input Limits
Control memory usage and speed:qwen_vl processors support this config.
Concurrent Processing Control
Attention Backend Selection
Special Considerations
Gemma 3 Bidirectional Attention
Gemma 3 multimodal uses bidirectional attention between image tokens during prefill. Limitation: Only supported with Triton backend, incompatible with CUDA Graph and Chunked Prefill.MiniCPM-o Audio/Video
MiniCPM-o adds audio/video support to MiniCPM-V:GLM Models Chat Template
Some GLM vision models require specific chat templates:NVILA Mamba Cache Size
NVILA uses hybrid Mamba-Transformer architecture:Specialized Multimodal Models
OCR Models
| Model | Command | Use Case |
|---|---|---|
| DeepSeek-OCR-2 | --model-path deepseek-ai/DeepSeek-OCR-2 | Document understanding |
| GLM-OCR | --model-path zai-org/GLM-OCR | Fast general OCR |
| DotsVLM-OCR | --model-path rednote-hilab/dots.ocr | Enhanced text extraction |
| LightOnOCR | Model-specific | Lightweight OCR |
| PaddleOCR-VL | Model-specific | PaddlePaddle OCR |
Image Generation
| Model | Capabilities |
|---|---|
| DeepSeek-Janus-Pro | Understanding + Generation |
Enterprise Models
| Model | Provider | Key Features |
|---|---|---|
| NVIDIA Nemotron Nano 2.0 VL | NVIDIA | Hybrid Mamba-Transformer, high throughput |
| Llama Nemotron Super | NVIDIA | Enterprise AI agents |
| JetVLM | Jet AI | High-performance multimodal (coming soon) |
