Skip to main content
TensorRT-LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text. These models combine specialized encoders for images, video, and audio with powerful LLM decoders.

Architecture Overview

Multimodal LLMs typically handle non-text inputs by combining a multimodal encoder with an LLM decoder:
1

Multimodal Input Processor

Preprocesses raw multimodal input (images, audio) into a format suitable for the encoder, such as pixel values or spectrograms.
2

Multimodal Encoder

Encodes the processed input into embeddings aligned with the LLM’s embedding space (e.g., vision transformers for images).
3

Integration with LLM Decoder

Fuses multimodal embeddings with text embeddings as input to the LLM decoder for downstream inference.
Image/Audio → Preprocessor → Encoder → Embeddings ──┐
                                                     ├→ LLM Decoder → Output
Text Prompt ─────────────────────────────────────────┘

Supported Models

TensorRT-LLM supports a wide range of multimodal architectures:

Vision-Language Models

  • LLaVA (LLaMA + Vision)
  • VILA (Visual Language Assistant)
  • Qwen2-VL (Qwen with Vision)
  • NVILA (NVIDIA Vision-Language)
  • BLIP2 (Bootstrapped Language-Image Pre-training)
  • Nougat (Neural OCR for documents)

Audio Models

  • Whisper (Speech recognition)
  • Audio-language models (coming soon)
For the complete and up-to-date support matrix, see the Multimodal Feature Support Matrix.

Optimizations

TensorRT-LLM incorporates key optimizations to enhance multimodal inference performance:
Batches multimodal requests within the GPU executor to improve GPU utilization and throughput. Context-phase (image encoding) and generation-phase requests are batched together.
Asynchronously overlaps data preprocessing on the CPU with image encoding on the GPU, reducing end-to-end latency.
Leverages image hashes and token chunk information to improve KV cache reuse and minimize collisions. Identical images across requests share cached encoder outputs.

Quick Start

Basic Usage

Run a vision-language model with a single image:
from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image

# Load image
image = Image.open("/path/to/image.jpg")

# Create multimodal prompt
prompt = TextPrompt(
    prompt="Describe this image in detail.",
    multi_modal_data={"image": [image]}
)

# Initialize model
llm = LLM(model="Efficient-Large-Model/NVILA-8B")

# Generate
outputs = llm.generate(prompt)
print(outputs[0].text)

Multiple Images

Process multiple images in a single prompt:
from tensorrt_llm.inputs import TextPrompt
from PIL import Image

image1 = Image.open("/path/to/image1.jpg")
image2 = Image.open("/path/to/image2.jpg")

prompt = TextPrompt(
    prompt="What are the differences between these two images?",
    multi_modal_data={"image": [image1, image2]}
)

outputs = llm.generate(prompt)

KV Cache Reuse with UUIDs

For better cache management across sessions, provide custom UUIDs:
from tensorrt_llm.inputs import TextPrompt
from PIL import Image

image1 = Image.open("/path/to/image1.jpg")
image2 = Image.open("/path/to/image2.jpg")

prompt = TextPrompt(
    prompt="Describe these images.",
    multi_modal_data={"image": [image1, image2]},
    multi_modal_uuids={"image": ["image-001", "image-002"]}
)

outputs = llm.generate(prompt)
Why use UUIDs? Custom UUIDs enable deterministic cache management. The same UUID + content combination always produces the same cache key, allowing you to:
  • Track cache entries externally
  • Implement per-user cache isolation
  • Pre-warm cache with known images
  • Manage cache lifecycle across sessions

Serving Multimodal Models

Start OpenAI-Compatible Server

Launch a server with multimodal support:
trtllm-serve Qwen/Qwen2-VL-7B-Instruct --backend pytorch

Send Requests with Images

import openai
import base64

# Encode image to base64
with open("/path/to/image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Benchmarking

Evaluate multimodal inference performance:
trtllm-bench \
  --model Qwen/Qwen2-VL-7B-Instruct \
  throughput \
  --dataset /path/to/multimodal_dataset.json \
  --num_requests 100
For detailed benchmarking instructions, see the performance benchmarking guide.

Configuration Options

Disable KV Cache Reuse

For testing or when cache reuse is not beneficial:
python quickstart_multimodal.py \
  --model Efficient-Large-Model/NVILA-8B \
  --modality image \
  --disable_kv_cache_reuse
Or in Python:
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig

kv_cache_config = KvCacheConfig(
    enable_block_reuse=False
)

llm = LLM(
    model="Efficient-Large-Model/NVILA-8B",
    kv_cache_config=kv_cache_config
)

Multimodal-Specific Cache Settings

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import KvCacheConfig

kv_cache_config = KvCacheConfig(
    enable_block_reuse=True,           # Enable cross-request reuse
    free_gpu_memory_fraction=0.9,      # Allocate 90% of free GPU memory
    dtype='fp8'                         # Use FP8 KV cache (2x memory savings)
)

llm = LLM(
    model="Qwen/Qwen2-VL-7B-Instruct",
    kv_cache_config=kv_cache_config
)

Model-Specific Examples

LLaVA

from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image

image = Image.open("/path/to/image.jpg")

prompt = TextPrompt(
    prompt="USER: <image>\nWhat is shown in this image?\nASSISTANT:",
    multi_modal_data={"image": [image]}
)

llm = LLM(model="llava-hf/llava-1.5-7b-hf")
outputs = llm.generate(prompt)

NVILA

from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image

image = Image.open("/path/to/image.jpg")

prompt = TextPrompt(
    prompt="Describe this image in detail.",
    multi_modal_data={"image": [image]}
)

llm = LLM(model="Efficient-Large-Model/NVILA-8B")
outputs = llm.generate(prompt)

Qwen2-VL

from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from PIL import Image

image = Image.open("/path/to/image.jpg")

prompt = TextPrompt(
    prompt="<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n",
    multi_modal_data={"image": [image]}
)

llm = LLM(model="Qwen/Qwen2-VL-7B-Instruct")
outputs = llm.generate(prompt)

Best Practices

  • Resize images to model’s expected resolution before inference
  • Use appropriate image format (JPEG, PNG) based on content
  • Normalize pixel values according to model requirements
  • Batch multiple images when possible for better throughput
  • Enable enable_block_reuse=True for scenarios with repeated images
  • Use custom multi_modal_uuids for deterministic cache keys
  • Allocate sufficient GPU memory for KV cache (90%+ of free memory)
  • Consider FP8 KV cache for 2x memory savings
  • Follow model-specific prompt templates (LLaVA uses USER:/ASSISTANT:, Qwen uses special tokens)
  • Place image tokens where the model expects them
  • Be explicit about what you want the model to analyze
  • For multiple images, clearly reference which image you’re asking about
  • Use in-flight batching to mix image encoding and text generation
  • Enable CPU/GPU concurrency for image preprocessing
  • Monitor cache hit rates for repeated images
  • Benchmark with representative workloads

Limitations

  • Vision components use FP16 by default (cannot be quantized independently)
  • Some models have specific image resolution requirements
  • Multi-image support varies by model architecture
  • Video inputs are supported only for specific models (check support matrix)

Complete Example

Here’s a full example with all best practices:
from tensorrt_llm import LLM
from tensorrt_llm.inputs import TextPrompt
from tensorrt_llm.llmapi import KvCacheConfig
from tensorrt_llm.sampling_params import SamplingParams
from PIL import Image
import hashlib

# Load and prepare images
image1 = Image.open("/path/to/product1.jpg")
image2 = Image.open("/path/to/product2.jpg")

# Generate stable UUIDs based on image content or external IDs
image1_uuid = "product-image-12345"
image2_uuid = "product-image-67890"

# Configure KV cache with reuse
kv_cache_config = KvCacheConfig(
    enable_block_reuse=True,
    free_gpu_memory_fraction=0.9,
    dtype='fp8',
    host_cache_size=2*1024**3  # 2GB host cache for overflow
)

# Initialize model
llm = LLM(
    model="Qwen/Qwen2-VL-7B-Instruct",
    kv_cache_config=kv_cache_config
)

# Create prompts with UUIDs for cache management
prompts = [
    TextPrompt(
        prompt="Describe the product in this image.",
        multi_modal_data={"image": [image1]},
        multi_modal_uuids={"image": [image1_uuid]}
    ),
    TextPrompt(
        prompt="Describe the product in this image.",
        multi_modal_data={"image": [image2]},
        multi_modal_uuids={"image": [image2_uuid]}
    ),
    TextPrompt(
        prompt="What are the differences between these products?",
        multi_modal_data={"image": [image1, image2]},
        multi_modal_uuids={"image": [image1_uuid, image2_uuid]}
    )
]

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=200,
    temperature=0.7
)

# Generate (third prompt reuses cached encodings from first two)
for output in llm.generate(prompts, sampling_params):
    print(output.text)
    print("-" * 80)

Additional Resources

Multimodal Examples

Complete quickstart example for multimodal models

Supported Models

Full multimodal model support matrix

Serving Script

Example serving client for multimodal requests

Benchmarking Guide

Measure multimodal inference performance

Build docs developers (and LLMs) love