Performance optimization guide

vLLM provides multiple optimization strategies to improve throughput, reduce latency, and maximize GPU utilization. This guide covers the key optimization techniques.

Running out of memory? See the conserving memory guide for memory optimization strategies.

Preemption

Due to the autoregressive nature of transformer architecture, vLLM may need to preempt requests when KV cache space is insufficient.

Understanding preemption

When you see this warning:

WARNING: Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode 
because there is not enough KV cache space. This can affect the 
end-to-end performance. total_cumulative_preemption_cnt=1

It means vLLM is evicting requests from KV cache and will recompute them later. While this ensures robustness, frequent preemptions hurt performance.

Reducing preemption

To minimize preemption:

gpu_memory_utilization

float

Increase GPU memory utilization (default: 0.9).

llm = LLM(model="MODEL", gpu_memory_utilization=0.95)

Provides more KV cache space but leaves less headroom for temporary allocations.

max_num_seqs

int

Decrease maximum concurrent sequences.

llm = LLM(model="MODEL", max_num_seqs=128)

Reduces concurrent batch size, requiring less KV cache.

max_num_batched_tokens

int

Decrease maximum batched tokens.

llm = LLM(model="MODEL", max_num_batched_tokens=8192)

tensor_parallel_size

int

Increase tensor parallelism.

llm = LLM(model="MODEL", tensor_parallel_size=4)

Distributes model across more GPUs, leaving more memory per GPU for KV cache. May add communication overhead.

Monitoring preemption

Monitor preemption through:

Prometheus metrics exposed by vLLM
Log statistics with disable_log_stats=False

In vLLM V1, the default preemption mode is RECOMPUTE rather than SWAP, as recomputation has lower overhead in the V1 architecture.

Chunked prefill

Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them with decode requests.

Benefits

Better throughput: Balances compute-bound (prefill) and memory-bound (decode) operations
Improved inter-token latency: Decode requests are prioritized
Higher GPU utilization: Co-locates prefill and decode in the same batch

How it works

In V1, chunked prefill is enabled by default. The scheduler:

Prioritizes all pending decode requests
Schedules prefills with remaining max_num_batched_tokens budget
Automatically chunks large prefills that don’t fit

Performance tuning

max_num_batched_tokens

int

Controls chunked prefill behavior.

Smaller values (e.g., 2048): Better inter-token latency, fewer prefills slow down decodes
Larger values (e.g., 16384): Better TTFT, more prefill tokens per batch
For throughput: Set > 8192, especially for smaller models on large GPUs
Equals max_model_len: Almost equivalent to V0 default scheduling (still prioritizes decodes)

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_num_batched_tokens=16384
)

When chunked prefill is disabled, max_num_batched_tokens must be greater than max_model_len. Otherwise, vLLM may crash at startup.

Parallelism strategies

vLLM supports multiple parallelism strategies that can be combined.

Tensor parallelism (TP)

Shards model parameters across GPUs within each layer. When to use:

Model too large for single GPU
Need more KV cache space per GPU

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4
)

Pipeline parallelism (PP)

Distributes model layers across GPUs. When to use:

Maxed out efficient TP but need more distribution
Multi-node deployments
Very deep, narrow models

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4,
    pipeline_parallel_size=2
)

Expert parallelism (EP)

Specialized parallelism for Mixture of Experts models. When to use:

MoE models (DeepSeekV3, Qwen3MoE, Llama-4)
Balance expert computation across GPUs

llm = LLM(
    model="deepseek-ai/DeepSeek-V3",
    tensor_parallel_size=8,
    enable_expert_parallel=True
)

Uses the same degree of parallelism as TP, but applies to MoE layers instead.

Data parallelism (DP)

Replicates entire model across GPU sets. When to use:

Enough GPUs to replicate model
Need to scale throughput, not model size
Multi-user environments with request batch isolation

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    data_parallel_size=4
)

MoE layers are sharded by tensor_parallel_size × data_parallel_size.

For multi-modal models, you can use batch-level DP to shard input data instead of weights. Benefits:

Reduces communication overhead (no all-reduce after every layer)
10% throughput improvement for TP=8
40% additional improvement for Conv3D operations

Trade-off:

Slightly higher memory usage (encoder weights replicated)

llm = LLM(
    model="Qwen/Qwen2.5-VL-72B-Instruct",
    tensor_parallel_size=4,
    mm_encoder_tp_mode="data"  # Batch-level DP for vision encoder
)

Batch-level DP is independent from API request-level DP (controlled by data_parallel_size).

Supported models:

dots_ocr
GLM-4.1V or above
InternVL
Kimi-VL
Llama4
MiniCPM-V-2.5 or above
Qwen2-VL or above
Step3

Input processing optimization

Parallel processing

Scale input processing with API server scale-out:

# 4 API processes, 1 engine core
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4

# 4 API processes, 2 engine cores
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2

API server scale-out is only available for online inference.

Default: 8 CPU threads per API server load media items.With API server scale-out, adjust VLLM_MEDIA_LOADING_THREAD_COUNT to avoid CPU exhaustion:

export VLLM_MEDIA_LOADING_THREAD_COUNT=4
vllm serve MODEL --api-server-count 4

Avoid repeated processing of the same multi-modal inputs (common in multi-turn conversations).

Processor caching

Automatically enabled to cache processed multi-modal inputs.

IPC caching

Automatically enabled when there’s 1:1 correspondence between API and engine processes.

Key-replicated cache

Default mode. Cache keys exist in both processes, data only in engine process.

Shared memory cache

More efficient for multi-worker setups (TP > 1):

llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    tensor_parallel_size=2,
    mm_processor_cache_type="shm",
    mm_processor_cache_gb=8
)

Configuration

mm_processor_cache_gb

float

default:"4"

Size of multi-modal processor cache in GiB.Set to 0 to disable caching.

llm = LLM(model="MODEL", mm_processor_cache_gb=8)

mm_processor_cache_type

str

default:"lru"

Cache type.Options: lru, shm

llm = LLM(
    model="MODEL",
    mm_processor_cache_type="shm",
    tensor_parallel_size=2
)

CPU resources for GPU deployments

vLLM V1 uses a multi-process architecture. CPU underprovisioning is a common performance bottleneck.

Minimum requirements

For deployment with N GPUs:

Minimum physical cores = 2 + N

1 API server process
1 engine core process
N GPU worker processes (1 per GPU)

Using fewer physical CPU cores than processes causes severe performance degradation. The engine core runs a busy loop and is very sensitive to CPU starvation.

With hyperthreading enabled: 1 vCPU = 1 hyperthread = 1/2 physical coreSo you need 2 × (2 + N) minimum vCPUs.

Data parallel deployments

Minimum physical cores = A + DP + N + (1 if DP > 1 else 0)

Where:

A = API server count (defaults to DP)
DP = data parallel size
N = total number of GPUs

Example with DP=4, TP=2 on 8 GPUs:

4 API + 4 engines + 8 workers + 1 coordinator = 17 processes

Performance impact

CPU underprovisioning affects:

Input processing: Tokenization, chat templates, multi-modal loading
Scheduling latency: Engine core scheduler dispatches tokens to GPUs
Output processing: Detokenization, networking, streaming responses

If GPU utilization is lower than expected, check CPU availability. More cores and higher clock speeds can significantly improve performance.

Attention backend selection

vLLM automatically selects the optimal attention backend, but you can override:

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    attention_backend="FLASHINFER"
)

Options: FLASH_ATTN, XFORMERS, FLASHINFER, TORCH_SDPA See Attention Backend Feature Support for detailed comparison.

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Performance optimization guide

Preemption

Understanding preemption

Reducing preemption

Monitoring preemption

Chunked prefill

Benefits

How it works

Performance tuning

Parallelism strategies

Tensor parallelism (TP)

Pipeline parallelism (PP)

Expert parallelism (EP)

Data parallelism (DP)

Input processing optimization

Parallel processing

Processor caching

IPC caching

Key-replicated cache

Shared memory cache

Configuration

CPU resources for GPU deployments

Minimum requirements

Data parallel deployments

Performance impact

Attention backend selection

See also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​Preemption

​Understanding preemption

​Reducing preemption

​Monitoring preemption

​Chunked prefill

​Benefits

​How it works

​Performance tuning

​Related papers

​Parallelism strategies

​Tensor parallelism (TP)

​Pipeline parallelism (PP)

​Expert parallelism (EP)

​Data parallelism (DP)

​Batch-level DP for multi-modal encoders

​Input processing optimization

​Parallel processing

​Multi-modal caching

​Processor caching

​IPC caching

​Key-replicated cache

​Shared memory cache

​Configuration

​CPU resources for GPU deployments

​Minimum requirements

​Data parallel deployments

​Performance impact

​Attention backend selection

​See also

Build docs developers (and LLMs) love

Preemption

Understanding preemption

Reducing preemption

Monitoring preemption

Chunked prefill

Benefits

How it works

Performance tuning

Related papers

Parallelism strategies

Tensor parallelism (TP)

Pipeline parallelism (PP)

Expert parallelism (EP)

Data parallelism (DP)

Batch-level DP for multi-modal encoders

Input processing optimization

Parallel processing

Multi-modal caching

Processor caching

IPC caching

Key-replicated cache

Shared memory cache

Configuration

CPU resources for GPU deployments

Minimum requirements

Data parallel deployments

Performance impact

Attention backend selection

See also