Running out of memory? See the conserving memory guide for memory optimization strategies.
Preemption
Due to the autoregressive nature of transformer architecture, vLLM may need to preempt requests when KV cache space is insufficient.Understanding preemption
When you see this warning:Reducing preemption
To minimize preemption:Increase GPU memory utilization (default: 0.9).Provides more KV cache space but leaves less headroom for temporary allocations.
Decrease maximum concurrent sequences.Reduces concurrent batch size, requiring less KV cache.
Decrease maximum batched tokens.
Increase tensor parallelism.Distributes model across more GPUs, leaving more memory per GPU for KV cache. May add communication overhead.
Monitoring preemption
Monitor preemption through:- Prometheus metrics exposed by vLLM
- Log statistics with
disable_log_stats=False
In vLLM V1, the default preemption mode is
RECOMPUTE rather than SWAP, as recomputation has lower overhead in the V1 architecture.Chunked prefill
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them with decode requests.Benefits
- Better throughput: Balances compute-bound (prefill) and memory-bound (decode) operations
- Improved inter-token latency: Decode requests are prioritized
- Higher GPU utilization: Co-locates prefill and decode in the same batch
How it works
In V1, chunked prefill is enabled by default. The scheduler:- Prioritizes all pending decode requests
- Schedules prefills with remaining
max_num_batched_tokensbudget - Automatically chunks large prefills that don’t fit
Performance tuning
Controls chunked prefill behavior.
- Smaller values (e.g., 2048): Better inter-token latency, fewer prefills slow down decodes
- Larger values (e.g., 16384): Better TTFT, more prefill tokens per batch
- For throughput: Set > 8192, especially for smaller models on large GPUs
- Equals max_model_len: Almost equivalent to V0 default scheduling (still prioritizes decodes)
Related papers
- Orca: A Distributed Serving System for Transformer-Based Generative Models
- Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Parallelism strategies
vLLM supports multiple parallelism strategies that can be combined.Tensor parallelism (TP)
Shards model parameters across GPUs within each layer. When to use:- Model too large for single GPU
- Need more KV cache space per GPU
Pipeline parallelism (PP)
Distributes model layers across GPUs. When to use:- Maxed out efficient TP but need more distribution
- Multi-node deployments
- Very deep, narrow models
Expert parallelism (EP)
Specialized parallelism for Mixture of Experts models. When to use:- MoE models (DeepSeekV3, Qwen3MoE, Llama-4)
- Balance expert computation across GPUs
Data parallelism (DP)
Replicates entire model across GPU sets. When to use:- Enough GPUs to replicate model
- Need to scale throughput, not model size
- Multi-user environments with request batch isolation
MoE layers are sharded by
tensor_parallel_size × data_parallel_size.Batch-level DP for multi-modal encoders
For multi-modal models, you can use batch-level DP to shard input data instead of weights. Benefits:- Reduces communication overhead (no all-reduce after every layer)
- 10% throughput improvement for TP=8
- 40% additional improvement for Conv3D operations
- Slightly higher memory usage (encoder weights replicated)
Batch-level DP is independent from API request-level DP (controlled by
data_parallel_size).- dots_ocr
- GLM-4.1V or above
- InternVL
- Kimi-VL
- Llama4
- MiniCPM-V-2.5 or above
- Qwen2-VL or above
- Step3
Input processing optimization
Parallel processing
Scale input processing with API server scale-out:API server scale-out is only available for online inference.
Multi-modal caching
Avoid repeated processing of the same multi-modal inputs (common in multi-turn conversations).Processor caching
Automatically enabled to cache processed multi-modal inputs.IPC caching
Automatically enabled when there’s 1:1 correspondence between API and engine processes.Key-replicated cache
Default mode. Cache keys exist in both processes, data only in engine process.Shared memory cache
More efficient for multi-worker setups (TP > 1):Configuration
Size of multi-modal processor cache in GiB.Set to
0 to disable caching.Cache type.Options:
lru, shmCPU resources for GPU deployments
vLLM V1 uses a multi-process architecture. CPU underprovisioning is a common performance bottleneck.Minimum requirements
For deployment with N GPUs:- 1 API server process
- 1 engine core process
- N GPU worker processes (1 per GPU)
With hyperthreading enabled: 1 vCPU = 1 hyperthread = 1/2 physical coreSo you need
2 × (2 + N) minimum vCPUs.Data parallel deployments
- A = API server count (defaults to DP)
- DP = data parallel size
- N = total number of GPUs
Performance impact
CPU underprovisioning affects:- Input processing: Tokenization, chat templates, multi-modal loading
- Scheduling latency: Engine core scheduler dispatches tokens to GPUs
- Output processing: Detokenization, networking, streaming responses
Attention backend selection
vLLM automatically selects the optimal attention backend, but you can override:FLASH_ATTN, XFORMERS, FLASHINFER, TORCH_SDPA
See Attention Backend Feature Support for detailed comparison.
See also
- Engine arguments - Complete engine configuration reference
- Environment variables - Runtime environment configuration
- Conserving memory guide - Memory optimization strategies