EngineArgs dataclass contains all configuration parameters for initializing the vLLM engine. It provides fine-grained control over model loading, parallelism, memory management, and execution.
Overview
Model configuration
The name or path of a HuggingFace Transformers model.
The name or path of a HuggingFace Transformers tokenizer. If None, uses the model path.
The tokenizer mode:
"auto" (fast if available) or "slow" (always use slow tokenizer).Trust remote code from HuggingFace when downloading model and tokenizer.
Data type for model weights and activations. Supports
"auto", "float32", "float16", or "bfloat16".Quantization method:
"awq", "gptq", "fp8", or None.Maximum sequence length. If None, uses the model’s config value.
Model revision (branch name, tag, or commit id).
Tokenizer revision (branch name, tag, or commit id).
Random seed for sampling.
Parallelism configuration
Number of GPUs to use for tensor parallelism.
Number of pipeline stages for pipeline parallelism.
Number of data parallel replicas.
Backend for distributed execution:
"ray", "mp" (multiprocessing), or None (auto-detect).Disable custom all-reduce kernels and use NCCL instead.
Memory configuration
Fraction of GPU memory to use for model and KV cache (0.0 to 1.0).
Exact size of KV cache per GPU in bytes. When set, overrides
gpu_memory_utilization.CPU swap space size in GiB per GPU.
Size of CPU memory in GiB for offloading model weights.
Token block size for paged attention.
Enable prefix caching to reuse KV cache for common prefixes.
Scheduling configuration
Maximum number of tokens to batch together. If None, uses model’s max sequence length.
Maximum number of sequences to process in a batch.
Scheduling policy:
"fcfs" (first-come-first-served) or "priority".Enable chunked prefill to process long prompts in chunks.
Execution configuration
Disable CUDA graphs and use eager execution only.
Maximum number of log probabilities to return per token.
Disable logging of statistics.
Multi-modal configuration
Maximum number of multi-modal inputs per prompt by modality type.
Additional kwargs for the multi-modal processor.
LoRA configuration
Enable LoRA adapter support.
Maximum number of LoRA adapters to cache.
Maximum LoRA rank.
Data type for LoRA weights.
Advanced configuration
Configuration for model compilation and CUDA graphs.
Configuration for attention mechanisms.
Configuration for pooling models (embeddings, classification).
Example: Multi-GPU configuration
Example: Quantized model
Related
- LLM - Use EngineArgs with the LLM class
- AsyncLLMEngine - Use EngineArgs with AsyncLLMEngine