Skip to main content
The EngineArgs dataclass contains all configuration parameters for initializing the vLLM engine. It provides fine-grained control over model loading, parallelism, memory management, and execution.

Overview

from vllm.engine.arg_utils import EngineArgs

engine_args = EngineArgs(
    model="facebook/opt-125m",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=2048,
)

Model configuration

model
str
required
The name or path of a HuggingFace Transformers model.
tokenizer
str | None
default:"None"
The name or path of a HuggingFace Transformers tokenizer. If None, uses the model path.
tokenizer_mode
str
default:"auto"
The tokenizer mode: "auto" (fast if available) or "slow" (always use slow tokenizer).
trust_remote_code
bool
default:"False"
Trust remote code from HuggingFace when downloading model and tokenizer.
dtype
str
default:"auto"
Data type for model weights and activations. Supports "auto", "float32", "float16", or "bfloat16".
quantization
str | None
default:"None"
Quantization method: "awq", "gptq", "fp8", or None.
max_model_len
int | None
default:"None"
Maximum sequence length. If None, uses the model’s config value.
revision
str | None
default:"None"
Model revision (branch name, tag, or commit id).
tokenizer_revision
str | None
default:"None"
Tokenizer revision (branch name, tag, or commit id).
seed
int
default:"0"
Random seed for sampling.

Parallelism configuration

tensor_parallel_size
int
default:"1"
Number of GPUs to use for tensor parallelism.
pipeline_parallel_size
int
default:"1"
Number of pipeline stages for pipeline parallelism.
data_parallel_size
int
default:"1"
Number of data parallel replicas.
distributed_executor_backend
str | None
default:"None"
Backend for distributed execution: "ray", "mp" (multiprocessing), or None (auto-detect).
disable_custom_all_reduce
bool
default:"False"
Disable custom all-reduce kernels and use NCCL instead.

Memory configuration

gpu_memory_utilization
float
default:"0.9"
Fraction of GPU memory to use for model and KV cache (0.0 to 1.0).
kv_cache_memory_bytes
int | None
default:"None"
Exact size of KV cache per GPU in bytes. When set, overrides gpu_memory_utilization.
swap_space
float
default:"4"
CPU swap space size in GiB per GPU.
cpu_offload_gb
float
default:"0"
Size of CPU memory in GiB for offloading model weights.
block_size
int
default:"16"
Token block size for paged attention.
enable_prefix_caching
bool | None
default:"None"
Enable prefix caching to reuse KV cache for common prefixes.

Scheduling configuration

max_num_batched_tokens
int | None
default:"None"
Maximum number of tokens to batch together. If None, uses model’s max sequence length.
max_num_seqs
int | None
default:"None"
Maximum number of sequences to process in a batch.
scheduling_policy
str
default:"fcfs"
Scheduling policy: "fcfs" (first-come-first-served) or "priority".
enable_chunked_prefill
bool | None
default:"None"
Enable chunked prefill to process long prompts in chunks.

Execution configuration

enforce_eager
bool
default:"False"
Disable CUDA graphs and use eager execution only.
max_logprobs
int
default:"20"
Maximum number of log probabilities to return per token.
disable_log_stats
bool
default:"False"
Disable logging of statistics.

Multi-modal configuration

limit_mm_per_prompt
dict[str, int]
default:"{}"
Maximum number of multi-modal inputs per prompt by modality type.
mm_processor_kwargs
dict | None
default:"None"
Additional kwargs for the multi-modal processor.

LoRA configuration

enable_lora
bool
default:"False"
Enable LoRA adapter support.
max_loras
int
default:"1"
Maximum number of LoRA adapters to cache.
max_lora_rank
int
default:"16"
Maximum LoRA rank.
lora_dtype
str | None
default:"None"
Data type for LoRA weights.

Advanced configuration

compilation_config
dict | CompilationConfig | None
default:"None"
Configuration for model compilation and CUDA graphs.
attention_config
dict | AttentionConfig | None
default:"None"
Configuration for attention mechanisms.
pooler_config
PoolerConfig | None
default:"None"
Configuration for pooling models (embeddings, classification).

Example: Multi-GPU configuration

from vllm.engine.arg_utils import EngineArgs
from vllm import LLM

engine_args = EngineArgs(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,  # Use 4 GPUs
    dtype="float16",
    gpu_memory_utilization=0.95,
    max_model_len=4096,
    enable_prefix_caching=True,
)

llm = LLM(**engine_args.__dict__)

Example: Quantized model

engine_args = EngineArgs(
    model="TheBloke/Llama-2-13B-AWQ",
    quantization="awq",
    dtype="float16",
    gpu_memory_utilization=0.9,
)

llm = LLM(**engine_args.__dict__)
  • LLM - Use EngineArgs with the LLM class
  • AsyncLLMEngine - Use EngineArgs with AsyncLLMEngine

Build docs developers (and LLMs) love