Skip to main content
Engine arguments control the behavior of the vLLM engine. These arguments are used in:
  • Offline inference: Arguments to the LLM class
  • Online serving: Arguments to vllm serve command
The engine argument classes (EngineArgs and AsyncEngineArgs) combine multiple configuration classes defined in vllm.config. For detailed developer documentation, refer to these configuration classes as they are the source of truth for types, defaults, and docstrings.
Many engine arguments accept JSON strings for complex configuration objects. You can either:
  • Pass a valid JSON string: --compilation-config '{"level": 1}'
  • Pass JSON keys individually when the argument parser supports it

EngineArgs

The EngineArgs class contains all configuration options for the vLLM engine. These are organized into logical groups:

Model configuration

model
str
required
The model name or path from Hugging Face, local directory, or cloud storage (S3, GCS).
LLM(model="meta-llama/Llama-3.1-8B-Instruct")
tokenizer
str
default:"None"
The tokenizer name or path. If not specified, uses the model path.
tokenizer_mode
str
default:"auto"
The tokenizer mode.
  • auto: Automatically detect tokenizer mode
  • slow: Use slow tokenizer
  • mistral: Use Mistral tokenizer
trust_remote_code
bool
default:"false"
Trust remote code from Hugging Face when loading models.
Only enable this if you trust the model source, as it can execute arbitrary code.
dtype
str
default:"auto"
Data type for model weights and activations.Options: auto, float16, bfloat16, float32
max_model_len
int
default:"None"
Maximum sequence length supported by the model. If not specified, derived from model config.Supports human-readable formats: 4k, 8K, 16384
quantization
str
default:"None"
Quantization method to use.Supported methods: awq, squeezellm, gptq, fp8, compressed-tensors, bitsandbytes, gguf
seed
int
default:"0"
Random seed for reproducibility.
revision
str
default:"None"
The specific model version to use (branch name, tag name, or commit ID).
tokenizer_revision
str
default:"None"
The specific tokenizer version to use.
enforce_eager
bool
default:"false"
Always use eager mode (disable CUDA graphs).

Parallel configuration

tensor_parallel_size
int
default:"1"
Number of tensor parallel replicas. Shards model parameters across GPUs.
vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 4
pipeline_parallel_size
int
default:"1"
Number of pipeline parallel stages. Distributes model layers across GPUs.
data_parallel_size
int
default:"1"
Number of data parallel replicas. Replicates the entire model.
distributed_executor_backend
str
default:"None"
Backend for distributed execution.Options: ray, mp (multiprocessing)
enable_expert_parallel
bool
default:"false"
Enable expert parallelism for MoE models instead of tensor parallelism for MoE layers.

Cache configuration

block_size
int
default:"16"
Token block size for contiguous chunks in KV cache.
gpu_memory_utilization
float
default:"0.9"
Fraction of GPU memory to use for KV cache (0.0 to 1.0).
Increase this if you have memory headroom. Decrease if you encounter OOM errors.
swap_space
float
default:"4.0"
CPU swap space size in GiB per GPU.
kv_cache_dtype
str
default:"auto"
Data type for KV cache storage.Options: auto, fp8, fp8_e5m2, fp8_e4m3
enable_prefix_caching
bool
default:"None"
Enable automatic prefix caching to reduce redundant computation.

Scheduler configuration

max_num_batched_tokens
int
default:"None"
Maximum number of tokens to process in a single batch.Supports human-readable formats: 2k, 8K, 16384
max_num_seqs
int
default:"None"
Maximum number of sequences to process in a single batch.
enable_chunked_prefill
bool
default:"None"
Enable chunked prefill to process large prefills in smaller chunks.In V1, this is enabled by default when possible.
scheduling_policy
str
default:"fcfs"
Scheduling policy for request processing.Options: fcfs (first-come-first-served)

Load configuration

load_format
str
default:"auto"
Format to load model weights.Options: auto, pt, safetensors, npcache, dummy, tensorizer, bitsandbytes
download_dir
str
default:"None"
Directory to download and cache model weights.

Compilation configuration

compilation_config
CompilationConfig
default:"CompilationConfig()"
Configuration for model compilation (torch.compile and CUDA graphs).Pass as JSON string:
--compilation-config '{"level": 1, "cudagraph_capture_sizes": [1, 2, 4]}'
cudagraph_capture_sizes
list[int]
default:"None"
Batch sizes to capture in CUDA graphs. Overrides compilation_config setting.

Attention configuration

attention_config
AttentionConfig
default:"AttentionConfig()"
Configuration for attention backend selection.Pass as JSON string or use individual arguments below.
attention_backend
str
default:"None"
Attention backend to use.Options: FLASH_ATTN, XFORMERS, FLASHINFER, TORCH_SDPA

LoRA configuration

enable_lora
bool
default:"false"
Enable LoRA adapter support.
max_loras
int
default:"1"
Maximum number of LoRA adapters to load simultaneously.
max_lora_rank
int
default:"16"
Maximum LoRA rank.

Multi-modal configuration

limit_mm_per_prompt
dict
default:"{}"
Maximum number of multi-modal items per prompt.Example: {"image": 4, "video": 1}
mm_processor_kwargs
dict
default:"None"
Additional keyword arguments for multi-modal processor.

Observability configuration

otlp_traces_endpoint
str
default:"None"
OpenTelemetry endpoint for sending traces.
disable_log_stats
bool
default:"false"
Disable logging of statistics.

Speculative decoding

speculative_config
dict
default:"None"
Configuration for speculative decoding.Pass as JSON string with draft model and other settings.

AsyncEngineArgs

The AsyncEngineArgs class extends EngineArgs with additional arguments specific to asynchronous engine operation:
enable_log_requests
bool
default:"false"
Enable logging of request information.
  • INFO level: Logs request ID, parameters, and LoRA request
  • DEBUG level: Logs prompt inputs (text, token IDs)
Set minimum log level via VLLM_LOGGING_LEVEL environment variable.

Usage examples

Offline inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.95,
    max_model_len=8192,
    enable_prefix_caching=True,
)

prompts = ["Tell me about AI"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
outputs = llm.generate(prompts, sampling_params)

Online serving

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 8192 \
  --enable-prefix-caching

See also

Build docs developers (and LLMs) love