Engine configuration arguments

Engine arguments control the behavior of the vLLM engine. These arguments are used in:

Offline inference: Arguments to the LLM class
Online serving: Arguments to vllm serve command

The engine argument classes (EngineArgs and AsyncEngineArgs) combine multiple configuration classes defined in vllm.config. For detailed developer documentation, refer to these configuration classes as they are the source of truth for types, defaults, and docstrings.

Many engine arguments accept JSON strings for complex configuration objects. You can either:

Pass a valid JSON string: --compilation-config '{"level": 1}'
Pass JSON keys individually when the argument parser supports it

EngineArgs

The EngineArgs class contains all configuration options for the vLLM engine. These are organized into logical groups:

Model configuration

model

str

required

The model name or path from Hugging Face, local directory, or cloud storage (S3, GCS).

LLM(model="meta-llama/Llama-3.1-8B-Instruct")

tokenizer

str

default:"None"

The tokenizer name or path. If not specified, uses the model path.

tokenizer_mode

str

default:"auto"

The tokenizer mode.

auto: Automatically detect tokenizer mode
slow: Use slow tokenizer
mistral: Use Mistral tokenizer

trust_remote_code

bool

default:"false"

Trust remote code from Hugging Face when loading models.

Only enable this if you trust the model source, as it can execute arbitrary code.

dtype

str

default:"auto"

Data type for model weights and activations.Options: auto, float16, bfloat16, float32

max_model_len

int

default:"None"

Maximum sequence length supported by the model. If not specified, derived from model config.Supports human-readable formats: 4k, 8K, 16384

quantization

str

default:"None"

Quantization method to use.Supported methods: awq, squeezellm, gptq, fp8, compressed-tensors, bitsandbytes, gguf

seed

int

default:"0"

Random seed for reproducibility.

revision

str

default:"None"

The specific model version to use (branch name, tag name, or commit ID).

tokenizer_revision

str

default:"None"

The specific tokenizer version to use.

enforce_eager

bool

default:"false"

Always use eager mode (disable CUDA graphs).

Parallel configuration

tensor_parallel_size

int

default:"1"

Number of tensor parallel replicas. Shards model parameters across GPUs.

vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 4

pipeline_parallel_size

int

default:"1"

Number of pipeline parallel stages. Distributes model layers across GPUs.

data_parallel_size

int

default:"1"

Number of data parallel replicas. Replicates the entire model.

distributed_executor_backend

str

default:"None"

Backend for distributed execution.Options: ray, mp (multiprocessing)

enable_expert_parallel

bool

default:"false"

Enable expert parallelism for MoE models instead of tensor parallelism for MoE layers.

Cache configuration

block_size

int

default:"16"

Token block size for contiguous chunks in KV cache.

gpu_memory_utilization

float

default:"0.9"

Fraction of GPU memory to use for KV cache (0.0 to 1.0).

Increase this if you have memory headroom. Decrease if you encounter OOM errors.

swap_space

float

default:"4.0"

CPU swap space size in GiB per GPU.

kv_cache_dtype

str

default:"auto"

Data type for KV cache storage.Options: auto, fp8, fp8_e5m2, fp8_e4m3

enable_prefix_caching

bool

default:"None"

Enable automatic prefix caching to reduce redundant computation.

Scheduler configuration

max_num_batched_tokens

int

default:"None"

Maximum number of tokens to process in a single batch.Supports human-readable formats: 2k, 8K, 16384

max_num_seqs

int

default:"None"

Maximum number of sequences to process in a single batch.

enable_chunked_prefill

bool

default:"None"

Enable chunked prefill to process large prefills in smaller chunks.In V1, this is enabled by default when possible.

scheduling_policy

str

default:"fcfs"

Scheduling policy for request processing.Options: fcfs (first-come-first-served)

Load configuration

load_format

str

default:"auto"

Format to load model weights.Options: auto, pt, safetensors, npcache, dummy, tensorizer, bitsandbytes

download_dir

str

default:"None"

Directory to download and cache model weights.

Compilation configuration

compilation_config

CompilationConfig

default:"CompilationConfig()"

Configuration for model compilation (torch.compile and CUDA graphs).Pass as JSON string:

--compilation-config '{"level": 1, "cudagraph_capture_sizes": [1, 2, 4]}'

cudagraph_capture_sizes

list[int]

default:"None"

Batch sizes to capture in CUDA graphs. Overrides compilation_config setting.

Attention configuration

attention_config

AttentionConfig

default:"AttentionConfig()"

Configuration for attention backend selection.Pass as JSON string or use individual arguments below.

attention_backend

str

default:"None"

Attention backend to use.Options: FLASH_ATTN, XFORMERS, FLASHINFER, TORCH_SDPA

LoRA configuration

enable_lora

bool

default:"false"

Enable LoRA adapter support.

max_loras

int

default:"1"

Maximum number of LoRA adapters to load simultaneously.

max_lora_rank

int

default:"16"

Maximum LoRA rank.

limit_mm_per_prompt

dict

default:"{}"

Maximum number of multi-modal items per prompt.Example: {"image": 4, "video": 1}

mm_processor_kwargs

dict

default:"None"

Additional keyword arguments for multi-modal processor.

Observability configuration

otlp_traces_endpoint

str

default:"None"

OpenTelemetry endpoint for sending traces.

disable_log_stats

bool

default:"false"

Disable logging of statistics.

Speculative decoding

speculative_config

dict

default:"None"

Configuration for speculative decoding.Pass as JSON string with draft model and other settings.

AsyncEngineArgs

The AsyncEngineArgs class extends EngineArgs with additional arguments specific to asynchronous engine operation:

enable_log_requests

bool

default:"false"

Enable logging of request information.

INFO level: Logs request ID, parameters, and LoRA request
DEBUG level: Logs prompt inputs (text, token IDs)

Set minimum log level via VLLM_LOGGING_LEVEL environment variable.

Usage examples

Offline inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.95,
    max_model_len=8192,
    enable_prefix_caching=True,
)

prompts = ["Tell me about AI"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
outputs = llm.generate(prompts, sampling_params)

Online serving

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 8192 \
  --enable-prefix-caching

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Engine configuration arguments

EngineArgs

Model configuration

Parallel configuration

Cache configuration

Scheduler configuration

Load configuration

Compilation configuration

Attention configuration

LoRA configuration

Observability configuration

Speculative decoding

AsyncEngineArgs

Usage examples

Offline inference

Online serving

See also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​EngineArgs

​Model configuration

​Parallel configuration

​Cache configuration

​Scheduler configuration

​Load configuration

​Compilation configuration

​Attention configuration

​LoRA configuration

​Multi-modal configuration

​Observability configuration

​Speculative decoding

​AsyncEngineArgs

​Usage examples

​Offline inference

​Online serving

​See also

Build docs developers (and LLMs) love

EngineArgs

Model configuration

Parallel configuration

Cache configuration

Scheduler configuration

Load configuration

Compilation configuration

Attention configuration

LoRA configuration

Multi-modal configuration

Observability configuration

Speculative decoding

AsyncEngineArgs

Usage examples

Offline inference

Online serving

See also