- Offline inference: Arguments to the
LLMclass - Online serving: Arguments to
vllm servecommand
EngineArgs and AsyncEngineArgs) combine multiple configuration classes defined in vllm.config. For detailed developer documentation, refer to these configuration classes as they are the source of truth for types, defaults, and docstrings.
Many engine arguments accept JSON strings for complex configuration objects. You can either:
- Pass a valid JSON string:
--compilation-config '{"level": 1}' - Pass JSON keys individually when the argument parser supports it
EngineArgs
TheEngineArgs class contains all configuration options for the vLLM engine. These are organized into logical groups:
Model configuration
The model name or path from Hugging Face, local directory, or cloud storage (S3, GCS).
The tokenizer name or path. If not specified, uses the model path.
The tokenizer mode.
auto: Automatically detect tokenizer modeslow: Use slow tokenizermistral: Use Mistral tokenizer
Trust remote code from Hugging Face when loading models.
Data type for model weights and activations.Options:
auto, float16, bfloat16, float32Maximum sequence length supported by the model. If not specified, derived from model config.Supports human-readable formats:
4k, 8K, 16384Quantization method to use.Supported methods:
awq, squeezellm, gptq, fp8, compressed-tensors, bitsandbytes, ggufRandom seed for reproducibility.
The specific model version to use (branch name, tag name, or commit ID).
The specific tokenizer version to use.
Always use eager mode (disable CUDA graphs).
Parallel configuration
Number of tensor parallel replicas. Shards model parameters across GPUs.
Number of pipeline parallel stages. Distributes model layers across GPUs.
Number of data parallel replicas. Replicates the entire model.
Backend for distributed execution.Options:
ray, mp (multiprocessing)Enable expert parallelism for MoE models instead of tensor parallelism for MoE layers.
Cache configuration
Token block size for contiguous chunks in KV cache.
Fraction of GPU memory to use for KV cache (0.0 to 1.0).
Increase this if you have memory headroom. Decrease if you encounter OOM errors.
CPU swap space size in GiB per GPU.
Data type for KV cache storage.Options:
auto, fp8, fp8_e5m2, fp8_e4m3Enable automatic prefix caching to reduce redundant computation.
Scheduler configuration
Maximum number of tokens to process in a single batch.Supports human-readable formats:
2k, 8K, 16384Maximum number of sequences to process in a single batch.
Enable chunked prefill to process large prefills in smaller chunks.In V1, this is enabled by default when possible.
Scheduling policy for request processing.Options:
fcfs (first-come-first-served)Load configuration
Format to load model weights.Options:
auto, pt, safetensors, npcache, dummy, tensorizer, bitsandbytesDirectory to download and cache model weights.
Compilation configuration
Configuration for model compilation (torch.compile and CUDA graphs).Pass as JSON string:
Batch sizes to capture in CUDA graphs. Overrides compilation_config setting.
Attention configuration
Configuration for attention backend selection.Pass as JSON string or use individual arguments below.
Attention backend to use.Options:
FLASH_ATTN, XFORMERS, FLASHINFER, TORCH_SDPALoRA configuration
Enable LoRA adapter support.
Maximum number of LoRA adapters to load simultaneously.
Maximum LoRA rank.
Multi-modal configuration
Maximum number of multi-modal items per prompt.Example:
{"image": 4, "video": 1}Additional keyword arguments for multi-modal processor.
Observability configuration
OpenTelemetry endpoint for sending traces.
Disable logging of statistics.
Speculative decoding
Configuration for speculative decoding.Pass as JSON string with draft model and other settings.
AsyncEngineArgs
TheAsyncEngineArgs class extends EngineArgs with additional arguments specific to asynchronous engine operation:
Enable logging of request information.
- INFO level: Logs request ID, parameters, and LoRA request
- DEBUG level: Logs prompt inputs (text, token IDs)
VLLM_LOGGING_LEVEL environment variable.Usage examples
Offline inference
Online serving
See also
- Server arguments - Additional arguments for the OpenAI-compatible API server
- Environment variables - Runtime environment variable configuration
- Optimization guide - Performance tuning and optimization strategies