Installation time variables
These variables affect vLLM compilation and installation.Build configuration
Target device for vLLM.Options:
cuda, rocm, cpuMain CUDA version for vLLM (follows PyTorch but can be overridden).
Maximum number of parallel compilation jobs.Defaults to number of CPUs.
Number of threads for nvcc compilation.If set,
MAX_JOBS will be reduced to avoid CPU oversubscription.CMake build type.Options:
Debug, Release, RelWithDebInfoPrint verbose logs during installation.
Precompiled binaries
Use precompiled binaries (*.so files).
Skip adding +precompiled suffix to version string.
Runtime variables
These variables configure vLLM’s runtime behavior.Cache and storage
Root directory for vLLM cache files.Respects
XDG_CACHE_HOME if set.Root directory for vLLM configuration files.Respects
XDG_CONFIG_HOME if set.Path to cache for storing downloaded assets.
Distributed execution
IP address of the current node for distributed execution.Set this differently on each node when using multi-node inference.
Port for vLLM internal communication.
Path to NCCL library file.Needed because nccl>=2.19 from PyTorch may contain bugs.
Logging
Whether vLLM should configure logging.Set to
0 to disable vLLM’s logging configuration.Default logging level.Options:
DEBUG, INFO, WARNING, ERROR, CRITICALPath to logging config JSON file for both vLLM and uvicorn.
Control colored logging output.Options:
auto, 1 (always), 0 (never)Standard Unix flag for disabling ANSI color codes.
Interval in seconds to log statistics.
Model loading
Load models from ModelScope instead of Hugging Face Hub.
Path to JSON file or space-separated values table mapping model repo IDs to local folders.Example JSON:
Hugging Face API token for private models.
Engine configuration
Timeout for each iteration in the engine (seconds).
Timeout for engine cores to become ready during startup (seconds).
Allow max sequence length greater than the max length from model config.
Multi-modal settings
Timeout for fetching images when serving multi-modal models (seconds).
Timeout for fetching videos (seconds).
Timeout for fetching audio (seconds).
Maximum audio file size in MB for speech-to-text requests.
GPU and memory
Control visible GPU devices.
PyTorch float32 matmul precision mode.Options:
highest, high, mediumChunk size for fused MoE operations.
Compilation and optimization
Enable ahead-of-time compilation.Automatically enabled on PyTorch >= 2.10.0.
Disable torch.compile cache.
Enable Inductor max_autotune for better performance.
Debugging
Trace function calls for debugging.Set to
1 to enable.Dump FX graphs to the specified directory for debugging.
Debug pattern matching inside custom passes.Set to fx.Node name (e.g., ‘getitem_34’).
Platform-specific
CPU backend
CPU key-value cache space in GB.Default is 4 GB if not set.
CPU core IDs bound by OpenMP threads.Examples:
"0-31", "0,1,2", "0-31|32-63"ROCm backend
Chunk size in MB for sleeping memory allocations under ROCm.
Pad FP8 weights to 256 bytes for ROCm.
Use custom paged attention kernel for MI3* cards.
XLA/TPU
Path to XLA persistent cache directory.
Enable SPMD mode for TPU backend.
Ray settings
Channel type for Ray Compiled Graph communication.Options:
auto, nccl, shmMultiprocess context for workers.Options:
fork, spawnUsage statistics
Disable usage statistics collection.
Alternative flag to disable usage tracking.
Server URL for usage statistics.
Usage examples
Set logging level
Configure multi-node setup
Enable debugging
Use ModelScope instead of Hugging Face
Disable usage statistics
See also
- Engine arguments - Engine configuration arguments
- Server arguments - Server configuration arguments
- Optimization guide - Performance tuning strategies