Skip to main content
vLLM uses environment variables to configure system behavior, hardware settings, optimization flags, and debugging options.
Important notes:
  • VLLM_PORT and VLLM_HOST_IP set the port and IP for vLLM’s internal usage, not for the API server. Do not use --host $VLLM_HOST_IP or --port $VLLM_PORT to start the API server.
  • All vLLM environment variables are prefixed with VLLM_. Kubernetes users: Do not name your service vllm, as Kubernetes automatically creates environment variables with the capitalized service name as prefix, which will conflict with vLLM’s variables.

Installation time variables

These variables affect vLLM compilation and installation.

Build configuration

VLLM_TARGET_DEVICE
str
default:"cuda"
Target device for vLLM.Options: cuda, rocm, cpu
VLLM_MAIN_CUDA_VERSION
str
default:"12.9"
Main CUDA version for vLLM (follows PyTorch but can be overridden).
MAX_JOBS
str
default:"None"
Maximum number of parallel compilation jobs.Defaults to number of CPUs.
NVCC_THREADS
str
default:"None"
Number of threads for nvcc compilation.If set, MAX_JOBS will be reduced to avoid CPU oversubscription.
CMAKE_BUILD_TYPE
str
default:"None"
CMake build type.Options: Debug, Release, RelWithDebInfo
VERBOSE
bool
default:"false"
Print verbose logs during installation.

Precompiled binaries

VLLM_USE_PRECOMPILED
bool
default:"false"
Use precompiled binaries (*.so files).
VLLM_SKIP_PRECOMPILED_VERSION_SUFFIX
bool
default:"false"
Skip adding +precompiled suffix to version string.

Runtime variables

These variables configure vLLM’s runtime behavior.

Cache and storage

VLLM_CACHE_ROOT
str
default:"~/.cache/vllm"
Root directory for vLLM cache files.Respects XDG_CACHE_HOME if set.
VLLM_CONFIG_ROOT
str
default:"~/.config/vllm"
Root directory for vLLM configuration files.Respects XDG_CONFIG_HOME if set.
VLLM_ASSETS_CACHE
str
default:"~/.cache/vllm/assets"
Path to cache for storing downloaded assets.

Distributed execution

VLLM_HOST_IP
str
default:""
IP address of the current node for distributed execution.Set this differently on each node when using multi-node inference.
VLLM_PORT
int
default:"None"
Port for vLLM internal communication.
This is NOT the API server port. This is for internal distributed communication only.
VLLM_NCCL_SO_PATH
str
default:"None"
Path to NCCL library file.Needed because nccl>=2.19 from PyTorch may contain bugs.

Logging

VLLM_CONFIGURE_LOGGING
bool
default:"true"
Whether vLLM should configure logging.Set to 0 to disable vLLM’s logging configuration.
VLLM_LOGGING_LEVEL
str
default:"INFO"
Default logging level.Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
VLLM_LOGGING_CONFIG_PATH
str
default:"None"
Path to logging config JSON file for both vLLM and uvicorn.
VLLM_LOGGING_COLOR
str
default:"auto"
Control colored logging output.Options: auto, 1 (always), 0 (never)
NO_COLOR
bool
default:"false"
Standard Unix flag for disabling ANSI color codes.
VLLM_LOG_STATS_INTERVAL
float
default:"10.0"
Interval in seconds to log statistics.

Model loading

VLLM_USE_MODELSCOPE
bool
default:"false"
Load models from ModelScope instead of Hugging Face Hub.
VLLM_MODEL_REDIRECT_PATH
str
default:"None"
Path to JSON file or space-separated values table mapping model repo IDs to local folders.Example JSON:
{"meta-llama/Llama-3.2-1B": "/tmp/Llama-3.2-1B"}
HF_TOKEN
str
default:"None"
Hugging Face API token for private models.

Engine configuration

VLLM_ENGINE_ITERATION_TIMEOUT_S
int
default:"60"
Timeout for each iteration in the engine (seconds).
VLLM_ENGINE_READY_TIMEOUT_S
int
default:"600"
Timeout for engine cores to become ready during startup (seconds).
VLLM_ALLOW_LONG_MAX_MODEL_LEN
bool
default:"false"
Allow max sequence length greater than the max length from model config.

Multi-modal settings

VLLM_IMAGE_FETCH_TIMEOUT
int
default:"5"
Timeout for fetching images when serving multi-modal models (seconds).
VLLM_VIDEO_FETCH_TIMEOUT
int
default:"30"
Timeout for fetching videos (seconds).
VLLM_AUDIO_FETCH_TIMEOUT
int
default:"10"
Timeout for fetching audio (seconds).
VLLM_MAX_AUDIO_CLIP_FILESIZE_MB
int
default:"25"
Maximum audio file size in MB for speech-to-text requests.

GPU and memory

CUDA_VISIBLE_DEVICES
str
default:"None"
Control visible GPU devices.
VLLM_FLOAT32_MATMUL_PRECISION
str
default:"highest"
PyTorch float32 matmul precision mode.Options: highest, high, medium
VLLM_FUSED_MOE_CHUNK_SIZE
int
default:"16384"
Chunk size for fused MoE operations.

Compilation and optimization

VLLM_USE_AOT_COMPILE
bool
default:"auto"
Enable ahead-of-time compilation.Automatically enabled on PyTorch >= 2.10.0.
VLLM_DISABLE_COMPILE_CACHE
bool
default:"false"
Disable torch.compile cache.
VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE
bool
default:"true"
Enable Inductor max_autotune for better performance.

Debugging

VLLM_TRACE_FUNCTION
int
default:"0"
Trace function calls for debugging.Set to 1 to enable.
VLLM_DEBUG_DUMP_PATH
str
default:"None"
Dump FX graphs to the specified directory for debugging.
VLLM_PATTERN_MATCH_DEBUG
str
default:"None"
Debug pattern matching inside custom passes.Set to fx.Node name (e.g., ‘getitem_34’).

Platform-specific

CPU backend

VLLM_CPU_KVCACHE_SPACE
int
default:"0"
CPU key-value cache space in GB.Default is 4 GB if not set.
VLLM_CPU_OMP_THREADS_BIND
str
default:"auto"
CPU core IDs bound by OpenMP threads.Examples: "0-31", "0,1,2", "0-31|32-63"

ROCm backend

VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE
int
default:"256"
Chunk size in MB for sleeping memory allocations under ROCm.
VLLM_ROCM_FP8_PADDING
bool
default:"true"
Pad FP8 weights to 256 bytes for ROCm.
VLLM_ROCM_CUSTOM_PAGED_ATTN
bool
default:"true"
Use custom paged attention kernel for MI3* cards.

XLA/TPU

VLLM_XLA_CACHE_PATH
str
default:"~/.cache/vllm/xla_cache"
Path to XLA persistent cache directory.
VLLM_XLA_USE_SPMD
bool
default:"false"
Enable SPMD mode for TPU backend.

Ray settings

VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE
str
default:"auto"
Channel type for Ray Compiled Graph communication.Options: auto, nccl, shm
VLLM_WORKER_MULTIPROC_METHOD
str
default:"fork"
Multiprocess context for workers.Options: fork, spawn

Usage statistics

VLLM_NO_USAGE_STATS
bool
default:"false"
Disable usage statistics collection.
VLLM_DO_NOT_TRACK
bool
default:"false"
Alternative flag to disable usage tracking.
VLLM_USAGE_STATS_SERVER
str
default:"https://stats.vllm.ai"
Server URL for usage statistics.

Usage examples

Set logging level

export VLLM_LOGGING_LEVEL=DEBUG
vllm serve meta-llama/Llama-3.1-8B-Instruct

Configure multi-node setup

# On node 1
export VLLM_HOST_IP=192.168.1.10
vllm serve MODEL --tensor-parallel-size 4

# On node 2
export VLLM_HOST_IP=192.168.1.11
vllm serve MODEL --tensor-parallel-size 4

Enable debugging

export VLLM_TRACE_FUNCTION=1
export VLLM_DEBUG_DUMP_PATH=/tmp/vllm_debug
vllm serve MODEL

Use ModelScope instead of Hugging Face

export VLLM_USE_MODELSCOPE=true
vllm serve Qwen/Qwen2.5-7B-Instruct

Disable usage statistics

export VLLM_NO_USAGE_STATS=1
vllm serve MODEL

See also

Build docs developers (and LLMs) love