Environment variables

vLLM uses environment variables to configure system behavior, hardware settings, optimization flags, and debugging options.

Important notes:

VLLM_PORT and VLLM_HOST_IP set the port and IP for vLLM’s internal usage, not for the API server. Do not use --host $VLLM_HOST_IP or --port $VLLM_PORT to start the API server.
All vLLM environment variables are prefixed with VLLM_. Kubernetes users: Do not name your service vllm, as Kubernetes automatically creates environment variables with the capitalized service name as prefix, which will conflict with vLLM’s variables.

Installation time variables

These variables affect vLLM compilation and installation.

Build configuration

VLLM_TARGET_DEVICE

str

default:"cuda"

Target device for vLLM.Options: cuda, rocm, cpu

VLLM_MAIN_CUDA_VERSION

str

default:"12.9"

Main CUDA version for vLLM (follows PyTorch but can be overridden).

MAX_JOBS

str

default:"None"

Maximum number of parallel compilation jobs.Defaults to number of CPUs.

NVCC_THREADS

str

default:"None"

Number of threads for nvcc compilation.If set, MAX_JOBS will be reduced to avoid CPU oversubscription.

CMAKE_BUILD_TYPE

str

default:"None"

CMake build type.Options: Debug, Release, RelWithDebInfo

VERBOSE

bool

default:"false"

Print verbose logs during installation.

Precompiled binaries

VLLM_USE_PRECOMPILED

bool

default:"false"

Use precompiled binaries (*.so files).

VLLM_SKIP_PRECOMPILED_VERSION_SUFFIX

bool

default:"false"

Skip adding +precompiled suffix to version string.

Runtime variables

These variables configure vLLM’s runtime behavior.

Cache and storage

VLLM_CACHE_ROOT

str

default:"~/.cache/vllm"

Root directory for vLLM cache files.Respects XDG_CACHE_HOME if set.

VLLM_CONFIG_ROOT

str

default:"~/.config/vllm"

Root directory for vLLM configuration files.Respects XDG_CONFIG_HOME if set.

VLLM_ASSETS_CACHE

str

default:"~/.cache/vllm/assets"

Path to cache for storing downloaded assets.

Distributed execution

VLLM_HOST_IP

str

default:""

IP address of the current node for distributed execution.Set this differently on each node when using multi-node inference.

VLLM_PORT

int

default:"None"

Port for vLLM internal communication.

This is NOT the API server port. This is for internal distributed communication only.

VLLM_NCCL_SO_PATH

str

default:"None"

Path to NCCL library file.Needed because nccl>=2.19 from PyTorch may contain bugs.

Logging

VLLM_CONFIGURE_LOGGING

bool

default:"true"

Whether vLLM should configure logging.Set to 0 to disable vLLM’s logging configuration.

VLLM_LOGGING_LEVEL

str

default:"INFO"

Default logging level.Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

VLLM_LOGGING_CONFIG_PATH

str

default:"None"

Path to logging config JSON file for both vLLM and uvicorn.

VLLM_LOGGING_COLOR

str

default:"auto"

Control colored logging output.Options: auto, 1 (always), 0 (never)

NO_COLOR

bool

default:"false"

Standard Unix flag for disabling ANSI color codes.

VLLM_LOG_STATS_INTERVAL

float

default:"10.0"

Interval in seconds to log statistics.

Model loading

VLLM_USE_MODELSCOPE

bool

default:"false"

Load models from ModelScope instead of Hugging Face Hub.

VLLM_MODEL_REDIRECT_PATH

str

default:"None"

Path to JSON file or space-separated values table mapping model repo IDs to local folders.Example JSON:

{"meta-llama/Llama-3.2-1B": "/tmp/Llama-3.2-1B"}

HF_TOKEN

str

default:"None"

Hugging Face API token for private models.

Engine configuration

VLLM_ENGINE_ITERATION_TIMEOUT_S

int

default:"60"

Timeout for each iteration in the engine (seconds).

VLLM_ENGINE_READY_TIMEOUT_S

int

default:"600"

Timeout for engine cores to become ready during startup (seconds).

VLLM_ALLOW_LONG_MAX_MODEL_LEN

bool

default:"false"

Allow max sequence length greater than the max length from model config.

VLLM_IMAGE_FETCH_TIMEOUT

int

default:"5"

Timeout for fetching images when serving multi-modal models (seconds).

VLLM_VIDEO_FETCH_TIMEOUT

int

default:"30"

Timeout for fetching videos (seconds).

VLLM_AUDIO_FETCH_TIMEOUT

int

default:"10"

Timeout for fetching audio (seconds).

VLLM_MAX_AUDIO_CLIP_FILESIZE_MB

int

default:"25"

Maximum audio file size in MB for speech-to-text requests.

GPU and memory

CUDA_VISIBLE_DEVICES

str

default:"None"

Control visible GPU devices.

VLLM_FLOAT32_MATMUL_PRECISION

str

default:"highest"

PyTorch float32 matmul precision mode.Options: highest, high, medium

VLLM_FUSED_MOE_CHUNK_SIZE

int

default:"16384"

Chunk size for fused MoE operations.

Compilation and optimization

VLLM_USE_AOT_COMPILE

bool

default:"auto"

Enable ahead-of-time compilation.Automatically enabled on PyTorch >= 2.10.0.

VLLM_DISABLE_COMPILE_CACHE

bool

default:"false"

Disable torch.compile cache.

VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE

bool

default:"true"

Enable Inductor max_autotune for better performance.

Debugging

VLLM_TRACE_FUNCTION

int

default:"0"

Trace function calls for debugging.Set to 1 to enable.

VLLM_DEBUG_DUMP_PATH

str

default:"None"

Dump FX graphs to the specified directory for debugging.

VLLM_PATTERN_MATCH_DEBUG

str

default:"None"

Debug pattern matching inside custom passes.Set to fx.Node name (e.g., ‘getitem_34’).

Platform-specific

CPU backend

VLLM_CPU_KVCACHE_SPACE

int

default:"0"

CPU key-value cache space in GB.Default is 4 GB if not set.

VLLM_CPU_OMP_THREADS_BIND

str

default:"auto"

CPU core IDs bound by OpenMP threads.Examples: "0-31", "0,1,2", "0-31|32-63"

ROCm backend

VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE

int

default:"256"

Chunk size in MB for sleeping memory allocations under ROCm.

VLLM_ROCM_FP8_PADDING

bool

default:"true"

Pad FP8 weights to 256 bytes for ROCm.

VLLM_ROCM_CUSTOM_PAGED_ATTN

bool

default:"true"

Use custom paged attention kernel for MI3* cards.

XLA/TPU

VLLM_XLA_CACHE_PATH

str

default:"~/.cache/vllm/xla_cache"

Path to XLA persistent cache directory.

VLLM_XLA_USE_SPMD

bool

default:"false"

Enable SPMD mode for TPU backend.

Ray settings

VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE

str

default:"auto"

Channel type for Ray Compiled Graph communication.Options: auto, nccl, shm

VLLM_WORKER_MULTIPROC_METHOD

str

default:"fork"

Multiprocess context for workers.Options: fork, spawn

Usage statistics

VLLM_NO_USAGE_STATS

bool

default:"false"

Disable usage statistics collection.

VLLM_DO_NOT_TRACK

bool

default:"false"

Alternative flag to disable usage tracking.

VLLM_USAGE_STATS_SERVER

str

default:"https://stats.vllm.ai"

Server URL for usage statistics.

Usage examples

Set logging level

export VLLM_LOGGING_LEVEL=DEBUG
vllm serve meta-llama/Llama-3.1-8B-Instruct

Configure multi-node setup

# On node 1
export VLLM_HOST_IP=192.168.1.10
vllm serve MODEL --tensor-parallel-size 4

# On node 2
export VLLM_HOST_IP=192.168.1.11
vllm serve MODEL --tensor-parallel-size 4

Enable debugging

export VLLM_TRACE_FUNCTION=1
export VLLM_DEBUG_DUMP_PATH=/tmp/vllm_debug
vllm serve MODEL

Use ModelScope instead of Hugging Face

export VLLM_USE_MODELSCOPE=true
vllm serve Qwen/Qwen2.5-7B-Instruct

Disable usage statistics

export VLLM_NO_USAGE_STATS=1
vllm serve MODEL

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Environment variables

Installation time variables

Build configuration

Precompiled binaries

Runtime variables

Cache and storage

Distributed execution

Logging

Model loading

Engine configuration

GPU and memory

Compilation and optimization

Debugging

Platform-specific

CPU backend

ROCm backend

XLA/TPU

Ray settings

Usage statistics

Usage examples

Set logging level

Configure multi-node setup

Enable debugging

Use ModelScope instead of Hugging Face

Disable usage statistics

See also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​Installation time variables

​Build configuration

​Precompiled binaries

​Runtime variables

​Cache and storage

​Distributed execution

​Logging

​Model loading

​Engine configuration

​Multi-modal settings

​GPU and memory

​Compilation and optimization

​Debugging

​Platform-specific

​CPU backend

​ROCm backend

​XLA/TPU

​Ray settings

​Usage statistics

​Usage examples

​Set logging level

​Configure multi-node setup

​Enable debugging

​Use ModelScope instead of Hugging Face

​Disable usage statistics

​See also

Build docs developers (and LLMs) love

Installation time variables

Build configuration

Precompiled binaries

Runtime variables

Cache and storage

Distributed execution

Logging

Model loading

Engine configuration

Multi-modal settings

GPU and memory

Compilation and optimization

Debugging

Platform-specific

CPU backend

ROCm backend

XLA/TPU

Ray settings

Usage statistics

Usage examples

Set logging level

Configure multi-node setup

Enable debugging

Use ModelScope instead of Hugging Face

Disable usage statistics

See also