CLI usage guide

The vLLM CLI provides commands for serving models, running benchmarks, and managing your deployment. All commands follow the pattern vllm [subcommand] [options].

Available commands

View all available commands:

vllm --help

Command	Description
`vllm serve`	Launch OpenAI-compatible API server
`vllm bench`	Run performance benchmarks
`vllm collect-env`	Collect environment information for debugging
`vllm run-batch`	Run offline batch inference

vllm serve

Launch an OpenAI-compatible HTTP API server to serve LLM completions.

Basic usage

vllm serve [model_tag] [options]

Examples:

vllm serve

Model configuration

Model loading
Quantization
Tensor parallelism
Pipeline parallelism

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --dtype auto \
  --max-model-len 4096 \
  --trust-remote-code

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --quantization awq \
  --dtype half

vllm serve meta-llama/Llama-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95

vllm serve meta-llama/Llama-70B-Instruct \
  --pipeline-parallel-size 2 \
  --tensor-parallel-size 2

Data parallel deployment

Single node, multiple GPUs

Launch with data parallelism on a single node:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --tensor-parallel-size 2

This uses 8 GPUs total (4 DP ranks × 2 TP size).

Multi-node with internal load balancing

Run on multiple nodes with a single API endpoint:

# Node 0 (head node with IP 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Node 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-start-rank 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

Multi-node with external load balancing

Run each DP rank as a separate server:

# Rank 0 (IP: 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 0 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Rank 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 1 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

Performance options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95

Key parameters:

--max-num-batched-tokens - Maximum tokens processed in a single batch
--max-num-seqs - Maximum number of sequences in a batch
--enable-prefix-caching - Enable KV cache reuse for repeated prompts
--enable-chunked-prefill - Split large prompts into chunks
--gpu-memory-utilization - Fraction of GPU memory to use (0.0-1.0)

Chat templates

Specify a custom chat template:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template ./templates/custom_chat.jinja

Override content format detection:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template-content-format openai

Server options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key token-abc123 \
  --enable-request-id-headers \
  --enable-offline-docs \
  --uvicorn-log-level info

Server parameters:

--host - Host IP address (default: None)
--port - Port number (default: 8000)
--api-key - API key for authentication
--enable-request-id-headers - Enable X-Request-ID header tracking
--enable-offline-docs - Enable offline API documentation
--uvicorn-log-level - Logging level: critical, error, warning, info, debug, trace

Advanced deployment

# For multi-node deployments - no API server
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4

Environment variables

Common environment variables for vllm serve:

# CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3

# vLLM settings
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
export VLLM_MAX_AUDIO_CLIP_FILESIZE_MB=25

vllm serve meta-llama/Llama-3.2-1B-Instruct

vllm bench

Run performance benchmarks to measure throughput and latency.

Benchmark types

Throughput
Latency
Serving

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sonnet \
  --num-prompts 1000

vllm bench latency \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --input-len 128 \
  --output-len 64

vllm bench serve \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --backend vllm \
  --dataset-name random \
  --num-prompts 1000 \
  --request-rate 10

Benchmark options

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --max-num-batched-tokens 8192

Common parameters:

--model - Model to benchmark
--dataset-name - Dataset to use (sonnet, sharegpt, random)
--num-prompts - Number of prompts to process
--input-len - Input sequence length
--output-len - Output sequence length
--request-rate - Requests per second (for serving benchmarks)

vllm collect-env

Collect environment information for debugging and issue reporting:

vllm collect-env

This outputs:

Python version
PyTorch version
CUDA version
vLLM version
GPU information
System details

Example output:

vLLM Version: 0.6.0
Python Version: 3.10.12
PyTorch Version: 2.4.0+cu121
CUDA Version: 12.1
GPU: NVIDIA A100-SXM4-80GB

vllm run-batch

Run offline batch inference from command line:

vllm run-batch \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl

Input file format (JSONL):

{"prompt": "Hello, my name is", "max_tokens": 50}
{"prompt": "The capital of France is", "max_tokens": 50}

Contextual help

Get help for specific command groups:

# View model configuration options
vllm serve --help=ModelConfig

# View frontend server options
vllm serve --help=Frontend

# View all options at once
vllm serve --help=all

Version information

Check vLLM version:

vllm --version

Common workflows

Development server

Quick local server for testing:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --port 8000 \
  --max-model-len 2048

Production deployment

Production-ready configuration:

vllm serve meta-llama/Llama-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key $API_KEY \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --disable-log-stats

Benchmark comparison

Compare different configurations:

# Baseline
vllm bench throughput --model meta-llama/Llama-3.2-1B-Instruct

# With prefix caching
vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --enable-prefix-caching

For complete parameter documentation, visit the configuration reference.

Troubleshooting

If you encounter CUDA out-of-memory errors, try:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --gpu-memory-utilization 0.8 \
  --max-model-len 2048

Common issues:

Port already in use: Change the port with --port 8080
Model not found: Ensure HuggingFace credentials are set: huggingface-cli login
GPU memory issues: Reduce --gpu-memory-utilization or --max-model-len
Slow startup: Add --enforce-eager to skip CUDA graph compilation

Examples

Explore full CLI examples:

Server configuration - vllm/entrypoints/cli/serve.py:33
Multi-node setup script
Data parallel deployment

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

CLI usage guide

Available commands

vllm serve

Basic usage

Model configuration

Data parallel deployment

Performance options

Chat templates

Server options

Advanced deployment

Environment variables

vllm bench

Benchmark types

Benchmark options

vllm collect-env

vllm run-batch

Contextual help

Version information

Common workflows

Troubleshooting

Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​Available commands

​vllm serve

​Basic usage

​Model configuration

​Data parallel deployment

​Performance options

​Chat templates

​Server options

​Advanced deployment

​Environment variables

​vllm bench

​Benchmark types

​Benchmark options

​vllm collect-env

​vllm run-batch

​Contextual help

​Version information

​Common workflows

​Troubleshooting

​Examples

Build docs developers (and LLMs) love

Available commands

vllm serve

Basic usage

Model configuration

Data parallel deployment

Performance options

Chat templates

Server options

Advanced deployment

Environment variables

vllm bench

Benchmark types

Benchmark options

vllm collect-env

vllm run-batch

Contextual help

Version information

Common workflows

Troubleshooting

Examples