Quickstart

This guide will help you quickly get started with vLLM to perform offline batched inference and deploy an OpenAI-compatible API server.

Prerequisites

Operating system

Linux (including WSL on Windows)

Python version

Python 3.10, 3.11, 3.12, or 3.13

Installation

Choose your hardware platform for installation instructions:

NVIDIA CUDA
AMD ROCm
Google TPU

If you are using NVIDIA GPUs, you can install vLLM using pip directly.It’s recommended to use uv, a very fast Python environment manager. Install uv and create a new environment:

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

uv can automatically select the appropriate PyTorch index by inspecting your CUDA driver version with --torch-backend=auto. To select a specific backend (e.g., cu126), use --torch-backend=cu126.

You can also run vLLM commands without creating a permanent environment:

uv run --with vllm vllm --help

If you prefer conda, you can also use it to manage environments:

conda create -n myenv python=3.12 -y
conda activate myenv
pip install --upgrade uv
uv pip install vllm --torch-backend=auto

For AMD GPUs, install vLLM using the ROCm wheel index:

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Currently supports Python 3.12, ROCm 7.0, and glibc >= 2.35.

To run vLLM on Google TPUs, install the vllm-tpu package:

uv pip install vllm-tpu

For more detailed TPU instructions including Docker and troubleshooting, see the vLLM on TPU documentation.

For more installation options and other platforms, see the Installation Guide.

Offline batched inference

With vLLM installed, you can start generating text for a list of input prompts (offline batch inference).

Basic usage

The core classes you’ll use are:

LLM - Main class for running offline inference with the vLLM engine
SamplingParams - Parameters for the sampling process

from vllm import LLM, SamplingParams

Define prompts and sampling parameters

Create a list of input prompts and configure sampling parameters:

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

By default, vLLM uses sampling parameters from the model’s generation_config.json on HuggingFace if it exists. This provides optimal results recommended by the model creator.To use vLLM’s default sampling parameters instead, set generation_config="vllm" when creating the LLM instance.

Initialize the LLM engine

Create an LLM instance with your chosen model:

llm = LLM(model="facebook/opt-125m")

By default, vLLM downloads models from HuggingFace. To use models from ModelScope, set the environment variable:

export VLLM_USE_MODELSCOPE=True

Generate outputs

Generate text for all prompts with high throughput:

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The llm.generate method does not automatically apply chat templates. For Instruct or Chat models, manually apply the template or use the llm.chat method:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/path/to/chat_model")
messages_list = [
    [{"role": "user", "content": prompt}]
    for prompt in prompts
]
texts = tokenizer.apply_chat_template(
    messages_list,
    tokenize=False,
    add_generation_prompt=True,
)

outputs = llm.generate(texts, sampling_params)

Complete example

See the full working example at: examples/offline_inference/basic/basic.py

OpenAI-compatible server

vLLM can be deployed as a server implementing the OpenAI API protocol, allowing it to be a drop-in replacement for OpenAI API applications.

Start the server

Launch the vLLM server with a model:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

By default, the server:

Starts at http://localhost:8000
Hosts one model at a time
Implements OpenAI-compatible endpoints

Use --host and --port to customize the server address:

vllm serve Qwen/Qwen2.5-1.5B-Instruct --host 0.0.0.0 --port 8080

By default, the server applies generation_config.json from the HuggingFace model repository. To disable this and use vLLM defaults, pass:

vllm serve Qwen/Qwen2.5-1.5B-Instruct --generation-config vllm

API authentication

Enable API key checking:

vllm serve Qwen/Qwen2.5-1.5B-Instruct --api-key sk-key1 --api-key sk-key2

Or use environment variable:

export VLLM_API_KEY="sk-your-key"
vllm serve Qwen/Qwen2.5-1.5B-Instruct

Query the server

List available models:

curl http://localhost:8000/v1/models

Completions API

Generate completions using the /v1/completions endpoint:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Chat completions API

The chat interface enables dynamic, back-and-forth exchanges:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Attention backends

vLLM automatically selects the most performant attention backend for your system. You can manually specify a backend:

Online serving
Offline inference

vllm serve Qwen/Qwen2.5-1.5B-Instruct --attention-backend FLASH_ATTN

python script.py --attention-backend FLASHINFER

Available backends

NVIDIA CUDA

FLASH_ATTN or FLASHINFER

AMD ROCm

TRITON_ATTN, ROCM_ATTN, ROCM_AITER_FA, ROCM_AITER_UNIFIED_ATTN, TRITON_MLA, ROCM_AITER_MLA, or ROCM_AITER_TRITON_MLA

Flash Infer is not included in pre-built wheels. Install it separately following the Flash Infer documentation.

Next steps

Installation guide

Detailed installation for all platforms

Supported models

Browse all compatible models

API reference

Explore complete API documentation

Configuration

Learn about advanced configuration options

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Prerequisites

Operating system

Python version

Installation

Offline batched inference

Basic usage

Define prompts and sampling parameters

Initialize the LLM engine

Generate outputs

Complete example

OpenAI-compatible server

Start the server

API authentication

Query the server

Completions API

Chat completions API

Attention backends

Available backends

NVIDIA CUDA

AMD ROCm

Next steps

Installation guide

Supported models

API reference

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​Prerequisites

Operating system

Python version

​Installation

​Offline batched inference

​Basic usage

​Define prompts and sampling parameters

​Initialize the LLM engine

​Generate outputs

​Complete example

​OpenAI-compatible server

​Start the server

​API authentication

​Query the server

​Completions API

​Chat completions API

​Attention backends

​Available backends

NVIDIA CUDA

AMD ROCm

​Next steps

Installation guide

Supported models

API reference

Configuration

Build docs developers (and LLMs) love

Prerequisites

Installation

Offline batched inference

Basic usage

Define prompts and sampling parameters

Initialize the LLM engine

Generate outputs

Complete example

OpenAI-compatible server

Start the server

API authentication

Query the server

Completions API

Chat completions API

Attention backends

Available backends

Next steps