Skip to main content
This guide will help you quickly get started with vLLM to perform offline batched inference and deploy an OpenAI-compatible API server.

Prerequisites

Operating system

Linux (including WSL on Windows)

Python version

Python 3.10, 3.11, 3.12, or 3.13

Installation

Choose your hardware platform for installation instructions:
If you are using NVIDIA GPUs, you can install vLLM using pip directly.It’s recommended to use uv, a very fast Python environment manager. Install uv and create a new environment:
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
uv can automatically select the appropriate PyTorch index by inspecting your CUDA driver version with --torch-backend=auto. To select a specific backend (e.g., cu126), use --torch-backend=cu126.
You can also run vLLM commands without creating a permanent environment:
uv run --with vllm vllm --help
If you prefer conda, you can also use it to manage environments:
conda create -n myenv python=3.12 -y
conda activate myenv
pip install --upgrade uv
uv pip install vllm --torch-backend=auto
For more installation options and other platforms, see the Installation Guide.

Offline batched inference

With vLLM installed, you can start generating text for a list of input prompts (offline batch inference).

Basic usage

The core classes you’ll use are:
  • LLM - Main class for running offline inference with the vLLM engine
  • SamplingParams - Parameters for the sampling process
from vllm import LLM, SamplingParams

Define prompts and sampling parameters

Create a list of input prompts and configure sampling parameters:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
By default, vLLM uses sampling parameters from the model’s generation_config.json on HuggingFace if it exists. This provides optimal results recommended by the model creator.To use vLLM’s default sampling parameters instead, set generation_config="vllm" when creating the LLM instance.

Initialize the LLM engine

Create an LLM instance with your chosen model:
llm = LLM(model="facebook/opt-125m")
By default, vLLM downloads models from HuggingFace. To use models from ModelScope, set the environment variable:
export VLLM_USE_MODELSCOPE=True

Generate outputs

Generate text for all prompts with high throughput:
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
The llm.generate method does not automatically apply chat templates. For Instruct or Chat models, manually apply the template or use the llm.chat method:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/path/to/chat_model")
messages_list = [
    [{"role": "user", "content": prompt}]
    for prompt in prompts
]
texts = tokenizer.apply_chat_template(
    messages_list,
    tokenize=False,
    add_generation_prompt=True,
)

outputs = llm.generate(texts, sampling_params)

Complete example

See the full working example at: examples/offline_inference/basic/basic.py

OpenAI-compatible server

vLLM can be deployed as a server implementing the OpenAI API protocol, allowing it to be a drop-in replacement for OpenAI API applications.

Start the server

Launch the vLLM server with a model:
vllm serve Qwen/Qwen2.5-1.5B-Instruct
By default, the server:
  • Starts at http://localhost:8000
  • Hosts one model at a time
  • Implements OpenAI-compatible endpoints
Use --host and --port to customize the server address:
vllm serve Qwen/Qwen2.5-1.5B-Instruct --host 0.0.0.0 --port 8080
By default, the server applies generation_config.json from the HuggingFace model repository. To disable this and use vLLM defaults, pass:
vllm serve Qwen/Qwen2.5-1.5B-Instruct --generation-config vllm

API authentication

Enable API key checking:
vllm serve Qwen/Qwen2.5-1.5B-Instruct --api-key sk-key1 --api-key sk-key2
Or use environment variable:
export VLLM_API_KEY="sk-your-key"
vllm serve Qwen/Qwen2.5-1.5B-Instruct

Query the server

List available models:
curl http://localhost:8000/v1/models

Completions API

Generate completions using the /v1/completions endpoint:
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Chat completions API

The chat interface enables dynamic, back-and-forth exchanges:
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Attention backends

vLLM automatically selects the most performant attention backend for your system. You can manually specify a backend:
vllm serve Qwen/Qwen2.5-1.5B-Instruct --attention-backend FLASH_ATTN

Available backends

NVIDIA CUDA

FLASH_ATTN or FLASHINFER

AMD ROCm

TRITON_ATTN, ROCM_ATTN, ROCM_AITER_FA, ROCM_AITER_UNIFIED_ATTN, TRITON_MLA, ROCM_AITER_MLA, or ROCM_AITER_TRITON_MLA
Flash Infer is not included in pre-built wheels. Install it separately following the Flash Infer documentation.

Next steps

Installation guide

Detailed installation for all platforms

Supported models

Browse all compatible models

API reference

Explore complete API documentation

Configuration

Learn about advanced configuration options

Build docs developers (and LLMs) love