Introduction to vLLM

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Quickstart

Get started with vLLM in minutes

Installation

Install vLLM on your platform

API Reference

Explore the vLLM API documentation

Supported Models

Browse supported LLM models

What makes vLLM fast

vLLM delivers state-of-the-art serving throughput through innovative optimizations:

PagedAttention

Efficient management of attention key and value memory with breakthrough PagedAttention algorithm

Continuous batching

Continuous batching of incoming requests maximizes GPU utilization

Optimized kernels

Fast model execution with CUDA/HIP graph and integration with FlashAttention and FlashInfer

Quantization support

Advanced quantization methods including GPTQ, AWQ, AutoRound, INT4, INT8, and FP8

Performance features

Speculative decoding - Accelerate inference with draft models
Chunked prefill - Optimize long-context processing
Prefix caching - Reuse KV cache for common prompts

Flexibility and ease of use

vLLM is designed to be flexible and easy to integrate into your workflow:

Seamless HuggingFace integration

Works with popular models from HuggingFace out of the box

Multiple decoding algorithms

Support for parallel sampling, beam search, and more

OpenAI-compatible API

Drop-in replacement for OpenAI API with compatible server endpoints

Distributed inference

Tensor, pipeline, data, and expert parallelism for large-scale deployments

Key capabilities

Streaming outputs - Real-time token generation
Multi-LoRA support - Serve multiple LoRA adapters simultaneously
Broad hardware support - NVIDIA GPUs, AMD CPUs/GPUs, Intel CPUs/GPUs, PowerPC CPUs, Arm CPUs, TPU, and more
Hardware plugins - Support for Intel Gaudi, IBM Spyre, Huawei Ascend, and other accelerators

Supported model types

vLLM seamlessly supports most popular open-source models on HuggingFace:

Transformer LLMs

Llama, Mistral, Qwen, and other transformer-based models

Mixture-of-Experts

Mixtral, Deepseek-V2, Deepseek-V3, and other MoE architectures

Embedding models

E5-Mistral and other embedding models

Multi-modal LLMs

LLaVA and other vision-language models

Find the complete list of supported models in the Supported Models documentation.

Installation preview

Installing vLLM is simple with pip:

pip install vllm

We recommend using uv, a very fast Python environment manager, for the best installation experience. See the Installation Guide for detailed instructions.

Community and support

vLLM is a community-driven project with an active ecosystem:

GitHub Issues

Report bugs and request features

Discussion Forum

Connect with fellow users

Developer Slack

Coordinate contributions and development

Blog

Read latest updates and technical deep dives

Next steps

Follow the quickstart

Run your first inference in minutes

Explore the API

Learn about offline and online inference APIs

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Introduction to vLLM

Easy, fast, and cheap LLM serving for everyone

Quickstart

Installation

API Reference

Supported Models

What makes vLLM fast

PagedAttention

Continuous batching

Optimized kernels

Quantization support

Performance features

Flexibility and ease of use

Key capabilities

Supported model types

Transformer LLMs

Mixture-of-Experts

Embedding models

Multi-modal LLMs

Installation preview

Community and support

GitHub Issues

Discussion Forum

Developer Slack

Blog

Next steps

Follow the quickstart

Explore the API

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​Easy, fast, and cheap LLM serving for everyone

Quickstart

Installation

API Reference

Supported Models

​What makes vLLM fast

PagedAttention

Continuous batching

Optimized kernels

Quantization support

​Performance features

​Flexibility and ease of use

​Key capabilities

​Supported model types

Transformer LLMs

Mixture-of-Experts

Embedding models

Multi-modal LLMs

​Installation preview

​Community and support

GitHub Issues

Discussion Forum

Developer Slack

Blog

​Next steps

Follow the quickstart

Explore the API

Build docs developers (and LLMs) love

Easy, fast, and cheap LLM serving for everyone

What makes vLLM fast

Performance features

Flexibility and ease of use

Key capabilities

Supported model types

Installation preview

Community and support

Next steps