Skip to main content
vLLM Logo

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Quickstart

Get started with vLLM in minutes

Installation

Install vLLM on your platform

API Reference

Explore the vLLM API documentation

Supported Models

Browse supported LLM models

What makes vLLM fast

vLLM delivers state-of-the-art serving throughput through innovative optimizations:

PagedAttention

Efficient management of attention key and value memory with breakthrough PagedAttention algorithm

Continuous batching

Continuous batching of incoming requests maximizes GPU utilization

Optimized kernels

Fast model execution with CUDA/HIP graph and integration with FlashAttention and FlashInfer

Quantization support

Advanced quantization methods including GPTQ, AWQ, AutoRound, INT4, INT8, and FP8

Performance features

  • Speculative decoding - Accelerate inference with draft models
  • Chunked prefill - Optimize long-context processing
  • Prefix caching - Reuse KV cache for common prompts

Flexibility and ease of use

vLLM is designed to be flexible and easy to integrate into your workflow:
1

Seamless HuggingFace integration

Works with popular models from HuggingFace out of the box
2

Multiple decoding algorithms

Support for parallel sampling, beam search, and more
3

OpenAI-compatible API

Drop-in replacement for OpenAI API with compatible server endpoints
4

Distributed inference

Tensor, pipeline, data, and expert parallelism for large-scale deployments

Key capabilities

  • Streaming outputs - Real-time token generation
  • Multi-LoRA support - Serve multiple LoRA adapters simultaneously
  • Broad hardware support - NVIDIA GPUs, AMD CPUs/GPUs, Intel CPUs/GPUs, PowerPC CPUs, Arm CPUs, TPU, and more
  • Hardware plugins - Support for Intel Gaudi, IBM Spyre, Huawei Ascend, and other accelerators

Supported model types

vLLM seamlessly supports most popular open-source models on HuggingFace:

Transformer LLMs

Llama, Mistral, Qwen, and other transformer-based models

Mixture-of-Experts

Mixtral, Deepseek-V2, Deepseek-V3, and other MoE architectures

Embedding models

E5-Mistral and other embedding models

Multi-modal LLMs

LLaVA and other vision-language models
Find the complete list of supported models in the Supported Models documentation.

Installation preview

Installing vLLM is simple with pip:
pip install vllm
We recommend using uv, a very fast Python environment manager, for the best installation experience. See the Installation Guide for detailed instructions.

Community and support

vLLM is a community-driven project with an active ecosystem:

GitHub Issues

Report bugs and request features

Discussion Forum

Connect with fellow users

Developer Slack

Coordinate contributions and development

Blog

Read latest updates and technical deep dives

Next steps

Follow the quickstart

Run your first inference in minutes

Explore the API

Learn about offline and online inference APIs

Build docs developers (and LLMs) love