Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.Quickstart
Get started with vLLM in minutes
Installation
Install vLLM on your platform
API Reference
Explore the vLLM API documentation
Supported Models
Browse supported LLM models
What makes vLLM fast
vLLM delivers state-of-the-art serving throughput through innovative optimizations:PagedAttention
Efficient management of attention key and value memory with breakthrough PagedAttention algorithm
Continuous batching
Continuous batching of incoming requests maximizes GPU utilization
Optimized kernels
Fast model execution with CUDA/HIP graph and integration with FlashAttention and FlashInfer
Quantization support
Advanced quantization methods including GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
Performance features
- Speculative decoding - Accelerate inference with draft models
- Chunked prefill - Optimize long-context processing
- Prefix caching - Reuse KV cache for common prompts
Flexibility and ease of use
vLLM is designed to be flexible and easy to integrate into your workflow:Key capabilities
- Streaming outputs - Real-time token generation
- Multi-LoRA support - Serve multiple LoRA adapters simultaneously
- Broad hardware support - NVIDIA GPUs, AMD CPUs/GPUs, Intel CPUs/GPUs, PowerPC CPUs, Arm CPUs, TPU, and more
- Hardware plugins - Support for Intel Gaudi, IBM Spyre, Huawei Ascend, and other accelerators
Supported model types
vLLM seamlessly supports most popular open-source models on HuggingFace:Transformer LLMs
Llama, Mistral, Qwen, and other transformer-based models
Mixture-of-Experts
Mixtral, Deepseek-V2, Deepseek-V3, and other MoE architectures
Embedding models
E5-Mistral and other embedding models
Multi-modal LLMs
LLaVA and other vision-language models
Find the complete list of supported models in the Supported Models documentation.
Installation preview
Installing vLLM is simple with pip:Community and support
vLLM is a community-driven project with an active ecosystem:GitHub Issues
Report bugs and request features
Discussion Forum
Connect with fellow users
Developer Slack
Coordinate contributions and development
Blog
Read latest updates and technical deep dives
Next steps
Follow the quickstart
Run your first inference in minutes
Explore the API
Learn about offline and online inference APIs