Entrypoints
vLLM provides multiple entrypoints for interacting with the system:LLM class
TheLLM class provides the primary Python interface for offline inference (interacting with a model without using a separate inference server).
vllm/entrypoints/llm.py
OpenAI-compatible API server
The second primary interface is the OpenAI-compatible API server, which can be started using thevllm serve command:
vllm/entrypoints/cli/main.py
More details: OpenAI-Compatible Server
V1 process architecture
vLLM V1 uses a multi-process architecture to separate concerns and maximize throughput. Understanding this architecture is important for properly sizing CPU resources in your deployment.Key processes
- API server process
- Engine core process
- GPU worker processes
- DP coordinator process
Purpose: Handles HTTP requests, performs input processing (tokenization, multi-modal data loading), and streams results back to clients.Communication: Communicates with engine core process(es) via ZMQ sockets.Count: 1 API server process by default, but automatically scales to match data parallel size. Can be manually configured with
--api-server-count flag.Threading: Each API server uses multiple CPU threads for media loading (controlled by VLLM_MEDIA_LOADING_THREAD_COUNT, default 8).Topology: Many-to-many topology where each API server connects to all engine cores via ZMQ, enabling any API server to route requests to any engine core.Code location: vllm/entrypoints/openai/api_server.py and vllm/v1/utils.pyProcess count summary
For a deployment withN GPUs, TP tensor parallel size, DP data parallel size, and A API server count:
| Process Type | Count | Notes |
|---|---|---|
| API Server | A (default DP) | Handles HTTP requests and input processing |
| Engine Core | DP (default 1) | Scheduler and KV cache management |
| GPU Worker | N (= DP x TP) | One per GPU, executes model forward passes |
| DP Coordinator | 1 if DP > 1, else 0 | Load balancing across DP ranks |
| Total | A + DP + N (+ 1 if DP > 1) |
Example configurations
- Single-node (TP=4)
- Data parallel (TP=2, DP=4)
A typical single-node deployment with 4 GPUs:Processes:
- 1 API server
- 1 engine core
- 4 GPU workers
- Total: 6 processes
LLM engine
TheLLMEngine and AsyncLLMEngine classes are central to vLLM’s functioning, handling model inference and asynchronous request processing.
LLMEngine
TheLLMEngine class is the core component that receives requests from clients and generates outputs from the model.
Responsibilities:
- Input processing: Tokenization of input text using the specified tokenizer
- Scheduling: Chooses which requests are processed in each step
- Model execution: Manages distributed execution across multiple GPUs
- Output processing: Decodes token IDs into human-readable text
vllm/engine/llm_engine.py
AsyncLLMEngine
TheAsyncLLMEngine class is an asynchronous wrapper for the LLMEngine class.
Features:
- Uses
asyncioto create a background loop for continuous request processing - Designed for online serving with multiple concurrent requests
- Supports streaming outputs to clients
- Powers the OpenAI-compatible API server
- Available in demo API server at
vllm/entrypoints/api_server.py
vllm/engine/async_llm_engine.py
Core components
Worker
A worker is a process that runs model inference. vLLM follows the common practice of using one process to control one accelerator device. Example: With tensor parallelism size 2 and pipeline parallelism size 2, there are 4 workers total. Identification:rank: Used for global orchestrationlocal_rank: Used for assigning accelerator device and accessing local resources (file system, shared memory)
Model runner
Every worker has one model runner object responsible for loading and running the model. Responsibilities:- Preparing input tensors
- Capturing CUDA graphs
- Executing model forward passes
Model
Every model runner object has one model object, which is the actualtorch.nn.Module instance.
See HuggingFace Integration for how various configurations affect the model class.
Class hierarchy
vLLM’s class hierarchy is designed with three key principles:1. Extensibility
All classes accept a configuration object containing all necessary information. TheVllmConfig class is the main configuration object passed throughout the hierarchy.
Benefits:
- Deep class hierarchy with easy configuration access
- New features can be added without changing constructor signatures
- Configuration options only need to be added to
VllmConfig
2. Uniformity
The model runner needs a unified interface to create and initialize models. vLLM supports 50+ model types, each with unique initialization logic. All vLLM models now use a uniform constructor signature:The constructor is keyword-only to ensure errors are raised if old configurations are passed. For out-of-tree models, developers need to update their models to use this signature.
3. Sharding and quantization at initialization
Certain features require changing model weights during initialization rather than after. Why initialize-time modifications? Consider running a 405B model (roughly 810GB weights) on 16 H100 80GB GPUs:- After initialization: Would need to load 810GB to every GPU, then shard (huge memory overhead)
- During initialization: Each layer only creates the shard it needs (50GB per GPU)
prefix argument:
The prefix parameter enables non-uniform quantization where different parts of the model are quantized differently:
- Usually an empty string for the top-level model
- Strings like
"vision"or"language"for sub-models - Matches the module’s state dict name in the checkpoint file
Testing implications
One disadvantage of this design is that it’s hard to write unit tests for individual components since every component needs a complete config object. Solution: Provide a default initialization function that creates a default config object with all fields set toNone. For testing a component that only cares about a few fields, create a default config and set only the fields needed.
Many vLLM tests are end-to-end tests that test the whole system, so this is not a major problem in practice.
Summary
The completeVllmConfig object can be treated as an engine-level global state shared among all vLLM classes. This design enables:
- Easy extensibility for new features
- Uniform interfaces across all models
- Memory-efficient initialization for large models
- Support for complex features like non-uniform quantization
Next steps
Plugin system
Extend vLLM with custom features
Memory management
Learn about PagedAttention and KV cache