How plugins work in vLLM
Plugins are user-registered code that vLLM executes. Given vLLM’s architecture, multiple processes may be involved, especially when using distributed inference with various parallelism techniques. Key requirement: Every process created by vLLM needs to load the plugin. This is done by theload_plugins_by_group function in the vllm.plugins module.
How vLLM discovers plugins
vLLM’s plugin system uses the standard Pythonentry_points mechanism. This allows developers to register functions in their Python packages for use by other packages.
Example plugin
Plugin components
Every plugin has three parts:1. Plugin group
The name of the entry point group. This is the key ofentry_points in setup.py.
General plugins: Use vllm.general_plugins for vLLM’s general plugins.
2. Plugin name
The name of the plugin. This is the value in the dictionary of theentry_points dictionary.
Example: register_dummy_model
Filtering: Plugins can be filtered by name using the VLLM_PLUGINS environment variable. To load only a specific plugin, set VLLM_PLUGINS to the plugin name.
3. Plugin value
The fully qualified name of the function or module to register in the plugin system. Example:vllm_add_dummy_model:register refers to a function named register in the vllm_add_dummy_model module.
Types of supported plugins
General plugins
Group name:vllm.general_plugins
Primary use case: Register custom, out-of-the-tree models into vLLM by calling ModelRegistry.register_model inside the plugin function.
Example: bart-plugin which adds support for BartForConditionalGeneration.
Platform plugins
Group name:vllm.platform_plugins
Primary use case: Register custom, out-of-the-tree platforms into vLLM.
Return value: The plugin function should return:
Nonewhen the platform is not supported in the current environment- The platform class’s fully qualified name when the platform is supported
IO processor plugins
Group name:vllm.io_processor_plugins
Primary use case: Register custom pre-/post-processing of the model prompt and model output for pooling models.
Return value: The IOProcessor’s class fully qualified name.
Stat logger plugins
Group name:vllm.stat_logger_plugins
Primary use case: Register custom, out-of-the-tree loggers into vLLM.
Requirements: The entry point should be a class that subclasses StatLoggerBase.
Guidelines for writing plugins
General guidelines
Platform plugin guidelines
Platform plugins allow you to add support for custom hardware platforms to vLLM.Register entry point
In The
setup.py, add the entry point:register function should return the platform class’s fully qualified name:Implement platform class
In
my_dummy_platform.py, implement the platform class inheriting from vllm.platforms.interface.Platform.Key properties and methods:_enum: Device enumeration fromPlatformEnum(usuallyPlatformEnum.OOTfor out-of-tree)device_type: Type of device PyTorch uses (e.g.,"cpu","cuda")device_name: Usually same asdevice_type, mainly for loggingcheck_and_update_config: Called early in initialization to update vLLM config. Must set worker_cls hereget_attn_backend_cls: Return the attention backend class’s fully qualified nameget_device_communicator_cls: Return the device communicator class’s fully qualified name
Implement worker class
In
my_dummy_worker.py, implement the worker class inheriting from WorkerBase.Required methods:init_device: Set up the device for the workerinitialize_cache: Set cache config for the workerload_model: Load model weights to deviceget_kv_cache_spec: Generate KV cache spec for the modeldetermine_available_memory: Profile peak memory usageinitialize_from_config: Allocate device KV cacheexecute_model: Execute model inference (called every step)
sleepandwakeup: Support sleep mode featurecompile_or_warm_up_model: Support graph mode featuretake_draft_token_ids: Support speculative decodingadd_lora,remove_lora,list_loras,pin_lora: Support LoRAexecute_dummy_batch: Support data parallelism
Implement attention backend
In
my_dummy_attention.py, implement the attention backend class inheriting from AttentionBackend.Purpose: Calculate attention with your device.Examples: See vllm.v1.attention.backends for various attention backend implementations.Implement custom ops (optional)
Implement custom ops for high performance. vLLM supports:PyTorch ops:
- Communicator ops: Device communicator operations (all-reduce, all-gather, etc.). Inherit from
DeviceCommunicatorBase - Common ops: Common operations (matmul, softmax, etc.). Register using
CustomOpclass - C++ ops: Implemented in C++ and registered as torch custom ops. Follow
csrcmodule andvllm._custom_ops
Compatibility guarantee
vLLM guarantees that the interface of documented plugins (such asModelRegistry.register_model) will always be available.
The interface for models/modules may change during vLLM’s development. If you see any deprecation log info, please upgrade your plugin to the latest version.
Deprecation announcements
Example: Complete general plugin
Here’s a complete example of a general plugin that registers a custom model:Next steps
Model registration
Learn more about registering models
Architecture
Understand vLLM’s architecture