Overview
Custom kernels can be integrated in two ways:- Custom Operations: PyTorch custom ops for the PyTorch backend
- TensorRT Plugins: Native TensorRT plugins for the TensorRT backend
Custom PyTorch Operations
For the PyTorch backend, use PyTorch’s custom op registration:Example: Custom Attention Kernel
Integration with TensorRT-LLM
Add your custom op to the auto_deploy custom_ops directory:TensorRT Plugins (Legacy)
For the TensorRT backend, implement a TensorRT plugin:Performance Considerations
Memory Coalescing
Memory Coalescing
Ensure memory accesses are coalesced for maximum bandwidth:
Shared Memory
Shared Memory
Occupancy
Occupancy
Maximize occupancy by balancing register usage and thread blocks:
Testing Custom Kernels
Examples in TensorRT-LLM
Study existing custom ops:tensorrt_llm/_torch/custom_ops/fused_moe/- Fused MoE kerneltensorrt_llm/_torch/cuda_tile_kernels/- CUDA tile kernelstensorrt_llm/kernels/- C++ CUDA kernels
Next Steps
Profiling
Profile your custom kernels
Optimization Guide
Optimize kernel performance