Kernel Development
This guide covers developing custom CUDA kernels for SGLang, including Triton kernels and CUDA C++ kernels.Overview
SGLang uses highly optimized kernels for:- Attention: FlashAttention, FlashInfer
- GEMM: Matrix multiplication (via cuBLAS, cutlass)
- Elementwise ops: RMSNorm, SiLU, RoPE
- Sampling: Top-k, top-p, softmax
- Triton kernels:
python/sglang/srt/layers/ - CUDA kernels:
sgl-kernelpackage (separate repository)
Why Custom Kernels?
Custom kernels provide:- Performance: 2-10x speedup over PyTorch ops
- Memory efficiency: Fused operations reduce memory bandwidth
- Flexibility: Implement custom operators not in PyTorch
Triton Kernels
Introduction to Triton
Triton is a Python DSL for writing GPU kernels. It’s easier than CUDA C++ but still offers high performance.Example: Fused RMSNorm
RMSNorm (Root Mean Square Layer Normalization) is commonly used in modern LLMs.Unfused Implementation (PyTorch)
Fused Triton Kernel
Triton Best Practices
1. Use Power-of-2 Block Sizes
2. Coalesce Memory Accesses
3. Minimize Synchronization
4. Optimize Occupancy
CUDA C++ Kernels
For maximum performance, write CUDA C++ kernels in thesgl-kernel package.
Example: Fused Add + ReLU
PyTorch Binding
Build System
FlashAttention Integration
SGLang uses FlashInfer for optimized attention.Using FlashInfer
Custom Attention Backend
To add a new attention backend:- Create attention class:
- Register backend:
- Use it:
Kernel Optimization Techniques
1. Tiling
Break computation into tiles that fit in shared memory:2. Vectorized Loads
Load multiple elements per thread:3. Warp Shuffle
Communicate within a warp without shared memory:Profiling Kernels
Nsight Compute
Key Metrics
- SM Throughput: Streaming Multiprocessor utilization
- Memory Throughput: DRAM bandwidth utilization
- Occupancy: Active warps / max warps
- Register Usage: Registers per thread
- Shared Memory Usage: Bytes per block
Testing Kernels
Correctness Test
Performance Test
Adding Kernels to sgl-kernel
See Contribution Guide for the multi-PR workflow.Step 1: Add Kernel Implementation
Step 2: Submit PR
Submit PR tosgl-kernel without using it yet.
Step 3: Bump Version
Submit another PR to bumpsgl-kernel version. This triggers PyPI release.
Step 4: Use Kernel
Updatepyproject.toml in sglang and use the new kernel.
Resources
Next Steps
- Architecture Overview - System design
- Scheduler - Request scheduling
- Memory Management - KV cache system
