Adding Models
This guide covers how to add support for new model architectures to SGLang.Prerequisites
- Understanding of the model architecture you want to add
- Access to the model’s Hugging Face implementation or source code
- Familiarity with PyTorch and transformer models
- SGLang development environment set up (Development Setup)
Overview
Adding a new model to SGLang typically involves:- Creating a model implementation file
- Registering the model architecture
- Adding tests
- Updating documentation
Step 1: Create Model Implementation
File Location
Create a new file inpython/sglang/srt/models/ named after your model (e.g., my_model.py).
Model Structure
A typical model implementation includes:Key Components
Model Layers
Implement the core model architecture:Attention Layer
Implement attention using SGLang’s optimized attention:Weight Loading
Implement weight loading from Hugging Face checkpoints:Step 2: Register Model
Add your model to the model registry inpython/sglang/srt/model_loader/loader.py:
Step 3: Add Configuration
If your model has a custom configuration, create a config class:Step 4: Add Tests
Create tests intest/srt/test_my_model.py:
Step 5: Test Your Model
Manual Testing
Run Unit Tests
Step 6: Optimize Performance
Use Fused Kernels
Replace standard operations with optimized kernels:Enable CUDA Graphs
Ensure your model supports CUDA graphs by avoiding dynamic operations in the forward pass.Step 7: Add Documentation
Update the documentation:- Add model to supported models list
- Create example usage in docs
- Document any special requirements or configuration
Example Documentation
- Supports GQA (Grouped-Query Attention)
- Requires
trust_remote_code=True
MoE Models
For Mixture-of-Experts models, use SGLang’s MoE layers:Troubleshooting
Model Not Loading
- Check model registration in
loader.py - Verify
model_typein config matches registry - Ensure
trust_remote_code=Trueif needed
OOM (Out of Memory)
- Reduce batch size
- Enable memory optimizations:
Slow Performance
- Enable CUDA graphs: Remove
--disable-cuda-graph - Use tensor parallelism:
--tp-size 2 - Profile with nsight: See Benchmark and Profiling
Checklist
Before submitting your model:- Model implementation complete
- Weights load correctly from Hugging Face
- Unit tests pass
- Manual testing successful
- Documentation updated
- Pre-commit hooks pass
- Performance acceptable
Resources
Next Steps
- Testing - Add comprehensive tests
- Kernel Development - Optimize with custom kernels
- Contribution Guide - Submit your model
