Supported Architectures
Mini-SGLang currently supports the following model families:Llama-3 Series
Meta’s Llama 3 and 3.1 models with various parameter sizes
Qwen-3 Series
Alibaba Cloud’s Qwen 3 models including MoE variants
Qwen-2.5 Series
Qwen 2.5 model family with improved performance
Llama-3 Models
The Llama-3 architecture (LlamaForCausalLM) supports:
- meta-llama/Llama-3.1-8B
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-3.1-70B
- meta-llama/Llama-3.1-70B-Instruct
- meta-llama/Llama-3.1-405B
- meta-llama/Llama-3.1-405B-Instruct
Qwen-3 Models
The Qwen-3 architecture (Qwen3ForCausalLM) supports:
Qwen-3 MoE Models
Mixture-of-Experts variants (Qwen3MoeForCausalLM):
Qwen-2.5 Models
The Qwen-2.5 architecture (Qwen2ForCausalLM) supports:
- Qwen/Qwen2.5-0.5B
- Qwen/Qwen2.5-1.5B
- Qwen/Qwen2.5-3B
- Qwen/Qwen2.5-7B
- Qwen/Qwen2.5-14B
- Qwen/Qwen2.5-32B
- Qwen/Qwen2.5-72B
Specifying Models
Basic Usage
Specify models using the--model argument:
organization/model-name.
Examples
Model Sources
Mini-SGLang supports downloading models from two sources:HuggingFace (Default)
By default, models are downloaded from HuggingFace Hub:Gated Models
For gated models (e.g., Llama-3), provide your HuggingFace token:huggingface-cli to login:
ModelScope
For users in regions with restricted HuggingFace access, use ModelScope:Model identifiers remain the same regardless of the source. Mini-SGLang handles the translation automatically.
Model Selection Guide
By Use Case
Development & Testing
Development & Testing
Recommended: Qwen/Qwen3-0.6B or Qwen/Qwen2.5-0.5B
- Fast iteration
- Low memory footprint
- Runs on consumer GPUs
Production (Single GPU)
Production (Single GPU)
Recommended: Qwen/Qwen3-14B or meta-llama/Llama-3.1-8B-Instruct
- Good balance of quality and speed
- Fits on 24GB GPUs (A10, RTX 3090/4090)
Production (Multi-GPU)
Production (Multi-GPU)
Recommended: Qwen/Qwen3-32B or meta-llama/Llama-3.1-70B-Instruct
- State-of-the-art quality
- Requires 2-4 GPUs
High Throughput
High Throughput
Recommended: Qwen/Qwen3-MoE-16B
- Mixture-of-Experts for efficiency
- Better throughput than dense models
By GPU Memory
- 8-16 GB
- 16-24 GB
- 40-48 GB
- 80+ GB (Single)
- Multiple GPUs
- Qwen/Qwen3-0.6B
- Qwen/Qwen2.5-0.5B
- Qwen/Qwen2.5-1.5B
Model Architecture Detection
Mini-SGLang automatically detects the model architecture from theconfig.json file:
Custom and Fine-tuned Models
Local Models
Load models from a local path:Fine-tuned Models
Any model fine-tuned from a supported base architecture should work:Quantized Models
Troubleshooting
Model Not Found
Model Not Found
- Verify the model identifier is correct
- Check HuggingFace/ModelScope availability
- For gated models, ensure you have access and provided
HF_TOKEN
Architecture Not Supported
Architecture Not Supported
Error:
Model architecture X not supportedThe model’s architecture is not in the registry. Only Llama-3, Qwen-2.5, Qwen-3, and Qwen-3-MoE are currently supported.Network Issues
Network Issues
If HuggingFace downloads fail:
Out of Memory
Out of Memory
- Use a smaller model variant
- Enable tensor parallelism with
--tp - Reduce
--max-prefill-length