Skip to main content
Mini-SGLang supports several popular open-source model architectures optimized for efficient inference. This guide covers supported models, how to specify them, and model source configuration.

Supported Architectures

Mini-SGLang currently supports the following model families:

Llama-3 Series

Meta’s Llama 3 and 3.1 models with various parameter sizes

Qwen-3 Series

Alibaba Cloud’s Qwen 3 models including MoE variants

Qwen-2.5 Series

Qwen 2.5 model family with improved performance

Llama-3 Models

The Llama-3 architecture (LlamaForCausalLM) supports: And other Llama-3 compatible models.

Qwen-3 Models

The Qwen-3 architecture (Qwen3ForCausalLM) supports:

Qwen-3 MoE Models

Mixture-of-Experts variants (Qwen3MoeForCausalLM):

Qwen-2.5 Models

The Qwen-2.5 architecture (Qwen2ForCausalLM) supports:

Specifying Models

Basic Usage

Specify models using the --model argument:
python -m minisgl --model "Qwen/Qwen3-0.6B"
The model identifier follows the HuggingFace format: organization/model-name.

Examples

python -m minisgl --model "Qwen/Qwen3-0.6B"

Model Sources

Mini-SGLang supports downloading models from two sources:

HuggingFace (Default)

By default, models are downloaded from HuggingFace Hub:
python -m minisgl --model "Qwen/Qwen3-0.6B"

Gated Models

For gated models (e.g., Llama-3), provide your HuggingFace token:
export HF_TOKEN=hf_your_token_here
python -m minisgl --model "meta-llama/Llama-3.1-8B-Instruct"
Or use the huggingface-cli to login:
pip install huggingface-hub
huggingface-cli login

ModelScope

For users in regions with restricted HuggingFace access, use ModelScope:
python -m minisgl --model "Qwen/Qwen3-32B" --model-source modelscope
Model identifiers remain the same regardless of the source. Mini-SGLang handles the translation automatically.

Model Selection Guide

By Use Case

Recommended: Qwen/Qwen3-0.6B or Qwen/Qwen2.5-0.5B
  • Fast iteration
  • Low memory footprint
  • Runs on consumer GPUs
python -m minisgl --model "Qwen/Qwen3-0.6B" --shell
Recommended: Qwen/Qwen3-14B or meta-llama/Llama-3.1-8B-Instruct
  • Good balance of quality and speed
  • Fits on 24GB GPUs (A10, RTX 3090/4090)
python -m minisgl --model "Qwen/Qwen3-14B"
Recommended: Qwen/Qwen3-32B or meta-llama/Llama-3.1-70B-Instruct
  • State-of-the-art quality
  • Requires 2-4 GPUs
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4
Recommended: Qwen/Qwen3-MoE-16B
  • Mixture-of-Experts for efficiency
  • Better throughput than dense models
python -m minisgl --model "Qwen/Qwen3-MoE-16B"

By GPU Memory

  • Qwen/Qwen3-0.6B
  • Qwen/Qwen2.5-0.5B
  • Qwen/Qwen2.5-1.5B

Model Architecture Detection

Mini-SGLang automatically detects the model architecture from the config.json file:
# Supported architectures in the registry:
_MODEL_REGISTRY = {
    "LlamaForCausalLM": (".llama", "LlamaForCausalLM"),
    "Qwen2ForCausalLM": (".qwen2", "Qwen2ForCausalLM"),
    "Qwen3ForCausalLM": (".qwen3", "Qwen3ForCausalLM"),
    "Qwen3MoeForCausalLM": (".qwen3_moe", "Qwen3MoeForCausalLM"),
}
If your model uses one of these architectures but isn’t explicitly listed, it should still work.

Custom and Fine-tuned Models

Local Models

Load models from a local path:
python -m minisgl --model /path/to/local/model

Fine-tuned Models

Any model fine-tuned from a supported base architecture should work:
python -m minisgl --model "your-username/your-finetuned-llama-3"

Quantized Models

Mini-SGLang currently does not support quantized models (GPTQ, AWQ, etc.). Full-precision and bfloat16 models are supported.

Troubleshooting

  • Verify the model identifier is correct
  • Check HuggingFace/ModelScope availability
  • For gated models, ensure you have access and provided HF_TOKEN
Error: Model architecture X not supportedThe model’s architecture is not in the registry. Only Llama-3, Qwen-2.5, Qwen-3, and Qwen-3-MoE are currently supported.
If HuggingFace downloads fail:
python -m minisgl --model "Qwen/Qwen3-32B" --model-source modelscope
  • Use a smaller model variant
  • Enable tensor parallelism with --tp
  • Reduce --max-prefill-length
For the latest list of tested models and compatibility information, check the Mini-SGLang GitHub repository.

Build docs developers (and LLMs) love