Supported Models

Mini-SGLang supports several popular open-source model architectures optimized for efficient inference. This guide covers supported models, how to specify them, and model source configuration.

Supported Architectures

Mini-SGLang currently supports the following model families:

Llama-3 Series

Meta’s Llama 3 and 3.1 models with various parameter sizes

Qwen-3 Series

Alibaba Cloud’s Qwen 3 models including MoE variants

Qwen-2.5 Series

Qwen 2.5 model family with improved performance

Llama-3 Models

The Llama-3 architecture (LlamaForCausalLM) supports:

And other Llama-3 compatible models.

Qwen-3 Models

The Qwen-3 architecture (Qwen3ForCausalLM) supports:

Qwen-3 MoE Models

Mixture-of-Experts variants (Qwen3MoeForCausalLM):

Qwen/Qwen3-MoE-16B

Qwen-2.5 Models

The Qwen-2.5 architecture (Qwen2ForCausalLM) supports:

Specifying Models

Basic Usage

Specify models using the --model argument:

python -m minisgl --model "Qwen/Qwen3-0.6B"

The model identifier follows the HuggingFace format: organization/model-name.

Examples

python -m minisgl --model "Qwen/Qwen3-0.6B"

Model Sources

Mini-SGLang supports downloading models from two sources:

HuggingFace (Default)

By default, models are downloaded from HuggingFace Hub:

python -m minisgl --model "Qwen/Qwen3-0.6B"

Gated Models

For gated models (e.g., Llama-3), provide your HuggingFace token:

export HF_TOKEN=hf_your_token_here
python -m minisgl --model "meta-llama/Llama-3.1-8B-Instruct"

Or use the huggingface-cli to login:

pip install huggingface-hub
huggingface-cli login

ModelScope

For users in regions with restricted HuggingFace access, use ModelScope:

python -m minisgl --model "Qwen/Qwen3-32B" --model-source modelscope

Model identifiers remain the same regardless of the source. Mini-SGLang handles the translation automatically.

Model Selection Guide

By Use Case

Development & Testing

Recommended: Qwen/Qwen3-0.6B or Qwen/Qwen2.5-0.5B

Fast iteration
Low memory footprint
Runs on consumer GPUs

python -m minisgl --model "Qwen/Qwen3-0.6B" --shell

Production (Single GPU)

Recommended: Qwen/Qwen3-14B or meta-llama/Llama-3.1-8B-Instruct

Good balance of quality and speed
Fits on 24GB GPUs (A10, RTX 3090/4090)

python -m minisgl --model "Qwen/Qwen3-14B"

Production (Multi-GPU)

Recommended: Qwen/Qwen3-32B or meta-llama/Llama-3.1-70B-Instruct

State-of-the-art quality
Requires 2-4 GPUs

python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4

High Throughput

Recommended: Qwen/Qwen3-MoE-16B

Mixture-of-Experts for efficiency
Better throughput than dense models

python -m minisgl --model "Qwen/Qwen3-MoE-16B"

By GPU Memory

8-16 GB
16-24 GB
40-48 GB
80+ GB (Single)
Multiple GPUs

Qwen/Qwen3-0.6B
Qwen/Qwen2.5-0.5B
Qwen/Qwen2.5-1.5B

Model Architecture Detection

Mini-SGLang automatically detects the model architecture from the config.json file:

# Supported architectures in the registry:
_MODEL_REGISTRY = {
    "LlamaForCausalLM": (".llama", "LlamaForCausalLM"),
    "Qwen2ForCausalLM": (".qwen2", "Qwen2ForCausalLM"),
    "Qwen3ForCausalLM": (".qwen3", "Qwen3ForCausalLM"),
    "Qwen3MoeForCausalLM": (".qwen3_moe", "Qwen3MoeForCausalLM"),
}

If your model uses one of these architectures but isn’t explicitly listed, it should still work.

Custom and Fine-tuned Models

Local Models

Load models from a local path:

python -m minisgl --model /path/to/local/model

Fine-tuned Models

Any model fine-tuned from a supported base architecture should work:

python -m minisgl --model "your-username/your-finetuned-llama-3"

Quantized Models

Mini-SGLang currently does not support quantized models (GPTQ, AWQ, etc.). Full-precision and bfloat16 models are supported.

Troubleshooting

Model Not Found

Verify the model identifier is correct
Check HuggingFace/ModelScope availability
For gated models, ensure you have access and provided HF_TOKEN

Architecture Not Supported

Error: Model architecture X not supportedThe model’s architecture is not in the registry. Only Llama-3, Qwen-2.5, Qwen-3, and Qwen-3-MoE are currently supported.

Network Issues

If HuggingFace downloads fail:

python -m minisgl --model "Qwen/Qwen3-32B" --model-source modelscope

Out of Memory

Use a smaller model variant
Enable tensor parallelism with --tp
Reduce --max-prefill-length

For the latest list of tested models and compatibility information, check the Mini-SGLang GitHub repository.

Getting Started

Core Concepts

Guides

Configuration

Performance

Supported Architectures

Llama-3 Series

Qwen-3 Series

Qwen-2.5 Series

Llama-3 Models

Qwen-3 Models

Qwen-3 MoE Models

Qwen-2.5 Models

Specifying Models

Basic Usage

Examples

Model Sources

HuggingFace (Default)

Gated Models

ModelScope

Model Selection Guide

By Use Case

By GPU Memory

Model Architecture Detection

Custom and Fine-tuned Models

Local Models

Fine-tuned Models

Quantized Models

Troubleshooting

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Configuration

Performance

​Supported Architectures

Llama-3 Series

Qwen-3 Series

Qwen-2.5 Series

​Llama-3 Models

​Qwen-3 Models

​Qwen-3 MoE Models

​Qwen-2.5 Models

​Specifying Models

​Basic Usage

​Examples

​Model Sources

​HuggingFace (Default)

​Gated Models

​ModelScope

​Model Selection Guide

​By Use Case

​By GPU Memory

​Model Architecture Detection

​Custom and Fine-tuned Models

​Local Models

​Fine-tuned Models

​Quantized Models

​Troubleshooting

Build docs developers (and LLMs) love

Supported Architectures

Llama-3 Models

Qwen-3 Models

Qwen-3 MoE Models

Qwen-2.5 Models

Specifying Models

Basic Usage

Examples

Model Sources

HuggingFace (Default)

Gated Models

ModelScope

Model Selection Guide

By Use Case

By GPU Memory

Model Architecture Detection

Custom and Fine-tuned Models

Local Models

Fine-tuned Models

Quantized Models

Troubleshooting