Skip to main content
SGLang supports various quantization methods to reduce memory usage and increase throughput. Quantization converts model weights from high-precision formats (BF16/FP16) to lower-precision formats (INT8/FP8/INT4/FP4).
Offline quantization is recommended over online quantization for better performance, usability, and convenience.

Quantization Types

Offline Quantization

Load pre-quantized model weights. Required for GPTQ, AWQ, and optimal for FP8/FP4.

Online Quantization

Dynamically quantize weights at runtime. Convenient but slower startup and higher memory usage.

Offline vs Online

AspectOfflineOnline
Startup timeFastSlow (quantization on startup)
Memory usageLowHigh (during quantization)
Quality controlCan be validated before deploymentLimited pre-deployment validation
PreparationRequires quantization stepNo preparation needed

Offline Quantization

Load pre-quantized models directly. The quantization method is automatically detected from the model configuration.

Basic Usage

# Load pre-quantized model (quantization auto-detected)
python -m sglang.launch_server \
    --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
    --port 30000
Do NOT add --quantization when loading pre-quantized models. The quantization method is parsed from the model config.

Per-Channel Quantization

For per-channel quantized models (INT8/FP8) with per-token dynamic quantization, you can optionally specify --quantization to use sgl-kernel instead of vLLM kernels:
python -m sglang.launch_server \
    --model-path neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic \
    --quantization w8a8_fp8  # Use sgl-kernel FP8 kernel

Quantization Tools

We strongly recommend Unsloth for quantization and deployment.

NVIDIA ModelOpt

NVIDIA ModelOpt provides advanced quantization optimized for NVIDIA hardware.

Quick Start

# Install ModelOpt
pip install nvidia-modelopt

# Quantize and export
python examples/usage/modelopt_quantize_and_export.py quantize \
    --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --export-dir ./quantized_tinyllama_fp8 \
    --quantization-method modelopt_fp8

# Deploy
python -m sglang.launch_server \
    --model-path ./quantized_tinyllama_fp8 \
    --quantization modelopt \
    --port 30000

Available Methods

FP8

modelopt_fp8 - Optimal on NVIDIA Hopper and Blackwell GPUs

FP4

modelopt_fp4 - Optimal on NVIDIA Blackwell GPUs

Python API

import sglang as sgl
from sglang.srt.configs.device_config import DeviceConfig
from sglang.srt.configs.load_config import LoadConfig
from sglang.srt.configs.model_config import ModelConfig
from sglang.srt.model_loader.loader import get_model_loader

# Configure model with quantization
model_config = ModelConfig(
    model_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization="modelopt_fp8",
    trust_remote_code=True,
)

load_config = LoadConfig(
    modelopt_export_path="./exported_model",
    modelopt_checkpoint_save_path="./checkpoint.pth",  # Optional
)

device_config = DeviceConfig(device="cuda")

# Load and quantize
model_loader = get_model_loader(load_config, model_config)
quantized_model = model_loader.load_model(
    model_config=model_config,
    device_config=device_config,
)

Pre-Quantized Models

Load existing pre-quantized ModelOpt models:
# FP8 model
python -m sglang.launch_server \
    --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \
    --quantization modelopt_fp8

# FP4 model
python -m sglang.launch_server \
    --model-path nvidia/Llama-3.3-70B-Instruct-NVFP4 \
    --quantization modelopt_fp4

auto-round

Supports multiple quantization formats and both LLMs and VLMs.
pip install auto-round

LLM Quantization

from auto_round import AutoRound

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-autoround-4bit"

# Schemes: W2A16, W3A16, W4A16, W8A16, NVFP4, MXFP4, GGUF:Q4_K_M, etc.
scheme = "W4A16"
format = "auto_round"

autoround = AutoRound(model_id, scheme=scheme)
autoround.quantize_and_save(quant_path, format=format)

VLM Quantization

from auto_round import AutoRoundMLLM

model_name = "Qwen/Qwen2-VL-2B-Instruct"
quant_path = "Qwen2-VL-2B-Instruct-autoround-4bit"

autoround = AutoRoundMLLM(model_name, scheme="W4A16")
autoround.quantize_and_save(quant_path, format="auto_round")

Command Line

auto-round \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --bits 4 \
    --group_size 128 \
    --format "auto_round" \
    --output_dir ./tmp_autoround

GPTQModel

pip install gptqmodel --no-build-isolation -v
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

# Load calibration dataset
calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))["text"]

# Configure and quantize
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path)

LLM Compressor

From the vLLM project, supports FP8 and other formats.
pip install llmcompressor
from transformers import AutoTokenizer
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

# Load model
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure FP8 quantization
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"]
)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Save
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
Deploy:
python -m sglang.launch_server \
    --model-path ./Meta-Llama-3-8B-Instruct-FP8-Dynamic

Online Quantization

Quantize weights dynamically at server startup.

FP8 Online

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --quantization fp8 \
    --port 30000

TorchAO Quantization

SGLang supports torchao quantization methods:
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --torchao-config int4wo-128 \
    --port 30000

Supported Methods

  • int8dq - INT8 dynamic quantization (⚠️ disable CUDA graph with --disable-cuda-graph)
  • int8wo - INT8 weight-only
  • fp8wo - FP8 weight-only
  • fp8dq-per_tensor - FP8 dynamic per-tensor
  • fp8dq-per_row - FP8 dynamic per-row
  • int4wo-32, int4wo-64, int4wo-128, int4wo-256 - INT4 weight-only with different group sizes
int8dq has issues with CUDA graph capture. Always use --disable-cuda-graph with this method.

AMD GPU Quantization

For AMD GPUs (CDNA3/CDNA4), use quark_int4fp8_moe to quantize MoE layers:
python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V3 \
    --quantization quark_int4fp8_moe \
    --port 30000
This quantizes:
  • MoE layers: weights to INT4, upcasted to FP8 for compute
  • Other layers: weights to FP8 directly

Pre-Quantized Model Sources

Unsloth

High-quality quantized models

NVIDIA ModelOpt

NVIDIA-optimized models

NeuralMagic

Sparse and quantized models
Always validate quantized models via benchmarks post-quantization to guard against quality degradation.

Performance Impact

Memory Reduction

PrecisionMemory vs FP16Typical Use Case
FP16/BF161.0× (baseline)Full precision
FP80.5×Hopper/Blackwell GPUs
INT80.5×Broad compatibility
FP4/INT40.25×Maximum compression

Throughput Improvements

Quantization typically provides:
  • 1.5-2× throughput with FP8/INT8
  • 2-3× throughput with FP4/INT4
  • Lower latency due to reduced memory bandwidth
  • Higher batch sizes due to memory savings

Known Limitations

Not fully supported due to vLLM’s layer fusion (e.g., QKV fusion). Different bit-widths within fused layers can cause compatibility issues.
May encounter issues due to kernel limitations. Try skipping problematic layers like mlp.gate.
Limited support. Some format combinations may fail. AWQ format typically works best.

References