Skip to main content
8-bit quantization reduces model memory footprint by 4x compared to BF16/FP32 weights while maintaining generation quality. OminiX-MLX supports INT8 affine quantization with configurable group sizes.

How quantization works

Quantization compresses 32-bit floating point weights into 8-bit integers using affine (linear) quantization:
quantized = round((weight - zero_point) / scale)
dequantized = quantized * scale + zero_point
Weights are grouped (typically 64 or 128 elements per group) with separate scale/zero-point per group. Smaller group sizes preserve more accuracy but increase overhead.

Quantization in models

Moxin-7B VLM

Moxin-7B quantizes only the Mistral-7B decoder to INT8. The dual vision encoders (DINOv2 + SigLIP) remain in BF16 to preserve visual feature quality.
moxin-vlm-mlx/src/lib.rs
use moxin_vlm_mlx::{load_model, normalize_dino, normalize_siglip};

// Load model in BF16
let mut vlm = load_model("~/.OminiX/models/moxin-vlm-7b")?;

// Quantize LLM decoder to 8-bit (group_size=64)
let vlm = vlm.quantize(64, 8)?;

// Vision encoders stay in BF16 for quality
From moxin-vlm-mlx/README.md:
8-bit quantization — Mistral-7B decoder linear layers quantized to INT8; vision encoders kept in BF16
Memory usage: ~10GB (8-bit) vs ~14GB (BF16)

Qwen3-ASR

Qwen3-ASR models quantize the text decoder only. The audio encoder (Conv2d + Transformer) stays at full precision:
# Download pre-quantized 1.7B model (2.46 GB)
huggingface-cli download mlx-community/Qwen3-ASR-1.7B-8bit \
    --local-dir ~/.OminiX/models/qwen3-asr-1.7b

# Download pre-quantized 0.6B model (1.01 GB)  
huggingface-cli download mlx-community/Qwen3-ASR-0.6B-8bit \
    --local-dir ~/.OminiX/models/qwen3-asr-0.6b
From qwen3-asr-mlx/README.md:
The audio encoder is not quantized — only the text decoder uses 8-bit quantization. This preserves audio feature quality while reducing memory for the larger LLM component.
ModelSizeSpeed
Qwen3-ASR-1.7B-8bit2.46 GB~30x RT
Qwen3-ASR-0.6B-8bit1.01 GB~50x RT

GLM4-MoE

GLM4-MoE supports 3-bit quantization for ultra-low memory:
# Download 3-bit quantized model
huggingface-cli download mlx-community/glm-4.5-chat-moe-3bit \
    --local-dir ./models/GLM-4.5-MoE-3bit
Memory: ~20GB (3-bit) vs ~80GB (BF16) for the 45-expert MoE.

Performance impact

Memory reduction

PrecisionSizeRelative
BF16/FP324 bytes/param1.0x
8-bit1 byte/param0.25x
3-bit0.375 bytes/param0.09x
Example for 7B parameter model:
  • BF16: ~14 GB
  • 8-bit: ~3.5 GB (base weights) + ~1 GB (scales/zeros) = ~4.5 GB
  • 3-bit: ~2.6 GB

Speed impact

Quantized inference is typically 5-15% slower than full precision due to dequantization overhead. However, memory savings enable:
  • Running larger models on the same hardware
  • Higher batch sizes
  • Reduced memory bandwidth pressure
On Apple Silicon, the unified memory architecture means quantization reduces memory pressure for both GPU and CPU operations.

Group size selection

Smaller group sizes preserve more accuracy but add overhead:
Group SizeQualityOverheadUse Case
32BestHighCritical quality tasks
64ExcellentMediumRecommended default
128GoodLowMaximum efficiency
256FairMinimalExperimental
From mlx-rs-core:
// Example: Quantize with group_size=64, 8-bit
let quantized_model = model.quantize(64, 8)?;

Pre-quantized vs runtime quantization

Models from mlx-community on HuggingFace are pre-quantized:
  • Weights stored as INT8 + scales/zeros in safetensors
  • No conversion overhead at load time
  • Optimized scale/zero-point placement
  • Immediate inference after loading

Runtime quantization

Quantize models at load time:
let mut model = load_model("path/to/bf16-model")?;
let model = model.quantize(64, 8)?;
Advantages:
  • Use any BF16 model without conversion
  • Experiment with different group sizes
Disadvantages:
  • Slower first load (1-2 minutes for 7B model)
  • Uses peak memory (BF16 + INT8) during conversion

Saving quantized models

Save quantized weights to avoid re-quantization:
use moxin_vlm_mlx::{load_model, save_quantized};

let mut vlm = load_model("~/.OminiX/models/moxin-vlm-7b")?;
let vlm = vlm.quantize(64, 8)?;

save_quantized(&vlm, "~/.OminiX/models/moxin-vlm-7b-8bit")?;
Command line example:
cargo run --release --example save_quantized -- \
    --model ~/.OminiX/models/moxin-vlm-7b \
    --output ~/.OminiX/models/moxin-vlm-7b-8bit

Quality considerations

What to quantize

Quantize:
  • LLM decoder layers (Transformer blocks)
  • Linear/Dense layers
  • Large parameter-heavy components
Keep full precision:
  • Vision encoders (DINOv2, SigLIP, ViT)
  • Audio encoders (Conv2d, Mel spectrogram features)
  • Small models (< 1B parameters)
  • Embeddings and output heads (minimal memory impact)

Model-specific recommendations

Model FamilyRecommended Quantization
LLMs (7B+)8-bit, group_size=64
VLMsDecoder only, 8-bit
ASRDecoder only, 8-bit
MoE (large)3-bit or 4-bit
Small models (under 1B)No quantization

Weight format

Quantized models use safetensors with special keys:
model.safetensors:
  - layer.*.weight           # INT8 quantized weights
  - layer.*.weight_scales    # FP32 scales per group
  - layer.*.weight_biases    # FP32 zero-points per group
  - audio_tower.*            # Full precision (not quantized)
  - vision_encoder.*         # Full precision (not quantized)
Example from qwen3-asr-mlx:
Weight Format: Models use safetensors with audio_tower.* at full precision and model.* (text decoder) as 8-bit affine quantized with group_size=64.

Benchmarks

Measured on Apple M4 Max (128GB):
ModelPrecisionMemorySpeedQuality
Moxin-7B VLMBF1614 GB32 tok/sBaseline
Moxin-7B VLM8-bit10 GB30 tok/sNegligible loss
Qwen3-ASR-1.7BBF163.2 GB32x RTBaseline
Qwen3-ASR-1.7B8-bit2.5 GB30x RTNegligible loss
GLM4-MoEBF1680 GBN/ABaseline
GLM4-MoE3-bit20 GB15-20 tok/sAcceptable loss
Speed reduction from quantization is typically 5-10%, primarily from dequantization overhead. Memory savings often enable running models that wouldn’t fit in BF16.

References

Build docs developers (and LLMs) love