8-bit quantization

8-bit quantization reduces model memory footprint by 4x compared to BF16/FP32 weights while maintaining generation quality. OminiX-MLX supports INT8 affine quantization with configurable group sizes.

How quantization works

Quantization compresses 32-bit floating point weights into 8-bit integers using affine (linear) quantization:

quantized = round((weight - zero_point) / scale)
dequantized = quantized * scale + zero_point

Weights are grouped (typically 64 or 128 elements per group) with separate scale/zero-point per group. Smaller group sizes preserve more accuracy but increase overhead.

Quantization in models

Moxin-7B VLM

Moxin-7B quantizes only the Mistral-7B decoder to INT8. The dual vision encoders (DINOv2 + SigLIP) remain in BF16 to preserve visual feature quality.

moxin-vlm-mlx/src/lib.rs

use moxin_vlm_mlx::{load_model, normalize_dino, normalize_siglip};

// Load model in BF16
let mut vlm = load_model("~/.OminiX/models/moxin-vlm-7b")?;

// Quantize LLM decoder to 8-bit (group_size=64)
let vlm = vlm.quantize(64, 8)?;

// Vision encoders stay in BF16 for quality

From moxin-vlm-mlx/README.md:

8-bit quantization — Mistral-7B decoder linear layers quantized to INT8; vision encoders kept in BF16

Memory usage: ~10GB (8-bit) vs ~14GB (BF16)

Qwen3-ASR

Qwen3-ASR models quantize the text decoder only. The audio encoder (Conv2d + Transformer) stays at full precision:

# Download pre-quantized 1.7B model (2.46 GB)
huggingface-cli download mlx-community/Qwen3-ASR-1.7B-8bit \
    --local-dir ~/.OminiX/models/qwen3-asr-1.7b

# Download pre-quantized 0.6B model (1.01 GB)  
huggingface-cli download mlx-community/Qwen3-ASR-0.6B-8bit \
    --local-dir ~/.OminiX/models/qwen3-asr-0.6b

From qwen3-asr-mlx/README.md:

The audio encoder is not quantized — only the text decoder uses 8-bit quantization. This preserves audio feature quality while reducing memory for the larger LLM component.

Model	Size	Speed
Qwen3-ASR-1.7B-8bit	2.46 GB	~30x RT
Qwen3-ASR-0.6B-8bit	1.01 GB	~50x RT

GLM4-MoE

GLM4-MoE supports 3-bit quantization for ultra-low memory:

# Download 3-bit quantized model
huggingface-cli download mlx-community/glm-4.5-chat-moe-3bit \
    --local-dir ./models/GLM-4.5-MoE-3bit

Memory: ~20GB (3-bit) vs ~80GB (BF16) for the 45-expert MoE.

Performance impact

Memory reduction

Precision	Size	Relative
BF16/FP32	4 bytes/param	1.0x
8-bit	1 byte/param	0.25x
3-bit	0.375 bytes/param	0.09x

Example for 7B parameter model:

BF16: ~14 GB
8-bit: ~3.5 GB (base weights) + ~1 GB (scales/zeros) = ~4.5 GB
3-bit: ~2.6 GB

Speed impact

Quantized inference is typically 5-15% slower than full precision due to dequantization overhead. However, memory savings enable:

Running larger models on the same hardware
Higher batch sizes
Reduced memory bandwidth pressure

On Apple Silicon, the unified memory architecture means quantization reduces memory pressure for both GPU and CPU operations.

Group size selection

Smaller group sizes preserve more accuracy but add overhead:

Group Size	Quality	Overhead	Use Case
32	Best	High	Critical quality tasks
64	Excellent	Medium	Recommended default
128	Good	Low	Maximum efficiency
256	Fair	Minimal	Experimental

From mlx-rs-core:

// Example: Quantize with group_size=64, 8-bit
let quantized_model = model.quantize(64, 8)?;

Pre-quantized vs runtime quantization

Pre-quantized models (recommended)

Models from mlx-community on HuggingFace are pre-quantized:

Weights stored as INT8 + scales/zeros in safetensors
No conversion overhead at load time
Optimized scale/zero-point placement
Immediate inference after loading

Runtime quantization

Quantize models at load time:

let mut model = load_model("path/to/bf16-model")?;
let model = model.quantize(64, 8)?;

Advantages:

Use any BF16 model without conversion
Experiment with different group sizes

Disadvantages:

Slower first load (1-2 minutes for 7B model)
Uses peak memory (BF16 + INT8) during conversion

Saving quantized models

Save quantized weights to avoid re-quantization:

use moxin_vlm_mlx::{load_model, save_quantized};

let mut vlm = load_model("~/.OminiX/models/moxin-vlm-7b")?;
let vlm = vlm.quantize(64, 8)?;

save_quantized(&vlm, "~/.OminiX/models/moxin-vlm-7b-8bit")?;

Command line example:

cargo run --release --example save_quantized -- \
    --model ~/.OminiX/models/moxin-vlm-7b \
    --output ~/.OminiX/models/moxin-vlm-7b-8bit

Quality considerations

What to quantize

Quantize:

LLM decoder layers (Transformer blocks)
Linear/Dense layers
Large parameter-heavy components

Keep full precision:

Vision encoders (DINOv2, SigLIP, ViT)
Audio encoders (Conv2d, Mel spectrogram features)
Small models (< 1B parameters)
Embeddings and output heads (minimal memory impact)

Model-specific recommendations

Model Family	Recommended Quantization
LLMs (7B+)	8-bit, group_size=64
VLMs	Decoder only, 8-bit
ASR	Decoder only, 8-bit
MoE (large)	3-bit or 4-bit
Small models (under 1B)	No quantization

Weight format

Quantized models use safetensors with special keys:

model.safetensors:
  - layer.*.weight           # INT8 quantized weights
  - layer.*.weight_scales    # FP32 scales per group
  - layer.*.weight_biases    # FP32 zero-points per group
  - audio_tower.*            # Full precision (not quantized)
  - vision_encoder.*         # Full precision (not quantized)

Example from qwen3-asr-mlx:

Weight Format: Models use safetensors with audio_tower.* at full precision and model.* (text decoder) as 8-bit affine quantized with group_size=64.

Benchmarks

Measured on Apple M4 Max (128GB):

Model	Precision	Memory	Speed	Quality
Moxin-7B VLM	BF16	14 GB	32 tok/s	Baseline
Moxin-7B VLM	8-bit	10 GB	30 tok/s	Negligible loss
Qwen3-ASR-1.7B	BF16	3.2 GB	32x RT	Baseline
Qwen3-ASR-1.7B	8-bit	2.5 GB	30x RT	Negligible loss
GLM4-MoE	BF16	80 GB	N/A	Baseline
GLM4-MoE	3-bit	20 GB	15-20 tok/s	Acceptable loss

Speed reduction from quantization is typically 5-10%, primarily from dequantization overhead. Memory savings often enable running models that wouldn’t fit in BF16.

References

MLX Quantization Guide
mlx-community models
moxin-vlm-mlx/README.md:24
qwen3-asr-mlx/README.md:229
glm4-moe-mlx/README.md:10

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

How quantization works

Quantization in models

Moxin-7B VLM

Qwen3-ASR

GLM4-MoE

Performance impact

Memory reduction

Speed impact

Group size selection

Pre-quantized vs runtime quantization

Pre-quantized models (recommended)

Runtime quantization

Saving quantized models

Quality considerations

What to quantize

Model-specific recommendations

Weight format

Benchmarks

References

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​How quantization works

​Quantization in models

​Moxin-7B VLM

​Qwen3-ASR

​GLM4-MoE

​Performance impact

​Memory reduction

​Speed impact

​Group size selection

​Pre-quantized vs runtime quantization

​Pre-quantized models (recommended)

​Runtime quantization

​Saving quantized models

​Quality considerations

​What to quantize

​Model-specific recommendations

​Weight format

​Benchmarks

​References

Build docs developers (and LLMs) love

How quantization works

Quantization in models

Moxin-7B VLM

Qwen3-ASR

GLM4-MoE

Performance impact

Memory reduction

Speed impact

Group size selection

Pre-quantized vs runtime quantization

Pre-quantized models (recommended)

Runtime quantization

Saving quantized models

Quality considerations

What to quantize

Model-specific recommendations

Weight format

Benchmarks

References