Architecture
Moxin-7B uses a dual-encoder vision system fused with a Mistral-7B decoder:Vision encoders
DINOv2 ViT-L/14
DINOv2 ViT-L/14
- Layers: 24 transformer blocks
- Embedding dim: 1024
- Patch size: 14×14
- Features: CLS token + 4 register tokens + LayerScale
- Normalization: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
SigLIP ViT-SO400M/14
SigLIP ViT-SO400M/14
- Layers: 27 transformer blocks
- Embedding dim: 1152
- Patch size: 14×14
- Features: No CLS token, all patch tokens used
- Normalization: Unit normalization (mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
Language decoder
The Mistral-7B decoder provides the language generation capabilities:- Parameters: 7B
- Layers: 36 transformer blocks
- Hidden size: 4096
- Attention: Grouped Query Attention (32 query heads, 8 KV heads)
- Context: Rotary Position Embeddings (RoPE) with base 10000
- Vocabulary: 32,064 tokens
Installation
Moxin-7B is included in the OminiX source distribution:Quick start
Usage
Basic example
Here’s a complete example showing how to use Moxin-7B as a library:Image preprocessing
Moxin-7B requires images to be preprocessed differently for each encoder:Both encoders expect 224×224 RGB images in NHWC format (MLX standard). Make sure to resize images to exactly 224×224 before normalization.
Prompt formatting
Moxin-7B uses the Prismatic “Pure” prompt format:"In: Describe this image.\nOut:""In: What objects are visible in this scene?\nOut:""In: Is there a dog in this image?\nOut:"
Sampling strategies
Performance
Typical performance on Apple Silicon (M3 Max, 36 GPU cores):Prefill (vision + prompt)
| Configuration | Time | Throughput |
|---|---|---|
| BF16 (no quantization) | ~250ms | 1024 tokens (256 visual + text) |
| INT8 quantization | ~200ms | 1024 tokens (256 visual + text) |
Decode (token generation)
| Configuration | Tokens/second | Memory Usage |
|---|---|---|
| BF16 (no quantization) | 35-45 tok/s | ~14GB |
| INT8 quantization | 55-70 tok/s | ~7GB |
| INT4 quantization | 75-95 tok/s | ~4GB |
Performance varies based on prompt length, generated length, and hardware. These benchmarks are from the
generate example with default settings.Memory breakdown
- DINOv2 ViT-L/14: ~300MB (kept in BF16)
- SigLIP ViT-SO400M/14: ~450MB (kept in BF16)
- FusedMLPProjector: ~50MB (kept in BF16)
- Mistral-7B decoder: ~13GB (BF16) / ~6.5GB (INT8) / ~3.5GB (INT4)
Quantization
Moxin-7B supports efficient quantization of the language decoder:Why quantize?
- Reduced memory: 8-bit uses ~50% less memory, 4-bit uses ~70% less
- Faster inference: Quantized models run 1.5-2x faster on Apple Silicon
- Quality: 8-bit quantization has minimal quality loss
Quantization options
group_size: Quantization group size (typically 64 or 128)bits: Bit width (4 or 8)
- 8-bit, group_size=64: Best balance of speed, memory, and quality
- 4-bit, group_size=128: Maximum speed and memory savings, slight quality loss
What gets quantized?
Only the Mistral-7B decoder is quantized:- ✅ Attention projections (Q, K, V, O)
- ✅ MLP layers (gate, up, down)
- ✅ LM head
- ❌ Vision encoders (DINOv2, SigLIP) - kept in BF16
- ❌ Projector - kept in BF16
Vision encoders are kept in BF16 because they only run once during prefill and have dimension sizes that aren’t cleanly divisible by common group sizes.
Command-line examples
Themoxin-vlm-mlx crate includes several command-line examples:
generate
Basic image captioning and visual question answering:--model: Path to model directory--image: Path to input image (any format supported byimagecrate)--prompt: Text prompt (will be formatted as “In: \nOut:”)--temp: Sampling temperature (0.0 = greedy, higher = more random)--max-tokens: Maximum tokens to generate (default: 256)--quantize: Quantization bits (0 = none, 4 or 8)
save_quantized
Quantize and save model weights for faster loading:server
OpenAI-compatible HTTP server for VLM inference:/v1/chat/completions with base64-encoded images.
API reference
Core functions
Load Moxin-7B model from a directory containing
model.safetensors and config.json.Supports both sharded weights (model.safetensors.index.json) and single-file weights.Load the tokenizer from
tokenizer.json in the model directory.Normalize image for DINOv2 encoder using ImageNet statistics.Input:
[B, 224, 224, 3] float32 in [0, 1]Normalize image for SigLIP encoder using unit normalization.Input:
[B, 224, 224, 3] float32 in [0, 1]MoxinVLM methods
Full VLM forward pass: encode image + text → logits.This is used during the prefill phase when processing the initial image and prompt.
Text-only decode for single token generation (uses KV cache).This is used during the decode phase for fast autoregressive generation.
Quantize the LLM decoder to reduce memory and improve speed.Only quantizes the Mistral-7B decoder; vision encoders remain in BF16.
Generate iterator
Create a token generator for VLM inference.Returns an iterator that yields tokens one at a time, handling both prefill and decode phases automatically.
Troubleshooting
Out of memory errors
Out of memory errors
Try quantizing the model to reduce memory usage:Or use 4-bit quantization for even lower memory:
Slow inference
Slow inference
Enable quantization for faster inference:8-bit quantization typically provides 1.5-2x speedup with minimal quality loss.
Image format issues
Image format issues
Ensure images are:
- Exactly 224×224 pixels
- RGB format (3 channels)
- Float32 values in [0, 1] range
- NHWC layout (batch, height, width, channels)
Model loading failures
Model loading failures
Verify the model directory contains:
model.safetensorsormodel.safetensors.index.json+ shard filesconfig.jsontokenizer.json
Next steps
VLM overview
Learn more about vision-language models
API reference
Explore the complete API documentation