OminiX provides high-performance vision-language model (VLM) inference on Apple Silicon through MLX. These models combine visual understanding with language capabilities, enabling applications like image captioning, visual question answering, and multimodal understanding.
Available models
OminiX currently supports the following vision-language models:
Moxin-7B Dual-encoder VLM with DINOv2 and SigLIP vision backbones
Key features
Dual vision encoders
OminiX VLMs use multiple specialized vision encoders to capture different aspects of visual information:
DINOv2 - Self-supervised ViT trained on ImageNet, excellent for semantic understanding
SigLIP - Contrastive vision-language encoder trained on image-text pairs
The outputs from these encoders are fused and projected into the language model’s embedding space, providing rich multimodal representations.
Efficient quantization
Reduce memory usage and improve inference speed with 8-bit or 4-bit quantization:
use moxin_vlm_mlx :: load_model;
let vlm = load_model ( "~/.OminiX/models/moxin-vlm-7b" ) ? ;
// Quantize to 8-bit (group_size=64)
let vlm = vlm . quantize ( 64 , 8 ) ? ;
Quantization is applied only to the language model decoder, keeping vision encoders in BF16 for optimal quality.
KV-cache generation
All VLMs use efficient KV-caching for fast autoregressive generation:
Prefill - Process image and prompt in parallel, cache key-value pairs
Decode - Generate tokens one at a time using cached values
This approach provides significant speedups, especially for longer sequences.
Architecture overview
The typical VLM architecture in OminiX follows this pipeline:
Image (224×224)
├─ Vision Encoder 1 → [B, 256, D1]
└─ Vision Encoder 2 → [B, 256, D2]
│ concat + project
[B, 256, 4096] (visual tokens)
│
BOS + [visual tokens] + text tokens
│ LLM Decoder
logits → autoregressive generation
Visual features are converted into “visual tokens” that the language model processes alongside text tokens, enabling seamless multimodal reasoning.
Memory usage
BF16 (no quantization) : ~14GB for Moxin-7B
INT8 quantization : ~7GB for Moxin-7B
INT4 quantization : ~4GB for Moxin-7B
Inference speed
Typical performance on M3 Max (36 GPU cores):
Prefill : 200-300ms (vision encoding + prompt processing)
Decode : 30-50 tokens/second (BF16), 50-80 tokens/second (INT8)
Vision encoding only runs once during prefill. Subsequent token generation uses only the language decoder, making decode steps much faster.
Getting started
Download a model
Use the Hugging Face CLI to download model weights: huggingface-cli download moxin-org/moxin-llm-7b \
--local-dir ~/.OminiX/models/moxin-vlm-7b
Load and run inference
Load the model and generate from an image: use moxin_vlm_mlx :: {load_model, load_tokenizer, normalize_dino, normalize_siglip};
let mut vlm = load_model ( "~/.OminiX/models/moxin-vlm-7b" ) ? ;
let tokenizer = load_tokenizer ( "~/.OminiX/models/moxin-vlm-7b" ) ? ;
// Preprocess image
let ( dino_img , siglip_img ) = preprocess_image ( "photo.jpg" ) ? ;
// Generate
let prompt = "Describe this image." ;
// ... (see model-specific docs for complete examples)
Optimize with quantization
For faster inference and lower memory usage: let vlm = vlm . quantize ( 64 , 8 ) ? ; // 8-bit quantization
Next steps
Moxin-7B guide Learn how to use the Moxin-7B VLM
API reference Explore the complete API documentation