Features
| Feature | Value |
|---|---|
| Parameters | 9B |
| Max context | 1M+ tokens |
| Inference speed | 3.5× faster than Qwen3-8B at 256K context |
| Memory efficiency | Runs on M3 Max, RTX 5090, A6000D |
| License | Apache-2.0 |
Architecture highlights
- Hybrid attention: 25% sparse + 75% lightning attention layers
- Custom Metal kernels: Optimized GLA (Gated Linear Attention) implementation
- Self-speculative decoding: Draft from first 8 layers for faster generation
- OpenAI-compatible API: Drop-in replacement for OpenAI server
- 8-bit quantization: 9.6 GB model size with 28 tok/s decode speed
Installation
Add to yourCargo.toml:
Quick start
Examples
Text generation
Basic generation with chat template:--max-tokens N: Generate up to N tokens (default: 256)--temperature T: Sampling temperature (default: 0.7, use 0 for greedy)--raw: Skip chat template, use raw completion--system "...": Custom system prompt--no-think: Hide<think>...</think>reasoning blocks
Interactive chat
Multi-turn conversation:clear: Reset conversation historyquitorexit: Exit chat
Batched inference
Process multiple prompts in parallel:Self-speculative decoding
Use first 8 layers as draft model:Long context test
Needle-in-a-haystack evaluation:OpenAI-compatible API server
Start HTTP server:--port N: Listen on port N (default: 8080)--temperature T: Default temperature (default: 0.7)--max-tokens N: Default max tokens (default: 2048)--no-think: Strip<think>...</think>from responses--models-dir PATH: Directory for managed models (default:~/.ominix/models)
API endpoints
Chat completions
POST
/v1/chat/completionsOpenAI-compatible chat endpointList models
GET
/v1/modelsList available models with metadataDownload model
POST
/v1/models/downloadDownload model from HuggingFaceDelete model
DELETE
/v1/models/{id}Remove downloaded modelChat completion example
Model management
Supported models
8-bit (recommended)
Size: 9.6 GB
Prefill: 443 tok/s
Decode: 28 tok/s
Use case: Best balance of speed and quality
Prefill: 443 tok/s
Decode: 28 tok/s
Use case: Best balance of speed and quality
4-bit (fastest)
Size: 5.4 GB
Prefill: 260 tok/s
Decode: 35 tok/s
Use case: Memory-constrained systems
Prefill: 260 tok/s
Decode: 35 tok/s
Use case: Memory-constrained systems
fp16 (highest quality)
Size: 18 GB
Prefill: 314 tok/s
Decode: 3.6 tok/s
Use case: Batch processing, not interactive
Prefill: 314 tok/s
Decode: 3.6 tok/s
Use case: Batch processing, not interactive
Performance
Throughput (Apple M3 Max, 128 GB)
| Variant | Size | Prefill | Decode |
|---|---|---|---|
| fp16 | 18 GB | 0.4 – 313.9 tok/s | 3.5 – 3.6 tok/s |
| 8-bit | 9.6 GB | 4.7 – 442.6 tok/s | 27.3 – 28.1 tok/s |
| 4-bit | 5.4 GB | 2.2 – 260.3 tok/s | 34.4 – 35.6 tok/s |
Decode speed is steady-state autoregressive generation.
Speed vs Qwen3-8B (both 8-bit)
MiniCPM-SALA (Rust/mlx-rs) vs Qwen3-8B (Python/mlx-lm):| Context | SALA Prefill | Qwen3 Prefill | SALA Decode | Qwen3 Decode |
|---|---|---|---|---|
| 4K | 309 tok/s | 488 tok/s | 26 tok/s | 35 tok/s |
| 8K | 325 tok/s | 493 tok/s | 25 tok/s | 33 tok/s |
| 16K | 325 tok/s | 417 tok/s | 23 tok/s | 25 tok/s |
| 32K | 350 tok/s | 333 tok/s | 23 tok/s | 18 tok/s |
| 64K | 220 tok/s | OOM | 19 tok/s | — |
| 128K | 192 tok/s | OOM | 9 tok/s | — |
- At short contexts (< 16K), Qwen3-8B is faster due to optimized dense GQA
- At 32K, SALA overtakes Qwen3 in both prefill and decode
- Beyond 32K, Qwen3’s KV cache grows too large while SALA continues to 128K+
- SALA’s advantage grows with context length (75% lightning attention layers use O(1) state)
Needle-in-a-haystack results
Retrieval of specific fact in long filler text (8-bit, greedy):| Context | Depth | Found? | Prefill Speed | Prefill Time |
|---|---|---|---|---|
| 4K | 50% | ✅ YES | 309 tok/s | 13s |
| 8K | 25% | ✅ YES | 325 tok/s | 25s |
| 16K | 25% | ✅ YES | 325 tok/s | 49s |
| 32K | 95% | ✅ YES | 350 tok/s | 92s |
| 64K | 95% | ✅ YES | 220 tok/s | 293s |
| 128K | 95% | ✅ YES | 192 tok/s | 671s (11 min) |
| 256K | 95% | ❌ NO | 276 tok/s | 934s (16 min) |
- Reliable retrieval within sliding window (last ~2K tokens) and init region (first ~8K tokens)
- Middle-region retrieval depends on InfLLM-v2 sparse selection (can miss individual facts in repetitive text)
- 128K prefill in ~11 min on M3 Max
- Decode speed degrades at very long contexts (9 tok/s at 128K vs 28 tok/s at 4K)
Hybrid attention architecture
MiniCPM-SALA alternates two attention types:Sparse attention layers (25%)
InfLLM-v2: Selects top-K blocks from history based on attention scores- Local window: Always attends to last ~2048 tokens
- Top-K blocks: Dynamically selects 64 most relevant 64-token blocks from earlier context
- Good for precise retrieval of specific facts
Lightning attention layers (75%)
GLA (Gated Linear Attention): Recurrent state updated per token- O(1) memory per layer (fixed-size state, not growing with context)
- Linear complexity: O(n) instead of O(n²)
- Good for global understanding and summarization
- Million-token context capability
- Faster inference than dense attention at long contexts
- Better quality than pure linear attention
Converting models
Save quantized weights from fp16 checkpoint:--bits 8: 8-bit (recommended)--bits 4: 4-bit (faster, lower quality)
API reference
Loading functions
Generation utilities
Think filter
<think>...</think> blocks from output when no_think = true.
Troubleshooting
Slow decode speed
MiniCPM-SALA decode speed is limited by:- Sparse layers: Scan growing KV cache (slower at long contexts)
- Lightning layers: Fixed overhead per token
- Use 4-bit model for 30% faster decode (35 vs 28 tok/s)
- Keep context under 32K for best speed
- Consider batched inference for multiple prompts
Out of memory
MiniCPM-SALA-9B (8-bit) requires 12GB+ memory. Solutions:- Use 4-bit model (5.4 GB vs 9.6 GB)
- Close other applications
- Reduce max context length
Missing facts in middle of long context
InfLLM-v2 sparse selection may miss individual facts in repetitive filler text. This is expected behavior. For critical retrieval:- Place important info near start or end of context
- Use explicit markers or section headers
- Increase
topkparameter in model config (requires retraining)
Think blocks not filtered
Make sure to pass--no-think flag:
Related models
References
- HuggingFace Model - Upstream PyTorch implementation
- GitHub Repository - Source code and docs
- Technical Report - Architecture details