Features
- Sliding window attention: Efficient processing of long contexts (4096 token window)
- Grouped Query Attention (GQA): Reduced KV cache memory with 4:1 query:key-value head ratio
- RoPE position embeddings: Rotary position encoding for better extrapolation
- Quantization support: 4-bit and 8-bit quantized models available
- SwiGLU activation: Improved MLP activation function
Installation
Add to yourCargo.toml:
Quick start
Architecture
Mistral uses an optimized transformer architecture:Key optimizations
Sliding window attention
Mistral uses a 4096 token attention window instead of full attention:Grouped Query Attention (GQA)
32 query heads share 8 key-value heads (4:1 ratio):SwiGLU activation
Replaces traditional ReLU/GELU in MLP:Code example
From examples/generate_mistral.rs:Supported models
Mistral-7B-v0.3 (bf16)
Parameters: 7B
Size: 14 GB
Precision: bfloat16
Use case: Maximum quality
Size: 14 GB
Precision: bfloat16
Use case: Maximum quality
Mistral-7B-v0.3 (4-bit)
Parameters: 7B
Size: 4 GB
Precision: 4-bit quantized
Use case: Recommended for consumer hardware
Size: 4 GB
Precision: 4-bit quantized
Use case: Recommended for consumer hardware
Model variants
| Model | Parameters | Notes |
|---|---|---|
| Mistral-7B-v0.1 | 7B | Original release |
| Mistral-7B-v0.2 | 7B | Improved instruction following |
| Mistral-7B-v0.3 | 7B | Latest version (recommended) |
Performance
Benchmark results (Apple M3 Max)
| Model | Memory | Tokens/sec (Prefill) | Tokens/sec (Decode) |
|---|---|---|---|
| Mistral-7B (bf16) | 14 GB | ~180 tok/s | 40 tok/s |
| Mistral-7B (4-bit) | 4 GB | ~220 tok/s | 55 tok/s |
- 3.5x less memory (4 GB vs 14 GB)
- 1.4x faster generation (55 vs 40 tok/s)
- Minimal quality degradation
Sliding window efficiency
Mistral’s 4096-token sliding window enables efficient long-context processing:| Context Length | Memory (bf16) | Memory (4-bit) | Speed |
|---|---|---|---|
| 2K tokens | 14 GB | 4 GB | 40 tok/s |
| 4K tokens | 15 GB | 4.5 GB | 38 tok/s |
| 8K tokens | 16 GB | 5 GB | 35 tok/s |
Mistral Instruct format
Mistral Instruct models use a specific chat format:[INST]...[/INST] tags for best results.
Converting models
Convert from HuggingFace with quantization:Model configuration
Mistral-7B configuration:- GQA ratio: 32:8 = 4:1 (4x KV cache reduction)
- Sliding window: 4096 tokens
- Large intermediate: 14336 dims (3.5× hidden size)
API reference
Loading functions
Generation
Re-exported types
Benchmarking
Run the benchmark example:- Model load time
- Prompt processing speed
- Token generation speed
- Memory usage
Troubleshooting
Unexpected output format
Make sure to use Mistral Instruct format:Out of memory
Mistral-7B (bf16) requires 16GB+ memory. Solutions:- Use 4-bit quantized model
- Close other applications
- Reduce max generation length
Slow generation speed
- Use
--releasebuild mode (essential for performance) - Verify Metal GPU is active in Activity Monitor
- Use 4-bit model for faster inference
- Update to latest macOS for Metal optimizations