Four Fit Levels
Perfect
Criteria:
- Running on GPU (not CPU-only)
- Recommended memory met
- Comfortable headroom for inference
Good
Criteria:
- Fits with ≥20% headroom
- Best achievable for MoE offload
- Best achievable for CPU+GPU offload
Marginal
Criteria:
- Minimum memory met but tight
- Best achievable for CPU-only
- Risk of OOM under load
Too Tight
Criteria:
- Insufficient memory in all pools
- Model will not run
Fit Scoring Logic
Fit level depends on both memory headroom and run mode:Key insight: CPU-only and offload modes can never achieve Perfect. Perfect requires GPU acceleration with comfortable memory.
Run Mode Determination
llmfit tries execution paths in order of preference:Unified Memory Detection
If
system.unified_memory == true (Apple Silicon, NVIDIA Grace):- GPU and CPU share the same memory pool
- No separate CPU+GPU offload path
- Use GPU path with full memory budget
Try GPU Path (Discrete VRAM)
Attempt to fit the model in VRAM with dynamic quantization:If any quantization level fits, use GPU path.
Try MoE Offload (MoE Models Only)
For Mixture-of-Experts models, try expert offloading:Requirements:
- Active experts fit in VRAM
- Inactive experts fit in system RAM
Try CPU+GPU Offload
Model doesn’t fit in VRAM—spill to system RAM:Penalty: 0.5× GPU speed (RAM bandwidth bottleneck)
Dynamic Quantization Selection
Instead of using the model’s default quantization, llmfit walks a hierarchy to find the best quality that fits:GGUF Quantization Hierarchy (llama.cpp)
MLX Quantization Hierarchy (Apple Silicon)
Selection Algorithm
Example: Llama-3.1-70B on RTX 4090 (24 GB VRAM)
- Q8_0: 75.2 GB — doesn’t fit
- Q6_K: 58.0 GB — doesn’t fit
- Q5_K_M: 49.3 GB — doesn’t fit
- Q4_K_M: 42.2 GB — doesn’t fit
- Q3_K_M: 34.7 GB — doesn’t fit
- Q2_K: 26.7 GB — doesn’t fit
- Q8_0 @ 65K ctx: 73.1 GB — doesn’t fit
- Q6_K @ 65K ctx: 55.9 GB — doesn’t fit
- Q5_K_M @ 65K ctx: 47.2 GB — doesn’t fit
- Q4_K_M @ 65K ctx: 40.1 GB — doesn’t fit
- Q3_K_M @ 65K ctx: 32.6 GB — doesn’t fit
- Q2_K @ 65K ctx: 24.6 GB — fits! ✓
Q2_K at 65K context, 24.6 GB VRAMMemory Estimation Formula
llmfit computes memory requirements dynamically:Model Weights
Model Weights
Formula:
params × bytes_per_param(quant)Example: 7B @ Q4_K_M = 7 × 0.58 = 4.06 GBThis is the bulk of memory usage—the model parameters themselves.KV Cache
KV Cache
Formula:
0.000008 × params × context_lengthExample: 7B @ 8K context = 0.000008 × 7 × 8192 = 0.46 GBStores key/value tensors for attention mechanism. Grows linearly with context length.Runtime Overhead
Runtime Overhead
Fixed: 0.5 GBCovers CUDA/Metal context, buffer allocations, and framework overhead.
Memory Utilization Targets
llmfit aims for specific utilization ranges:- GPU Inference
- CPU+GPU Offload
- CPU-Only
Target: 50-80% of VRAMSweet spot: Efficient use without risking OOM.
MoE Expert Offloading
Mixture-of-Experts models can split across VRAM and RAM:Memory Split Calculation
Example: Mixtral 8x7B @ Q4_K_M
Fit Analysis Examples
Example 1: Perfect Fit
Example 1: Perfect Fit
Hardware: RTX 4090 (24 GB VRAM), 64 GB RAMModel: Qwen2.5-Coder-14B-InstructAnalysis:Result: Excellent fit—high-quality quantization with plenty of headroom.
Example 2: MoE Offload
Example 2: MoE Offload
Hardware: RTX 3070 (8 GB VRAM), 32 GB RAMModel: Mixtral 8x7B-InstructAnalysis:Result: Fits via expert offloading—wouldn’t run otherwise.
Example 3: CPU+GPU Offload
Example 3: CPU+GPU Offload
Hardware: GTX 1660 Ti (6 GB VRAM), 16 GB RAMModel: Llama-3.1-8B-InstructAnalysis:Result: Spills to RAM—significant performance hit but runnable.
Example 4: Too Tight
Example 4: Too Tight
Hardware: GTX 1650 (4 GB VRAM), 8 GB RAMModel: Llama-3.1-70B-InstructAnalysis:Result: Unrunnable. Model ranks last in fit results.
Context-Length Capping
Use--max-context to reduce memory requirements:
Run Mode Selection Summary
llmfit always tries the fastest path first (GPU) and falls back gracefully to slower modes when memory is insufficient.
