Fit Analysis API
Thefit module analyzes how well models run on specific hardware, computing fit levels, run modes, and multi-dimensional scores.
Core Types
ModelFit
Complete analysis of a model’s fit on specific hardware:
model- The analyzed modelfit_level- Memory fit quality (Perfect, Good, Marginal, TooTight)run_mode- Execution path (Gpu, CpuOffload, CpuOnly, MoeOffload)memory_required_gb- Memory needed for this run modememory_available_gb- Memory pool capacityutilization_pct- Memory utilization percentagenotes- Human-readable analysis notesscore- Composite score (0-100)score_components- Individual dimension scoresestimated_tps- Estimated tokens per secondbest_quant- Optimal quantization for this hardwareruntime- Inference runtime (MLX or llama.cpp)installed- Whether model is locally installed
FitLevel
Memory fit quality:
- Perfect: GPU path with recommended memory available
- Good: Adequate headroom (1.2x minimum or more)
- Marginal: Meets minimum but tight
- TooTight: Insufficient memory
RunMode
Execution path:
- Gpu: Full GPU inference (fastest)
- MoeOffload: Expert switching with some latency
- CpuOffload: Significant slowdown from memory transfers
- CpuOnly: Slowest, but works without GPU
InferenceRuntime
Software framework:
ScoreComponents
Multi-dimensional scoring:
- Quality: Model capability (params + family + quant + task alignment)
- Speed: Estimated tok/s vs. target for use case
- Fit: Memory efficiency (50-80% utilization = optimal)
- Context: Context window vs. use case needs
SortColumn
Sort options for ranked models:
Functions
ModelFit::analyze()
Analyzes model fit on hardware:
model- Model to analyzesystem- System hardware specs
ModelFit::analyze_with_context_limit()
Analyzes with custom context limit:
model- Model to analyzesystem- System hardware specscontext_limit- Maximum context length to assume
Helper Methods
Ranking Functions
rank_models_by_fit()
Ranks models by composite score:
rank_models_by_fit_opts()
Ranks with installed-first option:
rank_models_by_fit_opts_col()
Ranks with custom sort column:
Scoring Algorithm
The composite score combines four dimensions with use-case-specific weights:Quality Score (0-100)
Based on:- Parameter count tier (1B = 45, 7B = 75, 70B+ = 95)
- Family reputation (Qwen +2, DeepSeek +3, Llama +2)
- Quantization penalty (Q8 = 0, Q4 = -5, Q2 = -12)
- Task alignment bonus (coding models on coding tasks +6)
Speed Score (0-100)
Normalized against use-case target:- General/Coding/Chat: 40 tok/s target
- Reasoning: 25 tok/s target
- Embedding: 200 tok/s target
Fit Score (0-100)
Memory efficiency:- 50-80% utilization: 100 points (sweet spot)
- 30-50% utilization: 60-100 points (scaled)
- 80-90% utilization: 70 points (getting tight)
- 90-100% utilization: 50 points (very tight)
-
100% utilization: 0 points (doesn’t fit)
Context Score (0-100)
Context window vs. use case needs:- Meets or exceeds target: 100 points
- At least half of target: 70 points
- Below half: 30 points
Weighted Composite
Use-case-specific weights[Quality, Speed, Fit, Context]:
- General: [0.45, 0.30, 0.15, 0.10]
- Coding: [0.50, 0.20, 0.15, 0.15]
- Reasoning: [0.55, 0.15, 0.15, 0.15]
- Chat: [0.40, 0.35, 0.15, 0.10]
- Multimodal: [0.50, 0.20, 0.15, 0.15]
- Embedding: [0.30, 0.40, 0.20, 0.10]
Speed Estimation
Token generation is memory-bandwidth-bound. The estimator uses:Bandwidth-Based (Preferred)
For recognized GPUs:- Kernel launch overhead
- KV-cache reads
- Memory controller inefficiency
- Gpu: 1.0
- MoeOffload: 0.8
- CpuOffload: 0.5
- CpuOnly: 0.3
Fallback (Unknown GPUs)
Fixed constants by backend:- Metal + MLX: 250
- Metal + llama.cpp: 160
- CUDA: 220
- ROCm: 180
- Vulkan: 150
- SYCL: 100
- CPU ARM: 90
- CPU x86: 70
- Ascend: 390
Fit Analysis Notes
Thenotes field contains human-readable explanations:
- “GPU: model loaded into VRAM”
- “Unified memory: GPU and CPU share the same pool”
- “MoE: 2/8 experts active in VRAM (12.5 GB) at Q4_K_M”
- “Inactive experts offloaded to system RAM (18.3 GB)”
- “GPU: insufficient VRAM, spilling to system RAM”
- “Best quantization for hardware: Q4_K_M (model default: Q8_0)”
- “MLX runtime: ~35% faster than llama.cpp (85.2 vs 63.1 tok/s)”
- “Baseline estimated speed: 45.3 tok/s”
