Four Scoring Dimensions
Quality
What it measures:
- Parameter count (more params = higher quality)
- Model family reputation (Llama, Qwen, DeepSeek)
- Quantization quality penalty
- Task-specific alignment
Speed
What it measures:
- Estimated tokens per second
- Normalized against use-case target
- Based on memory bandwidth or backend constants
Fit
What it measures:
- Memory utilization efficiency
- Sweet spot: 50-80% of available memory
- Penalties for under-utilization or tightness
Context
What it measures:
- Context window capability
- Compared to use-case target
- Bonus for exceeding requirements
1. Quality Score
The quality score combines multiple factors to estimate model capability:Base Quality by Parameter Count
Family Reputation Bumps
Certain model families receive quality bonuses based on benchmark performance:| Family | Bump | Reasoning |
|---|---|---|
| DeepSeek | +3.0 | Strong reasoning, code generation |
| Qwen | +2.0 | Multilingual, high benchmark scores |
| Llama | +2.0 | Well-tested, open community |
| Mistral/Mixtral | +1.0 | Efficient MoE architectures |
| Gemma | +1.0 | Google research quality |
| StarCoder | +1.0 | Specialized code models |
Quantization Penalty
Lower quantization reduces quality:Q4_K_M is the sweet spot: minimal quality loss with 4× memory savings compared to F16.
Task Alignment Bonus
Coding (+6.0):- Models with “code”, “starcoder”, or “wizard” in name
- Examples: CodeLlama, Qwen2.5-Coder, StarCoder2
- Models with ≥13B params
- Chain-of-thought architectures
- Models with “vision” in name or use case
- Examples: Llama 3.2 Vision, Qwen2.5-VL
Final Quality Formula
Example: Qwen2.5-Coder-14B-Instruct at Q4_K_M
- Base (10-20B): 82.0
- Family (Qwen): +2.0
- Quantization (Q4_K_M): -5.0
- Task (coding): +6.0
- Total: 85.0
2. Speed Score
Speed estimation uses a physics-based approach when GPU bandwidth is known, falling back to per-backend constants otherwise.Bandwidth-Based Estimation (Preferred)
Token generation is memory-bandwidth-bound. Each token requires reading the full model weights once:- Kernel launch overhead
- KV-cache memory reads
- Memory controller saturation
- NVIDIA Example
- Apple Silicon Example
RTX 4090 (1008 GB/s)
Fallback: Backend Constants
For unrecognized GPUs or synthetic entries from--memory override:
Run Mode Penalties
Quantization Speed Multiplier
Lower quantization = less data to read = faster inference:Speed Score Normalization
Speed is normalized against a use-case-specific target:A model hitting the target TPS gets a perfect 100. Exceeding the target is capped at 100 (no extra credit).
3. Fit Score
Measures how efficiently the model uses available memory. The sweet spot is 50-80% utilization:20% Utilization
Score: 76Under-utilizing hardware. Room for larger/better models.
65% Utilization
Score: 100Sweet spot. Efficient use with comfortable headroom.
92% Utilization
Score: 50Very tight. Risk of OOM errors or swapping.
4. Context Score
Compares the model’s context window to use-case expectations:Example: A coding model with 4K context vs 8K target gets a 70 score—usable but not ideal for large codebases.
Weighted Composite Score
The final score is a weighted combination based on use case priorities:General (Balanced)
General (Balanced)
Weights: Quality 45%, Speed 30%, Fit 15%, Context 10%Balanced scoring for everyday use. Prioritizes model quality with decent speed.
Coding
Coding
Weights: Quality 50%, Speed 20%, Fit 15%, Context 15%Higher context weight (15% vs 10%) for large codebases. Quality is paramount for correct suggestions.
Reasoning
Reasoning
Weights: Quality 55%, Speed 15%, Fit 15%, Context 15%Quality dominates (55%). Chain-of-thought models are naturally slower, so speed weight is reduced.
Chat
Chat
Weights: Quality 40%, Speed 35%, Fit 15%, Context 10%Speed is critical (35%) for interactive conversations. Quality matters less than responsiveness.
Multimodal
Multimodal
Weights: Quality 50%, Speed 20%, Fit 15%, Context 15%Similar to coding. High quality for accurate vision understanding, reasonable context for multi-turn interactions.
Embedding
Embedding
Weights: Quality 30%, Speed 40%, Fit 20%, Context 10%Speed is king (40%) for batch processing. Fit matters more (20%) since embeddings process many documents.
Complete Scoring Example
Let’s score Qwen2.5-Coder-14B-Instruct on an RTX 3090 (24 GB VRAM, 936 GB/s bandwidth) for coding:Step 1: Path Selection
- VRAM available: 24 GB
- Model at Q4_K_M: ~8.2 GB
- Run mode: GPU (fits comfortably)
Step 2: Individual Scores
Quality (0-100):Step 3: Weighted Composite
Score Interpretation
| Score Range | Interpretation |
|---|---|
| 90-100 | Excellent fit for use case and hardware |
| 80-89 | Very good, minor compromises |
| 70-79 | Good, some trade-offs |
| 60-69 | Acceptable, noticeable limitations |
| 50-59 | Marginal, significant compromises |
| < 50 | Poor fit, consider alternatives |
