- Hardware detection
- Model database loading
- Dynamic quantization selection
- Multi-dimensional scoring
- Fit analysis and ranking
Architecture Overview
The codebase is organized into specialized modules:All data flows through these modules in sequence: hardware detection → model loading → fit analysis → scoring → ranking.
1. Hardware Detection
TheSystemSpecs::detect() function in hardware.rs probes your system using multiple detection methods:
RAM and CPU
Uses thesysinfo crate to read:
- Total and available system RAM
- CPU core count and model name
- Available RAM with fallback strategies for platforms where
sysinforeports 0
Multi-GPU Detection
llmfit detects GPUs across all major vendors using vendor-specific tools and sysfs:NVIDIA GPUs
NVIDIA GPUs
Primary method: Groups same-model GPUs and tracks per-card VRAM. For 2x RTX 3090, reports 24 GB per card (48 GB total VRAM for tensor splitting).Fallback: Linux sysfs at
nvidia-smi with multi-GPU aggregation/sys/class/drm/card*/device/ for containerized environments where nvidia-smi is unavailable.Unified memory detection: On NVIDIA Tegra and Grace Blackwell (GB10, GB20), detects unified CPU+GPU memory via addressing_mode field or model name heuristics.AMD GPUs
AMD GPUs
Primary method: Fallback: Linux sysfs vendor ID check (0x1002) at
rocm-smi for ROCm-enabled systems/sys/class/drm/APU support: Detects AMD Ryzen AI unified memory APUs (Strix Halo, Strix Point) by CPU name pattern matching and assigns full system RAM as shared VRAM.Intel Arc
Intel Arc
Uses sysfs to read discrete VRAM from
/sys/class/drm/card*/device/mem_info_vram_total for Arc GPUs (A370M, A770).Integrated Intel Arc GPUs detected via lspci with SYCL backend for oneAPI inference.Apple Silicon
Apple Silicon
Detects via On unified memory systems (M1/M2/M3/M4), VRAM = total system RAM since GPU and CPU share the same memory pool.
system_profiler SPDisplaysDataType:Ascend NPUs
Ascend NPUs
Detects Huawei Ascend NPUs via
npu-smi:Backend Identification
llmfit automatically determines the inference acceleration backend:| Backend | Hardware | Speed Constant |
|---|---|---|
| CUDA | NVIDIA GPUs | 220 |
| Metal | Apple Silicon | 160 (llama.cpp) / 250 (MLX) |
| ROCm | AMD GPUs with ROCm | 180 |
| Vulkan | AMD GPUs without ROCm, Windows AMD | 150 |
| SYCL | Intel Arc / oneAPI | 100 |
| CPU (ARM) | ARM processors | 90 |
| CPU (x86) | Intel/AMD CPUs | 70 |
| NPU (Ascend) | Huawei Ascend NPUs | 390 |
On Apple Silicon with unified memory, llmfit prefers the MLX runtime for native optimization. Otherwise it defaults to llama.cpp.
2. Model Database Loading
TheModelDatabase::new() function in models.rs loads the model list from data/hf_models.json, which is embedded at compile time via include_str!():
No runtime file I/O. The database is baked into the binary, so llmfit works offline and has no dependency on external data files.
3. Scoring and Ranking Flow
For each model,ModelFit::analyze() in fit.rs evaluates fitness across four dimensions:
Path Selection Logic
-
If GPU available:
- Try full GPU fit with dynamic quantization
- For MoE models: try expert offloading (active experts in VRAM, inactive in RAM)
- Fall back to CPU+GPU offload if VRAM insufficient
- Last resort: CPU-only
-
If unified memory (Apple Silicon, NVIDIA Grace):
- GPU and CPU share same pool → no separate CPU+GPU path
- MoE models still noted but don’t need offloading
-
If no GPU:
- CPU-only path with dynamic quantization in system RAM
4. Run Modes
Each model is assigned one of four run modes based on hardware fit:GPU
Optimal performanceModel fits entirely in VRAM. Fast inference with full GPU acceleration.Requirements: VRAM ≥ model size at selected quantization
MoE Offload
MoE optimizationActive experts in VRAM, inactive experts offloaded to RAM. ~80% of GPU speed with reduced VRAM requirements.Example: Mixtral 8x7B needs ~6.6 GB VRAM (active) + ~20 GB RAM (inactive) instead of 23.9 GB VRAM
CPU+GPU
Mixed performanceModel doesn’t fit in VRAM, spills to system RAM with partial GPU offload. Significantly slower than pure GPU.Speed penalty: 0.5× GPU baseline
CPU Only
Fallback modeNo GPU or insufficient VRAM. Model runs entirely in system RAM.Speed penalty: 0.3× GPU baseline
Fit cap: Always Marginal (never Perfect/Good)
5. Multi-Dimensional Scoring
Each model receives a composite score (0-100) weighted by use case:| Use Case | Quality | Speed | Fit | Context |
|---|---|---|---|---|
| General | 45% | 30% | 15% | 10% |
| Coding | 50% | 20% | 15% | 15% |
| Reasoning | 55% | 15% | 15% | 15% |
| Chat | 40% | 35% | 15% | 10% |
| Multimodal | 50% | 20% | 15% | 15% |
| Embedding | 30% | 40% | 20% | 10% |
Performance Estimation
llmfit uses physics-based speed estimation when the GPU model is recognized:- RTX 4090 (1008 GB/s): 7B Q4 → ~61 tok/s (measured: ~60 tok/s)
- T4 (320 GB/s): 7B F16 → ~16 tok/s (llama.cpp Discussion #4225)
- M1 Max (400 GB/s): 7B Q4 → ~61 tok/s (llama.cpp Discussion #4167)
llmfit’s database contains bandwidth specs for ~80 GPUs across NVIDIA (consumer + datacenter), AMD (RDNA + CDNA), and Apple Silicon families.
