Fit Analysis API

The fit module analyzes how well models run on specific hardware, computing fit levels, run modes, and multi-dimensional scores.

Core Types

`ModelFit`

Complete analysis of a model’s fit on specific hardware:

pub struct ModelFit {
    pub model: LlmModel,
    pub fit_level: FitLevel,
    pub run_mode: RunMode,
    pub memory_required_gb: f64,
    pub memory_available_gb: f64,
    pub utilization_pct: f64,
    pub notes: Vec<String>,
    pub moe_offloaded_gb: Option<f64>,
    pub score: f64,
    pub score_components: ScoreComponents,
    pub estimated_tps: f64,
    pub best_quant: String,
    pub use_case: UseCase,
    pub runtime: InferenceRuntime,
    pub installed: bool,
}

Key Fields:

model - The analyzed model
fit_level - Memory fit quality (Perfect, Good, Marginal, TooTight)
run_mode - Execution path (Gpu, CpuOffload, CpuOnly, MoeOffload)
memory_required_gb - Memory needed for this run mode
memory_available_gb - Memory pool capacity
utilization_pct - Memory utilization percentage
notes - Human-readable analysis notes
score - Composite score (0-100)
score_components - Individual dimension scores
estimated_tps - Estimated tokens per second
best_quant - Optimal quantization for this hardware
runtime - Inference runtime (MLX or llama.cpp)
installed - Whether model is locally installed

`FitLevel`

Memory fit quality:

pub enum FitLevel {
    Perfect,  // Recommended memory met on GPU
    Good,     // Fits with headroom
    Marginal, // Minimum memory met but tight
    TooTight, // Does not fit in available memory
}

Fit Criteria:

Perfect: GPU path with recommended memory available
Good: Adequate headroom (1.2x minimum or more)
Marginal: Meets minimum but tight
TooTight: Insufficient memory

Note: CPU-only paths cap at Marginal since they’re always a compromise.

`RunMode`

Execution path:

pub enum RunMode {
    Gpu,        // Fully loaded into VRAM (fast)
    MoeOffload, // MoE: active experts in VRAM, inactive in RAM
    CpuOffload, // Partial GPU, spills to RAM (mixed)
    CpuOnly,    // Entirely in system RAM (slow)
}

Performance Order:

Gpu: Full GPU inference (fastest)
MoeOffload: Expert switching with some latency
CpuOffload: Significant slowdown from memory transfers
CpuOnly: Slowest, but works without GPU

`InferenceRuntime`

Software framework:

pub enum InferenceRuntime {
    LlamaCpp, // llama.cpp / Ollama
    Mlx,      // Apple MLX framework
}

`ScoreComponents`

Multi-dimensional scoring:

pub struct ScoreComponents {
    pub quality: f64,  // 0-100
    pub speed: f64,    // 0-100
    pub fit: f64,      // 0-100
    pub context: f64,  // 0-100
}

Dimensions:

Quality: Model capability (params + family + quant + task alignment)
Speed: Estimated tok/s vs. target for use case
Fit: Memory efficiency (50-80% utilization = optimal)
Context: Context window vs. use case needs

`SortColumn`

Sort options for ranked models:

pub enum SortColumn {
    Score,       // Composite score (default)
    Tps,         // Estimated tokens/second
    Params,      // Parameter count
    MemPct,      // Memory utilization
    Ctx,         // Context length
    ReleaseDate, // Release date
    UseCase,     // Use case category
}

Functions

`ModelFit::analyze()`

Analyzes model fit on hardware:

pub fn analyze(model: &LlmModel, system: &SystemSpecs) -> Self

Parameters:

model - Model to analyze
system - System hardware specs

Returns: Complete fit analysis Example:

use llmfit_core::{ModelFit, ModelDatabase, SystemSpecs};

let specs = SystemSpecs::detect();
let db = ModelDatabase::new();

let model = &db.get_all_models()[0];
let fit = ModelFit::analyze(model, &specs);

println!("{}: {}", model.name, fit.fit_text());
println!("  Run mode: {}", fit.run_mode_text());
println!("  Memory: {:.2}/{:.2} GB ({:.1}%)",
    fit.memory_required_gb,
    fit.memory_available_gb,
    fit.utilization_pct
);
println!("  Speed: {:.1} tok/s", fit.estimated_tps);
println!("  Score: {:.1}/100", fit.score);

for note in &fit.notes {
    println!("  - {}", note);
}

`ModelFit::analyze_with_context_limit()`

Analyzes with custom context limit:

pub fn analyze_with_context_limit(
    model: &LlmModel,
    system: &SystemSpecs,
    context_limit: Option<u32>,
) -> Self

Parameters:

model - Model to analyze
system - System hardware specs
context_limit - Maximum context length to assume

Example:

// Analyze with 8K context instead of model's full 128K
let fit = ModelFit::analyze_with_context_limit(
    model,
    &specs,
    Some(8192)
);

// This reduces memory estimates for long-context models

Helper Methods

impl ModelFit {
    pub fn fit_emoji(&self) -> &str;      // "🟢" / "🟡" / "🟠" / "🔴"
    pub fn fit_text(&self) -> &str;       // "Perfect" / "Good" / etc.
    pub fn runtime_text(&self) -> &str;   // "llama.cpp" / "MLX"
    pub fn run_mode_text(&self) -> &str;  // "GPU" / "CPU" / etc.
}

Ranking Functions

`rank_models_by_fit()`

Ranks models by composite score:

pub fn rank_models_by_fit(models: Vec<ModelFit>) -> Vec<ModelFit>

Returns: Models sorted by score (best first), with TooTight at end Example:

use llmfit_core::{rank_models_by_fit, ModelFit, FitLevel};

let specs = SystemSpecs::detect();
let db = ModelDatabase::new();

let fits: Vec<ModelFit> = db.get_all_models()
    .iter()
    .map(|m| ModelFit::analyze(m, &specs))
    .collect();

let ranked = rank_models_by_fit(fits);

println!("Top 5 models:");
for (i, fit) in ranked.iter()
    .filter(|f| f.fit_level != FitLevel::TooTight)
    .take(5)
    .enumerate()
{
    println!("{}. {} - Score {:.1}, {:.1} tok/s",
        i + 1,
        fit.model.name,
        fit.score,
        fit.estimated_tps
    );
}

`rank_models_by_fit_opts()`

Ranks with installed-first option:

pub fn rank_models_by_fit_opts(
    models: Vec<ModelFit>,
    installed_first: bool,
) -> Vec<ModelFit>

Example:

// Rank with installed models first
let ranked = rank_models_by_fit_opts(fits, true);

`rank_models_by_fit_opts_col()`

Ranks with custom sort column:

pub fn rank_models_by_fit_opts_col(
    models: Vec<ModelFit>,
    installed_first: bool,
    sort_column: SortColumn,
) -> Vec<ModelFit>

Example:

use llmfit_core::SortColumn;

// Sort by speed (tok/s)
let by_speed = rank_models_by_fit_opts_col(
    fits.clone(),
    false,
    SortColumn::Tps
);

// Sort by parameter count
let by_size = rank_models_by_fit_opts_col(
    fits.clone(),
    false,
    SortColumn::Params
);

// Sort by release date (newest first)
let by_date = rank_models_by_fit_opts_col(
    fits.clone(),
    false,
    SortColumn::ReleaseDate
);

Scoring Algorithm

The composite score combines four dimensions with use-case-specific weights:

Quality Score (0-100)

Based on:

Parameter count tier (1B = 45, 7B = 75, 70B+ = 95)
Family reputation (Qwen +2, DeepSeek +3, Llama +2)
Quantization penalty (Q8 = 0, Q4 = -5, Q2 = -12)
Task alignment bonus (coding models on coding tasks +6)

Speed Score (0-100)

Normalized against use-case target:

General/Coding/Chat: 40 tok/s target
Reasoning: 25 tok/s target
Embedding: 200 tok/s target

Score = (estimated_tps / target) * 100, capped at 100.

Fit Score (0-100)

Memory efficiency:

50-80% utilization: 100 points (sweet spot)
30-50% utilization: 60-100 points (scaled)
80-90% utilization: 70 points (getting tight)
90-100% utilization: 50 points (very tight)
100% utilization: 0 points (doesn’t fit)

Context Score (0-100)

Context window vs. use case needs:

Meets or exceeds target: 100 points
At least half of target: 70 points
Below half: 30 points

Weighted Composite

Use-case-specific weights [Quality, Speed, Fit, Context]:

General: [0.45, 0.30, 0.15, 0.10]
Coding: [0.50, 0.20, 0.15, 0.15]
Reasoning: [0.55, 0.15, 0.15, 0.15]
Chat: [0.40, 0.35, 0.15, 0.10]
Multimodal: [0.50, 0.20, 0.15, 0.15]
Embedding: [0.30, 0.40, 0.20, 0.10]

Example:

let fit = ModelFit::analyze(model, &specs);
let sc = fit.score_components;

println!("Quality: {:.1}/100", sc.quality);
println!("Speed: {:.1}/100", sc.speed);
println!("Fit: {:.1}/100", sc.fit);
println!("Context: {:.1}/100", sc.context);
println!("Weighted: {:.1}/100", fit.score);

Speed Estimation

Token generation is memory-bandwidth-bound. The estimator uses:

Bandwidth-Based (Preferred)

For recognized GPUs:

model_gb = params * bytes_per_param(quant)
raw_tps = (bandwidth_gbps / model_gb) * 0.55
estimated_tps = raw_tps * run_mode_factor

Efficiency factor (0.55) accounts for:

Kernel launch overhead
KV-cache reads
Memory controller inefficiency

Run mode factors:

Gpu: 1.0
MoeOffload: 0.8
CpuOffload: 0.5
CpuOnly: 0.3

Fallback (Unknown GPUs)

Fixed constants by backend:

Metal + MLX: 250
Metal + llama.cpp: 160
CUDA: 220
ROCm: 180
Vulkan: 150
SYCL: 100
CPU ARM: 90
CPU x86: 70
Ascend: 390

Example:

let fit = ModelFit::analyze(model, &specs);

println!("Estimated speed: {:.1} tok/s", fit.estimated_tps);

if fit.runtime == InferenceRuntime::Mlx {
    println!("Using MLX runtime on Apple Silicon");
}

Fit Analysis Notes

The notes field contains human-readable explanations:

for note in &fit.notes {
    println!("  {}", note);
}

Common notes:

“GPU: model loaded into VRAM”
“Unified memory: GPU and CPU share the same pool”
“MoE: 2/8 experts active in VRAM (12.5 GB) at Q4_K_M”
“Inactive experts offloaded to system RAM (18.3 GB)”
“GPU: insufficient VRAM, spilling to system RAM”
“Best quantization for hardware: Q4_K_M (model default: Q8_0)”
“MLX runtime: ~35% faster than llama.cpp (85.2 vs 63.1 tok/s)”
“Baseline estimated speed: 45.3 tok/s”

Backend Compatibility

pub fn backend_compatible(model: &LlmModel, system: &SystemSpecs) -> bool

Example:

use llmfit_core::fit::backend_compatible;

let db = ModelDatabase::new();
let specs = SystemSpecs::detect();

for model in db.get_all_models() {
    if !backend_compatible(model, &specs) {
        println!("{}: incompatible with {}",
            model.name,
            specs.backend.label()
        );
    }
}

MLX models require Apple Silicon (Metal backend + unified memory).

CLI Commands

REST API

Core Library

Fit Analysis

Fit Analysis API

Core Types

`ModelFit`

`FitLevel`

`RunMode`

`InferenceRuntime`

`ScoreComponents`

`SortColumn`

Functions

`ModelFit::analyze()`

`ModelFit::analyze_with_context_limit()`

Helper Methods

Ranking Functions

`rank_models_by_fit()`

`rank_models_by_fit_opts()`

`rank_models_by_fit_opts_col()`

Scoring Algorithm

Quality Score (0-100)

Speed Score (0-100)

Fit Score (0-100)

Context Score (0-100)

Weighted Composite

Speed Estimation

Bandwidth-Based (Preferred)

Fallback (Unknown GPUs)

Fit Analysis Notes

Backend Compatibility

Build docs developers (and LLMs) love

CLI Commands

REST API

Core Library

​Fit Analysis API

​Core Types

​ModelFit

​FitLevel

​RunMode

​InferenceRuntime

​ScoreComponents

​SortColumn

​Functions

​ModelFit::analyze()

​ModelFit::analyze_with_context_limit()

​Helper Methods

​Ranking Functions

​rank_models_by_fit()

​rank_models_by_fit_opts()

​rank_models_by_fit_opts_col()

​Scoring Algorithm

​Quality Score (0-100)

​Speed Score (0-100)

​Fit Score (0-100)

​Context Score (0-100)

​Weighted Composite

​Speed Estimation

​Bandwidth-Based (Preferred)

​Fallback (Unknown GPUs)

​Fit Analysis Notes

​Backend Compatibility

Build docs developers (and LLMs) love

Fit Analysis API

Core Types

`ModelFit`

`FitLevel`

`RunMode`

`InferenceRuntime`

`ScoreComponents`

`SortColumn`

Functions

`ModelFit::analyze()`

`ModelFit::analyze_with_context_limit()`

Helper Methods

Ranking Functions

`rank_models_by_fit()`

`rank_models_by_fit_opts()`

`rank_models_by_fit_opts_col()`

Scoring Algorithm

Quality Score (0-100)

Speed Score (0-100)

Fit Score (0-100)

Context Score (0-100)

Weighted Composite

Speed Estimation

Bandwidth-Based (Preferred)

Fallback (Unknown GPUs)

Fit Analysis Notes

Backend Compatibility