Skip to main content

Models API

The models module provides access to the embedded model database and metadata operations.

Core Types

LlmModel

Represents a single LLM model with all metadata:
pub struct LlmModel {
    pub name: String,
    pub provider: String,
    pub parameter_count: String,
    pub parameters_raw: Option<u64>,
    pub min_ram_gb: f64,
    pub recommended_ram_gb: f64,
    pub min_vram_gb: Option<f64>,
    pub quantization: String,
    pub context_length: u32,
    pub use_case: String,
    pub is_moe: bool,
    pub num_experts: Option<u32>,
    pub active_experts: Option<u32>,
    pub active_parameters: Option<u64>,
    pub release_date: Option<String>,
    pub gguf_sources: Vec<GgufSource>,
}
Key Fields:
  • name - Model identifier (e.g., “llama-3.1-8b-instruct”)
  • provider - Original provider (“Meta”, “Qwen”, etc.)
  • parameter_count - Human-readable size (“7B”, “8x7B”)
  • parameters_raw - Exact parameter count
  • min_ram_gb - Minimum system RAM for CPU inference
  • recommended_ram_gb - Recommended RAM for best performance
  • min_vram_gb - Minimum VRAM for GPU inference
  • quantization - Default quantization level (“Q4_K_M”, “mlx-4bit”)
  • context_length - Maximum context window
  • use_case - Primary use case category
  • is_moe - Whether this is a Mixture-of-Experts model
  • num_experts / active_experts - MoE expert configuration
  • active_parameters - Active parameter count for MoE models
  • release_date - Release date string (ISO 8601)
  • gguf_sources - Known GGUF download sources

GgufSource

A known GGUF download source:
pub struct GgufSource {
    pub repo: String,      // e.g., "bartowski/Llama-3.1-8B-Instruct-GGUF"
    pub provider: String,  // e.g., "bartowski", "unsloth"
}

ModelDatabase

Container for the embedded model database:
pub struct ModelDatabase {
    // Private fields
}
Methods:
impl ModelDatabase {
    pub fn new() -> Self;
    pub fn get_all_models(&self) -> &Vec<LlmModel>;
    pub fn find_model(&self, query: &str) -> Vec<&LlmModel>;
    pub fn models_fitting_system(
        &self,
        available_ram_gb: f64,
        has_gpu: bool,
        vram_gb: Option<f64>,
    ) -> Vec<&LlmModel>;
}

UseCase

Model use-case categories:
pub enum UseCase {
    General,
    Coding,
    Reasoning,
    Chat,
    Multimodal,
    Embedding,
}
Methods:
impl UseCase {
    pub fn label(&self) -> &'static str;
    pub fn from_model(model: &LlmModel) -> Self;
}

Functions

ModelDatabase::new()

Loads the embedded model database:
pub fn new() -> Self
Returns: ModelDatabase with all models loaded Example:
use llmfit_core::ModelDatabase;

let db = ModelDatabase::new();
println!("Loaded {} models", db.get_all_models().len());
The database is embedded at compile time from data/hf_models.json. No runtime file I/O occurs.

ModelDatabase::get_all_models()

Returns all models in the database:
pub fn get_all_models(&self) -> &Vec<LlmModel>
Returns: Reference to all models Example:
let db = ModelDatabase::new();

for model in db.get_all_models() {
    println!("{}: {} params, {} ctx",
        model.name,
        model.parameter_count,
        model.context_length
    );
}

ModelDatabase::find_model()

Searches models by name, provider, or parameter count:
pub fn find_model(&self, query: &str) -> Vec<&LlmModel>
Parameters:
  • query - Search term (case-insensitive substring match)
Returns: Matching models Example:
let db = ModelDatabase::new();

// Find all Llama models
let llama_models = db.find_model("llama");

// Find 7B models
let seven_b = db.find_model("7B");

// Find Qwen models
let qwen = db.find_model("qwen");

for model in qwen {
    println!("Found: {}", model.name);
}

ModelDatabase::models_fitting_system()

Filters models that fit on specific hardware:
pub fn models_fitting_system(
    &self,
    available_ram_gb: f64,
    has_gpu: bool,
    vram_gb: Option<f64>,
) -> Vec<&LlmModel>
Parameters:
  • available_ram_gb - Available system RAM
  • has_gpu - Whether GPU is present
  • vram_gb - GPU VRAM if available
Returns: Models that meet hardware requirements Example:
use llmfit_core::{SystemSpecs, ModelDatabase};

let specs = SystemSpecs::detect();
let db = ModelDatabase::new();

let fitting = db.models_fitting_system(
    specs.available_ram_gb,
    specs.has_gpu,
    specs.gpu_vram_gb,
);

println!("Models that fit: {}", fitting.len());
for model in fitting.iter().take(5) {
    println!("  - {}", model.name);
}

LlmModel Methods

is_mlx_model()

Checks if model is MLX-specific:
pub fn is_mlx_model(&self) -> bool
Returns: true if model name contains “-MLX-” suffix Example:
let model_name = "Qwen3-8B-MLX-4bit";
let is_mlx = model_name.contains("-MLX-");

if is_mlx {
    println!("Apple Silicon only");
}

params_b()

Parameter count in billions:
pub fn params_b(&self) -> f64
Returns: Parameter count in billions Example:
let model = /* ... */;
let params = model.params_b();

if params < 10.0 {
    println!("Small model: {:.1}B parameters", params);
} else if params < 100.0 {
    println!("Medium model: {:.1}B parameters", params);
} else {
    println!("Large model: {:.1}B parameters", params);
}

estimate_memory_gb()

Estimates memory required for specific quantization and context:
pub fn estimate_memory_gb(&self, quant: &str, ctx: u32) -> f64
Parameters:
  • quant - Quantization level (“Q4_K_M”, “Q8_0”, etc.)
  • ctx - Context length in tokens
Returns: Estimated memory in GB Example:
let model = /* ... */;

// 4K context with Q4 quantization
let mem_q4_4k = model.estimate_memory_gb("Q4_K_M", 4096);

// 32K context with Q8 quantization
let mem_q8_32k = model.estimate_memory_gb("Q8_0", 32768);

println!("Q4_K_M @ 4K: {:.2} GB", mem_q4_4k);
println!("Q8_0 @ 32K: {:.2} GB", mem_q8_32k);

best_quant_for_budget()

Selects best quantization that fits in memory:
pub fn best_quant_for_budget(
    &self,
    budget_gb: f64,
    ctx: u32,
) -> Option<(&'static str, f64)>
Parameters:
  • budget_gb - Available memory
  • ctx - Target context length
Returns: (quantization, estimated_memory) or None if nothing fits Example:
let model = /* ... */;
let budget = 16.0; // 16 GB available

if let Some((quant, mem)) = model.best_quant_for_budget(budget, 4096) {
    println!("Best quantization: {} ({:.2} GB)", quant, mem);
} else {
    println!("Model too large for available memory");
}
The function tries quantization levels in quality order:
  1. Q8_0 (best quality)
  2. Q6_K
  3. Q5_K_M
  4. Q4_K_M
  5. Q3_K_M
  6. Q2_K (smallest)
If nothing fits, it tries halving the context length once.

MoE-Specific Methods

For Mixture-of-Experts models:
// Active expert VRAM (GPU)
pub fn moe_active_vram_gb(&self) -> Option<f64>;

// Inactive expert RAM (offloaded to system RAM)
pub fn moe_offloaded_ram_gb(&self) -> Option<f64>;
Example:
let model = /* MoE model like Mixtral 8x7B */;

if model.is_moe {
    if let Some(active_vram) = model.moe_active_vram_gb() {
        println!("Active experts: {:.2} GB VRAM", active_vram);
    }
    if let Some(offloaded) = model.moe_offloaded_ram_gb() {
        println!("Inactive experts: {:.2} GB RAM", offloaded);
    }
}

Quantization Functions

quant_bpp()

Bytes per parameter for quantization level:
pub fn quant_bpp(quant: &str) -> f64
Example:
use llmfit_core::models::quant_bpp;

assert_eq!(quant_bpp("F16"), 2.0);
assert_eq!(quant_bpp("Q8_0"), 1.05);
assert_eq!(quant_bpp("Q4_K_M"), 0.58);
assert_eq!(quant_bpp("mlx-4bit"), 0.55);

quant_speed_multiplier()

Speed impact of quantization:
pub fn quant_speed_multiplier(quant: &str) -> f64
Higher values = faster inference (lower precision = faster math).

quant_quality_penalty()

Quality penalty for quantization:
pub fn quant_quality_penalty(quant: &str) -> f64
Negative values indicate quality loss relative to F16.

Quantization Hierarchies

Predefined quantization hierarchies (best to worst quality):
// Standard GGUF hierarchy
pub const QUANT_HIERARCHY: &[&str] = &[
    "Q8_0", "Q6_K", "Q5_K_M", "Q4_K_M", "Q3_K_M", "Q2_K"
];

// MLX-native hierarchy
pub const MLX_QUANT_HIERARCHY: &[&str] = &[
    "mlx-8bit", "mlx-4bit"
];
Example:
use llmfit_core::models::{QUANT_HIERARCHY, MLX_QUANT_HIERARCHY};

let model = /* ... */;
let budget = 12.0;

// Try GGUF quantizations
if let Some((q, mem)) = model.best_quant_for_budget_with(
    budget, 4096, QUANT_HIERARCHY
) {
    println!("GGUF: {} ({:.2} GB)", q, mem);
}

// Try MLX quantizations
if let Some((q, mem)) = model.best_quant_for_budget_with(
    budget, 4096, MLX_QUANT_HIERARCHY
) {
    println!("MLX: {} ({:.2} GB)", q, mem);
}

Use Case Inference

use llmfit_core::{UseCase, ModelDatabase};

let db = ModelDatabase::new();

for model in db.get_all_models() {
    let use_case = UseCase::from_model(model);
    println!("{}: {}", model.name, use_case.label());
}
Use cases are inferred from model name and metadata:
  • Embedding: “embed”, “bge” in name
  • Coding: “code” in name or use_case
  • Multimodal: “vision” in use_case
  • Reasoning: “reason” or “deepseek-r1” in name
  • Chat: “chat” or “instruction” in use_case
  • General: Default fallback

Build docs developers (and LLMs) love