Skip to main content
llmfit ships with an embedded database of hundreds of models from the HuggingFace ecosystem, stored in data/hf_models.json and baked into the binary at compile time.

Database Structure

The model database is a JSON array where each entry describes a single model variant:
{
  "name": "meta-llama/Llama-3.1-8B-Instruct",
  "provider": "Meta",
  "parameter_count": "8B",
  "parameters_raw": 8030000000,
  "min_ram_gb": 4.8,
  "recommended_ram_gb": 9.6,
  "min_vram_gb": 5.2,
  "quantization": "Q4_K_M",
  "context_length": 131072,
  "use_case": "Chat, instruction following",
  "is_moe": false,
  "release_date": "2024-07-23",
  "gguf_sources": [
    {
      "repo": "unsloth/Llama-3.1-8B-Instruct-GGUF",
      "provider": "unsloth"
    },
    {
      "repo": "bartowski/Llama-3.1-8B-Instruct-GGUF",
      "provider": "bartowski"
    }
  ]
}

Core Fields

name
string
required
Full HuggingFace model repo ID (e.g., Qwen/Qwen2.5-Coder-32B-Instruct)
provider
string
required
Organization or author (e.g., Meta, Mistral AI, DeepSeek)
parameter_count
string
required
Human-readable size (e.g., 7B, 8x7B, 1.3B)
parameters_raw
number
Exact parameter count for precise memory calculations
min_ram_gb
number
required
Minimum system RAM for CPU-only inference at Q4_K_M quantization
Recommended RAM for comfortable inference (typically 2× min_ram_gb)
min_vram_gb
number
Minimum VRAM for GPU inference at Q4_K_M. null for CPU-only models.
quantization
string
required
Default quantization format (e.g., Q4_K_M, Q8_0, mlx-4bit)Note: llmfit overrides this with dynamic quantization selection at runtime.
context_length
number
required
Maximum context window in tokens
use_case
string
required
Intended use case(s): General, Coding, Reasoning, Chat, Multimodal, Embedding

MoE-Specific Fields

is_moe
boolean
default:false
Whether this is a Mixture-of-Experts architecture
num_experts
number
Total number of expert layers (e.g., 8 for Mixtral 8x7B)
active_experts
number
Number of experts activated per token (typically 2)
active_parameters
number
Effective parameter count when only active experts are loadedExample: Mixtral 8x7B has 46.7B total params but only 12.9B active params.

GGUF Sources

gguf_sources
array
Known GGUF download repositories for llama.cpp runtime
[
  {
    "repo": "unsloth/Llama-3.1-8B-Instruct-GGUF",
    "provider": "unsloth"
  }
]
Populated by the scraper when --no-gguf-sources is not specified. Cached for 7 days to reduce HuggingFace API load.

Memory Calculation Formulas

The scraper estimates memory requirements using quantization bytes-per-parameter:

RAM (CPU-only inference)

bpp = 0.58  # Q4_K_M
model_size_gb = (params * bpp) / (1024 ** 3)
min_ram_gb = model_size_gb * 1.2  # +20% overhead
recommended_ram_gb = model_size_gb * 2.0

VRAM (GPU inference)

model_size_gb = (params * bpp) / (1024 ** 3)
min_vram_gb = model_size_gb * 1.1  # +10% for KV cache
These are baseline estimates at Q4_K_M. llmfit performs dynamic quantization selection at runtime, so actual memory usage may differ.

Model Categories

The database spans multiple categories sourced from HuggingFace:
  • Meta Llama (3.1, 3.2, 3.3, 4 Scout, 4 Maverick)
  • Mistral (7B, Nemo, Small)
  • Google Gemma (2B, 7B, 9B, 27B)
  • Qwen (0.5B to 72B)
  • Microsoft Phi (3.5, 4)
  • DeepSeek (V2, V3)
400+ models across all parameter sizes.
  • Qwen2.5-Coder / Qwen3-Coder (0.5B to 32B)
  • Meta CodeLlama (7B, 13B, 34B, 70B)
  • BigCode StarCoder2 (3B, 7B, 15B)
  • WizardCoder (Python-specialized variants)
  • DeepSeek-Coder (1.3B to 33B)
  • IBM Granite Code (3B, 8B, 20B, 34B)
50+ specialized coding models.
  • DeepSeek-R1 (1.5B to 671B)
  • Microsoft Orca-2 (7B, 13B)
  • Qwen QwQ (32B preview)
Chain-of-thought and step-by-step reasoning models.
  • Llama 3.2 Vision (11B, 90B)
  • Llama 4 Scout / Maverick (17B, 23B)
  • Qwen2.5-VL (2B, 7B, 72B)
  • Microsoft Phi-3.5 Vision (4.2B)
Models with image understanding capabilities.
  • nomic-embed-text (v1, v1.5)
  • BAAI bge (small, base, large)
  • Cohere embed-english (v3.0)
Dense vector embedding models for RAG and semantic search.
  • Mistral Mixtral 8x7B / 8x22B
  • DeepSeek-V2 / V3 (236B, 671B)
  • Qwen1.5-MoE (A2.7B)
Mixture-of-Experts models with expert offloading support.

Model Sources

All models are sourced from the HuggingFace Hub via the REST API. The database includes:
  • Meta Llama (meta-llama org)
  • Mistral AI (mistralai org)
  • Qwen (Qwen org)
  • Google Gemma (google org)
  • Microsoft Phi (microsoft org)
  • DeepSeek (deepseek-ai org)
  • IBM Granite (ibm-granite org)
  • Allen Institute OLMo (allenai org)
  • xAI Grok (xai-org)
  • Cohere (CohereForAI org)
  • BigCode (bigcode org)
  • 01.ai Yi (01-ai org)
  • Upstage Solar (upstage org)
  • TII Falcon (tiiuae org)
  • Zhipu GLM (THUDM org)
  • Moonshot Kimi (Moonshot org)
  • Baidu ERNIE (PaddlePaddle org)

MoE Architecture Detection

The scraper automatically identifies Mixture-of-Experts models via:
  1. Config file inspection:
    if "num_local_experts" in config or "num_experts_per_tok" in config:
        is_moe = True
    
  2. Architecture mapping:
    MOE_ARCHITECTURES = {
        "mixtral", "qwen2moe", "deepseek_v2", "deepseek_v3"
    }
    
  3. Active parameter calculation:
    active_params = shared_params + (expert_params * active_experts)
    
Example: Mixtral 8x7B
  • Total parameters: 46.7B
  • Active experts: 2/8
  • Active parameters: ~12.9B
  • VRAM savings: 23.9 GB → 6.6 GB with expert offloading

Update Process

The model database is generated by scripts/scrape_hf_models.py, a standalone Python script with no pip dependencies (stdlib only).

Manual Update

# Run the scraper
python3 scripts/scrape_hf_models.py

# Rebuild llmfit with new data
cargo build --release

Automated Update

# Backs up existing data, runs scraper, validates JSON, rebuilds binary
make update-models

# Or use the update script directly
./scripts/update_models.sh

Scraper Options

--no-gguf-sources
flag
Skip GGUF source enrichment for faster scrapingDefault: Enabled (queries HuggingFace for GGUF repos)
--output
string
Output file pathDefault: data/hf_models.json

Adding a New Model

  1. Add the HuggingFace repo ID to TARGET_MODELS in scripts/scrape_hf_models.py:
    TARGET_MODELS = [
        "meta-llama/Llama-3.1-8B-Instruct",
        "Qwen/Qwen2.5-7B-Instruct",
        "your-org/your-new-model",  # Add here
    ]
    
  2. If the model is gated (requires HF authentication), add a fallback:
    FALLBACKS = {
        "your-org/your-new-model": {
            "parameters_raw": 7_000_000_000,
            "context_length": 8192,
            "use_case": "Chat",
        }
    }
    
  3. Run the update process:
    make update-models
    
The scraper validates JSON output before writing. If scraping fails, the original hf_models.json remains intact.

GGUF Source Caching

GGUF source enrichment queries HuggingFace’s search API to find quantized versions. Results are cached in data/gguf_sources_cache.json with a 7-day TTL:
{
  "meta-llama/Llama-3.1-8B-Instruct": {
    "timestamp": "2025-03-01T10:30:00Z",
    "sources": [
      {"repo": "unsloth/Llama-3.1-8B-Instruct-GGUF", "provider": "unsloth"},
      {"repo": "bartowski/Llama-3.1-8B-Instruct-GGUF", "provider": "bartowski"}
    ]
  }
}
To force a fresh GGUF source lookup, delete the cache file before running the scraper.

Compile-Time Embedding

The model database is embedded at compile time via Rust’s include_str!() macro:
const HF_MODELS_JSON: &str = include_str!("../data/hf_models.json");

pub fn new() -> Self {
    let entries: Vec<HfModelEntry> = serde_json::from_str(HF_MODELS_JSON)
        .expect("Failed to parse embedded hf_models.json");
    // ...
}
Benefits:
  • No runtime file I/O
  • Works offline
  • Single binary distribution
  • No external data dependencies
The database must be present at data/hf_models.json when building. If the file is missing or invalid, compilation will fail.

Build docs developers (and LLMs) love