data/hf_models.json and baked into the binary at compile time.
Database Structure
The model database is a JSON array where each entry describes a single model variant:Core Fields
Full HuggingFace model repo ID (e.g.,
Qwen/Qwen2.5-Coder-32B-Instruct)Organization or author (e.g.,
Meta, Mistral AI, DeepSeek)Human-readable size (e.g.,
7B, 8x7B, 1.3B)Exact parameter count for precise memory calculations
Minimum system RAM for CPU-only inference at Q4_K_M quantization
Recommended RAM for comfortable inference (typically 2× min_ram_gb)
Minimum VRAM for GPU inference at Q4_K_M.
null for CPU-only models.Default quantization format (e.g.,
Q4_K_M, Q8_0, mlx-4bit)Note: llmfit overrides this with dynamic quantization selection at runtime.Maximum context window in tokens
Intended use case(s):
General, Coding, Reasoning, Chat, Multimodal, EmbeddingMoE-Specific Fields
Whether this is a Mixture-of-Experts architecture
Total number of expert layers (e.g., 8 for Mixtral 8x7B)
Number of experts activated per token (typically 2)
Effective parameter count when only active experts are loadedExample: Mixtral 8x7B has 46.7B total params but only 12.9B active params.
GGUF Sources
Known GGUF download repositories for llama.cpp runtimePopulated by the scraper when
--no-gguf-sources is not specified. Cached for 7 days to reduce HuggingFace API load.Memory Calculation Formulas
The scraper estimates memory requirements using quantization bytes-per-parameter:RAM (CPU-only inference)
VRAM (GPU inference)
Model Categories
The database spans multiple categories sourced from HuggingFace:General Purpose
General Purpose
- Meta Llama (3.1, 3.2, 3.3, 4 Scout, 4 Maverick)
- Mistral (7B, Nemo, Small)
- Google Gemma (2B, 7B, 9B, 27B)
- Qwen (0.5B to 72B)
- Microsoft Phi (3.5, 4)
- DeepSeek (V2, V3)
Coding
Coding
- Qwen2.5-Coder / Qwen3-Coder (0.5B to 32B)
- Meta CodeLlama (7B, 13B, 34B, 70B)
- BigCode StarCoder2 (3B, 7B, 15B)
- WizardCoder (Python-specialized variants)
- DeepSeek-Coder (1.3B to 33B)
- IBM Granite Code (3B, 8B, 20B, 34B)
Reasoning
Reasoning
- DeepSeek-R1 (1.5B to 671B)
- Microsoft Orca-2 (7B, 13B)
- Qwen QwQ (32B preview)
Multimodal / Vision
Multimodal / Vision
- Llama 3.2 Vision (11B, 90B)
- Llama 4 Scout / Maverick (17B, 23B)
- Qwen2.5-VL (2B, 7B, 72B)
- Microsoft Phi-3.5 Vision (4.2B)
Embedding
Embedding
- nomic-embed-text (v1, v1.5)
- BAAI bge (small, base, large)
- Cohere embed-english (v3.0)
MoE Architectures
MoE Architectures
- Mistral Mixtral 8x7B / 8x22B
- DeepSeek-V2 / V3 (236B, 671B)
- Qwen1.5-MoE (A2.7B)
Model Sources
All models are sourced from the HuggingFace Hub via the REST API. The database includes:- Meta Llama (meta-llama org)
- Mistral AI (mistralai org)
- Qwen (Qwen org)
- Google Gemma (google org)
- Microsoft Phi (microsoft org)
- DeepSeek (deepseek-ai org)
- IBM Granite (ibm-granite org)
- Allen Institute OLMo (allenai org)
- xAI Grok (xai-org)
- Cohere (CohereForAI org)
- BigCode (bigcode org)
- 01.ai Yi (01-ai org)
- Upstage Solar (upstage org)
- TII Falcon (tiiuae org)
- Zhipu GLM (THUDM org)
- Moonshot Kimi (Moonshot org)
- Baidu ERNIE (PaddlePaddle org)
MoE Architecture Detection
The scraper automatically identifies Mixture-of-Experts models via:-
Config file inspection:
-
Architecture mapping:
-
Active parameter calculation:
Example: Mixtral 8x7B
- Total parameters: 46.7B
- Active experts: 2/8
- Active parameters: ~12.9B
- VRAM savings: 23.9 GB → 6.6 GB with expert offloading
Update Process
The model database is generated byscripts/scrape_hf_models.py, a standalone Python script with no pip dependencies (stdlib only).
Manual Update
Automated Update
Scraper Options
Skip GGUF source enrichment for faster scrapingDefault: Enabled (queries HuggingFace for GGUF repos)
Output file pathDefault:
data/hf_models.jsonAdding a New Model
-
Add the HuggingFace repo ID to
TARGET_MODELSinscripts/scrape_hf_models.py: -
If the model is gated (requires HF authentication), add a fallback:
-
Run the update process:
The scraper validates JSON output before writing. If scraping fails, the original
hf_models.json remains intact.GGUF Source Caching
GGUF source enrichment queries HuggingFace’s search API to find quantized versions. Results are cached indata/gguf_sources_cache.json with a 7-day TTL:
Compile-Time Embedding
The model database is embedded at compile time via Rust’sinclude_str!() macro:
- No runtime file I/O
- Works offline
- Single binary distribution
- No external data dependencies
