Models API
Themodels module provides access to the embedded model database and metadata operations.
Core Types
LlmModel
Represents a single LLM model with all metadata:
name- Model identifier (e.g., “llama-3.1-8b-instruct”)provider- Original provider (“Meta”, “Qwen”, etc.)parameter_count- Human-readable size (“7B”, “8x7B”)parameters_raw- Exact parameter countmin_ram_gb- Minimum system RAM for CPU inferencerecommended_ram_gb- Recommended RAM for best performancemin_vram_gb- Minimum VRAM for GPU inferencequantization- Default quantization level (“Q4_K_M”, “mlx-4bit”)context_length- Maximum context windowuse_case- Primary use case categoryis_moe- Whether this is a Mixture-of-Experts modelnum_experts/active_experts- MoE expert configurationactive_parameters- Active parameter count for MoE modelsrelease_date- Release date string (ISO 8601)gguf_sources- Known GGUF download sources
GgufSource
A known GGUF download source:
ModelDatabase
Container for the embedded model database:
UseCase
Model use-case categories:
Functions
ModelDatabase::new()
Loads the embedded model database:
data/hf_models.json. No runtime file I/O occurs.
ModelDatabase::get_all_models()
Returns all models in the database:
ModelDatabase::find_model()
Searches models by name, provider, or parameter count:
query- Search term (case-insensitive substring match)
ModelDatabase::models_fitting_system()
Filters models that fit on specific hardware:
available_ram_gb- Available system RAMhas_gpu- Whether GPU is presentvram_gb- GPU VRAM if available
LlmModel Methods
is_mlx_model()
Checks if model is MLX-specific:
params_b()
Parameter count in billions:
estimate_memory_gb()
Estimates memory required for specific quantization and context:
quant- Quantization level (“Q4_K_M”, “Q8_0”, etc.)ctx- Context length in tokens
best_quant_for_budget()
Selects best quantization that fits in memory:
budget_gb- Available memoryctx- Target context length
- Q8_0 (best quality)
- Q6_K
- Q5_K_M
- Q4_K_M
- Q3_K_M
- Q2_K (smallest)
MoE-Specific Methods
For Mixture-of-Experts models:Quantization Functions
quant_bpp()
Bytes per parameter for quantization level:
quant_speed_multiplier()
Speed impact of quantization:
quant_quality_penalty()
Quality penalty for quantization:
Quantization Hierarchies
Predefined quantization hierarchies (best to worst quality):Use Case Inference
- Embedding: “embed”, “bge” in name
- Coding: “code” in name or use_case
- Multimodal: “vision” in use_case
- Reasoning: “reason” or “deepseek-r1” in name
- Chat: “chat” or “instruction” in use_case
- General: Default fallback
