Supported Providers
| Provider | Platforms | Detection | Download | Runtime |
|---|---|---|---|---|
| Ollama | Linux, macOS, Windows | API (/api/tags) | API (/api/pull) | Ollama serve |
| llama.cpp | Linux, macOS, Windows | Local cache | HuggingFace GGUF | llama-cli, llama-server |
| MLX | macOS (Apple Silicon) | Local cache | HuggingFace mlx-community | mlx_lm |
Ollama Integration
Ollama is a daemon-based runtime for running LLMs locally. llmfit connects to the Ollama API to detect installed models and download new ones.Requirements
- Ollama must be installed and running:
ollama serveor the Ollama desktop app - llmfit connects to
http://localhost:11434by default (Ollama’s default API port) - No configuration needed — if Ollama is running, llmfit detects it automatically
Install Detection
On startup, llmfit queriesGET /api/tags to list installed Ollama models. Detected models show a green ✓ in the “Inst” column of the TUI, and the system bar displays Ollama: ✓ (N installed).
API endpoint:
llama3.1:8b) to HuggingFace model names (e.g., meta-llama/Llama-3.1-8B-Instruct) using an internal mapping table.
Model Name Mapping
llmfit’s database uses HuggingFace model names, while Ollama uses its own naming scheme. llmfit maintains an accurate mapping between the two. Examples:| HuggingFace Name | Ollama Tag |
|---|---|
meta-llama/Llama-3.1-8B-Instruct | llama3.1:8b |
Qwen/Qwen2.5-Coder-14B-Instruct | qwen2.5-coder:14b |
mistralai/Mistral-7B-Instruct-v0.3 | mistral:7b-instruct |
deepseek-ai/DeepSeek-R1-Distill-Llama-8B | deepseek-r1:8b |
qwen2.5-coder:14b maps to the Coder model, not the base qwen2.5:14b.
Download Functionality
Pressd in the TUI (or use the download subcommand) to download a model via Ollama. llmfit sends POST /api/pull to Ollama with the appropriate tag.
API endpoint:
Remote Ollama Instances
To connect to Ollama running on a different machine or port, set theOLLAMA_HOST environment variable:
- Running llmfit on one machine while Ollama serves from another (e.g., GPU server + laptop client)
- Connecting to Ollama in Docker containers with custom ports
- Using Ollama behind reverse proxies or load balancers
Ollama Binary Detection
llmfit also detects if theollama CLI binary is available in PATH (even if the daemon is not running). This allows download operations to start the daemon automatically if needed.
Detection method:
ollama binary is found but the daemon is not running, the TUI shows Ollama: ✗ but still allows downloads (which will prompt the daemon to start).
llama.cpp Integration
llama.cpp is a C++ inference engine for GGUF quantized models. llmfit integrates with llama.cpp by downloading GGUF files from HuggingFace and detecting local cache.Requirements
llama-cliorllama-serveravailable inPATH(for runtime detection)- Network access to HuggingFace for GGUF downloads
Local Cache Detection
llmfit scans the llama.cpp model cache directory for GGUF files: Cache directory:✓ if a matching GGUF is found.
Model Name Mapping
llmfit maps HuggingFace model names to known GGUF repos using heuristic fallbacks and a curated list. Example mappings:| HuggingFace Name | GGUF Repo |
|---|---|
meta-llama/Llama-3.1-8B-Instruct | bartowski/Llama-3.1-8B-Instruct-GGUF |
Qwen/Qwen2.5-Coder-7B-Instruct | unsloth/Qwen2.5-Coder-7B-Instruct-GGUF |
mistralai/Mistral-7B-Instruct-v0.3 | bartowski/Mistral-7B-Instruct-v0.3-GGUF |
unsloth/<model-name>-GGUFbartowski/<model-name>-GGUF- Original repo with
-GGUFsuffix
Download Functionality
Pressd in the TUI to download a GGUF model. llmfit:
- Resolves the GGUF repo from the model name
- Lists available GGUF files in the repo
- Selects the best quantization that fits your hardware (or uses
--quantoverride) - Downloads the file to
~/.cache/llama.cpp/models/ - Shows progress with percentage and transfer speed
--quant is specified, llmfit selects the highest-quality quantization that fits in available memory (GPU VRAM or system RAM):
- Parse all GGUF filenames in the repo
- Extract quantization (e.g., Q4_K_M, Q8_0) and file size
- Rank by quality: Q8_0 > Q6_K > Q5_K_M > Q5_K_S > Q4_K_M > Q4_K_S > Q3_K_M > Q2_K
- Select the highest-quality quant where
file_size <= memory_budget
Running GGUF Models
Use therun subcommand to launch a downloaded model:
Run as API server instead of interactive chat (uses
llama-server instead of llama-cli)Port for API server (only with
--server)Number of GPU layers to offload.
-1 means all layers (full GPU).Context size in tokens
MLX Integration
MLX is Apple’s machine learning framework optimized for Apple Silicon (M1/M2/M3/M4). llmfit integrates with MLX via themlx_lm package.
Requirements
- Apple Silicon Mac (M1, M2, M3, M4, or later)
mlx_lmPython package installed (optional for runtime)
Local Cache Detection
llmfit scans the MLX model cache directory: Cache directory:mlx-community namespace on HuggingFace. llmfit detects these models and marks them as installed.
Model Name Mapping
llmfit maps HuggingFace model names tomlx-community equivalents:
Example mappings:
| HuggingFace Name | MLX Community Repo |
|---|---|
Qwen/Qwen3-4B-MLX-4bit | mlx-community/Qwen3-4B-MLX-4bit |
meta-llama/Llama-3.1-8B-Instruct | mlx-community/Llama-3.1-8B-Instruct-4bit |
-4bit, -8bit), llmfit treats it as MLX-native and maps it to mlx-community/<model-name>.
Download Functionality
Pressd in the TUI on an MLX model to download via mlx_lm. llmfit uses the mlx_lm.utils module to pull models:
MLX-Only Models
Some models in the database are MLX-only (quantized specifically for MLX). llmfit hides these models on non-Apple Silicon systems to avoid confusion. Detection: Models are marked MLX-only if:- Model name contains “MLX”
- Quantization format is
mlx-4bit,mlx-8bit, etc. - No GGUF sources are available
- On Apple Silicon: MLX-only models are visible and ranked normally
- On other systems: MLX-only models are hidden (counted in “backend hidden” in system bar)
Install Detection Indicators
The “Inst” column in the TUI shows install status:| Indicator | Meaning |
|---|---|
✓ | Installed in at least one provider (Ollama, MLX, or llama.cpp) |
O | Available via Ollama only |
L | Available via llama.cpp only |
OL | Available via both Ollama and llama.cpp |
… | Checking availability (background probe) |
— | Not available for download |
| Spinner + bar | Currently downloading |
i in the TUI to toggle installed-first sorting. When enabled, models detected in any runtime provider appear at the top of the list (regardless of score).
Provider Detection on Startup
On startup, llmfit probes all providers in parallel:- Ollama: HTTP GET to
http://localhost:11434/api/tags(or$OLLAMA_HOST/api/tags) - llama.cpp: Check for
llama-cliorllama-serverinPATH, scan~/.cache/llama.cpp/models/ - MLX: Check for
mlx_lmin Python path, scan~/.cache/huggingface/hub/models--mlx-community--*
Ollama: ✓ (N installed)— Ollama daemon running, N models installedOllama: ✗— Ollama not running or not reachableMLX: ✓ (N installed)— MLX runtime available, N models cachedMLX: (N cached)— MLX not installed, but N models cached locallyMLX: ✗— MLX not availablellama.cpp: ✓ (N models)— llama-cli or llama-server in PATH, N GGUFs cachedllama.cpp: (N cached)— No binary in PATH, but N GGUFs cachedllama.cpp: ✗— No runtime or cache detected
Download Provider Selection
When multiple providers are available for a model, pressingd opens a provider picker popup:
- Use
↑/↓orj/kto navigate - Press
EnterorSpaceto select - Press
Escorqto cancel
- MLX — If model is MLX-native and MLX is available
- Ollama — If model has Ollama mapping and Ollama is running
- llama.cpp — If model has GGUF sources and llama.cpp is available
Refresh Installed Models
Pressr in the TUI to refresh installed models from all providers. This re-queries:
- Ollama API (
/api/tags) - llama.cpp cache directory
- MLX cache directory
ollama pull llama3.1:8b or mlx_lm.load(...)).
Environment Variables
Ollama API URL. Set to connect to remote Ollama instances.Example:
OLLAMA_HOST="http://192.168.1.100:11434" llmfitContext length fallback for memory estimation when
--max-context is not set.Example: OLLAMA_CONTEXT_LENGTH=8192 llmfitProvider-Specific Notes
Ollama
- Requires daemon: Ollama must be running for downloads and install detection
- Model format: Native Ollama format (not GGUF)
- Storage: Models stored in Ollama’s own cache (not exposed to llmfit)
- Pull speed: Depends on Ollama’s download speed and disk I/O
llama.cpp
- No daemon: Downloads and runs models directly via CLI tools
- Model format: GGUF (quantized safetensors)
- Storage:
~/.cache/llama.cpp/models/ - Pull speed: Direct HuggingFace download, typically faster than Ollama
- Flexibility: Full control over quantization, context size, and GPU layers
MLX
- Apple Silicon only: Requires M1, M2, M3, M4, or later
- Model format: MLX-native (safetensors + config)
- Storage:
~/.cache/huggingface/hub/models--mlx-community--* - Pull speed: Direct HuggingFace download via
mlx_lm - Performance: Optimized for Apple Silicon unified memory
Provider detection is non-blocking. If a provider is unavailable, llmfit continues with reduced functionality (no downloads for that provider).
