GPU Memory Override
GPU VRAM autodetection can fail on some systems (brokennvidia-smi, VMs, passthrough setups, remote GPUs). Use --memory to manually specify your GPU’s VRAM.
Gigabytes (case-insensitive)
Megabytes (case-insensitive)
Terabytes (case-insensitive)
- If no GPU was detected,
--memorycreates a synthetic GPU entry so models are scored for GPU inference - If a GPU was detected but VRAM is unknown or wrong,
--memoryoverrides the detected value - Works with all modes: TUI, CLI, subcommands, and
serve
- VMs / passthrough: GPU is present but not directly visible to OS
- Broken nvidia-smi: nvidia-smi reports incorrect VRAM or fails
- Remote GPUs: Planning for a GPU you don’t have locally
- Multi-GPU: Override with aggregate VRAM (e.g., 2x 24GB = 48GB)
--memory overrides VRAM only. It does not affect system RAM or CPU detection.Context Length Cap
Use--max-context to cap the context length used for memory estimation. This does not change each model’s advertised maximum context — it only affects how much memory llmfit assumes the model will use.
- Reduce memory usage: Longer context = more memory for KV cache
- Realistic workloads: You may not need a model’s full 128k context window
- Fit more models: Capping context can promote a model from “Marginal” to “Good” fit
- 4K context: ~0.7 GB KV cache
- 8K context: ~1.4 GB KV cache
- 128K context: ~22.4 GB KV cache
--max-context is not set, llmfit checks the OLLAMA_CONTEXT_LENGTH environment variable:
OLLAMA_CONTEXT_LENGTH.
Examples:
serve mode, you can override the context cap on a per-request basis with the max_context query parameter:
Remote Ollama
By default, llmfit connects to Ollama athttp://localhost:11434. To connect to a remote Ollama instance, set the OLLAMA_HOST environment variable.
- GPU server + laptop client: Run llmfit on your laptop while Ollama serves from a GPU server
- Docker containers: Connect to Ollama running in a Docker container with custom ports
- Reverse proxies: Use Ollama behind a reverse proxy or load balancer
GET $OLLAMA_HOST/api/tags— List installed modelsPOST $OLLAMA_HOST/api/pull— Download models
Serve Mode for Cluster Scheduling
Theserve subcommand starts an HTTP API that exposes node-local model fit analysis. This is designed for cluster schedulers, aggregators, and remote clients that need to query hardware compatibility across multiple nodes.
Liveness probe. Returns
{"status": "ok", "node": {...}}Node hardware info (CPU, RAM, GPU, backend)
Full fit list with filters (limit, min_fit, runtime, use_case, etc.)
Top runnable models for scheduling (conservative defaults: limit=5, min_fit=good)
- Run
llmfit serveon each node in your cluster - From your scheduler, poll each node:
- Aggregate results and decide which node to schedule a model on
- Send deploy command to chosen node
Environment Variables
llmfit respects the following environment variables:Ollama API URL. Set to connect to remote Ollama instances.Example:
Context length fallback for memory estimation when This is useful if you use Ollama and have already configured your context length via
--max-context is not set.Example:OLLAMA_CONTEXT_LENGTH.--max-contextflag (highest priority)OLLAMA_CONTEXT_LENGTHenvironment variable- Model’s full advertised context (default, lowest priority)
Combining Flags
All global flags can be combined:Advanced Workflows
1. Multi-GPU Aggregate VRAM
If you have multiple GPUs with shared VRAM pool (e.g., NVLink), override with total VRAM:2. Planning for Future Hardware
Use--memory to plan models for a GPU you don’t have yet:
3. Workload-Specific Context Caps
Chat workload (short conversations):4. Remote Hardware Inspection
SSH into a remote node and check its hardware without installing llmfit:5. Kubernetes Cluster Scheduling
Deploy llmfit as a DaemonSet on all GPU nodes:Performance Considerations
TUI Startup Time
The TUI probes all providers (Ollama, MLX, llama.cpp) on startup. On slow networks or with many installed models, this can take 1-2 seconds. To skip provider detection, use CLI mode:API Response Time
The REST API computes fit analysis on each request. For large model databases (200+ models), this takes ~50-100ms. To reduce latency:- Use
limitparameter to reduce result set - Use
min_fit=goodto exclude unrunnable models - Cache results on the client side if hardware doesn’t change
Download Speed
- Ollama: Controlled by Ollama daemon (typically saturates bandwidth)
- llama.cpp: Direct HuggingFace download (typically faster than Ollama)
- MLX: Direct HuggingFace download via
mlx_lm(similar to llama.cpp)
Troubleshooting
GPU Not Detected
Symptom: TUI shows “GPU: none” even though you have a GPU. Causes:- nvidia-smi not in PATH or not working
- VM/passthrough setup where GPU is not visible to OS
- AMD GPU without rocm-smi
- Intel Arc without proper drivers
--memory to override:
Wrong VRAM Amount
Symptom: TUI shows incorrect VRAM (e.g., 16GB instead of 24GB). Causes:- nvidia-smi reporting bug
- Shared memory incorrectly reported
- Multi-GPU with incorrect aggregation
--memory to override:
Models Don’t Fit as Expected
Symptom: Models you think should fit are marked “Too Tight”. Causes:- Context length too high (KV cache uses a lot of memory)
- Available RAM lower than you think (OS overhead, other processes)
- Model requires more memory than you expect (MoE inactive experts, etc.)
Ollama Not Detected
Symptom: TUI shows “Ollama: ✗” even though Ollama is running. Causes:- Ollama running on non-default port
- Firewall blocking localhost:11434
- Ollama not fully started yet
OLLAMA_HOST:
Download Fails
Symptom: Download starts but fails with an error. Causes:- Network error (HuggingFace unreachable)
- Disk full
- Ollama daemon stopped mid-download
- GGUF repo not found
- Check network:
curl -I https://huggingface.co - Check disk space:
df -h - Restart Ollama:
ollama serve - Try a different provider (Ollama vs llama.cpp)
All advanced flags (
--memory, --max-context, OLLAMA_HOST) work in TUI, CLI, and serve modes. In serve mode, they affect all API responses.