Syntax
Description
Run a downloaded GGUF model usingllama-cli (interactive chat) or llama-server (OpenAI-compatible API server). The command searches for the model in your local llama.cpp cache and launches it with appropriate hardware acceleration settings.
Requires llama-cli (for interactive mode) or llama-server (for server mode) to be installed and available in your PATH.
Arguments
Model file or name to run. Can be:
- Full path to a GGUF file
- Model filename (searches local cache)
- Partial model name (e.g.,
llama-8b,mistral)
Options
Run as an OpenAI-compatible API server instead of interactive chat. Uses
llama-server instead of llama-cli.Port for the API server (only applies when
--server is used).Number of GPU layers to offload. Use
-1 to offload all layers to GPU (full GPU inference). Use 0 for CPU-only inference. Values between 0 and the model’s layer count enable partial GPU offloading.Context size in tokens. Maximum number of tokens the model can process in a single prompt/conversation. Larger values require more VRAM/RAM.
Usage Examples
Run interactive chat (default)
llama-cli.
Run as API server
Run with custom GPU offload
Run with larger context
CPU-only inference
Combined options
Example Output
Interactive Mode
Server Mode
GPU Layer Offloading
The--ngl (number of GPU layers) flag controls how much of the model runs on GPU vs CPU:
| Value | Behavior | Use Case |
|---|---|---|
-1 | All layers on GPU | Best performance when model fits in VRAM |
0 | CPU-only | No GPU available or testing CPU performance |
1-N | Partial offload | Model too large for VRAM; offload as many layers as fit |
Context Size Guidelines
| Context Size | Memory Impact | Use Case |
|---|---|---|
| 2048 | Low | Short conversations, code completion |
| 4096 (default) | Moderate | General chat, standard conversations |
| 8192 | High | Long conversations, document analysis |
| 16384+ | Very High | Large document processing, long context windows |
Requirements
- llama.cpp must be installed with
llama-cli(for interactive) orllama-server(for API mode) in your PATH - Install from: https://github.com/ggml-org/llama.cpp
- Models must be downloaded first using
llmfit downloador manually placed in the llama.cpp cache
Notes
- The command automatically detects your GPU and enables Metal (Apple Silicon), CUDA (NVIDIA), or ROCm (AMD) acceleration if available
- Model search checks:
- Exact filename in current directory
- Exact filename in
~/.cache/llama.cpp/ - Partial name match in cache (e.g.,
llama-8bmatchesLlama-3.1-8B-Instruct.Q4_K_M.gguf)
- Server mode is fully compatible with OpenAI API clients (LangChain, LiteLLM, etc.)
- Interactive mode supports multi-turn conversations with automatic context management
Related Commands
- llmfit download - Download GGUF models from HuggingFace
- llmfit fit - Find models that fit your hardware
- llmfit plan - Estimate hardware requirements for a model
