Overview
llama-cli is an interactive CLI tool for accessing and experimenting with most of llama.cpp’s functionality. It provides a straightforward way to run text generation, chat conversations, and test model parameters from the command line.
Basic Usage
Key Features
- Conversation Mode: Automatically activates for models with built-in chat templates
- Custom Grammars: Constrain output with BNF-like grammar rules
- Speculative Decoding: Use draft models to accelerate generation
- Multimodal Support: Process images and audio with compatible models
- Context Management: Automatic context shifting for infinite text generation
Common Parameters
Model Loading
Path to the GGUF model file to load.Can also be set via
LLAMA_ARG_MODEL environment variable.Hugging Face model repository in format
<user>/<model>[:quant].Quant is optional and defaults to Q4_K_M. Automatically downloads mmproj if available.Example: unsloth/phi-4-GGUF:q4_k_mSpecific file from Hugging Face to use, overrides the quant in
--hf-repo.Generation Settings
Prompt text to start generation with. For system messages, use
-sys instead.Path to a file containing the prompt to use.
Number of tokens to predict.
-1 means infinity.Can be set via LLAMA_ARG_N_PREDICT environment variable.Size of the prompt context.
0 means loaded from model.Can be set via LLAMA_ARG_CTX_SIZE environment variable.Conversation Mode
Run in conversation mode. Auto-enabled if chat template is available.In this mode:
- Special tokens and suffix/prefix are not printed
- Interactive mode is enabled
Set custom jinja chat template. Built-in templates include:
llama3, llama2, chatml, mistral-v3, phi3, phi4, gemma, deepseek, deepseek2, deepseek3, and many more.System prompt to use with model (if supported by chat template).
Sampling Parameters
Sampling temperature. Higher values increase randomness.
Top-k sampling.
0 disables it.Can be set via LLAMA_ARG_TOP_K environment variable.Top-p (nucleus) sampling.
1.0 disables it.Min-p sampling.
0.0 disables it.Penalize repeat sequences of tokens.
1.0 means disabled.RNG seed for reproducible generation.
-1 uses random seed.Performance & Hardware
Number of CPU threads to use during generation.Can be set via
LLAMA_ARG_THREADS environment variable.Number of layers to offload to GPU. Can be a number,
'auto', or 'all'.Can be set via LLAMA_ARG_N_GPU_LAYERS environment variable.Logical maximum batch size.Can be set via
LLAMA_ARG_BATCH environment variable.Flash Attention setting:
'on', 'off', or 'auto'.Can be set via LLAMA_ARG_FLASH_ATTN environment variable.Usage Examples
Interactive Conversation
Start conversation mode
Models with built-in chat templates automatically activate conversation mode:
Constrained Generation with Grammar
Constrain output to follow specific formats using GBNF grammars:grammars/ folder contains sample grammars. You can also use JSON schemas:
Speculative Decoding
Accelerate generation with a draft model:The draft model should be a smaller, faster variant of the target model for best results.
Multimodal Usage
Process images or audio with vision/audio models:Single-Turn Generation
Generate a single response without interactive mode:Advanced Features
Context Management
Enable context shift for infinite text generation. When the context is full, old tokens are shifted out.
Number of tokens to keep from initial prompt when context fills up.Use
-1 to keep all tokens.Maximum number of context checkpoints to create per slot for state-based context (SWA).
LoRA Adapters
Load LoRA adapters to modify model behavior:Control Vectors
Apply control vectors to steer model behavior:Output and Logging
Enable verbose logging (log all messages).
Path to write log output to a file.Can be set via
LLAMA_LOG_FILE environment variable.Don’t print the prompt at generation time.
Show timing information after each response.Can be set via
LLAMA_ARG_SHOW_TIMINGS environment variable.Environment Variables
Many parameters can be set via environment variables:Performance Tips
- Use
--flash-attn onfor faster attention computation on supported hardware - Increase
--batch-sizefor better throughput with longer prompts - Enable
--mlockto prevent model from being swapped out of RAM - Use quantized models (Q4_K_M, Q5_K_M) for faster inference with minimal quality loss
See Also
- llama-server - HTTP server for serving LLMs
- llama-bench - Performance benchmarking tool
- llama-perplexity - Model evaluation tool

