The llama-cli tool provides a powerful command-line interface for running LLM inference. It supports interactive conversations, structured outputs, and various sampling configurations.
Quick Start
Load a model
Start llama-cli with your GGUF model: ./llama-cli -m models/model.gguf
Add a prompt
Generate text from a prompt: ./llama-cli -m models/model.gguf -p "Hello, how are you?"
Enable conversation mode
Use -cnv for interactive chat: ./llama-cli -m models/model.gguf -cnv
Loading Models
From Local Files
# Load a local GGUF model
./llama-cli -m path/to/model.gguf
From Hugging Face
# Download and load from Hugging Face (defaults to Q4_K_M quantization)
./llama-cli -hf unsloth/phi-4-GGUF
# Specify quantization level
./llama-cli -hf unsloth/phi-4-GGUF:q8_0
When using -hf, llama-cli automatically downloads the model and any associated multimodal projectors (like for vision models). Use --no-mmproj to disable automatic projector loading.
From Docker Hub
# Load from Docker Hub repository
./llama-cli -dr ai/gemma3
Conversation Mode
Conversation mode provides an interactive chat interface that automatically formats messages using the model’s chat template.
Basic Conversation
# Enable conversation mode (auto-enabled for chat models)
./llama-cli -m models/chat-model.gguf -cnv
# With a system prompt
./llama-cli -m models/chat-model.gguf -cnv -sys "You are a helpful coding assistant."
Single-Turn Conversation
# Run a single conversation turn and exit
./llama-cli -m models/model.gguf -cnv -st -p "What is the capital of France?"
Conversation Options
Enable conversation mode. Automatically enabled if the model has a chat template.
System prompt to use with the model (if applicable based on chat template).
Run conversation for a single turn only, then exit when done.
Halt generation at this prompt and return control in interactive mode.
Prompting
Direct Prompts
# Simple prompt
./llama-cli -m model.gguf -p "Write a haiku about programming"
# Longer prompt with escape sequences
./llama-cli -m model.gguf -p "Line 1\nLine 2\nLine 3"
# Disable escape sequence processing
./llama-cli -m model.gguf --no-escape -p "Keep \n literal"
From Files
# Load prompt from text file
./llama-cli -m model.gguf -f prompt.txt
# Load prompt from binary file
./llama-cli -m model.gguf -bf prompt.bin
# Load system prompt from file
./llama-cli -m model.gguf -cnv -sysf system_prompt.txt
# Enable multiline input mode (no need for \ at line endings)
./llama-cli -m model.gguf -cnv -mli
Grammar Constraints
Constrain model outputs to follow specific formats using BNF-like grammars or JSON schemas.
Using Grammar Files
# Use a BNF grammar to constrain output
./llama-cli -m model.gguf -p "Generate a list" --grammar-file grammars/list.gbnf
Inline Grammar
# Inline grammar definition
./llama-cli -m model.gguf -p "Count to 5" --grammar 'root ::= [1-5] (" " [1-5])*'
JSON Schema
# Constrain to JSON object matching a schema
./llama-cli -m model.gguf -p "Generate user data" \
-j '{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"number"}},"required":["name","age"]}'
# Load schema from file
./llama-cli -m model.gguf -p "Generate user data" -jf schema.json
Example: List Grammar
Example: JSON Output
./llama-cli -m model.gguf \
-p "Generate a shopping list" \
--grammar 'root ::= item+; item ::= "- " [a-z]+ "\n"'
For JSON schemas with external $refs, use --grammar combined with the json_schema_to_grammar.py conversion script instead.
Sampling Parameters
Control how tokens are generated with various sampling strategies.
Temperature and Top-K/Top-P
# Adjust temperature (higher = more random)
./llama-cli -m model.gguf -p "Write a story" --temp 0.9
# Top-k sampling (limit to k most likely tokens)
./llama-cli -m model.gguf -p "Complete: The sky is" --top-k 20
# Top-p/nucleus sampling (cumulative probability threshold)
./llama-cli -m model.gguf -p "Generate text" --top-p 0.9
# Min-p sampling (minimum probability threshold)
./llama-cli -m model.gguf -p "Generate text" --min-p 0.1
Repetition Control
# Penalize repetition
./llama-cli -m model.gguf -p "Tell me about" \
--repeat-penalty 1.2 \
--repeat-last-n 128
# DRY (Don't Repeat Yourself) sampling
./llama-cli -m model.gguf -p "Write code" \
--dry-multiplier 0.8 \
--dry-base 1.75
Advanced Sampling
# Mirostat sampling (perplexity control)
./llama-cli -m model.gguf -p "Generate" \
--mirostat 2 \
--mirostat-lr 0.1 \
--mirostat-ent 5.0
# Dynamic temperature
./llama-cli -m model.gguf -p "Write" \
--dynatemp-range 0.5 \
--dynatemp-exp 1.0
Sampler Order
# Customize sampler sequence
./llama-cli -m model.gguf -p "Text" \
--samplers "top_k;top_p;temperature"
# Or use simplified sequence notation
./llama-cli -m model.gguf -p "Text" \
--sampler-seq "kymt"
Context and Generation Control
Context Window
# Set context size (0 = use model default)
./llama-cli -m model.gguf -c 4096 -p "Long prompt..."
# Keep specific tokens from initial prompt
./llama-cli -m model.gguf --keep 10 -p "Important context here..."
# Enable context shift for infinite generation
./llama-cli -m model.gguf --context-shift -p "Start"
Generation Length
# Generate specific number of tokens (-1 = infinite)
./llama-cli -m model.gguf -n 100 -p "Generate exactly 100 tokens"
# Stop at specific words
./llama-cli -m model.gguf -p "Count: " -r "ten"
Batch Processing
# Adjust batch sizes for performance
./llama-cli -m model.gguf \
-b 2048 \ # logical batch size
-ub 512 \ # physical batch size
-p "Prompt"
GPU Acceleration
# Offload all layers to GPU
./llama-cli -m model.gguf -ngl 99 -p "Fast inference"
# Offload specific number of layers
./llama-cli -m model.gguf -ngl 32 -p "Partial GPU"
# Split across multiple GPUs
./llama-cli -m model.gguf \
-ngl 99 \
-sm layer \
-ts 3,1 \ # 3:1 ratio across 2 GPUs
-p "Multi-GPU inference"
Advanced Features
LoRA Adapters
# Load single LoRA adapter
./llama-cli -m model.gguf --lora adapter.gguf -p "Specialized task"
# Load multiple adapters with scaling
./llama-cli -m model.gguf \
--lora-scaled "adapter1.gguf:0.5,adapter2.gguf:1.0" \
-p "Multi-adapter inference"
Control Vectors
# Apply control vector to influence model behavior
./llama-cli -m model.gguf \
--control-vector vector.gguf \
--control-vector-layer-range 0 20 \
-p "Generate with control"
Reasoning Models
# Enable reasoning/thinking for models like DeepSeek-R1
./llama-cli -m deepseek-r1.gguf \
--reasoning-format deepseek \
--reasoning-budget -1 \
-p "Solve this problem step by step"
Threading
# Set CPU threads for generation
./llama-cli -m model.gguf -t 8 -p "Prompt"
# Different threads for batch processing
./llama-cli -m model.gguf -t 8 -tb 16 -p "Prompt"
Memory Optimization
# Keep model in RAM (prevent swapping)
./llama-cli -m model.gguf --mlock -p "Prompt"
# Disable memory mapping
./llama-cli -m model.gguf --no-mmap -p "Prompt"
# Use Flash Attention
./llama-cli -m model.gguf -fa on -p "Prompt"
KV Cache Configuration
# Quantize KV cache for memory savings
./llama-cli -m model.gguf \
-ctk q8_0 \ # K cache type
-ctv q8_0 \ # V cache type
-p "Prompt"
Common Use Cases
Code Completion
./llama-cli -m codellama.gguf \
-p "def fibonacci(n):\n \"\"\" Calculate fibonacci number \"\"\" \n " \
--temp 0.2 \
--repeat-penalty 1.1 \
-n 200
Creative Writing
./llama-cli -m creative-model.gguf \
-cnv \
-sys "You are a creative storytelling assistant." \
-p "Write a short story about a robot learning to paint" \
--temp 0.9 \
--top-p 0.95
./llama-cli -m model.gguf \
-p "Extract product info: 'iPhone 15 Pro costs $999 , available in blue'" \
-j '{"type":"object","properties":{"product":{"type":"string"},"price":{"type":"number"},"color":{"type":"string"}}}' \
--temp 0.1
Output Control
Display Options
# Disable prompt display
./llama-cli -m model.gguf --no-display-prompt -p "Hidden prompt"
# Disable colored output
./llama-cli -m model.gguf -co off -p "No colors"
# Show timing information
./llama-cli -m model.gguf --show-timings -p "Benchmark"
# Include special tokens in output
./llama-cli -m model.gguf -sp -p "Show all tokens"
Simple I/O Mode
# Use basic I/O for better subprocess compatibility
./llama-cli -m model.gguf --simple-io -p "Pipe-friendly output"
Environment Variables
Many parameters can be set via environment variables:
# Set model path
export LLAMA_ARG_MODEL = "models/model.gguf"
# Set context size
export LLAMA_ARG_CTX_SIZE = 4096
# Set GPU layers
export LLAMA_ARG_N_GPU_LAYERS = 99
# Then run without flags
./llama-cli -p "Using env vars"
Command-line arguments take precedence over environment variables when both are set.
Logging
# Set verbosity level (0-4, higher = more verbose)
./llama-cli -m model.gguf -lv 4 -p "Debug mode"
# Disable all logging
./llama-cli -m model.gguf --log-disable -p "Silent mode"
# Log to file
export LLAMA_LOG_FILE = "llama.log"
./llama-cli -m model.gguf -p "Logged run"
# Enable timestamps in logs
./llama-cli -m model.gguf --log-timestamps -p "Timestamped logs"
See Also
Server OpenAI-compatible API server
Embeddings Generate text embeddings
Multimodal Vision and audio models
Speculative Decoding Accelerate generation with draft models