Skip to main content

Overview

llama-cli is an interactive CLI tool for accessing and experimenting with most of llama.cpp’s functionality. It provides a straightforward way to run text generation, chat conversations, and test model parameters from the command line.

Basic Usage

llama-cli -m my_model.gguf

Key Features

  • Conversation Mode: Automatically activates for models with built-in chat templates
  • Custom Grammars: Constrain output with BNF-like grammar rules
  • Speculative Decoding: Use draft models to accelerate generation
  • Multimodal Support: Process images and audio with compatible models
  • Context Management: Automatic context shifting for infinite text generation

Common Parameters

Model Loading

-m, --model
string
Path to the GGUF model file to load.Can also be set via LLAMA_ARG_MODEL environment variable.
-hf, --hf-repo
string
Hugging Face model repository in format <user>/<model>[:quant].Quant is optional and defaults to Q4_K_M. Automatically downloads mmproj if available.Example: unsloth/phi-4-GGUF:q4_k_m
--hf-file
string
Specific file from Hugging Face to use, overrides the quant in --hf-repo.

Generation Settings

-p, --prompt
string
Prompt text to start generation with. For system messages, use -sys instead.
-f, --file
string
Path to a file containing the prompt to use.
-n, --predict
integer
default:"-1"
Number of tokens to predict. -1 means infinity.Can be set via LLAMA_ARG_N_PREDICT environment variable.
-c, --ctx-size
integer
default:"0"
Size of the prompt context. 0 means loaded from model.Can be set via LLAMA_ARG_CTX_SIZE environment variable.

Conversation Mode

-cnv, --conversation
boolean
default:"auto"
Run in conversation mode. Auto-enabled if chat template is available.In this mode:
  • Special tokens and suffix/prefix are not printed
  • Interactive mode is enabled
--chat-template
string
Set custom jinja chat template. Built-in templates include:llama3, llama2, chatml, mistral-v3, phi3, phi4, gemma, deepseek, deepseek2, deepseek3, and many more.
-sys, --system-prompt
string
System prompt to use with model (if supported by chat template).

Sampling Parameters

--temp, --temperature
float
default:"0.8"
Sampling temperature. Higher values increase randomness.
--top-k
integer
default:"40"
Top-k sampling. 0 disables it.Can be set via LLAMA_ARG_TOP_K environment variable.
--top-p
float
default:"0.95"
Top-p (nucleus) sampling. 1.0 disables it.
--min-p
float
default:"0.05"
Min-p sampling. 0.0 disables it.
--repeat-penalty
float
default:"1.0"
Penalize repeat sequences of tokens. 1.0 means disabled.
-s, --seed
integer
default:"-1"
RNG seed for reproducible generation. -1 uses random seed.

Performance & Hardware

-t, --threads
integer
default:"auto"
Number of CPU threads to use during generation.Can be set via LLAMA_ARG_THREADS environment variable.
-ngl, --n-gpu-layers
string
default:"auto"
Number of layers to offload to GPU. Can be a number, 'auto', or 'all'.Can be set via LLAMA_ARG_N_GPU_LAYERS environment variable.
-b, --batch-size
integer
default:"2048"
Logical maximum batch size.Can be set via LLAMA_ARG_BATCH environment variable.
-fa, --flash-attn
string
default:"auto"
Flash Attention setting: 'on', 'off', or 'auto'.Can be set via LLAMA_ARG_FLASH_ATTN environment variable.

Usage Examples

Interactive Conversation

1

Start conversation mode

Models with built-in chat templates automatically activate conversation mode:
llama-cli -m model.gguf

# > hi, who are you?
# Hi there! I'm your helpful assistant! ...
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
2

Custom chat template

Use a specific template or define custom prefixes:
# Use built-in template
llama-cli -m model.gguf -cnv --chat-template chatml

# Custom prefix
llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

Constrained Generation with Grammar

Constrain output to follow specific formats using GBNF grammars:
llama-cli -m model.gguf -n 256 \
  --grammar-file grammars/json.gbnf \
  -p 'Request: schedule a call at 8pm; Command:'

# Output: {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
The grammars/ folder contains sample grammars. You can also use JSON schemas:
llama-cli -m model.gguf \
  -j '{"type": "object", "properties": {"name": {"type": "string"}}}' \
  -p 'Generate a person:'

Speculative Decoding

Accelerate generation with a draft model:
llama-cli -m model.gguf -md draft.gguf
The draft model should be a smaller, faster variant of the target model for best results.

Multimodal Usage

Process images or audio with vision/audio models:
llama-cli -m vision-model.gguf \
  -mm vision-projector.gguf \
  --image path/to/image.jpg \
  -p "Describe this image:"

Single-Turn Generation

Generate a single response without interactive mode:
llama-cli -m model.gguf \
  -p "Write a haiku about coding:" \
  -n 50 \
  --single-turn

Advanced Features

Context Management

--context-shift
boolean
default:"false"
Enable context shift for infinite text generation. When the context is full, old tokens are shifted out.
--keep
integer
default:"0"
Number of tokens to keep from initial prompt when context fills up.Use -1 to keep all tokens.
--ctx-checkpoints
integer
default:"8"
Maximum number of context checkpoints to create per slot for state-based context (SWA).

LoRA Adapters

Load LoRA adapters to modify model behavior:
# Single adapter
llama-cli -m model.gguf --lora adapter.gguf

# Multiple adapters with scaling
llama-cli -m model.gguf \
  --lora-scaled adapter1.gguf:0.5,adapter2.gguf:1.0

Control Vectors

Apply control vectors to steer model behavior:
llama-cli -m model.gguf \
  --control-vector-scaled happiness.gguf:0.8 \
  --control-vector-layer-range 10 30

Output and Logging

-v, --verbose
boolean
Enable verbose logging (log all messages).
--log-file
string
Path to write log output to a file.Can be set via LLAMA_LOG_FILE environment variable.
--no-display-prompt
boolean
Don’t print the prompt at generation time.
--show-timings
boolean
default:"true"
Show timing information after each response.Can be set via LLAMA_ARG_SHOW_TIMINGS environment variable.

Environment Variables

Many parameters can be set via environment variables:
export LLAMA_ARG_MODEL=/path/to/model.gguf
export LLAMA_ARG_CTX_SIZE=4096
export LLAMA_ARG_N_GPU_LAYERS=35
export LLAMA_ARG_THREADS=8

llama-cli -p "Hello world"

Performance Tips

  • Use --flash-attn on for faster attention computation on supported hardware
  • Increase --batch-size for better throughput with longer prompts
  • Enable --mlock to prevent model from being swapped out of RAM
  • Use quantized models (Q4_K_M, Q5_K_M) for faster inference with minimal quality loss

See Also