CLI Tool (llama-cli)

The llama-cli tool provides a powerful command-line interface for running LLM inference. It supports interactive conversations, structured outputs, and various sampling configurations.

Quick Start

Load a model

Start llama-cli with your GGUF model:

./llama-cli -m models/model.gguf

Add a prompt

Generate text from a prompt:

./llama-cli -m models/model.gguf -p "Hello, how are you?"

Enable conversation mode

Use -cnv for interactive chat:

./llama-cli -m models/model.gguf -cnv

Loading Models

From Local Files

# Load a local GGUF model
./llama-cli -m path/to/model.gguf

From Hugging Face

# Download and load from Hugging Face (defaults to Q4_K_M quantization)
./llama-cli -hf unsloth/phi-4-GGUF

# Specify quantization level
./llama-cli -hf unsloth/phi-4-GGUF:q8_0

When using -hf, llama-cli automatically downloads the model and any associated multimodal projectors (like for vision models). Use --no-mmproj to disable automatic projector loading.

From Docker Hub

# Load from Docker Hub repository
./llama-cli -dr ai/gemma3

Conversation Mode

Conversation mode provides an interactive chat interface that automatically formats messages using the model’s chat template.

Basic Conversation

# Enable conversation mode (auto-enabled for chat models)
./llama-cli -m models/chat-model.gguf -cnv

# With a system prompt
./llama-cli -m models/chat-model.gguf -cnv -sys "You are a helpful coding assistant."

Single-Turn Conversation

# Run a single conversation turn and exit
./llama-cli -m models/model.gguf -cnv -st -p "What is the capital of France?"

Conversation Options

-cnv, --conversation

boolean

default:"auto"

Enable conversation mode. Automatically enabled if the model has a chat template.

-sys, --system-prompt

string

System prompt to use with the model (if applicable based on chat template).

-st, --single-turn

boolean

default:"false"

Run conversation for a single turn only, then exit when done.

-r, --reverse-prompt

string

Halt generation at this prompt and return control in interactive mode.

Prompting

Direct Prompts

# Simple prompt
./llama-cli -m model.gguf -p "Write a haiku about programming"

# Longer prompt with escape sequences
./llama-cli -m model.gguf -p "Line 1\nLine 2\nLine 3"

# Disable escape sequence processing
./llama-cli -m model.gguf --no-escape -p "Keep \n literal"

From Files

# Load prompt from text file
./llama-cli -m model.gguf -f prompt.txt

# Load prompt from binary file
./llama-cli -m model.gguf -bf prompt.bin

# Load system prompt from file
./llama-cli -m model.gguf -cnv -sysf system_prompt.txt

Multiline Input

# Enable multiline input mode (no need for \ at line endings)
./llama-cli -m model.gguf -cnv -mli

Grammar Constraints

Constrain model outputs to follow specific formats using BNF-like grammars or JSON schemas.

Using Grammar Files

# Use a BNF grammar to constrain output
./llama-cli -m model.gguf -p "Generate a list" --grammar-file grammars/list.gbnf

Inline Grammar

# Inline grammar definition
./llama-cli -m model.gguf -p "Count to 5" --grammar 'root ::= [1-5] (" " [1-5])*'

JSON Schema

# Constrain to JSON object matching a schema
./llama-cli -m model.gguf -p "Generate user data" \
  -j '{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"number"}},"required":["name","age"]}'

# Load schema from file
./llama-cli -m model.gguf -p "Generate user data" -jf schema.json

./llama-cli -m model.gguf \
  -p "Generate a shopping list" \
  --grammar 'root ::= item+; item ::= "- " [a-z]+ "\n"'

For JSON schemas with external $refs, use --grammar combined with the json_schema_to_grammar.py conversion script instead.

Sampling Parameters

Control how tokens are generated with various sampling strategies.

Temperature and Top-K/Top-P

# Adjust temperature (higher = more random)
./llama-cli -m model.gguf -p "Write a story" --temp 0.9

# Top-k sampling (limit to k most likely tokens)
./llama-cli -m model.gguf -p "Complete: The sky is" --top-k 20

# Top-p/nucleus sampling (cumulative probability threshold)
./llama-cli -m model.gguf -p "Generate text" --top-p 0.9

# Min-p sampling (minimum probability threshold)
./llama-cli -m model.gguf -p "Generate text" --min-p 0.1

Repetition Control

# Penalize repetition
./llama-cli -m model.gguf -p "Tell me about" \
  --repeat-penalty 1.2 \
  --repeat-last-n 128

# DRY (Don't Repeat Yourself) sampling
./llama-cli -m model.gguf -p "Write code" \
  --dry-multiplier 0.8 \
  --dry-base 1.75

Advanced Sampling

# Mirostat sampling (perplexity control)
./llama-cli -m model.gguf -p "Generate" \
  --mirostat 2 \
  --mirostat-lr 0.1 \
  --mirostat-ent 5.0

# Dynamic temperature
./llama-cli -m model.gguf -p "Write" \
  --dynatemp-range 0.5 \
  --dynatemp-exp 1.0

Sampler Order

# Customize sampler sequence
./llama-cli -m model.gguf -p "Text" \
  --samplers "top_k;top_p;temperature"

# Or use simplified sequence notation
./llama-cli -m model.gguf -p "Text" \
  --sampler-seq "kymt"

Context and Generation Control

Context Window

# Set context size (0 = use model default)
./llama-cli -m model.gguf -c 4096 -p "Long prompt..."

# Keep specific tokens from initial prompt
./llama-cli -m model.gguf --keep 10 -p "Important context here..."

# Enable context shift for infinite generation
./llama-cli -m model.gguf --context-shift -p "Start"

Generation Length

# Generate specific number of tokens (-1 = infinite)
./llama-cli -m model.gguf -n 100 -p "Generate exactly 100 tokens"

# Stop at specific words
./llama-cli -m model.gguf -p "Count: " -r "ten"

Batch Processing

# Adjust batch sizes for performance
./llama-cli -m model.gguf \
  -b 2048 \    # logical batch size
  -ub 512 \    # physical batch size
  -p "Prompt"

GPU Acceleration

# Offload all layers to GPU
./llama-cli -m model.gguf -ngl 99 -p "Fast inference"

# Offload specific number of layers
./llama-cli -m model.gguf -ngl 32 -p "Partial GPU"

# Split across multiple GPUs
./llama-cli -m model.gguf \
  -ngl 99 \
  -sm layer \
  -ts 3,1 \  # 3:1 ratio across 2 GPUs
  -p "Multi-GPU inference"

Advanced Features

LoRA Adapters

# Load single LoRA adapter
./llama-cli -m model.gguf --lora adapter.gguf -p "Specialized task"

# Load multiple adapters with scaling
./llama-cli -m model.gguf \
  --lora-scaled "adapter1.gguf:0.5,adapter2.gguf:1.0" \
  -p "Multi-adapter inference"

Control Vectors

# Apply control vector to influence model behavior
./llama-cli -m model.gguf \
  --control-vector vector.gguf \
  --control-vector-layer-range 0 20 \
  -p "Generate with control"

Reasoning Models

# Enable reasoning/thinking for models like DeepSeek-R1
./llama-cli -m deepseek-r1.gguf \
  --reasoning-format deepseek \
  --reasoning-budget -1 \
  -p "Solve this problem step by step"

Performance Options

Threading

# Set CPU threads for generation
./llama-cli -m model.gguf -t 8 -p "Prompt"

# Different threads for batch processing
./llama-cli -m model.gguf -t 8 -tb 16 -p "Prompt"

Memory Optimization

# Keep model in RAM (prevent swapping)
./llama-cli -m model.gguf --mlock -p "Prompt"

# Disable memory mapping
./llama-cli -m model.gguf --no-mmap -p "Prompt"

# Use Flash Attention
./llama-cli -m model.gguf -fa on -p "Prompt"

KV Cache Configuration

# Quantize KV cache for memory savings
./llama-cli -m model.gguf \
  -ctk q8_0 \  # K cache type
  -ctv q8_0 \  # V cache type
  -p "Prompt"

Common Use Cases

Code Completion

./llama-cli -m codellama.gguf \
  -p "def fibonacci(n):\n    \"\"\"Calculate fibonacci number\"\"\"\n    " \
  --temp 0.2 \
  --repeat-penalty 1.1 \
  -n 200

Creative Writing

./llama-cli -m creative-model.gguf \
  -cnv \
  -sys "You are a creative storytelling assistant." \
  -p "Write a short story about a robot learning to paint" \
  --temp 0.9 \
  --top-p 0.95

Structured Data Extraction

./llama-cli -m model.gguf \
  -p "Extract product info: 'iPhone 15 Pro costs $999, available in blue'" \
  -j '{"type":"object","properties":{"product":{"type":"string"},"price":{"type":"number"},"color":{"type":"string"}}}' \
  --temp 0.1

Output Control

Display Options

# Disable prompt display
./llama-cli -m model.gguf --no-display-prompt -p "Hidden prompt"

# Disable colored output
./llama-cli -m model.gguf -co off -p "No colors"

# Show timing information
./llama-cli -m model.gguf --show-timings -p "Benchmark"

# Include special tokens in output
./llama-cli -m model.gguf -sp -p "Show all tokens"

Simple I/O Mode

# Use basic I/O for better subprocess compatibility
./llama-cli -m model.gguf --simple-io -p "Pipe-friendly output"

Environment Variables

Many parameters can be set via environment variables:

# Set model path
export LLAMA_ARG_MODEL="models/model.gguf"

# Set context size
export LLAMA_ARG_CTX_SIZE=4096

# Set GPU layers
export LLAMA_ARG_N_GPU_LAYERS=99

# Then run without flags
./llama-cli -p "Using env vars"

Command-line arguments take precedence over environment variables when both are set.

Logging

# Set verbosity level (0-4, higher = more verbose)
./llama-cli -m model.gguf -lv 4 -p "Debug mode"

# Disable all logging
./llama-cli -m model.gguf --log-disable -p "Silent mode"

# Log to file
export LLAMA_LOG_FILE="llama.log"
./llama-cli -m model.gguf -p "Logged run"

# Enable timestamps in logs
./llama-cli -m model.gguf --log-timestamps -p "Timestamped logs"

Server

OpenAI-compatible API server

Embeddings

Generate text embeddings

Multimodal

Vision and audio models

Speculative Decoding

Accelerate generation with draft models

Get Started

Core Concepts

Inference

Models

Advanced

​Quick Start

​Loading Models

​From Local Files

​From Hugging Face

​From Docker Hub

​Conversation Mode

​Basic Conversation

​Single-Turn Conversation

​Conversation Options

​Prompting

​Direct Prompts

​From Files

​Multiline Input

​Grammar Constraints

​Using Grammar Files

​Inline Grammar

​JSON Schema

​Sampling Parameters

​Temperature and Top-K/Top-P

​Repetition Control

​Advanced Sampling

​Sampler Order

​Context and Generation Control

​Context Window

​Generation Length

​Batch Processing

​GPU Acceleration

​Advanced Features

​LoRA Adapters

​Control Vectors

​Reasoning Models

​Performance Options

​Threading

​Memory Optimization

​KV Cache Configuration

​Common Use Cases

​Code Completion

​Creative Writing

​Structured Data Extraction

​Output Control

​Display Options

​Simple I/O Mode

​Environment Variables

​Logging

​See Also

Server

Embeddings

Multimodal

Speculative Decoding

Quick Start

Loading Models

From Local Files

From Hugging Face

From Docker Hub

Conversation Mode

Basic Conversation

Single-Turn Conversation

Conversation Options

Prompting

Direct Prompts

From Files

Multiline Input

Grammar Constraints

Using Grammar Files

Inline Grammar

JSON Schema

Sampling Parameters

Temperature and Top-K/Top-P

Repetition Control

Advanced Sampling

Sampler Order

Context and Generation Control

Context Window

Generation Length

Batch Processing

GPU Acceleration

Advanced Features

LoRA Adapters

Control Vectors

Reasoning Models

Performance Options

Threading

Memory Optimization

KV Cache Configuration

Common Use Cases

Code Completion

Creative Writing

Structured Data Extraction

Output Control

Display Options

Simple I/O Mode

Environment Variables

Logging

See Also