Performance Tuning
Optimize llama.cpp inference performance across CPU, GPU, and hybrid configurations.Quick Wins
Use GPU
Offload layers to GPU with
--n-gpu-layersOptimize Threads
Set
--threads to physical CPU coresChoose Quantization
Use Q4_K_M or Q5_K_M for best speed/quality
Adjust Context
Reduce
--ctx-size to minimum neededGPU Acceleration
CUDA (NVIDIA)
Offload layers to GPU:Metal (Apple Silicon)
Metal is enabled by default on macOS:ROCm (AMD)
Thread Configuration
Finding Optimal Thread Count
Start conservative:- CPU-only: Physical CPU cores (not logical/hyperthreaded)
- With GPU: 4-8 threads regardless of core count
- Server (parallel requests): 2-4 threads per request
Check physical core count
Check physical core count
- Linux
- macOS
- Windows
Batch Thread Configuration
Separate threads for prompt processing:Context Size Optimization
Context size directly impacts:- Memory usage (RAM/VRAM)
- Inference speed
- Maximum conversation length
Only use large context (>4096) when absolutely necessary. Most tasks work well with 2048.
Batch Size Tuning
Logical batch size (prompt processing parallelism):- Larger batch = faster prompt processing, more memory
- CPU: 512-2048
- GPU: 512-2048 (depends on VRAM)
- Server: 2048+ for parallel requests
Flash Attention
Enables more efficient attention computation:Flash Attention is enabled by default (
auto) when beneficial. Explicitly enable with --flash-attn on.Quantization Selection
| Quantization | Speed | Quality | Use Case |
|---|---|---|---|
| Q2_K | Fastest | Lowest | Experimentation |
| Q3_K_M | Very Fast | Low | Resource-constrained |
| Q4_K_M | Fast | Good | Recommended default |
| Q5_K_M | Moderate | Very Good | Quality-focused |
| Q6_K | Slower | Excellent | Near-original quality |
| Q8_0 | Slowest | Highest | Reference/evaluation |
Benchmark Example
Real-world benchmark on NVIDIA A6000 (48GB VRAM), 7-core CPU, 30B Q4_0 model:| Configuration | Tokens/sec |
|---|---|
| GPU only, wrong threads | <0.1 |
CPU only (-t 7) | 1.7 |
| GPU + 1 thread | 5.5 |
| GPU + 7 threads | 8.7 |
| GPU + 4 threads | 9.1 |
Hybrid CPU+GPU Inference
For models larger than VRAM:- 40 layers on GPU
- Remaining layers on CPU
Memory Optimization
Memory Mapping
Enable mmap (default, recommended):Memory Locking
Prevent swapping (requires sufficient RAM):Server Performance
Parallel Request Handling
--n-parallel: Number of simultaneous requests (2-8)--threads: Threads per request (2-4 recommended)--batch-size: Must be ≥ ctx-size × n-parallel
Continuous Batching
Enabled by default, improves throughput:Platform-Specific Tips
- NVIDIA GPU
- Apple Silicon
- AMD GPU (ROCm)
- CPU-Only
Optimal configuration:Multi-GPU:
Profiling and Monitoring
Built-in Performance Stats
Enable timing information:- Prompt evaluation time
- Token generation time
- Tokens per second
Server Metrics
Query server metrics endpoint:- Request counts
- Processing times
- KV cache usage
- Queue statistics
Benchmark Tool
Systematic performance testing:Common Performance Issues
Very slow generation (<1 tok/s)
Very slow generation (<1 tok/s)
Likely causes:
- Too many threads (oversaturation)
- No GPU acceleration
- Context size too large
- Set
--threads 1and gradually increase - Enable GPU layers:
--n-gpu-layers 32 - Reduce context:
--ctx-size 2048
Out of memory errors
Out of memory errors
Solutions:
- Use smaller quantization (Q4_K_M instead of Q8_0)
- Reduce context size:
--ctx-size 1024 - Reduce batch size:
--batch-size 256 - Offload fewer layers:
--n-gpu-layers 20 - Enable mmap:
--mmap
GPU underutilized
GPU underutilized
Check:
- Are layers offloaded? (check startup logs)
- Is batch size large enough? Try 512 or 1024
- Are you using optimal quantization? (Q4_K_M recommended)
Server slow with multiple requests
Server slow with multiple requests
Solutions:
- Increase
--n-parallel 8 - Ensure batch size ≥ ctx-size × n-parallel
- Reduce per-request threads:
--threads 2 - Enable continuous batching:
--cont-batching
Advanced Optimizations
CPU Affinity
Bind threads to specific cores:Process Priority
Increase process priority:-1 (low), 0 (normal), 1 (medium), 2 (high), 3 (realtime)
Polling Level
Reduce latency with busy-waiting:Next Steps
Quantization Guide
Learn about quantization types and tradeoffs
Backend Configuration
Configure GPU backends for your hardware
Benchmarking
Measure and compare performance
Server Tuning
Optimize server for production

