llama.cpp Architecture
llama.cpp is designed as a minimal, efficient C/C++ implementation for large language model inference. The architecture prioritizes simplicity, portability, and performance.
Design Philosophy
Minimal Dependencies Pure C/C++ with no external dependencies for core functionality
Hardware Agnostic Runs efficiently on CPU, GPU, and specialized accelerators
Memory Efficient Optimized memory management with support for memory mapping and quantization
Production Ready Battle-tested codebase used by millions through tools like Ollama, LM Studio, and GPT4All
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
│ (llama-cli, llama-server, llama-simple, custom apps) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ llama.cpp Library │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ llama_model │ │llama_context │ │llama_sampler │ │
│ │ │ │ │ │ │ │
│ │ • Model Load │ │ • Inference │ │ • Token │ │
│ │ • Tensors │ │ • KV Cache │ │ Selection │ │
│ │ • Metadata │ │ • Batch │ │ • Sampling │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ GGML Library │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Compute Graph│ │ Tensors │ │ Backends │ │
│ │ │ │ │ │ │ │
│ │ • Operations │ │ • Data Types │ │ • CPU │ │
│ │ • Auto-diff │ │ • Quantized │ │ • CUDA │ │
│ │ • Scheduling │ │ • Memory Mgmt│ │ • Metal │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Hardware Abstraction │
│ CPU │ CUDA │ Metal │ Vulkan │ SYCL │ OpenCL │ ... │
└─────────────────────────────────────────────────────────────┘
Core Components
1. GGML Tensor Library
Purpose : Low-level tensor operations and compute graph execution.
Key Features:
Automatic differentiation
Computation graph building and execution
Multi-dimensional tensor operations
Backend abstraction layer
Memory-efficient tensor storage
Key Files:
ggml/include/ggml.h - Core tensor library API
ggml/include/ggml-backend.h - Backend abstraction
ggml/src/ggml.c - Tensor operations implementation
GGML (Georgi Gerganov Machine Learning) is a general-purpose tensor library. llama.cpp serves as the main playground for developing GGML features.
Example: Building a Computation Graph
// Allocate context
struct ggml_init_params params = {
.mem_size = 16 * 1024 * 1024 ,
.mem_buffer = NULL ,
};
struct ggml_context * ctx = ggml_init (params);
// Create tensors
struct ggml_tensor * x = ggml_new_tensor_1d (ctx, GGML_TYPE_F32, 1 );
struct ggml_tensor * a = ggml_new_tensor_1d (ctx, GGML_TYPE_F32, 1 );
struct ggml_tensor * b = ggml_new_tensor_1d (ctx, GGML_TYPE_F32, 1 );
// Build computation graph: f(x) = a*x^2 + b
struct ggml_tensor * x2 = ggml_mul (ctx, x, x);
struct ggml_tensor * f = ggml_add (ctx, ggml_mul (ctx, a, x2), b);
// Execute
struct ggml_cgraph * gf = ggml_new_graph (ctx);
ggml_build_forward_expand (gf, f);
ggml_graph_compute_with_ctx (ctx, gf, n_threads);
Purpose : Binary format for storing models with metadata and quantized weights.
Key Features:
Self-describing format with embedded metadata
Multiple quantization formats (1.5-bit to 16-bit)
Extensible key-value metadata system
Memory-mappable for efficient loading
Single-file model distribution
Key Files:
ggml/include/gguf.h - GGUF format API
ggml/src/gguf.c - GGUF implementation
See GGUF Format Documentation for details.
3. llama Library
Purpose : High-level LLM inference API built on top of GGML.
Key Components:
llama_model - Model Management
Handles model loading, weight storage, and metadata. struct llama_model {
// Model metadata
llama_vocab vocab; // Tokenizer vocabulary
llama_model_params params; // Architecture parameters
// Tensors
std::vector < ggml_tensor *> tensors;
// Backend devices
std::vector < ggml_backend_dev_t > devices;
std::vector < ggml_backend_buffer_t > buffers;
};
Responsibilities:
Load GGUF files from disk
Initialize model weights and architecture
Manage memory allocation across backends
Provide model introspection (layer count, dimensions, etc.)
llama_context - Inference State
Manages inference state including KV cache and processing batches. struct llama_context {
llama_model * model; // Reference to model
// KV cache
llama_kv_cache kv_self; // Key-value attention cache
// Batch processing
llama_batch batch; // Current input batch
// Backend
ggml_backend_sched_t sched; // Compute scheduler
std::vector < ggml_backend_t > backends;
};
Responsibilities:
Maintain conversation context (KV cache)
Process input tokens in batches
Execute inference through backend scheduler
Manage context window and memory
llama_sampler - Token Selection
Handles token sampling strategies for generation. struct llama_sampler {
// Sampling parameters
float temp; // Temperature
float top_p; // Nucleus sampling
float top_k; // Top-K sampling
float min_p; // Min-P sampling
// State
llama_token_data_array candidates;
};
Responsibilities:
Apply temperature scaling
Filter tokens (top-k, top-p, min-p)
Apply repetition penalties
Sample next token from distribution
Manages tokenization and vocabulary. Supported Tokenizer Types:
SPM (SentencePiece) - LLaMA, Mistral
BPE (Byte-Pair Encoding) - GPT-2, GPT-3
WPM (WordPiece) - BERT
UGM (Unigram) - T5
RWKV - Greedy tokenization
Responsibilities:
Encode text to token IDs
Decode token IDs to text
Handle special tokens (BOS, EOS, etc.)
Key Files:
include/llama.h - Public C API
src/llama.cpp - Main implementation
src/llama-vocab.cpp - Tokenization
src/llama-context.cpp - Context management
src/llama-model.cpp - Model loading
Inference Pipeline
The complete flow from input text to generated output:
┌─────────────────────────────────────────────────────────────┐
│ 1. Input Text │
│ "Hello, how are you?" │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 2. Tokenization (llama_vocab) │
│ "Hello" → 15043, "," → 11, " how" → 1268, ... │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 3. Encode Batch (llama_encode) │
│ • Load tokens into batch │
│ • Process through transformer layers │
│ • Update KV cache with prompt │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 4. Generate Loop │
│ ┌────────────────────────────────────────────┐ │
│ │ 4a. Decode (llama_decode) │ │
│ │ • Process last token │ │
│ │ • Attention with KV cache │ │
│ │ • Get output logits │ │
│ └──────────────┬─────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ 4b. Sample Token (llama_sampler) │ │
│ │ • Apply temperature │ │
│ │ • Filter (top-k, top-p) │ │
│ │ • Sample from distribution │ │
│ │ • Return token ID │ │
│ └──────────────┬─────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────┐ │
│ │ 4c. Check Stop Condition │ │
│ │ • EOS token? │ │
│ │ • Max length? │ │
│ │ • User stop sequence? │ │
│ └──────────────┬─────────────────────────────┘ │
│ ↓ │
│ └───────────── Loop until stop ─────────────┘ │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 5. Detokenization (llama_vocab) │
│ 15043, 11, 1268, ... → "Hello, how are you?" │
└────────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 6. Output Text │
│ "I'm doing well, thank you for asking!" │
└─────────────────────────────────────────────────────────────┘
Model Loading Process
// Open and validate GGUF file
struct gguf_init_params params = {
.no_alloc = true ,
.ctx = NULL
};
struct gguf_context * ctx = gguf_init_from_file ( "model.gguf" , params);
// Verify magic number, version
verify_gguf_magic (ctx);
// Calculate memory requirements
size_t mem_required = calculate_model_size (ctx);
// Allocate buffers across backends
for (auto & backend : backends) {
ggml_backend_buffer_t buffer = ggml_backend_alloc_buffer (
backend, mem_required
);
model -> buffers . push_back (buffer);
}
// Memory map or read tensor data
if (use_mmap) {
// Memory map file for zero-copy loading
model -> mmap = llama_mmap_file ( "model.gguf" , prefetch);
} else {
// Read tensors into allocated buffers
for (tensor in model -> tensors ) {
read_tensor_data (tensor);
}
}
Step 5: Backend Initialization
// Initialize compute backends
if (n_gpu_layers > 0 ) {
// Offload layers to GPU
for ( int i = 0 ; i < n_gpu_layers; i ++ ) {
offload_layer_to_gpu (model, i);
}
}
// Create backend scheduler
model -> sched = ggml_backend_sched_new (
backends. data (),
backends. size (),
GGML_DEFAULT_GRAPH_SIZE
);
KV Cache Management
The Key-Value cache is critical for efficient autoregressive generation:
struct llama_kv_cache {
// Cache configuration
uint32_t size; // Maximum number of tokens
uint32_t used; // Currently used tokens
// Storage for key/value tensors
struct ggml_tensor * k; // [n_layers, n_ctx, n_embd]
struct ggml_tensor * v; // [n_layers, n_ctx, n_embd]
// Sequence tracking
std::vector < llama_seq_id > cells;
};
Cache Operations:
// Clear cache
llama_kv_cache_clear (ctx);
// Remove specific sequence
llama_kv_cache_seq_rm (ctx, seq_id, p0, p1);
// Copy sequence
llama_kv_cache_seq_cp (ctx, seq_src, seq_dst, p0, p1);
// Shift positions (for sliding window)
llama_kv_cache_seq_shift (ctx, seq_id, p0, p1, delta);
The KV cache stores attention keys and values for previously processed tokens, avoiding recomputation during generation.
Memory Management
llama.cpp employs several strategies for efficient memory usage:
Memory Mapping (mmap)
// Enable memory mapping (default)
llama_model_params params = llama_model_default_params ();
params.use_mmap = true ;
Benefits:
Zero-copy model loading
OS handles paging
Shared memory across processes
Faster startup
Memory Locking (mlock)
// Lock model in RAM (prevents swapping)
params.use_mlock = true ;
Benefits:
Prevents model from being swapped to disk
Consistent inference latency
Requires sufficient RAM
Quantization
See Quantization Documentation for details on reducing memory footprint.
Backend Abstraction
The backend scheduler dynamically routes operations to appropriate compute devices:
struct ggml_backend_sched {
// Available backends
std::vector < ggml_backend_t > backends;
// Operation scheduling
schedule_operation (ggml_tensor * tensor) {
// Decide which backend executes this operation
if ( tensor_on_gpu (tensor)) {
return gpu_backend;
} else {
return cpu_backend;
}
}
};
Split Execution:
CPU handles some operations (layer norms, embeddings)
GPU handles matrix multiplications
Automatic data transfer between devices
See Backends Documentation for supported hardware.
Thread Pool
llama.cpp uses a thread pool for CPU parallelism:
// Set thread count
llama_context_params params = llama_context_default_params ();
params.n_threads = 8 ; // Threads for generation
params.n_threads_batch = 8 ; // Threads for prompt processing
Optimal thread count is typically the number of physical CPU cores, not logical cores.
Optimization Techniques
Batch Processing
Process multiple tokens/prompts simultaneously:
llama_batch batch = llama_batch_init ( 512 , 0 , 1 );
// Add multiple tokens to batch
for ( int i = 0 ; i < n_tokens; i ++ ) {
llama_batch_add (batch, tokens [i], i, { 0 }, i == n_tokens - 1 );
}
// Process entire batch
llama_decode (ctx, batch);
Flash Attention
Memory-efficient attention computation:
params.flash_attn = true ; // Enable flash attention
Speculative Decoding
Use a small draft model to speed up generation:
llama-server -m model.gguf -md draft-model.gguf
Simple Example
Minimal inference example:
#include "llama.h"
int main () {
// Initialize backend
llama_backend_init ();
// Load model
llama_model_params model_params = llama_model_default_params ();
llama_model * model = llama_model_load_from_file (
"model.gguf" ,
model_params
);
// Create context
llama_context_params ctx_params = llama_context_default_params ();
ctx_params . n_ctx = 2048 ;
llama_context * ctx = llama_init_from_model (model, ctx_params);
// Tokenize prompt
const char * prompt = "Hello, world!" ;
std::vector < llama_token > tokens = llama_tokenize (
model, prompt, true
);
// Encode prompt
llama_batch batch = llama_batch_get_one ( tokens . data (), tokens . size ());
llama_encode (ctx, batch);
// Generate tokens
llama_sampler * sampler = llama_sampler_init_greedy ();
for ( int i = 0 ; i < 100 ; i ++ ) {
// Get next token
llama_token token = llama_sampler_sample (sampler, ctx, - 1 );
if (token == llama_token_eos (model)) break ;
// Decode token
char buf [ 256 ];
llama_token_to_piece (model, token, buf, sizeof (buf), 0 , true );
printf ( " %s " , buf);
// Feed token back
batch = llama_batch_get_one ( & token, 1 );
llama_decode (ctx, batch);
}
// Cleanup
llama_sampler_free (sampler);
llama_free (ctx);
llama_model_free (model);
llama_backend_free ();
return 0 ;
}
Further Reading