Overview
The libllama C API provides a complete interface for loading and running large language models in C/C++ applications. The API is designed around four core concepts:- Model: Loaded from GGUF files, contains model weights and architecture
- Context: Runtime state for inference, manages KV cache and computation
- Batch: Input data structure for encoding/decoding tokens
- Sampler: Token selection strategies for text generation
Initialization
Before using the library, initialize the backend:Call
llama_backend_init() once at program startup. For cleanup, call llama_backend_free() at program exit.NUMA Support (Optional)
Basic Usage Pattern
The typical workflow for using libllama follows this pattern:Simple Example
Here’s a complete minimal example based onexamples/simple/simple.cpp:
Core Data Types
Type Definitions
Core Structures
Default Parameters
Get default parameter structures:Query Functions
Retrieve model and context information:System Information
Performance Monitoring
Constants
Thread Safety
The tokenization API (
llama_tokenize, llama_detokenize, llama_token_to_piece) is thread-safe. Other APIs require external synchronization.Error Handling
Most functions returnNULL, -1, 0, or negative values to indicate errors. Always check return values:
Next Steps
Model Loading
Learn how to load models and configure parameters
Inference
Understand batching, decoding, and KV cache management
Sampling
Explore token sampling strategies and configuration

