Overview
Adding a new model architecture to llama.cpp requires three main steps: converting the model to GGUF format, defining the model architecture in the C++ codebase, and building the GGML graph implementation for inference.Before starting, ensure you’re familiar with the contribution guidelines and have tested your changes with the main examples and backends (CUDA, Metal, CPU).
Process Overview
Convert the model to GGUF
Use Python conversion scripts to transform model weights and configuration into GGUF format.
Define the model architecture
Register the model’s parameters and tensor layout in the llama.cpp source files.
Step 1: Convert Model to GGUF
This step is done in Python using the gguf library.Choose Conversion Script
Depending on the model format, use either:convert_hf_to_gguf.py- for HuggingFace modelsexamples/convert_legacy_llama.py- for Llama/Llama2 models in.pthformat
Register the Model Class
Define a model class with theModelBase.register annotation:
Define Tensor Layout in constants.py
Add entries togguf-py/gguf/constants.py:
Example: Falcon Model Tensor Layout
Example: Falcon Model Tensor Layout
- MODEL_ARCH enum entry
- MODEL_ARCH_NAMES - human-friendly name mapping
- MODEL_TENSORS - list of tensor names used by the architecture
Map Tensor Names
Map original tensor names to GGUF standardized names ingguf-py/gguf/tensor_mapping.py:
The
{bid} keyword substitutes the block/layer index for repetitive layers.transformer.blocks.{bid}.norm_1 maps to blk.{bid}.attn_norm in GGUF.
Verify Naming Convention
Before adding a new tensor name, verify that an equivalent standardized name doesn’t already exist in GGUF.Override Methods as Needed
Depending on the model configuration, tokenizer, and tensor layout, you may need to override:TextModel#set_gguf_parametersMmprojModel#set_gguf_parametersModelBase#set_vocabModelBase#modify_tensors
Step 2: Define Architecture in llama.cpp
The model parameters and tensor layout must be defined in the C++ source files.Register in llama-arch.cpp
In
src/llama-arch.cpp:- Add the architecture name to
LLM_ARCH_NAMESmap:
- Add tensor names to
llm_get_tensor_names(may also need to updateLLM_TENSOR_NAMES):
Add metadata loading
Add any non-standard metadata loading in the
llama_model_loader constructor in src/llama-model-loader.cpp.The dimensions in
ggml are typically in the reverse order of pytorch dimensions.Step 3: Build the GGML Graph Implementation
This is the core implementation where you define the inference logic.Create Graph Builder Struct
Create a new struct that inherits fromllm_graph_context in src/llama-model.cpp:
Reference Existing Implementations
Examine existing graph builders for guidance:llm_build_llama- Standard transformer architecturellm_build_dbrx- Mixture of experts modelllm_build_bert- Encoder-only model
Register in build_graph Method
Add a case for your architecture inllama_model::build_graph method:
Backend Considerations
Backend-specific implementations can be added later for:- CUDA
- Metal
- Vulkan
- SYCL
- Other accelerators
Debug the Inference Graph
To debug your graph implementation, use the llama-eval-callback example:Step 4: Test the Implementation
Before opening a PR, verify that the main examples work correctly:Essential Examples to Test
Essential Examples to Test
- cli - Command-line interface for text generation
- completion - Text completion example
- imatrix - Importance matrix generation for quantization
- quantize - Model quantization tool
- server - HTTP API server
Test on Main Backends
Run Test Suite
Verify Performance
Check that your implementation doesn’t negatively impact performance:Opening Your Pull Request
Focus on CPU support
Initial PR should focus on CPU support only, unless you have a good reason to include other backends.
Document your changes
Provide clear documentation of:
- What model architecture you’re adding
- Any special considerations or limitations
- Example usage with model download links
Follow PR guidelines
Review the contributing guidelines and ensure your PR follows all requirements.
Resources and Examples
GGUF Specification
Complete GGUF format specification: https://github.com/ggml-org/ggml/blob/master/docs/gguf.mdExample Pull Requests
Learn from these successful model additions:- YaRN RoPE scaling
- Baichuan serial models support
- Attention bias support
- Mixtral support
- BERT embeddings
- Grok-1 support
- Command R Plus support
- DBRX architecture
Additional Documentation
Next Steps
Testing
Learn how to thoroughly test your implementation
Contributing
Review the full contribution guidelines

