Adding New Model Architectures

Overview

Adding a new model architecture to llama.cpp requires three main steps: converting the model to GGUF format, defining the model architecture in the C++ codebase, and building the GGML graph implementation for inference.

Before starting, ensure you’re familiar with the contribution guidelines and have tested your changes with the main examples and backends (CUDA, Metal, CPU).

Process Overview

Convert the model to GGUF

Use Python conversion scripts to transform model weights and configuration into GGUF format.

Define the model architecture

Build the GGML graph

Implement the inference graph logic for forward passes.

Test the implementation

Verify that examples and backends work correctly with the new architecture.

Step 1: Convert Model to GGUF

This step is done in Python using the gguf library.

Choose Conversion Script

Depending on the model format, use either:

convert_hf_to_gguf.py - for HuggingFace models
examples/convert_legacy_llama.py - for Llama/Llama2 models in .pth format

Register the Model Class

Define a model class with the ModelBase.register annotation:

@ModelBase.register("MyModelForCausalLM")
class MyModel(TextModel):
    model_arch = gguf.MODEL_ARCH.MYMODEL

For vision models:

@ModelBase.register("MyModelForConditionalGeneration")
class MyModel(MmprojModel):
    model_arch = gguf.MODEL_ARCH.MYMODEL

Define Tensor Layout in constants.py

Add entries to gguf-py/gguf/constants.py:

Example: Falcon Model Tensor Layout

MODEL_ARCH.FALCON: [
    MODEL_TENSOR.TOKEN_EMBD,
    MODEL_TENSOR.OUTPUT_NORM,
    MODEL_TENSOR.OUTPUT,
    MODEL_TENSOR.ATTN_NORM,
    MODEL_TENSOR.ATTN_NORM_2,
    MODEL_TENSOR.ATTN_QKV,
    MODEL_TENSOR.ATTN_OUT,
    MODEL_TENSOR.FFN_DOWN,
    MODEL_TENSOR.FFN_UP,
]

Add three entries:

MODEL_ARCH enum entry
MODEL_ARCH_NAMES - human-friendly name mapping
MODEL_TENSORS - list of tensor names used by the architecture

Map Tensor Names

Map original tensor names to GGUF standardized names in gguf-py/gguf/tensor_mapping.py:

block_mappings_cfg: dict[MODEL_TENSOR, tuple[str, ...]] = {
    # Attention norm
    MODEL_TENSOR.ATTN_NORM: (
        "gpt_neox.layers.{bid}.input_layernorm",  # gptneox
        "transformer.h.{bid}.ln_1",                # gpt2 gpt-j refact qwen
        "transformer.blocks.{bid}.norm_1",         # mpt
    )
}

The {bid} keyword substitutes the block/layer index for repetitive layers.

Example: transformer.blocks.{bid}.norm_1 maps to blk.{bid}.attn_norm in GGUF.

Verify Naming Convention

Tensor names must end with .weight or .bias suffixes. Several tools like quantize expect this convention.

Before adding a new tensor name, verify that an equivalent standardized name doesn’t already exist in GGUF.

Override Methods as Needed

Depending on the model configuration, tokenizer, and tensor layout, you may need to override:

TextModel#set_gguf_parameters
MmprojModel#set_gguf_parameters
ModelBase#set_vocab
ModelBase#modify_tensors

Step 2: Define Architecture in llama.cpp

The model parameters and tensor layout must be defined in the C++ source files.

Add architecture enum

Define a new llm_arch enum value in src/llama-arch.h:

enum llm_arch {
    LLM_ARCH_LLAMA,
    LLM_ARCH_FALCON,
    LLM_ARCH_MYMODEL,  // Add your architecture
    // ...
};

In src/llama-arch.cpp:

Add the architecture name to LLM_ARCH_NAMES map:

static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
    { LLM_ARCH_LLAMA,   "llama"   },
    { LLM_ARCH_FALCON,  "falcon"  },
    { LLM_ARCH_MYMODEL, "mymodel" },
    // ...
};

Add tensor names to llm_get_tensor_names (may also need to update LLM_TENSOR_NAMES):

case LLM_ARCH_MYMODEL:
    return {
        { LLM_TENSOR_TOKEN_EMBD,  "token_embd"  },
        { LLM_TENSOR_OUTPUT_NORM, "output_norm" },
        { LLM_TENSOR_OUTPUT,      "output"      },
        // ...
    };

Add metadata loading

Add any non-standard metadata loading in the llama_model_loader constructor in src/llama-model-loader.cpp.

Configure RoPE (if applicable)

If the model uses RoPE (Rotary Position Embeddings), add a case for the architecture in llama_model_rope_type function in src/llama-model.cpp:

enum llama_rope_type llama_model_rope_type(const struct llama_model * model) {
    switch (model->arch) {
        case LLM_ARCH_MYMODEL:
            return LLAMA_ROPE_TYPE_NORM;
        // ...
    }
}

The dimensions in ggml are typically in the reverse order of pytorch dimensions.

Step 3: Build the GGML Graph Implementation

This is the core implementation where you define the inference logic.

Create Graph Builder Struct

Create a new struct that inherits from llm_graph_context in src/llama-model.cpp:

struct llm_build_mymodel : public llm_graph_context {
    llm_build_mymodel(
            llama_model & model,
            struct ggml_context * ctx_compute,
            const llama_batch & batch,
            const llama_ubatch & ubatch) : llm_graph_context(model, ctx_compute, batch, ubatch) {}
    
    struct ggml_cgraph * build() {
        // Implement your model's forward pass here
        struct ggml_cgraph * gf = ggml_new_graph(ctx_compute);
        
        // Example: Get input tokens
        struct ggml_tensor * inpL = ggml_get_rows(ctx_compute, 
            model.tok_embd, batch.token);
        
        // Build your model architecture
        // - Attention layers
        // - Feed-forward networks
        // - Normalization layers
        // - etc.
        
        return gf;
    }
};

Reference Existing Implementations

Examine existing graph builders for guidance:

llm_build_llama - Standard transformer architecture
llm_build_dbrx - Mixture of experts model
llm_build_bert - Encoder-only model

Register in build_graph Method

Add a case for your architecture in llama_model::build_graph method:

struct ggml_cgraph * llama_model::build_graph(
        const llama_batch & batch,
        const llama_ubatch & ubatch) {
    // ...
    
    switch (arch) {
        case LLM_ARCH_LLAMA:
            return llm_build_llama(*this, ctx_compute, batch, ubatch).build();
        case LLM_ARCH_MYMODEL:
            return llm_build_mymodel(*this, ctx_compute, batch, ubatch).build();
        // ...
    }
}

Backend Considerations

Some ggml backends do not support all operations. Focus on CPU support first, then add backend implementations in separate PRs.

Backend-specific implementations can be added later for:

CUDA
Metal
Vulkan
SYCL
Other accelerators

Debug the Inference Graph

To debug your graph implementation, use the llama-eval-callback example:

# Build with debug symbols
cmake -DCMAKE_BUILD_TYPE=Debug -B build
cmake --build build --target llama-eval-callback

# Run with your model
./build/bin/llama-eval-callback -m model.gguf

Step 4: Test the Implementation

Before opening a PR, verify that the main examples work correctly:

Essential Examples to Test

cli - Command-line interface for text generation
completion - Text completion example
imatrix - Importance matrix generation for quantization
quantize - Model quantization tool
server - HTTP API server

Test on Main Backends

# CPU-only test
cmake -B build
cmake --build build --target llama-cli
./build/bin/llama-cli -m model.gguf -p "Test prompt"

# CUDA test
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --target llama-cli
./build/bin/llama-cli -m model.gguf -ngl 99 -p "Test prompt"

# Metal test (macOS)
cmake -B build -DLLAMA_METAL=ON
cmake --build build --target llama-cli
./build/bin/llama-cli -m model.gguf -ngl 99 -p "Test prompt"

Run Test Suite

# Build tests
cmake -B build -DLLAMA_BUILD_TESTS=ON
cmake --build build

# Run all tests
cd build
ctest

# Run specific architecture test
ctest -R test-llama-archs -V

Verify Performance

Check that your implementation doesn’t negatively impact performance:

# Benchmark inference speed
llama-bench -m model.gguf -p 512 -n 128

# Check perplexity
llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw

Opening Your Pull Request

Focus on CPU support

Initial PR should focus on CPU support only, unless you have a good reason to include other backends.

Document your changes

Provide clear documentation of:

What model architecture you’re adding
Any special considerations or limitations
Example usage with model download links

Follow PR guidelines

Review the contributing guidelines and ensure your PR follows all requirements.

Plan follow-up PRs

Add support for other backends (CUDA, Metal, etc.) in separate follow-up PRs.

Resources and Examples

GGUF Specification

Complete GGUF format specification: https://github.com/ggml-org/ggml/blob/master/docs/gguf.md

Example Pull Requests

Learn from these successful model additions:

Additional Documentation

How to convert HuggingFace model to GGUF format

Next Steps

Testing

Learn how to thoroughly test your implementation

Contributing

Review the full contribution guidelines

Building

Contributing

​Overview

​Process Overview

​Step 1: Convert Model to GGUF

​Choose Conversion Script

​Register the Model Class

​Define Tensor Layout in constants.py

​Map Tensor Names

​Verify Naming Convention

​Override Methods as Needed

​Step 2: Define Architecture in llama.cpp

​Step 3: Build the GGML Graph Implementation

​Create Graph Builder Struct

​Reference Existing Implementations

​Register in build_graph Method

​Backend Considerations

​Debug the Inference Graph

​Step 4: Test the Implementation

​Test on Main Backends

​Run Test Suite

​Verify Performance

​Opening Your Pull Request

​Resources and Examples

​GGUF Specification

​Example Pull Requests

​Additional Documentation

​Next Steps

Testing

Contributing

Overview

Process Overview

Step 1: Convert Model to GGUF

Choose Conversion Script

Register the Model Class

Define Tensor Layout in constants.py

Map Tensor Names

Verify Naming Convention

Override Methods as Needed

Step 2: Define Architecture in llama.cpp

Step 3: Build the GGML Graph Implementation

Create Graph Builder Struct

Reference Existing Implementations

Register in build_graph Method

Backend Considerations

Debug the Inference Graph

Step 4: Test the Implementation

Test on Main Backends

Run Test Suite

Verify Performance

Opening Your Pull Request

Resources and Examples

GGUF Specification

Example Pull Requests

Additional Documentation

Next Steps