Skip to main content

Overview

llama.cpp includes an extensive test suite covering unit tests, integration tests, and backend-specific tests. This guide covers how to build, run, and debug tests effectively.
Before submitting a pull request, you should execute the full CI locally to ensure your changes don’t break existing functionality.

Quick Start

Build and Run All Tests

# Build the project with tests enabled
cmake -B build -DLLAMA_BUILD_TESTS=ON
cmake --build build

# Run all tests
cd build
ctest

# Run tests with verbose output
ctest -V

# Run tests in parallel
ctest -j $(nproc)

Run Specific Tests

# Run tests matching a pattern
ctest -R tokenizer

# Run a specific test by name
ctest -R test-tokenizer-0-llama-spm -V

# Run tests with a specific label
ctest -L main

Test Categories

llama.cpp has several categories of tests:
C++ Unit Tests - Test individual components and functionsExamples:
  • test-tokenizer-0 - Tokenizer validation
  • test-sampling - Sampling algorithms
  • test-grammar-parser - Grammar parsing
  • test-arg-parser - Command-line argument parsing
  • test-rope - Rotary position embeddings
  • test-quantize-fns - Quantization functions
Location: tests/test-*.cpp
Backend Ops Tests - Verify consistency across different backends (CPU, CUDA, Metal, etc.)The test-backend-ops tool checks that different backend implementations of ggml operators produce consistent results.
# Build backend ops test
cmake --build build --target test-backend-ops

# Run with default backends
./build/bin/test-backend-ops

# Run with specific backends
GGML_CUDA=1 ./build/bin/test-backend-ops
This test requires access to at least two different ggml backends to verify consistency.
Python-based Server Tests - Test the HTTP API server using pytestLocation: tools/server/tests/See Server Testing section for details.
End-to-End Tests - Test complete workflows with real modelsExamples:
  • test-chat - Chat template functionality
  • test-chat-template - Chat template parsing
  • test-llama-archs - Model architecture loading
  • test-thread-safety - Multi-threaded inference

Running the Full CI Locally

Before submitting a PR, execute the full CI locally:
mkdir tmp

# CPU-only build
bash ./ci/run.sh ./tmp/results ./tmp/mnt

# With CUDA support
GG_BUILD_CUDA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt

# With SYCL support
source /opt/intel/oneapi/setvars.sh
GG_BUILD_SYCL=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt

# With MUSA support
GG_BUILD_MUSA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
The CI runs comprehensive tests on different hardware configurations. Running it locally helps catch issues before submitting your PR.

Testing Modified Code

Testing ggml Modifications

If you modified the ggml source, you must run test-backend-ops:
1

Build with multiple backends

# Build with CUDA support
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --target test-backend-ops
2

Run backend operations test

./build/bin/test-backend-ops
This verifies that different backends produce consistent results for ggml operations.
3

Add test cases for new operators

If you added a new ggml operator, add corresponding test cases to tests/test-backend-ops.cpp:
// Example test case structure
struct test_my_op : public test_case {
    // Define test parameters and implementation
};

Testing Performance Impact

Verify your changes don’t negatively impact performance:
# Benchmark inference speed
llama-bench -m model.gguf -p 512 -n 128 -t 4

# Compare before and after changes
llama-bench -m model.gguf -r 5  # Run 5 repetitions for average

Testing Perplexity

Ensure your changes don’t affect model quality:
# Download test dataset (if needed)
wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip

# Run perplexity test
llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw

Debugging Tests

Using the debug-test.sh Script

The scripts/debug-test.sh script provides an easy way to debug specific tests:
# Show available tests matching pattern
./scripts/debug-test.sh test-tokenizer

# Run a specific test (interactive selection)
./scripts/debug-test.sh test-tokenizer

# Run with GDB debugger
./scripts/debug-test.sh -g test-tokenizer

# Run specific test number (if you know it)
./scripts/debug-test.sh test 23

# Show help
./scripts/debug-test.sh -h

Manual Debugging Process

For more control, follow these steps:
1

Create debug build directory

rm -rf build-ci-debug
mkdir build-ci-debug
cd build-ci-debug
2

Configure with debug symbols

cmake -DCMAKE_BUILD_TYPE=Debug \
      -DLLAMA_CUDA=1 \
      -DLLAMA_FATAL_WARNINGS=ON \
      ..
3

Build test binaries

make -j
4

Find test commands

# List all tests matching pattern
ctest -R "test-tokenizer" -V -N
This outputs test commands like:
Test command: /path/to/build/bin/test-tokenizer-0 "/path/to/models/ggml-vocab-llama-spm.gguf"
5

Run with GDB

gdb --args ./bin/test-tokenizer-0 "../models/ggml-vocab-llama-spm.gguf"
In GDB:
# Set breakpoint
(gdb) b main
(gdb) b llama_tokenize

# Run
(gdb) run

# Step through
(gdb) next
(gdb) step

# Inspect variables
(gdb) print token_count
(gdb) print *ctx

Debugging with Valgrind

# Check for memory leaks
valgrind --leak-check=full \
         --show-leak-kinds=all \
         --track-origins=yes \
         ./build/bin/test-tokenizer-0 model.gguf

# Check for threading issues
valgrind --tool=helgrind \
         ./build/bin/test-thread-safety -m model.gguf

Server Testing

The server has its own comprehensive test suite using Python and pytest.

Setup Server Tests

1

Install dependencies

cd tools/server/tests
pip install -r requirements.txt
2

Build the server

cd ../../../
cmake -B build
cmake --build build --target llama-server
3

Run tests

cd tools/server/tests
./tests.sh

Server Test Configuration

Environment variables for customizing server tests:
VariableDescriptionDefault
PORTServer listening port8080
LLAMA_SERVER_BIN_PATHPath to server binary../../../build/bin/llama-server
DEBUGEnable verbose output
N_GPU_LAYERSLayers to offload to GPU
LLAMA_CACHEModel cache directorytmp

Running Specific Server Tests

# Run slow tests (downloads many models)
SLOW_TESTS=1 ./tests.sh

# Run with debug output
DEBUG=1 ./tests.sh -s -v -x

# Run all tests in a file
./tests.sh unit/test_chat_completion.py -v -x

# Run a single test
./tests.sh unit/test_chat_completion.py::test_invalid_chat_completion_req

# Compile and test in single command
cmake --build build -j --target llama-server && ./tools/server/tests/tests.sh

Debugging Server Tests

Debug the server while running tests:
# Terminal 1: Start server in debugger
gdb --args ../../../build/bin/llama-server \
    --host 127.0.0.1 --port 8080 \
    --temp 0.8 --seed 42 \
    --hf-repo ggml-org/models \
    --hf-file tinyllamas/stories260K.gguf \
    --batch-size 32 --ctx-size 512 \
    --parallel 2 --n-predict 64

# Set breakpoint
(gdb) br server.cpp:4604
(gdb) r

# Terminal 2: Run test
env DEBUG_EXTERNAL=1 ./tests.sh unit/test_chat_completion.py -v -x
The DEBUG_EXTERNAL=1 environment variable tells the test suite to connect to an externally-started server instead of spawning its own.

Test Structure and CMake

Understanding Test Registration

Tests are registered in tests/CMakeLists.txt using helper functions:
# Build and test a source file
llama_build_and_test(test-sampling.cpp)

# Build once, test multiple times with different args
llama_build(test-tokenizer-0.cpp)
llama_test(test-tokenizer-0 
    NAME test-tokenizer-0-llama-spm 
    ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-llama-spm.gguf)
llama_test(test-tokenizer-0 
    NAME test-tokenizer-0-falcon 
    ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-falcon.gguf)

Adding a New Test

1

Create test source file

// tests/test-my-feature.cpp
#include "testing.h"

int main() {
    // Your test logic
    assert(my_feature() == expected_result);
    return 0;
}
2

Register in CMakeLists.txt

# tests/CMakeLists.txt
llama_build_and_test(test-my-feature.cpp)
3

Build and run

cmake --build build --target test-my-feature
ctest -R test-my-feature -V

Common Test Patterns

Testing with Models

Many tests require model files:
#include "get-model.h"

int main(int argc, char ** argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <model-path>\n", argv[0]);
        return 1;
    }
    
    const char * model_path = argv[1];
    
    // Load model
    llama_model_params model_params = llama_model_default_params();
    llama_model * model = llama_load_model_from_file(model_path, model_params);
    
    // Run tests...
    
    llama_free_model(model);
    return 0;
}

Assertion Helpers

Use the testing helpers from tests/testing.h:
#include "testing.h"

// Basic assertions
ASSERT(condition);
ASSERT_EQ(actual, expected);
ASSERT_NE(actual, unexpected);
ASSERT_LT(actual, limit);
ASSERT_LE(actual, limit);

// Floating point comparison
ASSERT_NEAR(actual, expected, epsilon);

Continuous Integration

llama.cpp uses GitHub Actions for CI/CD. The CI runs:
  • Unit tests on multiple platforms (Linux, macOS, Windows)
  • Backend-specific tests (CUDA, Metal, SYCL)
  • Integration tests with real models
  • Performance benchmarks
  • Code style checks
Some tests are disabled on certain platforms. Check .github/workflows/ for platform-specific configurations.

Best Practices

Test Before Submitting

Always run the full CI locally before opening a PR to catch issues early.

Add Tests for New Features

Every new feature should include corresponding tests to prevent regressions.

Test Multiple Backends

If modifying ggml operations, test on CPU, CUDA, and Metal backends.

Check Performance

Use llama-bench and llama-perplexity to verify no performance degradation.

Troubleshooting

Test Failures

Tests fail on CI but pass locally
  • Ensure you’re testing the same commit
  • Check if it’s a platform-specific issue
  • Verify model files are the same version
Timeout errors
  • Increase test timeout in CMakeLists.txt
  • Check for infinite loops or deadlocks
  • Run with smaller models for unit tests
Flaky tests
  • Check for race conditions in multi-threaded code
  • Ensure tests don’t depend on external state
  • Use fixed random seeds for reproducibility

Getting Help

Next Steps

Contributing

Learn the full contribution workflow and guidelines

Adding Models

Understand how to add new model architectures