Skip to main content
llama.cpp supports a wide variety of LLM architectures, both text-only and multimodal models. Typically, finetunes of the base models listed below are also supported.
For instructions on adding support for new models, see the HOWTO-add-model.md guide in the llama.cpp repository.

Text-Only Models

The following text-generation models are fully supported for inference:
LLaMA, LLaMA 2, and LLaMA 3 - Meta’s foundational large language models
  • LLaMA (original 7B, 13B, 33B, 65B)
  • LLaMA 2 (7B, 13B, 70B)
  • LLaMA 3 (8B, 70B, and larger variants)
These models form the foundation of llama.cpp and provide excellent performance across various tasks.
Mistral AI Models - High-performance open modelsMistral models are known for their strong performance relative to model size.
Gemma - Google’s open-source language models
  • Gemma - Available in multiple sizes
  • Optimized for efficiency and safety
Specialized for Code Generation
Additional Supported Architectures
Alternative Architectures

Complete Text Model List

For a comprehensive and up-to-date list of all supported text models, including:
  • Koala, Aquila, Vigogne (French)
  • InternLM2, Orion, Xverse
  • Command-R models, SEA-LION
  • GritLM, OLMo, OLMo 2
  • Poro, Smaug, Grok-1
  • Flan T5, Bitnet b1.58
  • Jais, Bielik, Trillion
  • Ling, LFM2, Hunyuan
  • And many more…
Visit the llama.cpp README for the complete list.

Multimodal Models

llama.cpp supports multimodal models that can process both text and images:
Vision-Language ModelsLLaVA models combine vision encoders with language models for visual understanding tasks.
Multimodal support in llama-server is documented in the multimodal documentation.

Model Compatibility

Finetunes

Most finetunes of the base models listed above are automatically supported. This includes:
  • Instruction-tuned variants (e.g., -Instruct, -Chat)
  • Domain-specific adaptations
  • LoRA-merged models
  • RLHF-trained variants

Format Requirements

All models must be in GGUF format to work with llama.cpp. Models in other formats (PyTorch, SafeTensors, etc.) need to be converted first. See Converting Models for details on the conversion process.

Finding Models

# Search for GGUF models on Hugging Face
https://huggingface.co/models?library=gguf&sort=trending

# Search for specific model families
https://huggingface.co/models?sort=trending&search=llama+gguf

Performance Considerations

Different model architectures have varying performance characteristics:
  • Smaller models (1B-7B): Run efficiently on consumer hardware, suitable for edge deployment
  • Medium models (13B-34B): Balance between capability and resource requirements
  • Large models (70B+): Require substantial VRAM or RAM, best quality results
  • MoE models: Larger parameter counts but efficient inference due to sparse activation
For optimal performance, consider using quantized models to reduce memory requirements while maintaining quality.