For instructions on adding support for new models, see the HOWTO-add-model.md guide in the llama.cpp repository.
Text-Only Models
The following text-generation models are fully supported for inference:LLaMA Family
LLaMA Family
LLaMA, LLaMA 2, and LLaMA 3 - Meta’s foundational large language models
- LLaMA (original 7B, 13B, 33B, 65B)
- LLaMA 2 (7B, 13B, 70B)
- LLaMA 3 (8B, 70B, and larger variants)
Mistral and Mixtral
Mistral and Mixtral
Mistral AI Models - High-performance open models
- Mistral 7B - Efficient 7B parameter model
- Mixtral MoE - Mixture of Experts architecture
Google Models
Google Models
Gemma - Google’s open-source language models
- Gemma - Available in multiple sizes
- Optimized for efficiency and safety
Chinese Language Models
Chinese Language Models
Specialized Chinese LLMs
Code Models
Code Models
Specialized for Code Generation
Other Notable Models
Other Notable Models
Additional Supported Architectures
- Falcon - TII UAE’s high-performance models
- Phi models - Microsoft’s small language models
- GPT-2 - OpenAI’s foundational model
- BERT - Bidirectional encoder
- Bloom - Multilingual model
- MPT - MosaicML Pretrained Transformer
- StableLM models
- Deepseek models
- GPT-NeoX + Pythia
Mixture of Experts (MoE)
Mixture of Experts (MoE)
MoE Architectures
State Space Models
State Space Models
Alternative Architectures
- Mamba - State space model
- FalconMamba Models
- RWKV-6
- RWKV-7
Small Language Models
Small Language Models
Efficient Small Models
Complete Text Model List
For a comprehensive and up-to-date list of all supported text models, including:- Koala, Aquila, Vigogne (French)
- InternLM2, Orion, Xverse
- Command-R models, SEA-LION
- GritLM, OLMo, OLMo 2
- Poro, Smaug, Grok-1
- Flan T5, Bitnet b1.58
- Jais, Bielik, Trillion
- Ling, LFM2, Hunyuan
- And many more…
Multimodal Models
llama.cpp supports multimodal models that can process both text and images:LLaVA Family
LLaVA Family
Vision-Language ModelsLLaVA models combine vision encoders with language models for visual understanding tasks.
Other Vision Models
Other Vision Models
Additional Multimodal Architectures
Multimodal support in
llama-server is documented in the multimodal documentation.Model Compatibility
Finetunes
Most finetunes of the base models listed above are automatically supported. This includes:- Instruction-tuned variants (e.g.,
-Instruct,-Chat) - Domain-specific adaptations
- LoRA-merged models
- RLHF-trained variants
Format Requirements
All models must be in GGUF format to work with llama.cpp. Models in other formats (PyTorch, SafeTensors, etc.) need to be converted first. See Converting Models for details on the conversion process.Finding Models
Performance Considerations
Different model architectures have varying performance characteristics:- Smaller models (1B-7B): Run efficiently on consumer hardware, suitable for edge deployment
- Medium models (13B-34B): Balance between capability and resource requirements
- Large models (70B+): Require substantial VRAM or RAM, best quality results
- MoE models: Larger parameter counts but efficient inference due to sparse activation

