Overview
Jan lets you run large language models (LLMs) entirely on your own computer using llama.cpp, an open-source inference engine. All models run locally with complete privacy - your conversations never leave your device.Local models use your computer’s RAM and processing power. Choose models that match your hardware capabilities for the best experience.
Why Run Models Locally?
Complete Privacy
Your conversations and data never leave your computer. Perfect for sensitive work or personal projects.
Zero Costs
No monthly subscriptions or per-token API fees. Run unlimited conversations for free.
Offline Capable
Work anywhere without internet access once models are downloaded.
Full Control
Customize model behavior, parameters, and performance settings to match your needs.
Getting Started
Download Your First Model
The easiest way to get started is through Jan’s built-in Hub:Browse Models
Browse available models or search for specific ones. Jan indicates if a model might be “Slow on your device” or requires “Not enough RAM” based on your system.
Import from HuggingFace
You can import models directly from HuggingFace:- Visit HuggingFace Models and find a GGUF model
- Copy the model ID (e.g.,
TheBloke/Mistral-7B-v0.1-GGUF) - Paste it into Jan’s Hub search bar
- Select your preferred quantization and download
Import Local GGUF Files
If you already have GGUF model files:- Go to Settings > Model Providers > Llama.cpp
- Click Import and select your GGUF file(s)
- Choose import method:
- Link Files: Creates symbolic links (saves disk space)
- Duplicate: Copies files to Jan’s directory (safer for external drives)
- Click Import to complete
Model Formats & Quantization
GGUF Format
All local models in Jan use the GGUF format, which is optimized for efficient inference on consumer hardware. GGUF files package the model weights and configuration in a single file.Quantization Explained
Quantization reduces model size by using lower precision numbers. This trades some accuracy for significant memory savings:| Quantization | Size Impact | Quality | Best For |
|---|---|---|---|
| Q4_K_M | Smallest | Good | Limited RAM, fastest inference |
| Q5_K_M | Medium | Better | Balanced performance |
| Q6_K | Larger | Great | More RAM available |
| Q8_0 | Largest | Excellent | Maximum quality, plenty of RAM |
Hardware Acceleration
GPU Support
Jan can offload model layers to your GPU for dramatically faster inference:- NVIDIA: CUDA support (CUDA 11.7 or 12.0)
- AMD: Vulkan backend support
- Apple Silicon: Native Metal acceleration (M1/M2/M3/M4)
- Intel Arc: Vulkan backend support
Configuring GPU Layers
Control how many model layers run on your GPU:- In a chat, click the gear icon next to your model
- Adjust the GPU Layers slider
- Higher values = faster inference (but uses more VRAM)
Start with maximum GPU layers and reduce only if you encounter out-of-memory errors.
Model Management
View Downloaded Models
Access all your models at Settings > Model Providers > Llama.cpp. Each model shows:- Name and version
- File size
- Current status (downloaded/downloading)
- Configuration options
Configure Model Settings
Click the gear icon next to any model to adjust:- Context Length: How much conversation history the model remembers
- GPU Layers: Hardware acceleration settings
- Temperature: Response creativity (0.1 = focused, 1.0 = creative)
- Prompt Template: Chat format used by the model
Enable Model Capabilities
Click the edit button next to a model to enable:- Vision: Analyze images you share in conversations
- Tools: Enable web search, code execution, and external tools
- Embeddings: Generate vector representations of text
- Reasoning: Step-by-step thinking for complex problems
Delete Models
- Go to Settings > Model Providers > Llama.cpp
- Find the model you want to remove
- Click the three dots and select Delete Model
Advanced: Manual Model Setup
For advanced users who want to add custom models:Navigate to Jan Data Folder
See Data Folder documentation for the location on your OS.
Example model.json
Performance Optimization
For Faster Inference
- Use GPU acceleration (maximize GPU Layers)
- Enable Continuous Batching in llama.cpp settings
- Close memory-intensive applications
- Choose smaller quantizations (Q4_K_M)
For Better Quality
- Use larger quantizations (Q8_0)
- Increase context length for longer conversations
- Adjust temperature (lower = more focused)
- Enable reasoning capabilities for complex tasks
For Limited Hardware
- Choose smaller models (1B-7B parameters)
- Use aggressive quantization (Q4_K_M)
- Reduce context length to 2048-4096 tokens
- Offload fewer layers to GPU
Troubleshooting
Model won't load
Model won't load
- Verify you have enough RAM (check model size)
- Try a different llama.cpp backend in Settings
- Ensure the GGUF file isn’t corrupted
- Check Jan’s logs for specific errors
Very slow responses
Very slow responses
- Increase GPU Layers (if you have a compatible GPU)
- Verify the correct backend is selected (CUDA for NVIDIA, Metal for Apple)
- Close other memory-intensive applications
- Try a smaller model or lower quantization
Out of memory errors
Out of memory errors
- Reduce Context Size in model settings
- Lower GPU Layers setting
- Switch to a smaller quantization (Q4_K_M instead of Q8_0)
- Try a smaller model
Model responses are repetitive
Model responses are repetitive
- Increase Temperature setting (try 0.8-1.0)
- Adjust Repeat Penalty (try 1.1-1.3)
- Enable Presence Penalty
Next Steps
Model Parameters
Fine-tune how your models think and respond
Local API Server
Use your local models via OpenAI-compatible API
MCP Integration
Connect models to external tools and data sources
llama.cpp Engine
Deep dive into engine configuration