Usage
localhost:6767/v1. Auto-detects LlamaCPP or MLX engine.
Arguments
Model ID to load. Omit to pick interactively from installed models.Can be:
- A model ID from
jan models list(e.g.qwen3.5-35b-a3b) - A HuggingFace repo ID (e.g.
Qwen/Qwen2.5-35B-Instruct-GGUF) — will auto-download - Derived from
--model-pathfilename if path is provided
Options
Model Configuration
Path to the GGUF file. Auto-resolved from
model.yml when omitted.Path to the inference binary. Auto-discovered from Jan data folder when omitted.
mmproj path for vision-language models. Auto-resolved from
model.yml when omitted.Server Configuration
Port the model server listens on. Use
0 to pick a random free port.API key required by clients. Sets Clients must include this in their requests:
LLAMA_API_KEY / MLX_API_KEY on the server.Seconds to wait for the model server to become ready.
Performance Configuration
GPU layers to offload.
-1: All layers (full GPU acceleration)0: CPU only> 0: Specific number of layers to offload
Context window size in tokens. Use
0 for model default.Setting
--ctx-size explicitly disables --fit. Use --fit to maximize context based on available VRAM.Auto-fit context to available VRAM, maximizing the context window.When enabled, Jan automatically determines the largest context size your GPU can handle.
CPU threads for inference. Use
0 to auto-detect.Model Type
Treat the model as an embedding model.
Background Mode
Run in the background (detach from terminal) and print the PID.Output:
Log file for background mode. Defaults to
<data-folder>/logs/serve.log.Output Control
Print full server logs (llama.cpp / mlx output) instead of the loading spinner.
Examples
Output
Success
http://127.0.0.1:6767/v1 with OpenAI-compatible endpoints:
/v1/chat/completions/v1/completions/v1/embeddings(for embedding models)/v1/models
Error
OpenAI-Compatible API
Once the model is serving, you can use it with any OpenAI-compatible client:HuggingFace Auto-Download
Jan can automatically download models from HuggingFace when you specify a repo ID:- Fetch available GGUF files from the repo
- Let you pick a quantization interactively
- Download the model to your Jan data folder
- Serve the model
Private/Gated Models
Set a HuggingFace token to download private or gated models:Background Mode
Run the model server in the background:Performance Tips
Maximize Context Window
Use--fit to automatically determine the largest context size your GPU can handle:
Optimize for Speed
Offload all layers to GPU:Optimize for Memory
Reduce context size and GPU layers:CPU-Only Mode
Run entirely on CPU (no GPU):Troubleshooting
Model Not Found
Binary Not Found
--bin.
Out of Memory
--ctx-size or --n-gpu-layers, or use --fit to auto-size the context.
Port Already in Use
--port, or use --port 0 to auto-select a free port.
See Also
Launch Command
Wire AI agents to local models
Commands Reference
Complete reference for all CLI commands