Skip to main content

Usage

jan serve [MODEL_ID] [OPTIONS]
Load a local model and expose it at localhost:6767/v1. Auto-detects LlamaCPP or MLX engine.

Arguments

model_id
string
Model ID to load. Omit to pick interactively from installed models.Can be:
  • A model ID from jan models list (e.g. qwen3.5-35b-a3b)
  • A HuggingFace repo ID (e.g. Qwen/Qwen2.5-35B-Instruct-GGUF) — will auto-download
  • Derived from --model-path filename if path is provided

Options

Model Configuration

model-path
string
Path to the GGUF file. Auto-resolved from model.yml when omitted.
jan serve --model-path /path/to/model.gguf
bin
string
Path to the inference binary. Auto-discovered from Jan data folder when omitted.
jan serve qwen3.5-35b --bin /usr/local/bin/llama-server
mmproj
string
mmproj path for vision-language models. Auto-resolved from model.yml when omitted.
jan serve llava-v1.6 --mmproj /path/to/mmproj.gguf

Server Configuration

port
number
default:"6767"
Port the model server listens on. Use 0 to pick a random free port.
jan serve qwen3.5-35b --port 8080
api-key
string
default:""
API key required by clients. Sets LLAMA_API_KEY / MLX_API_KEY on the server.
jan serve qwen3.5-35b --api-key my-secret-key
Clients must include this in their requests:
curl http://localhost:6767/v1/chat/completions \
  -H "Authorization: Bearer my-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3.5-35b", "messages": [{"role": "user", "content": "Hello"}]}'
timeout
number
default:"120"
Seconds to wait for the model server to become ready.
jan serve qwen3.5-35b --timeout 180

Performance Configuration

n-gpu-layers
number
default:"-1"
GPU layers to offload.
  • -1: All layers (full GPU acceleration)
  • 0: CPU only
  • > 0: Specific number of layers to offload
jan serve qwen3.5-35b --n-gpu-layers 35  # Offload 35 layers
jan serve qwen3.5-35b --n-gpu-layers 0   # CPU only
ctx-size
number
default:"4096"
Context window size in tokens. Use 0 for model default.
jan serve qwen3.5-35b --ctx-size 8192
Setting --ctx-size explicitly disables --fit. Use --fit to maximize context based on available VRAM.
fit
boolean
default:"false"
Auto-fit context to available VRAM, maximizing the context window.
jan serve qwen3.5-35b --fit
When enabled, Jan automatically determines the largest context size your GPU can handle.
threads
number
default:"0"
CPU threads for inference. Use 0 to auto-detect.
jan serve qwen3.5-35b --threads 8

Model Type

embedding
boolean
default:"false"
Treat the model as an embedding model.
jan serve nomic-embed-text --embedding

Background Mode

detach
boolean
default:"false"
Run in the background (detach from terminal) and print the PID.
jan serve qwen3.5-35b --detach
Output:
{
  "pid": 12345,
  "model_id": "qwen3.5-35b-a3b",
  "log": "/Users/you/Library/Application Support/Jan/logs/serve.log"
}
log
string
Log file for background mode. Defaults to <data-folder>/logs/serve.log.
jan serve qwen3.5-35b --detach --log /tmp/jan-serve.log

Output Control

verbose
boolean
default:"false"
Print full server logs (llama.cpp / mlx output) instead of the loading spinner.
jan serve qwen3.5-35b --verbose
jan serve qwen3.5-35b -v

Examples

# Pick a model interactively
jan serve

# Serve a specific model
jan serve qwen3.5-35b-a3b

Output

Success

✓ qwen3.5-35b-a3b ready · http://127.0.0.1:6767

  Endpoint  http://127.0.0.1:6767/v1

  Press Ctrl+C to stop.
The model is now serving at http://127.0.0.1:6767/v1 with OpenAI-compatible endpoints:
  • /v1/chat/completions
  • /v1/completions
  • /v1/embeddings (for embedding models)
  • /v1/models

Error

✗ Failed to load qwen3.5-35b-a3b

Error: model not found in data folder

OpenAI-Compatible API

Once the model is serving, you can use it with any OpenAI-compatible client:
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:6767/v1",
    api_key="jan"  # or your custom --api-key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message.content)

HuggingFace Auto-Download

Jan can automatically download models from HuggingFace when you specify a repo ID:
jan serve Qwen/Qwen2.5-35B-Instruct-GGUF
The CLI will:
  1. Fetch available GGUF files from the repo
  2. Let you pick a quantization interactively
  3. Download the model to your Jan data folder
  4. Serve the model

Private/Gated Models

Set a HuggingFace token to download private or gated models:
export HF_TOKEN="your_token_here"
jan serve meta-llama/Llama-3.3-70B-Instruct-GGUF

Background Mode

Run the model server in the background:
jan serve qwen3.5-35b --detach
Output:
{
  "pid": 12345,
  "model_id": "qwen3.5-35b-a3b",
  "log": "/Users/you/Library/Application Support/Jan/logs/serve.log"
}
To stop the background server:
kill 12345
View logs:
tail -f "$HOME/Library/Application Support/Jan/logs/serve.log"

Performance Tips

Maximize Context Window

Use --fit to automatically determine the largest context size your GPU can handle:
jan serve qwen3.5-35b --fit

Optimize for Speed

Offload all layers to GPU:
jan serve qwen3.5-35b --n-gpu-layers -1

Optimize for Memory

Reduce context size and GPU layers:
jan serve qwen3.5-35b --ctx-size 2048 --n-gpu-layers 20

CPU-Only Mode

Run entirely on CPU (no GPU):
jan serve qwen3.5-35b --n-gpu-layers 0 --threads 8

Troubleshooting

Model Not Found

Error: model not found in data folder
Solution: Download the model first using the Jan desktop app, or use a HuggingFace repo ID to auto-download.

Binary Not Found

✗ llama-server binary not found
Solution: Install a backend from Jan’s settings, or specify the binary path with --bin.

Out of Memory

Error: failed to allocate memory
Solution: Reduce --ctx-size or --n-gpu-layers, or use --fit to auto-size the context.

Port Already in Use

Error: address already in use
Solution: Choose a different port with --port, or use --port 0 to auto-select a free port.

See Also

Launch Command

Wire AI agents to local models

Commands Reference

Complete reference for all CLI commands

Build docs developers (and LLMs) love