jan serve

Usage

jan serve [MODEL_ID] [OPTIONS]

Load a local model and expose it at localhost:6767/v1. Auto-detects LlamaCPP or MLX engine.

Arguments

model_id

string

Model ID to load. Omit to pick interactively from installed models.Can be:

A model ID from jan models list (e.g. qwen3.5-35b-a3b)
A HuggingFace repo ID (e.g. Qwen/Qwen2.5-35B-Instruct-GGUF) — will auto-download
Derived from --model-path filename if path is provided

Options

Model Configuration

model-path

string

Path to the GGUF file. Auto-resolved from model.yml when omitted.

jan serve --model-path /path/to/model.gguf

bin

string

Path to the inference binary. Auto-discovered from Jan data folder when omitted.

jan serve qwen3.5-35b --bin /usr/local/bin/llama-server

mmproj

string

mmproj path for vision-language models. Auto-resolved from model.yml when omitted.

jan serve llava-v1.6 --mmproj /path/to/mmproj.gguf

Server Configuration

port

number

default:"6767"

Port the model server listens on. Use 0 to pick a random free port.

jan serve qwen3.5-35b --port 8080

api-key

string

default:""

API key required by clients. Sets LLAMA_API_KEY / MLX_API_KEY on the server.

jan serve qwen3.5-35b --api-key my-secret-key

Clients must include this in their requests:

curl http://localhost:6767/v1/chat/completions \
  -H "Authorization: Bearer my-secret-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3.5-35b", "messages": [{"role": "user", "content": "Hello"}]}'

timeout

number

default:"120"

Seconds to wait for the model server to become ready.

jan serve qwen3.5-35b --timeout 180

Performance Configuration

n-gpu-layers

number

default:"-1"

GPU layers to offload.

-1: All layers (full GPU acceleration)
0: CPU only
> 0: Specific number of layers to offload

jan serve qwen3.5-35b --n-gpu-layers 35  # Offload 35 layers
jan serve qwen3.5-35b --n-gpu-layers 0   # CPU only

ctx-size

number

default:"4096"

Context window size in tokens. Use 0 for model default.

jan serve qwen3.5-35b --ctx-size 8192

Setting --ctx-size explicitly disables --fit. Use --fit to maximize context based on available VRAM.

fit

boolean

default:"false"

Auto-fit context to available VRAM, maximizing the context window.

jan serve qwen3.5-35b --fit

When enabled, Jan automatically determines the largest context size your GPU can handle.

threads

number

default:"0"

CPU threads for inference. Use 0 to auto-detect.

jan serve qwen3.5-35b --threads 8

Model Type

embedding

boolean

default:"false"

Treat the model as an embedding model.

jan serve nomic-embed-text --embedding

Background Mode

detach

boolean

default:"false"

Run in the background (detach from terminal) and print the PID.

jan serve qwen3.5-35b --detach

Output:

{
  "pid": 12345,
  "model_id": "qwen3.5-35b-a3b",
  "log": "/Users/you/Library/Application Support/Jan/logs/serve.log"
}

log

string

Log file for background mode. Defaults to <data-folder>/logs/serve.log.

jan serve qwen3.5-35b --detach --log /tmp/jan-serve.log

Output Control

verbose

boolean

default:"false"

Print full server logs (llama.cpp / mlx output) instead of the loading spinner.

jan serve qwen3.5-35b --verbose
jan serve qwen3.5-35b -v

Examples

# Pick a model interactively
jan serve

# Serve a specific model
jan serve qwen3.5-35b-a3b

Output

Success

✓ qwen3.5-35b-a3b ready · http://127.0.0.1:6767

  Endpoint  http://127.0.0.1:6767/v1

  Press Ctrl+C to stop.

The model is now serving at http://127.0.0.1:6767/v1 with OpenAI-compatible endpoints:

/v1/chat/completions
/v1/completions
/v1/embeddings (for embedding models)
/v1/models

Error

✗ Failed to load qwen3.5-35b-a3b

Error: model not found in data folder

OpenAI-Compatible API

Once the model is serving, you can use it with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:6767/v1",
    api_key="jan"  # or your custom --api-key
)

response = client.chat.completions.create(
    model="qwen3.5-35b-a3b",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message.content)

HuggingFace Auto-Download

Jan can automatically download models from HuggingFace when you specify a repo ID:

jan serve Qwen/Qwen2.5-35B-Instruct-GGUF

The CLI will:

Fetch available GGUF files from the repo
Let you pick a quantization interactively
Download the model to your Jan data folder
Serve the model

Private/Gated Models

Set a HuggingFace token to download private or gated models:

export HF_TOKEN="your_token_here"
jan serve meta-llama/Llama-3.3-70B-Instruct-GGUF

Background Mode

Run the model server in the background:

jan serve qwen3.5-35b --detach

Output:

{
  "pid": 12345,
  "model_id": "qwen3.5-35b-a3b",
  "log": "/Users/you/Library/Application Support/Jan/logs/serve.log"
}

To stop the background server:

kill 12345

View logs:

tail -f "$HOME/Library/Application Support/Jan/logs/serve.log"

Performance Tips

Maximize Context Window

Use --fit to automatically determine the largest context size your GPU can handle:

jan serve qwen3.5-35b --fit

Optimize for Speed

Offload all layers to GPU:

jan serve qwen3.5-35b --n-gpu-layers -1

Optimize for Memory

Reduce context size and GPU layers:

jan serve qwen3.5-35b --ctx-size 2048 --n-gpu-layers 20

CPU-Only Mode

Run entirely on CPU (no GPU):

jan serve qwen3.5-35b --n-gpu-layers 0 --threads 8

Troubleshooting

Model Not Found

Error: model not found in data folder

Solution: Download the model first using the Jan desktop app, or use a HuggingFace repo ID to auto-download.

Binary Not Found

✗ llama-server binary not found

Solution: Install a backend from Jan’s settings, or specify the binary path with --bin.

Out of Memory

Error: failed to allocate memory

Solution: Reduce --ctx-size or --n-gpu-layers, or use --fit to auto-size the context.

Port Already in Use

Error: address already in use

Solution: Choose a different port with --port, or use --port 0 to auto-select a free port.

Launch Command

Wire AI agents to local models

Commands Reference

Complete reference for all CLI commands

CLI

Extensions

API Reference

Core Library

jan serve

Usage

Arguments

Options

Model Configuration

Server Configuration

Performance Configuration

Model Type

Background Mode

Output Control

Examples

Output

Success

Error

OpenAI-Compatible API

HuggingFace Auto-Download

Private/Gated Models

Background Mode

Performance Tips

Maximize Context Window

Optimize for Speed

Optimize for Memory

CPU-Only Mode

Troubleshooting

Model Not Found

Binary Not Found

Out of Memory

Port Already in Use

See Also

Launch Command

Commands Reference

Build docs developers (and LLMs) love

CLI

Extensions

API Reference

Core Library

​Usage

​Arguments

​Options

​Model Configuration

​Server Configuration

​Performance Configuration

​Model Type

​Background Mode

​Output Control

​Examples

​Output

​Success

​Error

​OpenAI-Compatible API

​HuggingFace Auto-Download

​Private/Gated Models

​Background Mode

​Performance Tips

​Maximize Context Window

​Optimize for Speed

​Optimize for Memory

​CPU-Only Mode

​Troubleshooting

​Model Not Found

​Binary Not Found

​Out of Memory

​Port Already in Use

​See Also

Launch Command

Commands Reference

Build docs developers (and LLMs) love

Usage

Arguments

Options

Model Configuration

Server Configuration

Performance Configuration

Model Type

Background Mode

Output Control

Examples

Output

Success

Error

OpenAI-Compatible API

HuggingFace Auto-Download

Private/Gated Models

Background Mode

Performance Tips

Maximize Context Window

Optimize for Speed

Optimize for Memory

CPU-Only Mode

Troubleshooting

Model Not Found

Binary Not Found

Out of Memory

Port Already in Use

See Also