Model Parameters

Overview

Model parameters control every aspect of how your AI behaves - from creativity and randomness to memory and processing speed. Understanding these settings helps you get the most out of both local and cloud models.

You can adjust parameters per-conversation or set permanent defaults for each model. Changes apply immediately.

Accessing Settings

There are multiple ways to configure model behavior:

Per-Conversation
Permanent Defaults
Model Capabilities

Adjust settings for a specific chat:

Open any conversation
Click the gear icon next to your selected model
Modify parameters in the sidebar
Changes apply to this conversation only

Use per-conversation settings to experiment without affecting your model’s default behavior.

Core Parameters

Temperature

Controls randomness and creativity in responses.

Value	Behavior	Best For
0.0 - 0.3	Deterministic, focused, factual	Code generation, math, factual Q&A
0.4 - 0.7	Balanced creativity and coherence	General chat, explanations, summaries
0.8 - 1.0	Creative, varied, exploratory	Creative writing, brainstorming, storytelling
1.0+	Highly random, experimental	Experimental or artistic purposes

# Same prompt, similar outputs
Prompt: "What is 2+2?"
Output 1: "2+2 equals 4."
Output 2: "2+2 equals 4."
Output 3: "2+2 equals 4."

Start with 0.7 for general use. Lower it for precision tasks, raise it for creative work.

Top P (Nucleus Sampling)

Controls diversity by limiting the pool of considered tokens.

0.9 (default): Considers top 90% probability tokens - good balance
0.95: Slightly more diverse outputs
0.5: Very focused, less variety
1.0: Considers all tokens

Top P works together with temperature. Both control randomness, but in different ways. Most users can leave Top P at default (0.9-0.95).

Top K

Limits the model to choosing from the K most likely next tokens.

40 (default): Moderate variety
20: More focused outputs
80-100: More diverse outputs

Top K and Top P serve similar purposes. Many models work well with defaults - adjust only if you notice issues with repetition or randomness.

Max Tokens

Maximum number of tokens the model can generate in a single response.

512: Short responses (a few paragraphs)
2048: Medium responses (most use cases)
4096+: Long-form content (essays, code, detailed explanations)
-1: Unlimited (continues until natural stopping point)

Higher max tokens = longer potential responses but also higher processing time and cost (for cloud models). The model may stop earlier if it naturally completes the response.

Memory Settings

Context Length (Context Size)

How much conversation history the model remembers.

Tokens	Approximate Words	Best For
2048	~1,500 words	Short Q&A, limited RAM
4096	~3,000 words	Standard conversations
8192	~6,000 words	Long discussions (default)
16384+	~12,000+ words	Very long conversations, document analysis
32768+	~24,000+ words	Entire documents, extensive context

Jan defaults to 8192 tokens or your model’s maximum (whichever is smaller). This handles most conversations well.

Memory Impact:

Longer context = more RAM/VRAM usage
Each message in history consumes context
Tools and system prompts also use context

If you hit context limits, start a new conversation or summarize the discussion so far.

Hardware Acceleration

GPU Layers (ngl)

Controls how many model layers run on your GPU vs CPU.

Setting	Performance	Memory	When to Use
Max (e.g., 40)	Fastest	High VRAM usage	You have a powerful GPU with plenty of VRAM
Moderate (20)	Balanced	Medium VRAM	GPU has limited memory
0	Slowest	No VRAM usage	CPU-only inference

Start High

Set GPU Layers to maximum (e.g., 40 for most models).

Monitor Performance

Run the model and check if it loads successfully.

Reduce if Needed

If you get out-of-memory errors, reduce by 5-10 layers at a time until it works.

Apple Silicon (M1/M2/M3/M4): GPU layers are automatically offloaded to the unified memory. Just maximize the setting.

Continuous Batching

Processes multiple requests or tokens simultaneously for better throughput.

Enabled (recommended): Better performance for multiple conversations or tool calls
Disabled: Simpler processing, slightly lower memory usage

Keep this enabled unless you’re troubleshooting issues.

Repetition Control

Repeat Penalty

Reduces the likelihood of repeating the same words or phrases.

1.0: No penalty (default for some models)
1.1 - 1.3: Reduces repetition (recommended for most users)
1.5+: Strongly discourages repetition (may affect coherence)

Values above 1.5 can make responses feel awkward or forced. Start around 1.1-1.2.

Frequency Penalty

Reduces repetition based on how often tokens appear.

0: No penalty
0.1 - 0.3: Mild discouragement of repeated words
0.5+: Strong penalty (can reduce coherence)

Presence Penalty

Encourages the model to explore new topics.

0: No penalty (sticks to current topic)
0.1 - 0.5: Encourages topic variety
1.0+: Strongly pushes for new topics (may lose focus)

Use Repeat Penalty for general repetition issues. Use Frequency/Presence Penalties for more nuanced control over topic exploration.

Advanced Settings

Min P

Minimum probability threshold for token selection.

0.05 (default): Prevents very unlikely tokens
Lower values: Allow more unlikely tokens (more creative)
Higher values: Restrict to likely tokens (more focused)

Most users don’t need to adjust Min P. It works well at default settings.

Stop Sequences

Tokens or strings that tell the model to stop generating.

"stop": ["\n\n", "User:", "###"]

Model stops when it generates any of these strings
Useful for controlling output format
Often used in chat templates

Prompt Template

Defines how the model interprets system messages, user input, and assistant responses.

<|system|>
{system_message}<|user|>
{prompt}<|assistant|>

Changing the prompt template can break model behavior. Only modify if you understand your model’s expected format.

Use Case Presets

General Chat
Code Generation
Creative Writing
Factual Q&A
Long Documents

Balanced settings for everyday conversations:

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 2048,
  "context_length": 8192,
  "repeat_penalty": 1.1
}

Precise, deterministic output for coding:

{
  "temperature": 0.2,
  "top_p": 0.9,
  "max_tokens": 4096,
  "context_length": 8192,
  "repeat_penalty": 1.0
}

Higher creativity for stories and content:

{
  "temperature": 0.9,
  "top_p": 0.95,
  "max_tokens": 4096,
  "context_length": 8192,
  "presence_penalty": 0.3
}

Focused, accurate responses:

{
  "temperature": 0.1,
  "top_p": 0.9,
  "max_tokens": 1024,
  "context_length": 4096,
  "repeat_penalty": 1.1
}

Maximum context for document analysis:

{
  "temperature": 0.5,
  "top_p": 0.9,
  "max_tokens": 4096,
  "context_length": 32768,
  "repeat_penalty": 1.15
}

Troubleshooting

Responses Too Repetitive

Symptoms: Model keeps repeating words, phrases, or ideasSolutions:

Increase Temperature to 0.8-1.0
Increase Repeat Penalty to 1.2-1.3
Add Frequency Penalty around 0.2
Try a different model (some are more prone to repetition)

Responses Too Random/Incoherent

Symptoms: Output doesn’t make sense or goes off-topicSolutions:

Lower Temperature to 0.3-0.5
Reduce Top P to 0.85
Lower Top K to 30-40
Check your prompt for clarity

Out of Memory Errors

Symptoms: Model fails to load or crashes mid-generationSolutions:

Reduce GPU Layers by 5-10 at a time
Lower Context Length to 4096 or less
Close other memory-intensive applications
Try a smaller model or lower quantization (Q4 instead of Q8)
Reduce Max Tokens if generating very long responses

Very Slow Responses

Symptoms: Model takes forever to generate textSolutions:

Increase GPU Layers to maximum (if you have a GPU)
Verify you’re using the correct backend (CUDA for NVIDIA, Metal for Apple)
Enable Continuous Batching
Reduce Context Length if very high (32k+ can be slow)
Close background applications
Try a smaller/faster model

Model Won't Use Tools/MCP

Symptoms: Model ignores available toolsSolutions:

Enable Tools capability in model settings (click edit button)
Lower Temperature to 0.3-0.5 for more consistent tool calling
Be explicit: “Use the search tool to find…”
Try a model known for good tool calling (Jan v1, Claude, GPT-4)

Model Capabilities

Beyond parameters, you can enable special features:

Vision

Lets the model analyze images you share.

Enable Vision

Click the edit button next to your model and toggle Vision on.

Share Images

In chat, click the image icon or paste images directly.

Ask Questions

“What’s in this image?” or “Describe this diagram in detail.”

Supported Models: GPT-4o, Claude Opus/Sonnet 4, Gemini Pro, LLaVA (local), Jan v2 VL

Tools (MCP)

Enables external tool calling via Model Context Protocol.

Enable Tools

Click the edit button next to your model and toggle Tools on.

Configure MCP Servers

See MCP Integration guide for setting up tools.

Use in Conversations

Model automatically uses tools when needed or when explicitly asked.

Reasoning

Enables step-by-step thinking for complex problems.

Best for: Math, logic puzzles, multi-step reasoning
Models: o3, o1, Claude Opus, specialized reasoning models
May increase response time but improves accuracy

Embeddings

Generates vector representations of text for semantic search and RAG.

Enable for: Document search, semantic similarity, RAG applications
Not needed for regular chat
Specialized models (e.g., bge-large-en) work best

Best Practices

Start with Defaults

Use Jan’s defaults first. They work well for most use cases.

Change One Thing at a Time

When tuning, adjust one parameter at a time so you understand its impact.

Match Task to Settings

Use appropriate presets: low temperature for code, high for creativity.

Monitor Context Usage

Long conversations eat up context. Start fresh if you hit limits.

Maximize GPU Usage

If you have a GPU, max out GPU layers for best performance.

Document Your Settings

If you find settings that work well, note them for future reference.

Next Steps

Local Models

Learn which models work best for different parameter configurations

MCP Integration

Enhance models with external tools and data sources

API Server

Use parameter-tuned models via API

Cloud Integration

Apply parameters to cloud models like GPT-4 and Claude

Get Started

Desktop App

Features

Integrations

Model Parameters

Overview

Accessing Settings

Core Parameters

Temperature

Top P (Nucleus Sampling)

Top K

Max Tokens

Memory Settings

Context Length (Context Size)

Hardware Acceleration

GPU Layers (ngl)

Continuous Batching

Repetition Control

Repeat Penalty

Frequency Penalty

Presence Penalty

Advanced Settings

Min P

Stop Sequences

Prompt Template

Use Case Presets

Troubleshooting

Model Capabilities

Vision

Tools (MCP)

Reasoning

Embeddings

Best Practices

Next Steps

Local Models

MCP Integration

API Server

Cloud Integration

Build docs developers (and LLMs) love

Get Started

Desktop App

Features

Integrations

​Overview

​Accessing Settings

​Core Parameters

​Temperature

​Top P (Nucleus Sampling)

​Top K

​Max Tokens

​Memory Settings

​Context Length (Context Size)

​Hardware Acceleration

​GPU Layers (ngl)

​Continuous Batching

​Repetition Control

​Repeat Penalty

​Frequency Penalty

​Presence Penalty

​Advanced Settings

​Min P

​Stop Sequences

​Prompt Template

​Use Case Presets

​Troubleshooting

​Model Capabilities

​Vision

​Tools (MCP)

​Reasoning

​Embeddings

​Best Practices

​Next Steps

Local Models

MCP Integration

API Server

Cloud Integration

Build docs developers (and LLMs) love

Overview

Accessing Settings

Core Parameters

Temperature

Top P (Nucleus Sampling)

Top K

Max Tokens

Memory Settings

Context Length (Context Size)

Hardware Acceleration

GPU Layers (ngl)

Continuous Batching

Repetition Control

Repeat Penalty

Frequency Penalty

Presence Penalty

Advanced Settings

Min P

Stop Sequences

Prompt Template

Use Case Presets

Troubleshooting

Model Capabilities

Vision

Tools (MCP)

Reasoning

Embeddings

Best Practices

Next Steps