Skip to main content

Overview

Model parameters control every aspect of how your AI behaves - from creativity and randomness to memory and processing speed. Understanding these settings helps you get the most out of both local and cloud models.
You can adjust parameters per-conversation or set permanent defaults for each model. Changes apply immediately.

Accessing Settings

There are multiple ways to configure model behavior:
Adjust settings for a specific chat:
  1. Open any conversation
  2. Click the gear icon next to your selected model
  3. Modify parameters in the sidebar
  4. Changes apply to this conversation only
Use per-conversation settings to experiment without affecting your model’s default behavior.

Core Parameters

Temperature

Controls randomness and creativity in responses.
ValueBehaviorBest For
0.0 - 0.3Deterministic, focused, factualCode generation, math, factual Q&A
0.4 - 0.7Balanced creativity and coherenceGeneral chat, explanations, summaries
0.8 - 1.0Creative, varied, exploratoryCreative writing, brainstorming, storytelling
1.0+Highly random, experimentalExperimental or artistic purposes
# Same prompt, similar outputs
Prompt: "What is 2+2?"
Output 1: "2+2 equals 4."
Output 2: "2+2 equals 4."
Output 3: "2+2 equals 4."
Start with 0.7 for general use. Lower it for precision tasks, raise it for creative work.

Top P (Nucleus Sampling)

Controls diversity by limiting the pool of considered tokens.
  • 0.9 (default): Considers top 90% probability tokens - good balance
  • 0.95: Slightly more diverse outputs
  • 0.5: Very focused, less variety
  • 1.0: Considers all tokens
Top P works together with temperature. Both control randomness, but in different ways. Most users can leave Top P at default (0.9-0.95).

Top K

Limits the model to choosing from the K most likely next tokens.
  • 40 (default): Moderate variety
  • 20: More focused outputs
  • 80-100: More diverse outputs
Top K and Top P serve similar purposes. Many models work well with defaults - adjust only if you notice issues with repetition or randomness.

Max Tokens

Maximum number of tokens the model can generate in a single response.
  • 512: Short responses (a few paragraphs)
  • 2048: Medium responses (most use cases)
  • 4096+: Long-form content (essays, code, detailed explanations)
  • -1: Unlimited (continues until natural stopping point)
Higher max tokens = longer potential responses but also higher processing time and cost (for cloud models). The model may stop earlier if it naturally completes the response.

Memory Settings

Context Length (Context Size)

How much conversation history the model remembers.
TokensApproximate WordsBest For
2048~1,500 wordsShort Q&A, limited RAM
4096~3,000 wordsStandard conversations
8192~6,000 wordsLong discussions (default)
16384+~12,000+ wordsVery long conversations, document analysis
32768+~24,000+ wordsEntire documents, extensive context
Jan defaults to 8192 tokens or your model’s maximum (whichever is smaller). This handles most conversations well.
Memory Impact:
  • Longer context = more RAM/VRAM usage
  • Each message in history consumes context
  • Tools and system prompts also use context
If you hit context limits, start a new conversation or summarize the discussion so far.

Hardware Acceleration

GPU Layers (ngl)

Controls how many model layers run on your GPU vs CPU.
SettingPerformanceMemoryWhen to Use
Max (e.g., 40)FastestHigh VRAM usageYou have a powerful GPU with plenty of VRAM
Moderate (20)BalancedMedium VRAMGPU has limited memory
0SlowestNo VRAM usageCPU-only inference
1

Start High

Set GPU Layers to maximum (e.g., 40 for most models).
2

Monitor Performance

Run the model and check if it loads successfully.
3

Reduce if Needed

If you get out-of-memory errors, reduce by 5-10 layers at a time until it works.
Apple Silicon (M1/M2/M3/M4): GPU layers are automatically offloaded to the unified memory. Just maximize the setting.

Continuous Batching

Processes multiple requests or tokens simultaneously for better throughput.
  • Enabled (recommended): Better performance for multiple conversations or tool calls
  • Disabled: Simpler processing, slightly lower memory usage
Keep this enabled unless you’re troubleshooting issues.

Repetition Control

Repeat Penalty

Reduces the likelihood of repeating the same words or phrases.
  • 1.0: No penalty (default for some models)
  • 1.1 - 1.3: Reduces repetition (recommended for most users)
  • 1.5+: Strongly discourages repetition (may affect coherence)
Values above 1.5 can make responses feel awkward or forced. Start around 1.1-1.2.

Frequency Penalty

Reduces repetition based on how often tokens appear.
  • 0: No penalty
  • 0.1 - 0.3: Mild discouragement of repeated words
  • 0.5+: Strong penalty (can reduce coherence)

Presence Penalty

Encourages the model to explore new topics.
  • 0: No penalty (sticks to current topic)
  • 0.1 - 0.5: Encourages topic variety
  • 1.0+: Strongly pushes for new topics (may lose focus)
Use Repeat Penalty for general repetition issues. Use Frequency/Presence Penalties for more nuanced control over topic exploration.

Advanced Settings

Min P

Minimum probability threshold for token selection.
  • 0.05 (default): Prevents very unlikely tokens
  • Lower values: Allow more unlikely tokens (more creative)
  • Higher values: Restrict to likely tokens (more focused)
Most users don’t need to adjust Min P. It works well at default settings.

Stop Sequences

Tokens or strings that tell the model to stop generating.
"stop": ["\n\n", "User:", "###"]
  • Model stops when it generates any of these strings
  • Useful for controlling output format
  • Often used in chat templates

Prompt Template

Defines how the model interprets system messages, user input, and assistant responses.
<|system|>
{system_message}<|user|>
{prompt}<|assistant|>
Changing the prompt template can break model behavior. Only modify if you understand your model’s expected format.

Use Case Presets

Balanced settings for everyday conversations:
{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 2048,
  "context_length": 8192,
  "repeat_penalty": 1.1
}

Troubleshooting

Symptoms: Model keeps repeating words, phrases, or ideasSolutions:
  • Increase Temperature to 0.8-1.0
  • Increase Repeat Penalty to 1.2-1.3
  • Add Frequency Penalty around 0.2
  • Try a different model (some are more prone to repetition)
Symptoms: Output doesn’t make sense or goes off-topicSolutions:
  • Lower Temperature to 0.3-0.5
  • Reduce Top P to 0.85
  • Lower Top K to 30-40
  • Check your prompt for clarity
Symptoms: Model fails to load or crashes mid-generationSolutions:
  • Reduce GPU Layers by 5-10 at a time
  • Lower Context Length to 4096 or less
  • Close other memory-intensive applications
  • Try a smaller model or lower quantization (Q4 instead of Q8)
  • Reduce Max Tokens if generating very long responses
Symptoms: Model takes forever to generate textSolutions:
  • Increase GPU Layers to maximum (if you have a GPU)
  • Verify you’re using the correct backend (CUDA for NVIDIA, Metal for Apple)
  • Enable Continuous Batching
  • Reduce Context Length if very high (32k+ can be slow)
  • Close background applications
  • Try a smaller/faster model
Symptoms: Model ignores available toolsSolutions:
  • Enable Tools capability in model settings (click edit button)
  • Lower Temperature to 0.3-0.5 for more consistent tool calling
  • Be explicit: “Use the search tool to find…”
  • Try a model known for good tool calling (Jan v1, Claude, GPT-4)

Model Capabilities

Beyond parameters, you can enable special features:

Vision

Lets the model analyze images you share.
1

Enable Vision

Click the edit button next to your model and toggle Vision on.
2

Share Images

In chat, click the image icon or paste images directly.
3

Ask Questions

“What’s in this image?” or “Describe this diagram in detail.”
Supported Models: GPT-4o, Claude Opus/Sonnet 4, Gemini Pro, LLaVA (local), Jan v2 VL

Tools (MCP)

Enables external tool calling via Model Context Protocol.
1

Enable Tools

Click the edit button next to your model and toggle Tools on.
2

Configure MCP Servers

See MCP Integration guide for setting up tools.
3

Use in Conversations

Model automatically uses tools when needed or when explicitly asked.

Reasoning

Enables step-by-step thinking for complex problems.
  • Best for: Math, logic puzzles, multi-step reasoning
  • Models: o3, o1, Claude Opus, specialized reasoning models
  • May increase response time but improves accuracy

Embeddings

Generates vector representations of text for semantic search and RAG.
  • Enable for: Document search, semantic similarity, RAG applications
  • Not needed for regular chat
  • Specialized models (e.g., bge-large-en) work best

Best Practices

1

Start with Defaults

Use Jan’s defaults first. They work well for most use cases.
2

Change One Thing at a Time

When tuning, adjust one parameter at a time so you understand its impact.
3

Match Task to Settings

Use appropriate presets: low temperature for code, high for creativity.
4

Monitor Context Usage

Long conversations eat up context. Start fresh if you hit limits.
5

Maximize GPU Usage

If you have a GPU, max out GPU layers for best performance.
6

Document Your Settings

If you find settings that work well, note them for future reference.

Next Steps

Local Models

Learn which models work best for different parameter configurations

MCP Integration

Enhance models with external tools and data sources

API Server

Use parameter-tuned models via API

Cloud Integration

Apply parameters to cloud models like GPT-4 and Claude

Build docs developers (and LLMs) love