Supported Models
Quest supports two primary models through Ollama, each optimized for different tasks:Qwen2.5-Coder (Default Model)
- Code generation and synthesis
- Fast responses to standard queries
- Explanation of algorithms and data structures
- General-purpose programming assistance
- Parameters: 1.5 billion
- Context window: 32K tokens
- Inference speed: Fast (2-5s typical)
- Specialization: Code-focused pre-training
DeepSeek-R1 (Reasoning Model)
- Complex problem decomposition
- Step-by-step reasoning for difficult problems
- Trade-off analysis (time vs. space complexity)
- Edge case identification
- Multi-step solution strategies
- Parameters: 7 billion
- Context window: 32K tokens
- Inference speed: Moderate (5-15s typical)
- Specialization: Chain-of-thought reasoning
<think> tags:
The RAGEngine automatically filters out
<think> blocks and returns only the final answer.Model Selection
Choosing Between Models
Use General Mode
- Standard coding questions
- Quick explanations
- Syntax help
- Algorithm implementation
Use Reasoning Mode
- Complex problem analysis
- Multiple solution approaches
- Optimization strategies
- Interview preparation
Switching Models at Runtime
Generation Parameters
Temperature
Controls randomness and creativity in generation:- Low (0.1-0.3)
- Medium (0.4-0.6)
- High (0.7-1.0)
Output: Highly deterministic, focused
Use for: Code generation, factual answers
Risk: May be repetitive or rigid
Use for: Code generation, factual answers
Risk: May be repetitive or rigid
Top-P (Nucleus Sampling)
Controls diversity by limiting the token pool:| top_p | Behavior | Use Case |
|---|---|---|
| 0.5 | Very focused, considers only top 50% of probability mass | Deterministic tasks |
| 0.7 | Moderately focused | Structured code generation |
| 0.9 | Balanced (default) | General-purpose |
| 0.95 | More diverse | Creative explanations |
Repeat Penalty
Reduces repetition by penalizing previously generated tokens:- None (1.0)
- Light (1.1)
- Medium (1.2)
- Strong (1.5+)
Effect: No penalty
Risk: Model may repeat phrases or code patterns
Risk: Model may repeat phrases or code patterns
Number of Threads
Controls CPU parallelism during inference:| CPU Cores | Recommended num_thread | Notes |
|---|---|---|
| 4 | 4 | Use all cores |
| 8 | 6-8 | Leave headroom for OS |
| 16 | 12-16 | High-performance systems |
| 32+ | 16-24 | Diminishing returns above 16 |
Example Configurations
Code Generation (Optimized)
Reasoning and Explanation (Balanced)
Creative Exploration (High Variance)
Low-Resource Systems
Configuration Best Practices
Start Conservative
Begin with default values and adjust incrementally based on output quality.
Match Task to Model
Use general mode for code, reasoning mode for complex analysis.
Monitor Performance
Track inference time and adjust
num_thread if responses are slow.Test Systematically
Change one parameter at a time to understand its impact.
Model Performance Comparison
| Metric | Qwen2.5-Coder | DeepSeek-R1 |
|---|---|---|
| Response Time | 2-5s | 5-15s |
| Code Quality | Excellent | Very Good |
| Explanation Depth | Good | Excellent |
| Reasoning | Basic | Advanced |
| Memory Usage | ~2GB | ~6GB |
| Best For | Implementation | Analysis |