Skip to main content

Supported Models

Quest supports two primary models through Ollama, each optimized for different tasks:

Qwen2.5-Coder (Default Model)

model_name="qwen2.5-coder:1.5b"
Best for:
  • Code generation and synthesis
  • Fast responses to standard queries
  • Explanation of algorithms and data structures
  • General-purpose programming assistance
Specifications:
  • Parameters: 1.5 billion
  • Context window: 32K tokens
  • Inference speed: Fast (2-5s typical)
  • Specialization: Code-focused pre-training
Qwen2.5-Coder is the default model because it provides the best balance of speed and code quality for DSA problems.

DeepSeek-R1 (Reasoning Model)

reasoning_model="deepseek-r1:7b"
Best for:
  • Complex problem decomposition
  • Step-by-step reasoning for difficult problems
  • Trade-off analysis (time vs. space complexity)
  • Edge case identification
  • Multi-step solution strategies
Specifications:
  • Parameters: 7 billion
  • Context window: 32K tokens
  • Inference speed: Moderate (5-15s typical)
  • Specialization: Chain-of-thought reasoning
Reasoning Format: DeepSeek-R1 outputs its thinking process in <think> tags:
<think>
1. The problem requires finding two indices...
2. A hash map can reduce time complexity...
3. Edge case: duplicate elements...
</think>

Here's the solution using a hash map approach...
The RAGEngine automatically filters out <think> blocks and returns only the final answer.

Model Selection

Choosing Between Models

Use General Mode

  • Standard coding questions
  • Quick explanations
  • Syntax help
  • Algorithm implementation

Use Reasoning Mode

  • Complex problem analysis
  • Multiple solution approaches
  • Optimization strategies
  • Interview preparation

Switching Models at Runtime

from rag_engine import RAGEngine

rag_engine = RAGEngine(retriever)

# Start with general mode (fast)
rag_engine.set_mode("general")
response = rag_engine.answer_question("Implement binary search")

# Switch to reasoning mode for complex query
rag_engine.set_mode("reasoning")
response = rag_engine.answer_question(
    "Compare different approaches to the knapsack problem"
)

Generation Parameters

Temperature

Controls randomness and creativity in generation:
rag_engine = RAGEngine(
    retriever=retriever,
    temperature=0.4  # Default
)
Output: Highly deterministic, focused
Use for: Code generation, factual answers
Risk: May be repetitive or rigid
temperature=0.2

Top-P (Nucleus Sampling)

Controls diversity by limiting the token pool:
rag_engine = RAGEngine(
    retriever=retriever,
    top_p=0.9  # Default
)
top_pBehaviorUse Case
0.5Very focused, considers only top 50% of probability massDeterministic tasks
0.7Moderately focusedStructured code generation
0.9Balanced (default)General-purpose
0.95More diverseCreative explanations
Setting top_p below 0.5 often leads to repetitive outputs. Keep it at 0.7 or higher for most tasks.

Repeat Penalty

Reduces repetition by penalizing previously generated tokens:
rag_engine = RAGEngine(
    retriever=retriever,
    repeat_penalty=1.1  # Default
)
Effect: No penalty
Risk: Model may repeat phrases or code patterns
repeat_penalty=1.0

Number of Threads

Controls CPU parallelism during inference:
rag_engine = RAGEngine(
    retriever=retriever,
    num_thread=8  # Default
)
Hardware-based recommendations:
CPU CoresRecommended num_threadNotes
44Use all cores
86-8Leave headroom for OS
1612-16High-performance systems
32+16-24Diminishing returns above 16
For laptops or shared systems, set num_thread to 50-75% of available cores to avoid overloading the CPU.

Example Configurations

Code Generation (Optimized)

rag_engine = RAGEngine(
    retriever=retriever,
    model_name="qwen2.5-coder:1.5b",
    mode="general",
    temperature=0.3,      # Low for consistency
    top_p=0.85,           # Focused
    repeat_penalty=1.1,
    num_thread=8
)

Reasoning and Explanation (Balanced)

rag_engine = RAGEngine(
    retriever=retriever,
    reasoning_model="deepseek-r1:7b",
    mode="reasoning",
    temperature=0.5,      # Moderate for clarity
    top_p=0.9,
    repeat_penalty=1.1,
    num_thread=12         # More resources for larger model
)

Creative Exploration (High Variance)

rag_engine = RAGEngine(
    retriever=retriever,
    model_name="qwen2.5-coder:1.5b",
    mode="general",
    temperature=0.7,      # Higher for variety
    top_p=0.95,           # More diverse
    repeat_penalty=1.2,   # Encourage variation
    num_thread=8
)

Low-Resource Systems

rag_engine = RAGEngine(
    retriever=retriever,
    model_name="qwen2.5-coder:1.5b",  # Smaller model
    mode="general",
    temperature=0.4,
    top_p=0.9,
    repeat_penalty=1.1,
    num_thread=4          # Fewer threads
)

Configuration Best Practices

Start Conservative

Begin with default values and adjust incrementally based on output quality.

Match Task to Model

Use general mode for code, reasoning mode for complex analysis.

Monitor Performance

Track inference time and adjust num_thread if responses are slow.

Test Systematically

Change one parameter at a time to understand its impact.
Avoid these common mistakes:
  • Setting temperature > 0.8 for code generation (causes syntax errors)
  • Using too many threads on low-core systems (causes thrashing)
  • Combining high temperature with high repeat penalty (produces incoherent text)

Model Performance Comparison

MetricQwen2.5-CoderDeepSeek-R1
Response Time2-5s5-15s
Code QualityExcellentVery Good
Explanation DepthGoodExcellent
ReasoningBasicAdvanced
Memory Usage~2GB~6GB
Best ForImplementationAnalysis

Build docs developers (and LLMs) love