Skip to main content

Quickstart

This guide will walk you through setting up Ollama, pulling the required models, and running your first query with Quest.

Install Ollama

Quest uses Ollama to run language models locally. Install Ollama for your platform:
# Download and install from official website
curl -fsSL https://ollama.ai/install.sh | sh

# Or download the installer from:
# https://ollama.ai/download/mac
Ollama runs as a local server on http://localhost:11434. Quest communicates with this API to generate responses.

Pull required models

Quest uses two different models depending on the query mode:
1

Pull qwen2.5-coder:1.5b

This is the default model for general queries and explanations:
ollama pull qwen2.5-coder:1.5b
Model specs:
  • Size: ~1.5B parameters
  • Purpose: Code generation and explanation
  • Speed: Fast inference (typically < 15 seconds per query)
This model is optimized for coding problems and generates concise, accurate solutions.
2

Pull deepseek-r1:7b

This model is used for complex reasoning tasks:
ollama pull deepseek-r1:7b
Model specs:
  • Size: ~7B parameters
  • Purpose: Step-by-step reasoning for complex problems
  • Speed: Slower but more thorough (typically < 4 minutes)
The reasoning model generates <think> blocks that Quest automatically filters out to provide clean answers. You can also use the smaller deepseek-r1:1.5b variant by configuring the reasoning_model parameter.
3

Verify models are installed

Check that both models are available:
ollama list
You should see both qwen2.5-coder:1.5b and deepseek-r1:7b in the output.

Start the Flask app

Now you’re ready to run Quest:
cd Quest
python app.py
You should see output like:
 * Serving Flask app 'app'
 * Debug mode: off
INFO:werkzeug:WARNING: This is a development server.
 * Running on http://127.0.0.1:5000
INFO:root:RAG Engine initialized successfully.
INFO:root:Retriever initialized successfully.
The first time you run Quest, it may take a few seconds to load the sentence transformer model (all-MiniLM-L6-v2) into memory.

Make your first query

With the Flask app running, you can interact with Quest in two ways:

Using the web interface

Open your browser and navigate to:
http://127.0.0.1:5000
You’ll see the Quest interface where you can:
  • Enter queries in the search box
  • Switch between “General” and “Reasoning” modes
  • View conversation history
  • Clear history when starting a new topic

Using the API directly

You can also query Quest programmatically:
import requests

# Make a query
response = requests.post(
    "http://127.0.0.1:5000/search",
    json={
        "query": "Explain the Two Sum problem",
        "mode": "general"
    }
)

result = response.json()
print(result["response"])

Example queries

Try these example queries to see Quest in action:
When you query an exact problem title, Quest retrieves it instantly from the hash map:
{
  "query": "Two Sum",
  "mode": "general"
}
Response includes the complete solution with metadata.
For general questions, Quest retrieves similar problems and generates explanations:
{
  "query": "Explain dynamic programming with an example",
  "mode": "general"
}
Quest finds relevant DP problems and generates a detailed explanation.
For harder problems, switch to reasoning mode:
{
  "query": "How do I optimize a recursive solution with memoization?",
  "mode": "reasoning"
}
The deepseek-r1 model provides step-by-step reasoning.
Quest maintains conversation history (configurable, default 3 queries):
// First query
{"query": "What is the Two Sum problem?", "mode": "general"}

// Follow-up (uses context from first query)
{"query": "Can you show me the hash map approach?", "mode": "general"}
The second response incorporates context from the first query.

Understanding the response

Quest responses include:
  • Exact Match Solution - If query exactly matches a problem title
  • Generated Solution - For general queries with retrieved context
  • Relevant code snippets and explanations
  • Problem metadata (difficulty, topics, companies)
Example response structure:
Generated Solution:
The Two Sum problem asks you to find two numbers in an array that add up to a specific target...

**Approach:**
1. Use a hash map to store numbers and their indices
2. For each number, check if target - number exists in the map
3. Return indices when found

**Implementation:**
[code snippet]

**Complexity:**
- Time: O(n)
- Space: O(n)

API endpoints

The Flask app exposes several endpoints:
EndpointMethodDescription
/GETRender the web interface
/searchPOSTSubmit a query (see examples above)
/set_modePOSTSwitch between general/reasoning mode
/get_historyGETRetrieve conversation history
/clear_historyPOSTClear conversation history
/stopPOSTStop ongoing generation
For detailed API documentation, see the API endpoints and Core Components pages.

Using the Python API directly

You can also use Quest without the Flask app:
from src.DSAAssistant.components.retriever2 import LeetCodeRetriever
from rag_engine3 import RAGEngine

# Initialize components
retriever = LeetCodeRetriever()
rag_engine = RAGEngine(
    retriever=retriever,
    max_history=3  # Keep last 3 interactions
)

# Set mode (general or reasoning)
rag_engine.set_mode("general")

# Query the engine
response = rag_engine.answer_question(
    query="Explain the concept of dynamic programming",
    k=5,  # Retrieve top 5 similar problems
    min_confidence=0.6  # Minimum similarity threshold
)

print(response)

Configuration options

The RAGEngine constructor accepts these parameters:
rag_engine = RAGEngine(
    retriever=retriever,
    ollama_url="http://localhost:11434/api/generate",
    model_name="qwen2.5-coder:1.5b",      # General mode model
    reasoning_model="deepseek-r1:7b",     # Reasoning mode model
    mode="general",                        # Default mode
    temperature=0.4,                       # Lower = more focused
    top_p=0.9,                            # Nucleus sampling
    confidence_threshold=0.7,              # Retrieval threshold
    repeat_penalty=1.1,                    # Reduce repetition
    num_thread=8,                          # CPU threads for inference
    max_history=3                          # Conversation memory
)
Adjust temperature (0.1-1.0) to control response creativity. Lower values make responses more deterministic.

Advanced usage

Metadata filtering

Filter solutions by company, difficulty, or topics:
retriever = LeetCodeRetriever()

# Find all medium difficulty problems from Amazon about BFS
filtered = retriever.filter_by_metadata(
    companies=["Amazon"],
    difficulty="Medium",
    topics=["BFS"]
)

for solution in filtered:
    print(f"Title: {solution.title}")
    print(f"Topics: {solution.topics}")

Custom HNSW parameters

Tune retrieval speed vs accuracy:
retriever = LeetCodeRetriever(
    ef_search=32  # Default: 32. Higher = more accurate but slower
)
From retriever2.py:47:
  • ef_search=16 - Faster, less accurate
  • ef_search=32 - Balanced (default)
  • ef_search=64 - Slower, more accurate

Troubleshooting

This means Ollama isn’t running. Start it with:
ollama serve
Or on Windows/macOS, launch the Ollama application.
You need to pull the model first:
ollama pull qwen2.5-coder:1.5b
ollama pull deepseek-r1:7b
Several factors affect speed:
  • First query - Slow as models load into memory
  • Large k value - Reduce k in search (try k=3 instead of k=5)
  • Reasoning mode - Inherently slower, switch to general for faster responses
  • CPU threads - Increase num_thread parameter if you have more cores
Try:
  • Lower the min_confidence threshold (default 0.6)
  • Rephrase your query to be more specific
  • Check that the problem exists in the dataset (1800+ LeetCode problems)
  • Use exact problem titles for instant matches

Next steps

Now that you’ve run your first query, explore more features:

API Reference

Detailed documentation of all API endpoints and classes

Configuration

Learn about advanced configuration options

Core Concepts

Understand how Quest’s components work together

Guides

Learn how to use Quest effectively

Build docs developers (and LLMs) love