Skip to main content
Ollama is the local LLM runtime that powers Quest’s AI responses. This guide will walk you through installing, configuring, and troubleshooting Ollama.

Installation

Using Homebrew

The easiest way to install Ollama on macOS is using Homebrew:
brew install ollama

Manual Installation

Alternatively, download the installer from the official website:
  1. Visit ollama.ai
  2. Download the macOS installer
  3. Open the downloaded .dmg file
  4. Drag Ollama to your Applications folder

Starting Ollama

After installation, start the Ollama service:
1

Start the Ollama service

ollama serve
This starts Ollama on http://localhost:11434 (the default port).
The Ollama service must be running for Quest to work. Keep this terminal window open.
2

Verify the service is running

In a new terminal, check if Ollama is accessible:
curl http://localhost:11434/api/version
You should see a JSON response with version information.

Pulling Required Models

Quest uses two models depending on the mode:
1

Pull the general model

For general queries, Quest uses qwen2.5-coder:1.5b:
ollama pull qwen2.5-coder:1.5b
This model is configured in rag_engine3.py:23:
model_name: str = "qwen2.5-coder:1.5b",  # Default model
2

Pull the reasoning model

For reasoning mode, Quest uses deepseek-r1:7b (or the 1.5b variant):
ollama pull deepseek-r1:7b
Or for faster performance on lower-spec machines:
ollama pull deepseek-r1:1.5b
This model is configured in rag_engine3.py:24:
reasoning_model: str = "deepseek-r1:7b",  # Reasoning model
The 1.5b variant requires less memory (~2GB) but the 7b variant provides better reasoning quality (~6GB).
3

Verify models are installed

List all downloaded models:
ollama list
You should see both models in the output:
NAME                     ID              SIZE      MODIFIED
qwen2.5-coder:1.5b      abc123def456    900 MB    2 minutes ago
deepseek-r1:7b          def789ghi012    6.0 GB    1 minute ago

Configuring Ollama for Quest

Quest connects to Ollama via its REST API. The default configuration in rag_engine3.py is:
rag_engine3.py
class RAGEngine:
    def __init__(
        self,
        retriever: LeetCodeRetriever,
        ollama_url: str = "http://localhost:11434/api/generate",
        model_name: str = "qwen2.5-coder:1.5b",
        reasoning_model: str = "deepseek-r1:7b",
        temperature: float = 0.4,
        top_p: float = 0.9,
        repeat_penalty: float = 1.1,
        num_thread: int = 8
    ):

Customizing Model Parameters

You can adjust these parameters when initializing the RAG engine:
rag_engine = RAGEngine(
    retriever,
    temperature=0.2,  # More deterministic responses
    top_p=0.85
)

Verifying Installation

1

Test Ollama API directly

Test the API with a simple prompt:
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:1.5b",
  "prompt": "Explain binary search in one sentence.",
  "stream": false
}'
2

Test with Quest

Start the Flask application:
python app.py
Visit http://localhost:5000 and try a test query like “Two Sum problem”.

Troubleshooting

Ollama Service Not Running

Error: Connection refused or Failed to connect to Ollama APISolution: Ensure Ollama is running:
ollama serve

Port Already in Use

Error: Error: listen tcp 127.0.0.1:11434: bind: address already in useSolution: Kill the existing process:
# Find the process
lsof -i :11434

# Kill it (replace PID with actual process ID)
kill -9 <PID>

Model Not Found

Error: Error: model 'qwen2.5-coder:1.5b' not foundSolution: Pull the model:
ollama pull qwen2.5-coder:1.5b

Out of Memory

Error: Model loading fails or system becomes unresponsiveSolution: Use smaller models:
rag_engine = RAGEngine(
    retriever,
    model_name="qwen2.5-coder:1.5b",
    reasoning_model="deepseek-r1:1.5b"  # Use 1.5b instead of 7b
)

Slow Response Times

If responses are slow, try:
  1. Increase thread count:
    rag_engine = RAGEngine(retriever, num_thread=16)
    
  2. Use GPU acceleration (if available):
    # Ollama automatically uses GPU if available
    # Verify with:
    nvidia-smi
    
  3. Reduce context window:
    rag_engine = RAGEngine(retriever, max_history=1)  # Less history
    

Advanced Configuration

Using a Custom Ollama URL

If Ollama is running on a different host or port:
app.py
rag_engine = RAGEngine(
    retriever,
    ollama_url="http://192.168.1.100:11434/api/generate"
)

Running Ollama as a Service

Create a systemd service file:
sudo nano /etc/systemd/system/ollama.service
Add:
[Unit]
Description=Ollama Service
After=network.target

[Service]
ExecStart=/usr/local/bin/ollama serve
Restart=always
User=your-username

[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl enable ollama
sudo systemctl start ollama

Next Steps

Using the Web Interface

Learn how to interact with Quest through the Flask web interface

Query Optimization

Write effective queries for better results

Build docs developers (and LLMs) love