The Quest web interface is built with Flask, providing a simple REST API for interacting with the RAG engine.
Application Structure
The Flask application is defined in app.py and provides a web interface and REST API endpoints.
Initialization
from flask import Flask, render_template, request, Response, jsonify
from src.DSAAssistant.components.retriever2 import LeetCodeRetriever, Solution
from rag_engine3 import RAGEngine
import logging
import time
# Initialize Flask app
app = Flask(__name__)
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Initialize RAG Engine
retriever = LeetCodeRetriever()
rag_engine = RAGEngine(retriever, max_history=3)
The application initializes:
- Flask app instance - The main application
- Logging - INFO level logging for request tracking
- LeetCodeRetriever - Loads the HNSW index and metadata
- RAGEngine - Initialized with max_history=3 to retain last 3 conversations
Running the Application
Development Mode
Run the Flask application in development mode:
The application starts on http://localhost:5000 by default.
The application is configured with debug=False in production. Change to debug=True for development.
Production Mode
For production deployments, use a WSGI server like Gunicorn:
gunicorn app:app --bind 0.0.0.0:5000 --workers 4
The RAG engine processes queries sequentially. Using multiple workers may lead to race conditions. Consider using a single worker or implementing proper request queuing.
Configuration
Application Settings
if __name__ == '__main__':
app.run(debug=False)
You can configure the Flask app with additional parameters:
app.run(
host='0.0.0.0', # Listen on all interfaces
port=5000, # Port number
debug=False, # Debug mode
threaded=True # Enable threading
)
Environment Variables
While the application doesn’t use environment variables by default, you can add them for configuration:
import os
OLLAMA_URL = os.getenv('OLLAMA_URL', 'http://localhost:11434/api/generate')
MAX_HISTORY = int(os.getenv('MAX_HISTORY', '3'))
rag_engine = RAGEngine(
retriever,
ollama_url=OLLAMA_URL,
max_history=MAX_HISTORY
)
Request Logging
The application logs all search requests with timing information:
# Log the start time
start_time = time.time()
# Set the mode (general or reasoning)
rag_engine.set_mode(mode)
logger.info(f"Mode set to: {mode}")
# Get response from RAG engine
response = rag_engine.answer_question(query)
# Log the time taken
logger.info(f"Response generated in {time.time() - start_time:.2f} seconds")
Example log output:
INFO:__main__:Mode set to: general
INFO:__main__:Response generated in 15.32 seconds
Error Handling
All endpoints include try-except blocks for error handling:
try:
# Set the mode (general or reasoning)
rag_engine.set_mode(mode)
logger.info(f"Mode set to: {mode}")
# Get response from RAG engine
response = rag_engine.answer_question(query)
# Return the response as JSON
return jsonify({"response": response})
except Exception as e:
logger.error(f"Error processing query: {e}")
return jsonify({"error": "An error occurred while processing your request."}), 500
Errors are:
- Logged with the error message
- Returned to the client with HTTP 500 status
- Include a generic error message (not the full stack trace)
Template Rendering
The main route renders an HTML template:
@app.route('/')
def index():
"""Render the main index page."""
return render_template('index.html')
The template is located at templates/index.html and provides the web interface for making queries.
CORS Configuration
If you need to enable CORS for external API access, install flask-cors:
Then configure CORS in your application:
from flask_cors import CORS
app = Flask(__name__)
CORS(app) # Enable CORS for all routes
Or configure it for specific routes:
from flask_cors import cross_origin
@app.route('/search', methods=['POST'])
@cross_origin()
def search():
# ... endpoint implementation
Deployment Considerations
Dependencies
Ensure all dependencies are installed:
pip install -r requirements.txt
Key dependencies:
flask>=2.1.0 - Web framework
gunicorn - Production WSGI server (if using)
flask-caching - Caching support (installed but not configured)
The initial request may take longer as the RAG engine loads the HNSW index and embeddings model. Subsequent requests will be faster.
Loading times:
- HNSW index: ~1-2 seconds
- Sentence transformer model: ~2-3 seconds
- First query: ~15-20 seconds (general mode)
- Subsequent queries: ~15 seconds (general mode)
Memory Usage
The application keeps several large objects in memory:
- HNSW index (~50-100 MB)
- Sentence transformer model (~100 MB)
- Metadata for 1800+ solutions (~10-20 MB)
- Conversation history (scales with usage)
Plan for at least 500 MB - 1 GB of RAM for the application.
Ollama Dependency
The application requires Ollama to be running on the same machine or a remote server:
# Default Ollama URL
http://localhost:11434/api/generate
Ensure Ollama is running before starting the Flask app:
And that the required models are available:
ollama pull qwen2.5-coder:1.5b
ollama pull deepseek-r1:7b