Skip to main content

Overview

The LeetCodeRetriever class handles semantic search over the LeetCode solution database using FAISS HNSW (Hierarchical Navigable Small World) indexing and sentence transformers for embeddings.

Initialization Parameters

index_path
string
default:"leetcode_hnsw2.index"
Path to the FAISS HNSW index file. The default path is relative to the component directory.
import os
from src.DSAAssistant.components.retriever2 import LeetCodeRetriever

retriever = LeetCodeRetriever(
    index_path="/custom/path/to/index.index"
)
The index must be created using faiss.IndexHNSWFlat. Other index types will raise a ValueError.
metadata_path
string
default:"leetcode_metadata2.pkl"
Path to the pickled metadata file containing Solution objects. This file stores the actual problem titles, solutions, difficulty levels, topics, and companies.
retriever = LeetCodeRetriever(
    metadata_path="/custom/path/to/metadata.pkl"
)
model_name
string
default:"all-MiniLM-L6-v2"
The sentence transformer model used for encoding queries. This must match the model used to create the index.Popular alternatives:
  • all-MiniLM-L6-v2: Fast, lightweight (default)
  • all-mpnet-base-v2: More accurate, slower
  • multi-qa-mpnet-base-dot-v1: Optimized for Q&A tasks
retriever = LeetCodeRetriever(
    model_name="all-mpnet-base-v2"
)
Changing the model requires rebuilding the index with embeddings from the new model.
HNSW search parameter that controls the speed/accuracy trade-off during retrieval.
  • Lower values (16-32): Faster search, slightly lower recall
  • Higher values (64-128): Slower search, higher recall
The default value of 32 provides a good balance for most use cases.
retriever = LeetCodeRetriever(
    ef_search=64  # More accurate, slower
)

Search Methods

results = retriever.search(
    query="dynamic programming coin change",
    k=5,
    return_scores=True
)

for solution, score in results:
    print(f"{solution.title}: {score:.3f}")
query
string
required
The search query (natural language or keywords)
k
int
default:"3"
Number of results to return
return_scores
bool
default:"True"
If True, returns (Solution, float) tuples with similarity scores. If False, returns only Solution objects.

Metadata Filtering

Filter solutions by difficulty, topics, or companies:
filtered = retriever.filter_by_metadata(
    companies=["Amazon", "Google"],
    difficulty="Medium",
    topics=["Dynamic Programming", "BFS"]
)
companies
List[str]
Filter by companies that ask this question (case-insensitive partial match)
difficulty
string
Filter by difficulty: "Easy", "Medium", or "Hard" (case-insensitive)
topics
List[str]
Filter by topics/tags (case-insensitive partial match)

Solution Data Structure

Each retrieved solution is a Solution dataclass with these fields:
@dataclass
class Solution:
    title: str          # Problem title (e.g., "Two Sum")
    solution: str       # Full solution with explanation and code
    difficulty: str     # "Easy", "Medium", or "Hard"
    topics: str         # Comma-separated topics
    companies: str      # Comma-separated companies

HNSW Tuning Guide

The ef_search parameter controls the size of the dynamic candidate list during search:
Speed: Very fast
Accuracy: ~90% recall
Use case: Real-time applications, large datasets

Performance Benchmarks

Assuming 1,000 indexed solutions:
ef_searchLatencyRecallRAM Usage
16~2ms89%Low
32~4ms95%Low
64~8ms98%Medium
128~15ms99%+High

Example Configurations

Default Configuration

from src.DSAAssistant.components.retriever2 import LeetCodeRetriever

retriever = LeetCodeRetriever()

High-Accuracy Configuration

For maximum retrieval precision:
retriever = LeetCodeRetriever(
    model_name="all-mpnet-base-v2",
    ef_search=128
)

Fast Configuration

For real-time applications:
retriever = LeetCodeRetriever(
    model_name="all-MiniLM-L6-v2",
    ef_search=16
)

Custom Paths

retriever = LeetCodeRetriever(
    index_path="/data/indices/custom.index",
    metadata_path="/data/metadata/custom.pkl",
    ef_search=32
)

Advanced Usage

Combining Search and Filtering

# First filter by metadata
filtered_solutions = retriever.filter_by_metadata(
    difficulty="Medium",
    topics=["Dynamic Programming"]
)

# Then perform semantic search within filtered results
query = "longest increasing subsequence"
query_vector = retriever.encoder.encode([query])
# Manual search on filtered subset

Inspecting Index Properties

print(f"Index type: {type(retriever.index)}")
print(f"Embedding dimension: {retriever.encoder.get_sentence_embedding_dimension()}")
print(f"Number of solutions: {len(retriever.solutions)}")
print(f"HNSW ef_search: {retriever.index.hnsw.efSearch}")

Configuration Tips

Start with defaults and adjust ef_search only if retrieval quality is insufficient.
For production systems: Use ef_search=32-64 for the best speed/accuracy balance.
Changing the embedding model requires rebuilding the entire FAISS index. Make sure embeddings are consistent.
The retriever uses cosine similarity (via L2 normalization) for semantic matching. Higher scores indicate better matches.

Build docs developers (and LLMs) love