The problem with single-method retrieval
Dense vector search (pgvector cosine similarity) excels at capturing semantic meaning. Ask “how should I price my subscription tier?” and it will surface documents about pricing strategy, value signaling, and fan psychology — even if none of those documents use the words “price” or “tier” exactly. But dense search fails on exact terminology: searching for a specific platform name, a creator handle, or a payout rate formula produces noisy results because these high-entropy tokens don’t behave cleanly in embedding space. BM25 sparse search (keyword term frequency) excels at exact terminology. Ask about “Slushy payout rules” and BM25 finds every document containing those exact words, ranked by term frequency and document length normalization. But BM25 is blind to semantics: it can’t recognize that “fan subscription renewal” and “recurring membership billing” describe the same concept.Dense vector search
Strengths: semantic similarity, concept matching, paraphrase handlingWeaknesses: exact terminology, platform names, high-entropy tokens, rare terms
BM25 sparse search
Strengths: exact term matching, platform names, creator handles, payout formulasWeaknesses: semantic blind, fails on paraphrasing and concept synonyms
The RRF formula
RRF scores each document by summing its reciprocal rank position across all result lists:dis a document (node ID)rank_i(d)is the rank of documentdin result listi(1-indexed)k = 60is the smoothing constant that prevents low-ranked items from dominating- The sum is taken across all result lists (dense + sparse)
1/(60+1) + 1/(60+1) = 0.0328. A document ranked #1 in dense but absent from sparse scores 1/(60+1) = 0.0164. Agreement between methods pushes documents to the top.
The smoothing constant
k=60 is the empirically established default from the original RRF paper (Cormack et al., 2009). It prevents documents ranked very low in one list from contributing negligible noise while allowing high-ranked documents from either method to meaningfully contribute.How the two lists are built
Dense hits: semantic activation from DuckDB
The dense result list is built from node activation scores in DuckDB, boosted by label overlap with the HyDE-generated hypothetical document:Sparse hits: BM25 over Nodes/ JSON files
BM25 indexes all.json node files across Nodes/Universe/, Nodes/User/, and Nodes/Transitional/. At query time, the raw query is tokenized and scored against the index:
bm25_index.py:get_index()).
RRF scoring implementation
get_top_node_ids(). Full node content (label, category, type, activation) is fetched from DuckDB and attached via get_top_context().