Overview
The algorithm works in two phases:- Training Phase: Scan your vault to build statistical profiles for each tag
- Inference Phase: Compare a note’s content against tag profiles to suggest relevant tags
Training Phase: Building Tag Profiles
Vault Scanning
When the plugin loads (or when you run “Rescan vault”), it processes all markdown files in batches:The plugin processes files in batches of 100 to prevent UI freezing during vault scans.
Text Tokenization
Before analyzing content, the plugin extracts meaningful words:- Removes frontmatter, code blocks, links, images, and tags
- Converts to lowercase
- Filters out stop words (common words like “the”, “and”, etc.)
- Keeps only words with 3+ characters
- Excludes pure numbers
- Supports English and German characters
Building Document Frequency
For each unique word in each document, the plugin tracks how many documents contain that word:What is Document Frequency?
What is Document Frequency?
Document frequency (DF) is the number of documents containing a specific word. This is used to calculate IDF (Inverse Document Frequency), which gives lower weight to common words and higher weight to rare, distinctive words.For example:
- “the” appears in 1000 documents → high DF → low IDF → low importance
- “kubernetes” appears in 5 documents → low DF → high IDF → high importance
Building Tag Profiles
For each tag, the plugin builds a profile containing:TF-IDF Calculation
Term Frequency (TF)
TF measures how often a word appears relative to total words:TF = (word count in tag's documents) / (total words in tag's documents)
Inverse Document Frequency (IDF)
IDF measures how rare or distinctive a word is:IDF = log(1 + total tagged documents / documents containing word)
The
+ 1 in the formula prevents division by zero and smooths the IDF values.TF-IDF Weight
The final weight combines both metrics:TF-IDF = TF × IDF
Vector Normalization
To enable cosine similarity calculation, the plugin computes the magnitude (L2 norm) of each tag’s vector:||v|| = √(w₁² + w₂² + ... + wₙ²)
Inference Phase: Suggesting Tags
Document Vector Construction
When suggesting tags for a note, the plugin first builds a TF-IDF vector for that note:Cosine Similarity
The plugin compares the note’s vector against each tag’s vector using cosine similarity:cosine_similarity = (A · B) / (||A|| × ||B||)
Where:
A · B= dot product of the two vectors||A||= magnitude of vector A||B||= magnitude of vector B
Why Cosine Similarity?
Why Cosine Similarity?
Cosine similarity measures the angle between two vectors, ranging from 0 (completely different) to 1 (identical direction). It’s ideal for text comparison because:
- It ignores document length (normalized by vector magnitudes)
- It focuses on the distribution of words, not their absolute counts
- It’s computationally efficient
- Two documents about “machine learning” will have similar word distributions (high cosine similarity) even if one is 100 words and the other is 1000 words.
Co-occurrence Boost
After calculating base similarity, the plugin applies a boost based on tag co-occurrence patterns:final_score = cosine_similarity × (1 + co-occurrence rate)
The co-occurrence rate is calculated as:
co-occurrence rate = (times tags appeared together) / (total documents with existing tag)
Understanding Co-occurrence
Understanding Co-occurrence
If you have the tag
#python in your note, and in your vault:#pythonappears in 20 documents#pythonand#web-developmentappear together in 8 documents
#web-development:- Co-occurrence rate = 8 / 20 = 0.4
- Score boost = 1 + 0.4 = 1.4×
#web-development has a base similarity of 0.15, it becomes 0.15 × 1.4 = 0.21 after the boost.Building the Co-occurrence Matrix
During vault scanning, the plugin tracks which tags appear together:Filtering and Ranking
Before returning suggestions, the plugin applies several filters:Performance Optimizations
Precomputed Vectors
Tag vectors are computed once during vault scanning and reused for all suggestions:- O(n) time during vault scan
- O(1) lookup during suggestion
- No repeated TF-IDF calculations
Batch Processing
Vault scanning yields to the UI thread every 100 files:Sparse Vector Representation
Vectors are stored asMap<string, number> instead of dense arrays. This is efficient because:
- Most words appear in only a few documents (sparse data)
- Only non-zero weights are stored
- Memory usage scales with unique words, not total vocabulary
Example Calculation
Let’s walk through a complete example:Full worked example
Full worked example
Vault stats:
- 100 tagged documents total
- Tag
#javascriptappears in 15 documents - Tag
#reactappears in 8 documents - They appear together in 6 documents
- “function” appears 45 times across 300 total words
- “function” appears in 40 documents globally
- TF = 45 / 300 = 0.15
- IDF = log(1 + 100 / 40) = log(3.5) ≈ 1.25
- TF-IDF = 0.15 × 1.25 = 0.1875
- Contains “function” 3 times in 50 words
- Already has tag
#react
- TF = 3 / 50 = 0.06
- IDF = log(1 + 100 / 40) ≈ 1.25
- TF-IDF = 0.06 × 1.25 = 0.075
- Dot product (simplified, just “function”): 0.1875 × 0.075 = 0.0141
- Assume vector norms are 1.0 (normalized)
- Base score: 0.0141
#reactand#javascriptappear together in 6 of 8#reactdocuments- Co-occurrence rate: 6 / 8 = 0.75
- Boost: 1 + 0.75 = 1.75
- Final score: 0.0141 × 1.75 ≈ 0.0247
minScore = 0.01, then #javascript would be suggested!Model Statistics
You can check the model’s current state in the plugin settings:- How much training data the model has
- Whether you need to rescan after adding new notes
- The vocabulary size the model learned from