Algorithm Deep Dive

Auto Tagger uses a machine learning approach based on TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity to suggest tags. This page explains the mathematical concepts and implementation details.

Overview

The algorithm works in two phases:

Training Phase: Scan your vault to build statistical profiles for each tag
Inference Phase: Compare a note’s content against tag profiles to suggest relevant tags

Training Phase: Building Tag Profiles

Vault Scanning

When the plugin loads (or when you run “Rescan vault”), it processes all markdown files in batches:

// From model.ts:56-81
async scan(app: App, onProgress?: (pct: number) => void): Promise<ModelStats> {
    this.clear();
    const files = app.vault.getMarkdownFiles();
    const BATCH = 100;

    for (let i = 0; i < files.length; i += BATCH) {
        const end = Math.min(i + BATCH, files.length);
        for (let j = i; j < end; j++) {
            const content = await app.vault.cachedRead(files[j]);
            this.processFile(content);
        }
        onProgress?.(end / files.length);
        // Yield to UI thread between batches
        if (end < files.length) {
            await new Promise<void>((r) => setTimeout(r, 0));
        }
    }

    this.finalizeVectors();
    this._ready = true;
    return this.getStats();
}

The plugin processes files in batches of 100 to prevent UI freezing during vault scans.

Text Tokenization

Before analyzing content, the plugin extracts meaningful words:

// From model.ts:197-217
tokenize(content: string): string[] {
    let body = content.replace(/^---\n[\s\S]*?\n---\n?/, ""); // frontmatter
    body = body.replace(/```[\s\S]*?```/g, ""); // code blocks
    body = body.replace(/`[^`]*`/g, ""); // inline code
    body = body.replace(/!\[.*?\]\(.*?\)/g, ""); // images
    body = body.replace(/\[([^\]]*)\]\(.*?\)/g, "$1"); // links
    body = body.replace(/\[\[([^\]|]*?)(?:\|.*?)?\]\]/g, "$1"); // wiki links
    body = body.replace(/#[a-zA-Z][a-zA-Z0-9_/\-]*/g, ""); // tags
    body = body.replace(/#{1,6}\s/g, ""); // heading markers
    body = body.replace(/[*_~`]+/g, ""); // emphasis

    return body
        .toLowerCase()
        .split(/[^a-zäöüßàáâãèéêëìíîïòóôõùúûüñç0-9]+/)
        .filter(
            (w) =>
                w.length >= 3 &&
                !STOP_WORDS.has(w) &&
                !/^\d+$/.test(w)
        );
}

The tokenizer:

Removes frontmatter, code blocks, links, images, and tags
Converts to lowercase
Filters out stop words (common words like “the”, “and”, etc.)
Keeps only words with 3+ characters
Excludes pure numbers
Supports English and German characters

Building Document Frequency

For each unique word in each document, the plugin tracks how many documents contain that word:

// From model.ts:239-245
const uniqueWords = new Set(words);

// Global document frequency (for IDF)
for (const w of uniqueWords) {
    this.globalDocFreq.set(
        w,
        (this.globalDocFreq.get(w) || 0) + 1
    );
}

What is Document Frequency?

Document frequency (DF) is the number of documents containing a specific word. This is used to calculate IDF (Inverse Document Frequency), which gives lower weight to common words and higher weight to rare, distinctive words.For example:

“the” appears in 1000 documents → high DF → low IDF → low importance
“kubernetes” appears in 5 documents → low DF → high IDF → high importance

Building Tag Profiles

For each tag, the plugin builds a profile containing:

// From model.ts:6-13
interface TagProfile {
    documentCount: number;          // How many notes have this tag
    wordCounts: Map<string, number>; // Word frequencies in tagged notes
    totalWordCount: number;          // Total words across all tagged notes
    vector: Map<string, number>;     // Precomputed TF-IDF vector
    vectorNorm: number;              // Precomputed vector magnitude
}

TF-IDF Calculation

Term Frequency (TF)

TF measures how often a word appears relative to total words:

// From model.ts:294
const tf = count / profile.totalWordCount;

Formula: TF = (word count in tag's documents) / (total words in tag's documents)

Inverse Document Frequency (IDF)

IDF measures how rare or distinctive a word is:

// From model.ts:296
const idf = Math.log(1 + this.taggedDocuments / df);

Formula: IDF = log(1 + total tagged documents / documents containing word)

The + 1 in the formula prevents division by zero and smooths the IDF values.

TF-IDF Weight

The final weight combines both metrics:

// From model.ts:297-298
const weight = tf * idf;
profile.vector.set(word, weight);

Formula: TF-IDF = TF × IDF

Vector Normalization

To enable cosine similarity calculation, the plugin computes the magnitude (L2 norm) of each tag’s vector:

// From model.ts:299-301
norm += weight * weight;
profile.vectorNorm = Math.sqrt(norm);

Formula: ||v|| = √(w₁² + w₂² + ... + wₙ²)

Inference Phase: Suggesting Tags

Document Vector Construction

When suggesting tags for a note, the plugin first builds a TF-IDF vector for that note:

// From model.ts:96-113
const words = this.tokenize(content);
const wordCounts = new Map<string, number>();
for (const w of words) wordCounts.set(w, (wordCounts.get(w) || 0) + 1);

const docVector = new Map<string, number>();
let docNorm = 0;

for (const [word, count] of wordCounts) {
    const df = this.globalDocFreq.get(word);
    if (!df) continue; // word not in corpus
    const tf = count / words.length;
    const idf = Math.log(1 + this.taggedDocuments / df);
    const weight = tf * idf;
    docVector.set(word, weight);
    docNorm += weight * weight;
}
docNorm = Math.sqrt(docNorm);

Cosine Similarity

The plugin compares the note’s vector against each tag’s vector using cosine similarity:

// From model.ts:122-129
let dot = 0;
for (const [word, docW] of docVector) {
    const tagW = profile.vector.get(word);
    if (tagW) dot += docW * tagW;
}

let score = dot / (docNorm * profile.vectorNorm);

Formula: cosine_similarity = (A · B) / (||A|| × ||B||) Where:

A · B = dot product of the two vectors
||A|| = magnitude of vector A
||B|| = magnitude of vector B

Why Cosine Similarity?

Cosine similarity measures the angle between two vectors, ranging from 0 (completely different) to 1 (identical direction). It’s ideal for text comparison because:

It ignores document length (normalized by vector magnitudes)
It focuses on the distribution of words, not their absolute counts
It’s computationally efficient

Example:

Two documents about “machine learning” will have similar word distributions (high cosine similarity) even if one is 100 words and the other is 1000 words.

Co-occurrence Boost

After calculating base similarity, the plugin applies a boost based on tag co-occurrence patterns:

// From model.ts:131-142
for (const existingTag of existingTags) {
    const coMap = this.cooccurrence.get(existingTag);
    if (!coMap) continue;
    const coCount = coMap.get(tag);
    if (!coCount) continue;
    const existingProfile = this.tagProfiles.get(existingTag);
    if (!existingProfile) continue;
    const coRate = coCount / existingProfile.documentCount;
    score *= 1 + coRate;
}

Formula: final_score = cosine_similarity × (1 + co-occurrence rate) The co-occurrence rate is calculated as: co-occurrence rate = (times tags appeared together) / (total documents with existing tag)

Understanding Co-occurrence

If you have the tag #python in your note, and in your vault:

#python appears in 20 documents
#python and #web-development appear together in 8 documents

Then for the candidate tag #web-development:

Co-occurrence rate = 8 / 20 = 0.4
Score boost = 1 + 0.4 = 1.4×

This means if #web-development has a base similarity of 0.15, it becomes 0.15 × 1.4 = 0.21 after the boost.

Building the Co-occurrence Matrix

During vault scanning, the plugin tracks which tags appear together:

// From model.ts:271-285
const tagArr = Array.from(tags);
for (let i = 0; i < tagArr.length; i++) {
    for (let j = 0; j < tagArr.length; j++) {
        if (i === j) continue;
        let map = this.cooccurrence.get(tagArr[i]);
        if (!map) {
            map = new Map();
            this.cooccurrence.set(tagArr[i], map);
        }
        map.set(
            tagArr[j],
            (map.get(tagArr[j]) || 0) + 1
        );
    }
}

Filtering and Ranking

Before returning suggestions, the plugin applies several filters:

// From model.ts:118-120
if (existingTags.has(tag)) continue;          // Skip tags already in note
if (profile.documentCount < 2) continue;      // Need at least 2 examples
if (profile.vectorNorm === 0) continue;       // Skip empty vectors

// From model.ts:144-150
if (score >= minScore) {
    results.push({ tag, score });
}

results.sort((a, b) => b.score - a.score);
return results.slice(0, maxResults);

Tags that appear in only 1 document are excluded from suggestions. This prevents overfitting to single examples and ensures the model has learned genuine patterns.

Performance Optimizations

Precomputed Vectors

Tag vectors are computed once during vault scanning and reused for all suggestions:

// From model.ts:288-302
private finalizeVectors() {
    // Precompute TF-IDF vectors for each tag
    for (const [, profile] of this.tagProfiles) {
        profile.vector.clear();
        let norm = 0;
        for (const [word, count] of profile.wordCounts) {
            const tf = count / profile.totalWordCount;
            const df = this.globalDocFreq.get(word) || 1;
            const idf = Math.log(1 + this.taggedDocuments / df);
            const weight = tf * idf;
            profile.vector.set(word, weight);
            norm += weight * weight;
        }
        profile.vectorNorm = Math.sqrt(norm);
    }
}

This means:

O(n) time during vault scan
O(1) lookup during suggestion
No repeated TF-IDF calculations

Batch Processing

Vault scanning yields to the UI thread every 100 files:

// From model.ts:72-75
if (end < files.length) {
    await new Promise<void>((r) => setTimeout(r, 0));
}

This prevents the UI from freezing during large vault scans.

Sparse Vector Representation

Vectors are stored as Map<string, number> instead of dense arrays. This is efficient because:

Most words appear in only a few documents (sparse data)
Only non-zero weights are stored
Memory usage scales with unique words, not total vocabulary

Example Calculation

Let’s walk through a complete example:

Full worked example

Vault stats:

100 tagged documents total
Tag #javascript appears in 15 documents
Tag #react appears in 8 documents
They appear together in 6 documents

Word frequencies for #javascript:

“function” appears 45 times across 300 total words
“function” appears in 40 documents globally

TF-IDF for “function” in #javascript:

TF = 45 / 300 = 0.15
IDF = log(1 + 100 / 40) = log(3.5) ≈ 1.25
TF-IDF = 0.15 × 1.25 = 0.1875

Current note:

Contains “function” 3 times in 50 words
Already has tag #react

Note’s TF-IDF for “function”:

TF = 3 / 50 = 0.06
IDF = log(1 + 100 / 40) ≈ 1.25
TF-IDF = 0.06 × 1.25 = 0.075

Cosine similarity:

Dot product (simplified, just “function”): 0.1875 × 0.075 = 0.0141
Assume vector norms are 1.0 (normalized)
Base score: 0.0141

Co-occurrence boost:

#react and #javascript appear together in 6 of 8 #react documents
Co-occurrence rate: 6 / 8 = 0.75
Boost: 1 + 0.75 = 1.75
Final score: 0.0141 × 1.75 ≈ 0.0247

If minScore = 0.01, then #javascript would be suggested!

Model Statistics

You can check the model’s current state in the plugin settings:

// From model.ts:41-47
getStats(): ModelStats {
    return {
        totalDocuments: this.totalDocuments,
        taggedDocuments: this.taggedDocuments,
        uniqueTags: this.tagProfiles.size,
        uniqueWords: this.globalDocFreq.size,
    };
}

These stats help you understand:

How much training data the model has
Whether you need to rescan after adding new notes
The vocabulary size the model learned from

Get Started

User Guide

Advanced

Development

Overview

Training Phase: Building Tag Profiles

Vault Scanning

Text Tokenization

Building Document Frequency

Building Tag Profiles

TF-IDF Calculation

Term Frequency (TF)

Inverse Document Frequency (IDF)

TF-IDF Weight

Vector Normalization

Inference Phase: Suggesting Tags

Document Vector Construction

Cosine Similarity

Co-occurrence Boost

Building the Co-occurrence Matrix

Filtering and Ranking

Performance Optimizations

Precomputed Vectors

Batch Processing

Sparse Vector Representation

Example Calculation

Model Statistics

Build docs developers (and LLMs) love

Get Started

User Guide

Advanced

Development

​Overview

​Training Phase: Building Tag Profiles

​Vault Scanning

​Text Tokenization

​Building Document Frequency

​Building Tag Profiles

​TF-IDF Calculation

​Term Frequency (TF)

​Inverse Document Frequency (IDF)

​TF-IDF Weight

​Vector Normalization

​Inference Phase: Suggesting Tags

​Document Vector Construction

​Cosine Similarity

​Co-occurrence Boost

​Building the Co-occurrence Matrix

​Filtering and Ranking

​Performance Optimizations

​Precomputed Vectors

​Batch Processing

​Sparse Vector Representation

​Example Calculation

​Model Statistics

Build docs developers (and LLMs) love

Overview

Training Phase: Building Tag Profiles

Vault Scanning

Text Tokenization

Building Document Frequency

Building Tag Profiles

TF-IDF Calculation

Term Frequency (TF)

Inverse Document Frequency (IDF)

TF-IDF Weight

Vector Normalization

Inference Phase: Suggesting Tags

Document Vector Construction

Cosine Similarity

Co-occurrence Boost

Building the Co-occurrence Matrix

Filtering and Ranking

Performance Optimizations

Precomputed Vectors

Batch Processing

Sparse Vector Representation

Example Calculation

Model Statistics