How It Works

Auto Tagger uses statistical analysis to learn from your existing tagging patterns. No AI or external services are required—everything runs locally in your vault.

Overview

The plugin builds a mathematical model of word-tag associations by analyzing all tagged notes in your vault. When you open or edit a note, it compares the note’s content against this model to suggest relevant tags.

The plugin only learns from notes that already have tags. At least 2 documents must contain a tag before it can be suggested.

The Four-Step Process

1. Vault Scanning

When Obsidian starts (or when you manually trigger a rescan), the plugin:

Scans all markdown files in your vault (in batches of 100 for performance)
Extracts tags from both inline format (#tag) and frontmatter YAML
Tokenizes note content into words
Builds statistical profiles for each tag

The scanner processes content in batches and yields to the UI thread between batches to keep Obsidian responsive during large vault scans. What gets indexed:

Words that are at least 3 characters long
Words that aren’t stop words (common words like “the”, “and”, “is”)
Words that aren’t pure numbers
Content from note body (excluding frontmatter, code blocks, and images)

What gets filtered out:

Frontmatter blocks
Code blocks (fenced and inline)
Image embeds
Link URLs (but link text is kept)
Wiki link paths
Existing tags
Stop words in English and German

You can see the full stop word list in src/stopwords.ts:1.

2. TF-IDF Vectors

Once scanning completes, the plugin calculates TF-IDF (Term Frequency-Inverse Document Frequency) vectors for each tag.

What is TF-IDF?

TF-IDF is a statistical measure that reflects how important a word is to a tag across your entire vault.

TF (Term Frequency): How often a word appears in notes with this tag
IDF (Inverse Document Frequency): How rare the word is across all tagged documents

Rare, distinctive words get higher weights than common ones. This helps the plugin focus on words that truly characterize a tag.

The math (from src/model.ts:294-301): For each tag profile:

tf = word_count / total_words_in_tag
df = number_of_documents_containing_word
idf = log(1 + tagged_documents / df)
weight = tf × idf

These weights are precomputed and stored as normalized vectors (using Euclidean norm) for fast comparison later.

3. Cosine Similarity

When you open or edit a note, the plugin:

Tokenizes the current note content into a TF-IDF vector (using the same process)
Compares this vector against each tag’s precomputed vector using cosine similarity
Filters out tags already present in the note
Filters out tags that appear in fewer than 2 documents

Cosine similarity formula (from src/model.ts:123-129):

score = dot_product(doc_vector, tag_vector) / (doc_norm × tag_norm)

This produces a score between 0 and 1, where higher values mean the note’s content is more similar to documents typically tagged with that tag.

Only tags that appear in at least 2 documents can be suggested. This prevents overfitting to single examples.

4. Co-occurrence Boosting

The final step applies a relevance boost based on tag co-occurrence patterns (from src/model.ts:133-142). If your note already has certain tags, and those tags frequently appear together with a candidate tag in your vault, the candidate’s score is boosted:

co_rate = times_tags_appeared_together / existing_tag_document_count
boosted_score = base_score × (1 + co_rate)

Example: If you have #python in your note, and #programming appears in 80% of your notes tagged with #python, then #programming gets a significant boost. This helps the plugin learn implicit tag hierarchies and related concepts from your existing patterns.

Content Extraction

Tag Extraction

The plugin recognizes tags in two formats (from src/model.ts:155-195): Inline tags:

#machine-learning #python #data/science

Supports alphanumeric characters, underscores, hyphens, and forward slashes (for nested tags). Frontmatter tags:

---
tags: [python, machine-learning]
---

---
tags:
  - python
  - machine-learning
---

Both formats are normalized (quotes and # prefixes are stripped).

Text Tokenization

The tokenization process (from src/model.ts:197-217):

Removes frontmatter blocks
Removes code blocks (fenced and inline)
Removes image embeds
Extracts text from markdown links
Extracts text from wiki links (ignoring aliases)
Removes existing tags
Removes heading markers and emphasis
Splits on non-word boundaries
Converts to lowercase
Filters by length (≥3 characters), stop words, and pure numbers

The tokenizer supports Unicode characters including German umlauts and accented characters (ä, ö, ü, ß, à, á, etc.).

Performance Considerations

Scanning
Suggestions
Memory

Vault scanning processes files in batches of 100 and yields to the UI thread between batches (from src/model.ts:63-76). This keeps Obsidian responsive even with large vaults.After scanning completes, you’ll see a notice showing:

Number of unique tags learned
Number of tagged documents processed

When to Rescan

The model is built once at startup. You should manually rescan when:

You’ve added many new tags to existing notes
You’ve bulk-imported notes with tags
You’ve significantly restructured your tagging system
Tag suggestions seem outdated

Use the “Rescan vault” command from the command palette or the button in settings.

Get Started

User Guide

Advanced

Development

Overview

The Four-Step Process

1. Vault Scanning

2. TF-IDF Vectors

3. Cosine Similarity

4. Co-occurrence Boosting

Content Extraction

Tag Extraction

Text Tokenization

Performance Considerations

When to Rescan

Build docs developers (and LLMs) love

Get Started

User Guide

Advanced

Development

​Overview

​The Four-Step Process

​1. Vault Scanning

​2. TF-IDF Vectors

​3. Cosine Similarity

​4. Co-occurrence Boosting

​Content Extraction

​Tag Extraction

​Text Tokenization

​Performance Considerations

​When to Rescan

Build docs developers (and LLMs) love

Overview

The Four-Step Process

1. Vault Scanning

2. TF-IDF Vectors

3. Cosine Similarity

4. Co-occurrence Boosting

Content Extraction

Tag Extraction

Text Tokenization

Performance Considerations

When to Rescan