Vectorization

Text Vectorization

Machine learning models require numerical input, but text is categorical and variable-length. Vectorization transforms text into fixed-size numerical vectors that capture linguistic patterns while enabling efficient computation.

For language detection, TF-IDF (Term Frequency-Inverse Document Frequency) is the gold standard. It captures word importance patterns that differ significantly across languages.

TF-IDF Overview

TF-IDF measures how important a word is to a document in a collection.

The Formula

TF-IDF = TF × IDF Where:

Term Frequency (TF)

Measures how often a term appears in a document:

TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total terms in document } d}

Example: In “el parlamento tiene una sesión”, the word “el” has TF = 1/5 = 0.20

Inverse Document Frequency (IDF)

Measures how rare/common a term is across all documents:

IDF(t) = \log\left(\frac{\text{total documents}}{\text{documents containing term } t}\right)

Example: If “el” appears in 6,800 of 49,000 documents:IDF(“el”) = log(49000/6800) ≈ 1.97If “parlamento” appears in only 450 documents:IDF(“parlamento”) = log(49000/450) ≈ 4.69 (higher = rarer)

TF-IDF Score

Combines both metrics:

TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)

High score: Term is frequent in this document but rare overall (important!)
Low score: Term is either rare in this document or common everywhere (less important)

Why TF-IDF Works for Language Detection

Language-Specific Words

Common words in one language (“der”, “el”, “le”) are rare/absent in others, creating distinctive patterns

Discriminative Features

TF-IDF automatically emphasizes words that distinguish one language from others

Robust to Length

Normalization makes the method work for both short and long texts

Computational Efficiency

Sparse matrix representation enables fast training and inference

Implementation with Scikit-learn

Basic TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Load preprocessed data
df = pd.read_csv('dataset/europarl_multilang_dataset_7000.csv')

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=5000,    # Keep top 5000 features
    ngram_range=(1, 2),   # Use unigrams and bigrams
    min_df=2,             # Ignore terms appearing in < 2 documents
    max_df=0.95,          # Ignore terms appearing in > 95% of documents
    sublinear_tf=True     # Apply sublinear TF scaling (log)
)

# Fit and transform the text data
X = vectorizer.fit_transform(df['texto'])
y = df['idioma']

print(f"Feature matrix shape: {X.shape}")
print(f"Number of features: {len(vectorizer.get_feature_names_out())}")
print(f"Matrix sparsity: {(1.0 - X.nnz / (X.shape[0] * X.shape[1])) * 100:.2f}%")

Output:

Feature matrix shape: (49000, 5000)
Number of features: 5000
Matrix sparsity: 98.73%

Sparsity: Most values in the TF-IDF matrix are zero. Sparse matrix representation saves memory and speeds up computation.

Parameter Tuning

Key TfidfVectorizer parameters and their impact:

max_features

int

default:"None"

Maximum number of features (vocabulary size)

Lower values (1000-3000): Faster, less memory, may miss rare but useful terms
Higher values (10000+): More features, captures rare patterns, slower
Recommended: 5000-10000 for language detection

ngram_range

tuple

default:"(1, 1)"

Range of n-grams to extract

# Unigrams only
ngram_range=(1, 1)  # ["el", "parlamento", "tiene"]

# Unigrams + bigrams  
ngram_range=(1, 2)  # ["el", "parlamento", "el parlamento", "parlamento tiene"]

# Unigrams + bigrams + trigrams
ngram_range=(1, 3)  # Even more combinations

(1, 1): Fast, good baseline
(1, 2): Better performance, captures phrases (recommended)
(1, 3): Marginal gains, much larger vocabulary

min_df

int or float

default:"1"

Minimum document frequency threshold

int: Absolute count (e.g., min_df=5 means term must appear in ≥5 docs)
float: Proportion (e.g., min_df=0.001 means term must appear in ≥0.1% of docs)
Purpose: Remove very rare terms that might be typos or noise

max_df

int or float

default:"1.0"

Maximum document frequency threshold

float: Proportion (e.g., max_df=0.95 means ignore terms in >95% of docs)
Purpose: Remove extremely common terms that appear everywhere
Note: Different from stopword removal - data-driven approach

sublinear_tf

bool

default:"False"

Apply sublinear term frequency scaling

False: TF = raw count
True: TF = 1 + log(count)
Effect: Reduces impact of terms appearing many times in one document
Recommended: True for better performance

Advanced Configuration

vectorizer = TfidfVectorizer(
    # Vocabulary size
    max_features=10000,
    
    # N-gram configuration
    ngram_range=(1, 3),        # Unigrams, bigrams, trigrams
    analyzer='word',           # 'word' or 'char' for character n-grams
    
    # Frequency filtering
    min_df=3,                  # Minimum 3 documents
    max_df=0.90,               # Maximum 90% of documents
    
    # TF-IDF scaling
    use_idf=True,              # Enable IDF weighting
    smooth_idf=True,           # Add 1 to IDF to avoid division by zero
    sublinear_tf=True,         # Use log scaling for TF
    norm='l2',                 # L2 normalization (Euclidean)
    
    # Tokenization (optional preprocessing)
    lowercase=True,            # Convert to lowercase
    strip_accents=None,        # Keep accents (important for languages!)
    token_pattern=r'\b\w+\b', # Word boundary pattern
)

Character N-grams

An alternative approach using character-level features:

# Character n-grams instead of word n-grams
char_vectorizer = TfidfVectorizer(
    analyzer='char',           # Character-level analysis
    ngram_range=(2, 4),        # 2, 3, and 4-character sequences
    max_features=5000,
    min_df=5,
    lowercase=True
)

X_char = char_vectorizer.fit_transform(df['texto'])

# Examples of character n-grams for Spanish "parlamento":
# 2-grams: "pa", "ar", "rl", "la", "am", "me", "en", "nt", "to"
# 3-grams: "par", "arl", "rla", "lam", "ame", "men", "ent", "nto"
# 4-grams: "parl", "arla", "rlam", "lame", "amen", "ment", "ento"

Word N-grams
Character N-grams

More interpretable features

Better for languages with clear word boundaries

Captures semantic patterns

Requires tokenization

Sensitive to spelling variations

Analyzing TF-IDF Features

Top Features Per Language

Identify the most discriminative words for each language:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

def get_top_features_per_language(df, n_features=20):
    """
    Extract top TF-IDF features for each language.
    """
    vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
    
    results = {}
    feature_names = None
    
    for lang in df['idioma'].unique():
        # Get texts for this language
        lang_texts = df[df['idioma'] == lang]['texto']
        
        # Fit vectorizer on this language's texts
        X_lang = vectorizer.fit_transform(lang_texts)
        
        # Get feature names
        if feature_names is None:
            feature_names = vectorizer.get_feature_names_out()
        
        # Calculate mean TF-IDF score per feature
        mean_tfidf = np.asarray(X_lang.mean(axis=0)).flatten()
        
        # Get top features
        top_indices = mean_tfidf.argsort()[-n_features:][::-1]
        top_features = [(feature_names[i], mean_tfidf[i]) 
                       for i in top_indices]
        
        results[lang] = top_features
    
    return results

# Analyze top features
top_features = get_top_features_per_language(df, n_features=10)

for lang, features in top_features.items():
    print(f"\n{lang.upper()} - Top 10 features:")
    for word, score in features:
        print(f"  {word:20s} {score:.4f}")

Example output:

ES - Top 10 features:
  que                  0.1245
  de                   0.1198
  la                   0.1087
  el                   0.0976
  es                   0.0854
  comisión             0.0723
  parlamento           0.0698
  ...

DE - Top 10 features:
  die                  0.1334
  und                  0.1289
  der                  0.1156
  ist                  0.0945
  das                  0.0887
  ...

Visualizing Feature Importance

import matplotlib.pyplot as plt
import seaborn as sns

def plot_top_features(top_features, n_show=15):
    """
    Create a bar plot of top features per language.
    """
    fig, axes = plt.subplots(2, 4, figsize=(20, 10))
    axes = axes.flatten()
    
    for idx, (lang, features) in enumerate(top_features.items()):
        words = [f[0] for f in features[:n_show]]
        scores = [f[1] for f in features[:n_show]]
        
        axes[idx].barh(range(len(words)), scores)
        axes[idx].set_yticks(range(len(words)))
        axes[idx].set_yticklabels(words)
        axes[idx].set_xlabel('Mean TF-IDF Score')
        axes[idx].set_title(f'{lang.upper()} - Top Features')
        axes[idx].invert_yaxis()
    
    plt.tight_layout()
    plt.savefig('tfidf_top_features.png', dpi=300, bbox_inches='tight')
    plt.show()

plot_top_features(top_features)

Training vs. Inference

Training Phase

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['texto'], df['idioma'], test_size=0.2, random_state=42, stratify=df['idioma']
)

# Fit vectorizer on TRAINING data only
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)  # Transform (not fit_transform!)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train_vec, y_train)

# Evaluate
accuracy = classifier.score(X_test_vec, y_test)
print(f"Test accuracy: {accuracy:.4f}")

Critical: Use fit_transform() on training data, but only transform() on test/new data. Fitting on test data causes data leakage.

Inference Phase

def predict_language(text: str, vectorizer, classifier) -> str:
    """
    Predict language of new text.
    """
    # Apply same preprocessing as training
    text_clean = preprocess_text(text)
    
    # Transform using fitted vectorizer
    text_vec = vectorizer.transform([text_clean])
    
    # Predict
    prediction = classifier.predict(text_vec)[0]
    probabilities = classifier.predict_proba(text_vec)[0]
    
    return prediction, probabilities

# Example
text = "Bonjour, comment allez-vous aujourd'hui?"
lang, probs = predict_language(text, vectorizer, classifier)
print(f"Detected language: {lang}")
print(f"Confidence: {max(probs):.2%}")

Best Practices

Start Simple

Begin with default TF-IDF settings, then experiment with parameters based on results

Monitor Sparsity

Very high sparsity (>99.5%) may indicate vocabulary is too large

Save the Vectorizer

Pickle the fitted vectorizer for consistent inference preprocessing

Feature Analysis

Inspect top features to verify they make linguistic sense

Alternatives to TF-IDF

Other vectorization approaches for comparison:

Method	Pros	Cons	Use Case
Count Vectorizer	Simple, interpretable	Doesn’t account for term importance	Baseline comparison
Word Embeddings (Word2Vec, GloVe)	Captures semantics	Requires averaging, less effective for language ID	When semantics matter more than lexicon
BERT Embeddings	State-of-art semantics	Computationally expensive, overkill	When you need multilingual understanding
Character TF-IDF	No tokenization needed	Less interpretable	Morphologically rich languages

For language detection specifically, TF-IDF remains the go-to choice due to its excellent performance-to-complexity ratio.

Get Started

Core Concepts

Models

Guides

Text Vectorization

TF-IDF Overview

The Formula

Why TF-IDF Works for Language Detection

Language-Specific Words

Discriminative Features

Robust to Length

Computational Efficiency

Implementation with Scikit-learn

Basic TF-IDF Vectorization

Parameter Tuning

Advanced Configuration

Character N-grams

Analyzing TF-IDF Features

Top Features Per Language

Visualizing Feature Importance

Training vs. Inference

Training Phase

Inference Phase

Best Practices

Start Simple

Monitor Sparsity

Save the Vectorizer

Feature Analysis

Alternatives to TF-IDF

Next Steps

Model Training

Back to Overview

Build docs developers (and LLMs) love

Get Started

Core Concepts

Models

Guides

​Text Vectorization

​TF-IDF Overview

​The Formula

​Why TF-IDF Works for Language Detection

Language-Specific Words

Discriminative Features

Robust to Length

Computational Efficiency

​Implementation with Scikit-learn

​Basic TF-IDF Vectorization

​Parameter Tuning

​Advanced Configuration

​Character N-grams

​Analyzing TF-IDF Features

​Top Features Per Language

​Visualizing Feature Importance

​Training vs. Inference

​Training Phase

​Inference Phase

​Best Practices

Start Simple

Monitor Sparsity

Save the Vectorizer

Feature Analysis

​Alternatives to TF-IDF

​Next Steps

Model Training

Back to Overview

Build docs developers (and LLMs) love

Text Vectorization

TF-IDF Overview

The Formula

Why TF-IDF Works for Language Detection

Implementation with Scikit-learn

Basic TF-IDF Vectorization

Parameter Tuning

Advanced Configuration

Character N-grams

Analyzing TF-IDF Features

Top Features Per Language

Visualizing Feature Importance

Training vs. Inference

Training Phase

Inference Phase

Best Practices

Alternatives to TF-IDF

Next Steps