Skip to main content

Text Vectorization

Machine learning models require numerical input, but text is categorical and variable-length. Vectorization transforms text into fixed-size numerical vectors that capture linguistic patterns while enabling efficient computation.
For language detection, TF-IDF (Term Frequency-Inverse Document Frequency) is the gold standard. It captures word importance patterns that differ significantly across languages.

TF-IDF Overview

TF-IDF measures how important a word is to a document in a collection.

The Formula

TF-IDF = TF × IDF Where:
Measures how often a term appears in a document:TF(t,d)=count of term t in document dtotal terms in document dTF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total terms in document } d}Example: In “el parlamento tiene una sesión”, the word “el” has TF = 1/5 = 0.20
Measures how rare/common a term is across all documents:IDF(t)=log(total documentsdocuments containing term t)IDF(t) = \log\left(\frac{\text{total documents}}{\text{documents containing term } t}\right)Example: If “el” appears in 6,800 of 49,000 documents:IDF(“el”) = log(49000/6800) ≈ 1.97If “parlamento” appears in only 450 documents:IDF(“parlamento”) = log(49000/450) ≈ 4.69 (higher = rarer)
Combines both metrics:TF-IDF(t,d)=TF(t,d)×IDF(t)TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)
  • High score: Term is frequent in this document but rare overall (important!)
  • Low score: Term is either rare in this document or common everywhere (less important)

Why TF-IDF Works for Language Detection

Language-Specific Words

Common words in one language (“der”, “el”, “le”) are rare/absent in others, creating distinctive patterns

Discriminative Features

TF-IDF automatically emphasizes words that distinguish one language from others

Robust to Length

Normalization makes the method work for both short and long texts

Computational Efficiency

Sparse matrix representation enables fast training and inference

Implementation with Scikit-learn

Basic TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Load preprocessed data
df = pd.read_csv('dataset/europarl_multilang_dataset_7000.csv')

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=5000,    # Keep top 5000 features
    ngram_range=(1, 2),   # Use unigrams and bigrams
    min_df=2,             # Ignore terms appearing in < 2 documents
    max_df=0.95,          # Ignore terms appearing in > 95% of documents
    sublinear_tf=True     # Apply sublinear TF scaling (log)
)

# Fit and transform the text data
X = vectorizer.fit_transform(df['texto'])
y = df['idioma']

print(f"Feature matrix shape: {X.shape}")
print(f"Number of features: {len(vectorizer.get_feature_names_out())}")
print(f"Matrix sparsity: {(1.0 - X.nnz / (X.shape[0] * X.shape[1])) * 100:.2f}%")
Output:
Feature matrix shape: (49000, 5000)
Number of features: 5000
Matrix sparsity: 98.73%
Sparsity: Most values in the TF-IDF matrix are zero. Sparse matrix representation saves memory and speeds up computation.

Parameter Tuning

Key TfidfVectorizer parameters and their impact:
max_features
int
default:"None"
Maximum number of features (vocabulary size)
  • Lower values (1000-3000): Faster, less memory, may miss rare but useful terms
  • Higher values (10000+): More features, captures rare patterns, slower
  • Recommended: 5000-10000 for language detection
ngram_range
tuple
default:"(1, 1)"
Range of n-grams to extract
# Unigrams only
ngram_range=(1, 1)  # ["el", "parlamento", "tiene"]

# Unigrams + bigrams  
ngram_range=(1, 2)  # ["el", "parlamento", "el parlamento", "parlamento tiene"]

# Unigrams + bigrams + trigrams
ngram_range=(1, 3)  # Even more combinations
  • (1, 1): Fast, good baseline
  • (1, 2): Better performance, captures phrases (recommended)
  • (1, 3): Marginal gains, much larger vocabulary
min_df
int or float
default:"1"
Minimum document frequency threshold
  • int: Absolute count (e.g., min_df=5 means term must appear in ≥5 docs)
  • float: Proportion (e.g., min_df=0.001 means term must appear in ≥0.1% of docs)
  • Purpose: Remove very rare terms that might be typos or noise
max_df
int or float
default:"1.0"
Maximum document frequency threshold
  • float: Proportion (e.g., max_df=0.95 means ignore terms in >95% of docs)
  • Purpose: Remove extremely common terms that appear everywhere
  • Note: Different from stopword removal - data-driven approach
sublinear_tf
bool
default:"False"
Apply sublinear term frequency scaling
  • False: TF = raw count
  • True: TF = 1 + log(count)
  • Effect: Reduces impact of terms appearing many times in one document
  • Recommended: True for better performance

Advanced Configuration

vectorizer = TfidfVectorizer(
    # Vocabulary size
    max_features=10000,
    
    # N-gram configuration
    ngram_range=(1, 3),        # Unigrams, bigrams, trigrams
    analyzer='word',           # 'word' or 'char' for character n-grams
    
    # Frequency filtering
    min_df=3,                  # Minimum 3 documents
    max_df=0.90,               # Maximum 90% of documents
    
    # TF-IDF scaling
    use_idf=True,              # Enable IDF weighting
    smooth_idf=True,           # Add 1 to IDF to avoid division by zero
    sublinear_tf=True,         # Use log scaling for TF
    norm='l2',                 # L2 normalization (Euclidean)
    
    # Tokenization (optional preprocessing)
    lowercase=True,            # Convert to lowercase
    strip_accents=None,        # Keep accents (important for languages!)
    token_pattern=r'\b\w+\b', # Word boundary pattern
)

Character N-grams

An alternative approach using character-level features:
# Character n-grams instead of word n-grams
char_vectorizer = TfidfVectorizer(
    analyzer='char',           # Character-level analysis
    ngram_range=(2, 4),        # 2, 3, and 4-character sequences
    max_features=5000,
    min_df=5,
    lowercase=True
)

X_char = char_vectorizer.fit_transform(df['texto'])

# Examples of character n-grams for Spanish "parlamento":
# 2-grams: "pa", "ar", "rl", "la", "am", "me", "en", "nt", "to"
# 3-grams: "par", "arl", "rla", "lam", "ame", "men", "ent", "nto"
# 4-grams: "parl", "arla", "rlam", "lame", "amen", "ment", "ento"
More interpretable features
Better for languages with clear word boundaries
Captures semantic patterns
Requires tokenization
Sensitive to spelling variations

Analyzing TF-IDF Features

Top Features Per Language

Identify the most discriminative words for each language:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

def get_top_features_per_language(df, n_features=20):
    """
    Extract top TF-IDF features for each language.
    """
    vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
    
    results = {}
    feature_names = None
    
    for lang in df['idioma'].unique():
        # Get texts for this language
        lang_texts = df[df['idioma'] == lang]['texto']
        
        # Fit vectorizer on this language's texts
        X_lang = vectorizer.fit_transform(lang_texts)
        
        # Get feature names
        if feature_names is None:
            feature_names = vectorizer.get_feature_names_out()
        
        # Calculate mean TF-IDF score per feature
        mean_tfidf = np.asarray(X_lang.mean(axis=0)).flatten()
        
        # Get top features
        top_indices = mean_tfidf.argsort()[-n_features:][::-1]
        top_features = [(feature_names[i], mean_tfidf[i]) 
                       for i in top_indices]
        
        results[lang] = top_features
    
    return results

# Analyze top features
top_features = get_top_features_per_language(df, n_features=10)

for lang, features in top_features.items():
    print(f"\n{lang.upper()} - Top 10 features:")
    for word, score in features:
        print(f"  {word:20s} {score:.4f}")
Example output:
ES - Top 10 features:
  que                  0.1245
  de                   0.1198
  la                   0.1087
  el                   0.0976
  es                   0.0854
  comisión             0.0723
  parlamento           0.0698
  ...

DE - Top 10 features:
  die                  0.1334
  und                  0.1289
  der                  0.1156
  ist                  0.0945
  das                  0.0887
  ...

Visualizing Feature Importance

import matplotlib.pyplot as plt
import seaborn as sns

def plot_top_features(top_features, n_show=15):
    """
    Create a bar plot of top features per language.
    """
    fig, axes = plt.subplots(2, 4, figsize=(20, 10))
    axes = axes.flatten()
    
    for idx, (lang, features) in enumerate(top_features.items()):
        words = [f[0] for f in features[:n_show]]
        scores = [f[1] for f in features[:n_show]]
        
        axes[idx].barh(range(len(words)), scores)
        axes[idx].set_yticks(range(len(words)))
        axes[idx].set_yticklabels(words)
        axes[idx].set_xlabel('Mean TF-IDF Score')
        axes[idx].set_title(f'{lang.upper()} - Top Features')
        axes[idx].invert_yaxis()
    
    plt.tight_layout()
    plt.savefig('tfidf_top_features.png', dpi=300, bbox_inches='tight')
    plt.show()

plot_top_features(top_features)

Training vs. Inference

Training Phase

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['texto'], df['idioma'], test_size=0.2, random_state=42, stratify=df['idioma']
)

# Fit vectorizer on TRAINING data only
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)  # Transform (not fit_transform!)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train_vec, y_train)

# Evaluate
accuracy = classifier.score(X_test_vec, y_test)
print(f"Test accuracy: {accuracy:.4f}")
Critical: Use fit_transform() on training data, but only transform() on test/new data. Fitting on test data causes data leakage.

Inference Phase

def predict_language(text: str, vectorizer, classifier) -> str:
    """
    Predict language of new text.
    """
    # Apply same preprocessing as training
    text_clean = preprocess_text(text)
    
    # Transform using fitted vectorizer
    text_vec = vectorizer.transform([text_clean])
    
    # Predict
    prediction = classifier.predict(text_vec)[0]
    probabilities = classifier.predict_proba(text_vec)[0]
    
    return prediction, probabilities

# Example
text = "Bonjour, comment allez-vous aujourd'hui?"
lang, probs = predict_language(text, vectorizer, classifier)
print(f"Detected language: {lang}")
print(f"Confidence: {max(probs):.2%}")

Best Practices

Start Simple

Begin with default TF-IDF settings, then experiment with parameters based on results

Monitor Sparsity

Very high sparsity (>99.5%) may indicate vocabulary is too large

Save the Vectorizer

Pickle the fitted vectorizer for consistent inference preprocessing

Feature Analysis

Inspect top features to verify they make linguistic sense

Alternatives to TF-IDF

Other vectorization approaches for comparison:
MethodProsConsUse Case
Count VectorizerSimple, interpretableDoesn’t account for term importanceBaseline comparison
Word Embeddings (Word2Vec, GloVe)Captures semanticsRequires averaging, less effective for language IDWhen semantics matter more than lexicon
BERT EmbeddingsState-of-art semanticsComputationally expensive, overkillWhen you need multilingual understanding
Character TF-IDFNo tokenization neededLess interpretableMorphologically rich languages
For language detection specifically, TF-IDF remains the go-to choice due to its excellent performance-to-complexity ratio.

Next Steps

Model Training

Learn how to train classifiers on TF-IDF features

Back to Overview

Review the complete pipeline

Build docs developers (and LLMs) love