Text Vectorization
Machine learning models require numerical input, but text is categorical and variable-length. Vectorization transforms text into fixed-size numerical vectors that capture linguistic patterns while enabling efficient computation.For language detection, TF-IDF (Term Frequency-Inverse Document Frequency) is the gold standard. It captures word importance patterns that differ significantly across languages.
TF-IDF Overview
TF-IDF measures how important a word is to a document in a collection.The Formula
TF-IDF = TF × IDF Where:Term Frequency (TF)
Term Frequency (TF)
Measures how often a term appears in a document:Example: In “el parlamento tiene una sesión”, the word “el” has TF = 1/5 = 0.20
Inverse Document Frequency (IDF)
Inverse Document Frequency (IDF)
Measures how rare/common a term is across all documents:Example: If “el” appears in 6,800 of 49,000 documents:IDF(“el”) = log(49000/6800) ≈ 1.97If “parlamento” appears in only 450 documents:IDF(“parlamento”) = log(49000/450) ≈ 4.69 (higher = rarer)
TF-IDF Score
TF-IDF Score
Combines both metrics:
- High score: Term is frequent in this document but rare overall (important!)
- Low score: Term is either rare in this document or common everywhere (less important)
Why TF-IDF Works for Language Detection
Language-Specific Words
Common words in one language (“der”, “el”, “le”) are rare/absent in others, creating distinctive patterns
Discriminative Features
TF-IDF automatically emphasizes words that distinguish one language from others
Robust to Length
Normalization makes the method work for both short and long texts
Computational Efficiency
Sparse matrix representation enables fast training and inference
Implementation with Scikit-learn
Basic TF-IDF Vectorization
Sparsity: Most values in the TF-IDF matrix are zero. Sparse matrix representation saves memory and speeds up computation.
Parameter Tuning
Key TfidfVectorizer parameters and their impact:Maximum number of features (vocabulary size)
- Lower values (1000-3000): Faster, less memory, may miss rare but useful terms
- Higher values (10000+): More features, captures rare patterns, slower
- Recommended: 5000-10000 for language detection
Range of n-grams to extract
- (1, 1): Fast, good baseline
- (1, 2): Better performance, captures phrases (recommended)
- (1, 3): Marginal gains, much larger vocabulary
Minimum document frequency threshold
- int: Absolute count (e.g., min_df=5 means term must appear in ≥5 docs)
- float: Proportion (e.g., min_df=0.001 means term must appear in ≥0.1% of docs)
- Purpose: Remove very rare terms that might be typos or noise
Maximum document frequency threshold
- float: Proportion (e.g., max_df=0.95 means ignore terms in >95% of docs)
- Purpose: Remove extremely common terms that appear everywhere
- Note: Different from stopword removal - data-driven approach
Apply sublinear term frequency scaling
- False: TF = raw count
- True: TF = 1 + log(count)
- Effect: Reduces impact of terms appearing many times in one document
- Recommended: True for better performance
Advanced Configuration
Character N-grams
An alternative approach using character-level features:- Word N-grams
- Character N-grams
More interpretable features
Better for languages with clear word boundaries
Captures semantic patterns
Analyzing TF-IDF Features
Top Features Per Language
Identify the most discriminative words for each language:Visualizing Feature Importance
Training vs. Inference
Training Phase
Inference Phase
Best Practices
Start Simple
Begin with default TF-IDF settings, then experiment with parameters based on results
Monitor Sparsity
Very high sparsity (>99.5%) may indicate vocabulary is too large
Save the Vectorizer
Pickle the fitted vectorizer for consistent inference preprocessing
Feature Analysis
Inspect top features to verify they make linguistic sense
Alternatives to TF-IDF
Other vectorization approaches for comparison:| Method | Pros | Cons | Use Case |
|---|---|---|---|
| Count Vectorizer | Simple, interpretable | Doesn’t account for term importance | Baseline comparison |
| Word Embeddings (Word2Vec, GloVe) | Captures semantics | Requires averaging, less effective for language ID | When semantics matter more than lexicon |
| BERT Embeddings | State-of-art semantics | Computationally expensive, overkill | When you need multilingual understanding |
| Character TF-IDF | No tokenization needed | Less interpretable | Morphologically rich languages |
For language detection specifically, TF-IDF remains the go-to choice due to its excellent performance-to-complexity ratio.
Next Steps
Model Training
Learn how to train classifiers on TF-IDF features
Back to Overview
Review the complete pipeline