Skip to main content
The Language Detection System implements multiple machine learning and deep learning models to classify text into seven European languages: Spanish (es), German (de), French (fr), Italian (it), Dutch (nl), Portuguese (pt), and Swedish (sv).

Available Models

Our system provides two categories of models, each with different trade-offs:

Traditional ML Models

Fast, efficient, and highly accurate classical machine learning approaches

Deep Learning Models

Neural network architectures that learn sequential patterns in text

Quick Comparison

ModelAccuracyTraining TimeInference SpeedModel Size
Naive Bayes99.92%~0.03s~0.01s29.84 MB
SVM99.77%~0.59s~0.00s14.92 MB
Random Forest99.41%~128s~0.66s230 MB
LSTM94.07%~15 epochs~15ms/sample618 KB
BiLSTM94.08%~15 epochs~30ms/sample618 KB
All models were trained on the Europarl Parallel Corpus with 7,000 samples per language.

Model Selection Guide

When to Use Traditional ML

Choose traditional ML models when you need:
  • Maximum accuracy (>99.9% on validation set)
  • Fast training (seconds instead of minutes)
  • Quick inference (milliseconds per prediction)
  • Production-ready performance with minimal resources
Recommended: Naive Bayes with alpha=0.5 achieves 99.92% accuracy with the best balance of speed and accuracy.

When to Use Deep Learning

Choose deep learning models when you:
  • Want to learn sequential patterns in text
  • Need interpretable embeddings for downstream tasks
  • Are working with noisy or out-of-vocabulary words
  • Have sufficient computational resources for training
Recommended: Bidirectional LSTM slightly outperforms unidirectional LSTM (94.08% vs 94.07%) and handles context better.

Feature Extraction

All models use TF-IDF vectorization with word-level n-grams:
  • Analyzer: Word-based
  • N-gram range: (1, 2) - unigrams and bigrams
  • Vocabulary size: ~50K features for traditional ML
  • Sequence length: 95th percentile of training data for deep learning (~50 tokens)
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 2)
)
X_train_vec = vectorizer.fit_transform(X_train)

Performance Metrics

Best Performing Models by Language

LanguageBest Traditional MLBest Deep Learning
Spanish (es)NB: 0.9986 F1BiLSTM: 0.9272 F1
German (de)NB: 0.9985 F1BiLSTM: 0.9659 F1
French (fr)NB: 0.9995 F1BiLSTM: 0.9365 F1
Italian (it)NB: 0.9995 F1LSTM: 0.8969 F1
Dutch (nl)NB: 0.9995 F1LSTM: 0.9693 F1
Portuguese (pt)NB: 0.9986 F1LSTM: 0.9369 F1
Swedish (sv)NB: 1.0000 F1BiLSTM: 0.9747 F1
Traditional ML models consistently outperform deep learning models on this task due to the structured nature of the Europarl corpus.

Next Steps

Traditional ML Details

Explore Naive Bayes, SVM, and Random Forest implementations

Deep Learning Details

Learn about LSTM and BiLSTM architectures

Model Comparison

See detailed performance comparisons and metrics

Training Guide

Learn how to train your own models

Build docs developers (and LLMs) love