Skip to main content
Traditional machine learning models provide excellent performance for language detection with fast training and inference times. All models use TF-IDF vectorization with word-level bigrams.

Naive Bayes

Multinomial Naive Bayes is the best performing model in our system, achieving 99.92% accuracy with minimal training time.

Architecture

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorizer
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))

# Model
model = MultinomialNB(alpha=0.5)

Hyperparameters

The key hyperparameter is alpha (smoothing parameter):
AlphaValidation AccuracyNotes
0.00199.78%Too little smoothing
0.0199.80%Slight improvement
0.199.92%Good performance
0.599.92%Optimal value
0.6-1.099.92%Consistent performance
2.0-10.099.90%Over-smoothing
Alpha = 0.5 was selected as the optimal value, providing the best balance between model complexity and generalization.

Training Code

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
import time

# Vectorize the data
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

# Train the model
start_time = time.time()
model = MultinomialNB(alpha=0.5)
model.fit(X_train_vec, y_train)
training_time = time.time() - start_time

print(f"Training time: {training_time:.2f} seconds")

# Evaluate
accuracy = model.score(X_val_vec, y_val)
print(f"Validation accuracy: {accuracy:.4f}")

Performance Metrics

Test Set Results (7,350 samples):
  • Overall Accuracy: 99.92%
  • Only 6 misclassifications out of 7,350 samples
Per-Language Performance:
LanguagePrecisionRecallF1-ScoreSupport
Swedish (sv)1.00001.00001.00001,038
Dutch (nl)1.00001.00001.00001,027
Portuguese (pt)1.00000.99910.99951,086
Italian (it)1.00000.99820.99911,089
French (fr)0.99900.99810.99861,050
German (de)0.99800.99900.99851,018
Spanish (es)0.99711.00000.99861,042

Common Misclassifications

Text: "Monsieur Bolkestein, je veux vous dire quelque chose!"
Actual: German (de)
Predicted: French (fr)
Confidence: 98.47%
Reason: French sentence in German document

Support Vector Machines (SVM)

Linear SVM provides strong performance with a smaller model size than Naive Bayes.

Architecture

from sklearn.svm import LinearSVC

model = LinearSVC()

Performance

  • Validation Accuracy: 99.77%
  • Training Time: 0.59 seconds
  • Inference Time: 0.00 seconds per prediction
  • Model Size: 14.92 MB (half the size of Naive Bayes)

Training Code

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Create pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(analyzer='word', ngram_range=(1, 2))),
    ('classifier', LinearSVC())
])

# Train
pipeline.fit(X_train, y_train)

# Evaluate
accuracy = pipeline.score(X_val, y_val)
print(f"Validation accuracy: {accuracy:.4f}")
SVM is a good alternative to Naive Bayes when you need a smaller model size with only a slight decrease in accuracy (0.15%).

Random Forest

Random Forest is an ensemble method that combines multiple decision trees.

Architecture

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)

Hyperparameters

  • n_estimators: 100 trees
  • Default settings for all other parameters

Performance

  • Validation Accuracy: 99.41%
  • Training Time: 128 seconds (significantly slower)
  • Inference Time: 0.66 seconds per batch
  • Model Size: ~230 MB (largest model)

Training Code

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Create pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(analyzer='word', ngram_range=(1, 2))),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Train
pipeline.fit(X_train, y_train)

# Evaluate
accuracy = pipeline.score(X_val, y_val)
print(f"Validation accuracy: {accuracy:.4f}")
Random Forest is not recommended for this task due to:
  • Significantly longer training time (128s vs 0.03s for Naive Bayes)
  • Lower accuracy (99.41% vs 99.92%)
  • Much larger model size (230 MB vs 30 MB)

Model Comparison

Accuracy vs Training Time

ModelAccuracyTraining TimeSpeedup vs Random Forest
Naive Bayes99.92%0.03s4,267x faster
SVM99.77%0.59s217x faster
Logistic Regression99.56%14.73s8.7x faster
Random Forest99.41%128s1x (baseline)

Key Findings

Best Overall

Naive Bayes (alpha=0.5)
  • Highest accuracy: 99.92%
  • Fastest training: 0.03s
  • Production-ready performance

Smallest Model

SVM
  • Model size: 14.92 MB
  • Accuracy: 99.77%
  • Fast inference: <0.01s
For production deployment, we recommend Naive Bayes with alpha=0.5:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import joblib

# Create and train pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(analyzer='word', ngram_range=(1, 2))),
    ('classifier', MultinomialNB(alpha=0.5))
])

pipeline.fit(X_train, y_train)

# Save for production
joblib.dump(pipeline, 'modelos/naive_bayes_alpha_0.5.joblib')

Next Steps

Deep Learning Models

Explore LSTM and BiLSTM architectures

Model Comparison

See detailed comparisons across all models

Training Process

Learn how to train models on your own data

Using Models

Start making predictions with trained models

Build docs developers (and LLMs) love