Naive Bayes
Multinomial Naive Bayes is the best performing model in our system, achieving 99.92% accuracy with minimal training time.Architecture
Hyperparameters
The key hyperparameter is alpha (smoothing parameter):| Alpha | Validation Accuracy | Notes |
|---|---|---|
| 0.001 | 99.78% | Too little smoothing |
| 0.01 | 99.80% | Slight improvement |
| 0.1 | 99.92% | Good performance |
| 0.5 | 99.92% | Optimal value |
| 0.6-1.0 | 99.92% | Consistent performance |
| 2.0-10.0 | 99.90% | Over-smoothing |
Alpha = 0.5 was selected as the optimal value, providing the best balance between model complexity and generalization.
Training Code
Performance Metrics
Test Set Results (7,350 samples):- Overall Accuracy: 99.92%
- Only 6 misclassifications out of 7,350 samples
| Language | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Swedish (sv) | 1.0000 | 1.0000 | 1.0000 | 1,038 |
| Dutch (nl) | 1.0000 | 1.0000 | 1.0000 | 1,027 |
| Portuguese (pt) | 1.0000 | 0.9991 | 0.9995 | 1,086 |
| Italian (it) | 1.0000 | 0.9982 | 0.9991 | 1,089 |
| French (fr) | 0.9990 | 0.9981 | 0.9986 | 1,050 |
| German (de) | 0.9980 | 0.9990 | 0.9985 | 1,018 |
| Spanish (es) | 0.9971 | 1.0000 | 0.9986 | 1,042 |
Common Misclassifications
Support Vector Machines (SVM)
Linear SVM provides strong performance with a smaller model size than Naive Bayes.Architecture
Performance
- Validation Accuracy: 99.77%
- Training Time: 0.59 seconds
- Inference Time: 0.00 seconds per prediction
- Model Size: 14.92 MB (half the size of Naive Bayes)
Training Code
SVM is a good alternative to Naive Bayes when you need a smaller model size with only a slight decrease in accuracy (0.15%).
Random Forest
Random Forest is an ensemble method that combines multiple decision trees.Architecture
Hyperparameters
- n_estimators: 100 trees
- Default settings for all other parameters
Performance
- Validation Accuracy: 99.41%
- Training Time: 128 seconds (significantly slower)
- Inference Time: 0.66 seconds per batch
- Model Size: ~230 MB (largest model)
Training Code
Random Forest is not recommended for this task due to:
- Significantly longer training time (128s vs 0.03s for Naive Bayes)
- Lower accuracy (99.41% vs 99.92%)
- Much larger model size (230 MB vs 30 MB)
Model Comparison
Accuracy vs Training Time
| Model | Accuracy | Training Time | Speedup vs Random Forest |
|---|---|---|---|
| Naive Bayes | 99.92% | 0.03s | 4,267x faster |
| SVM | 99.77% | 0.59s | 217x faster |
| Logistic Regression | 99.56% | 14.73s | 8.7x faster |
| Random Forest | 99.41% | 128s | 1x (baseline) |
Key Findings
Best Overall
Naive Bayes (alpha=0.5)
- Highest accuracy: 99.92%
- Fastest training: 0.03s
- Production-ready performance
Smallest Model
SVM
- Model size: 14.92 MB
- Accuracy: 99.77%
- Fast inference: <0.01s
Recommended Model
For production deployment, we recommend Naive Bayes with alpha=0.5:Next Steps
Deep Learning Models
Explore LSTM and BiLSTM architectures
Model Comparison
See detailed comparisons across all models
Training Process
Learn how to train models on your own data
Using Models
Start making predictions with trained models