Deep learning models learn sequential patterns in text through recurrent neural networks. While they achieve lower accuracy than traditional ML on this task (94% vs 99.9%), they provide valuable embeddings and handle out-of-vocabulary words better.
Data Preprocessing
Deep learning models require tokenization and padding instead of TF-IDF vectorization:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
def preprocesar_secuencias ( textos_train , textos_val , textos_test ,
max_palabras = 10000 , max_longitud = None ):
"""
Tokenize and pad sequences for recurrent neural networks.
Args:
textos_train: Training texts
textos_val: Validation texts
textos_test: Test texts
max_palabras: Maximum vocabulary size (default: 10,000)
max_longitud: Maximum sequence length (auto-calculated if None)
Returns:
Dictionary with processed data and tokenizer
"""
# Initialize tokenizer
tokenizer = Tokenizer( num_words = max_palabras, oov_token = "<OOV>" )
tokenizer.fit_on_texts(textos_train)
# Convert texts to sequences
secuencias_train = tokenizer.texts_to_sequences(textos_train)
secuencias_val = tokenizer.texts_to_sequences(textos_val)
secuencias_test = tokenizer.texts_to_sequences(textos_test)
# Determine max length (95th percentile to avoid outliers)
if max_longitud is None :
longitudes = [ len (seq) for seq in secuencias_train]
max_longitud = int (np.percentile(longitudes, 95 ))
# Apply padding
X_train = pad_sequences(secuencias_train, maxlen = max_longitud, padding = 'post' )
X_val = pad_sequences(secuencias_val, maxlen = max_longitud, padding = 'post' )
X_test = pad_sequences(secuencias_test, maxlen = max_longitud, padding = 'post' )
vocab_size = min (max_palabras, len (tokenizer.word_index) + 1 )
return {
'X_train' : X_train,
'X_val' : X_val,
'X_test' : X_test,
'tokenizer' : tokenizer,
'vocab_size' : vocab_size,
'max_length' : max_longitud
}
Preprocessing Configuration
Vocabulary size: 10,000 words
OOV token: <OOV> for out-of-vocabulary words
Max sequence length: 95th percentile of training data (~50 tokens)
Padding: Post-padding with zeros
LSTM Model
Long Short-Term Memory (LSTM) networks learn long-range dependencies in sequential data.
Architecture
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM , Dropout, Dense
from tensorflow.keras.optimizers import Adam
def crear_modelo_lstm ( vocab_size , embedding_dim , max_length ,
num_classes , dropout_rate = 0.3 ):
"""
Create LSTM model for language detection.
Args:
vocab_size: Vocabulary size (10,000)
embedding_dim: Embedding dimension (100)
max_length: Maximum sequence length (~50)
num_classes: Number of languages (7)
dropout_rate: Dropout rate for regularization (0.3)
Returns:
Compiled Keras model
"""
model = Sequential()
# Embedding layer
model.add(Embedding(
input_dim = vocab_size,
output_dim = embedding_dim,
input_length = max_length
))
# LSTM layer
model.add(LSTM(
units = 64 ,
dropout = dropout_rate,
recurrent_dropout = dropout_rate / 2 , # Lower recurrent dropout for stability
return_sequences = False # Return only final state
))
# Regularization
model.add(Dropout(dropout_rate))
# Output layer
model.add(Dense(num_classes, activation = 'softmax' ))
# Compile
model.compile(
optimizer = Adam( learning_rate = 0.001 ),
loss = 'sparse_categorical_crossentropy' ,
metrics = [ 'accuracy' ]
)
return model
Layer Configuration
Layer Type Parameters Output Shape Embedding Embedding vocab_size=10,000 output_dim=100 (batch, 50, 100) LSTM LSTM units=64 dropout=0.3 recurrent_dropout=0.15 (batch, 64) Dropout Dropout rate=0.3 (batch, 64) Dense Dense units=7 activation=‘softmax’ (batch, 7)
Total Parameters: ~700K trainable parameters
Training Configuration
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
callbacks = [
# Stop training when validation loss stops improving
EarlyStopping(
monitor = 'val_loss' ,
patience = 5 ,
restore_best_weights = True ,
verbose = 1
),
# Reduce learning rate when validation loss plateaus
ReduceLROnPlateau(
monitor = 'val_loss' ,
factor = 0.5 ,
patience = 3 ,
verbose = 1 ,
min_lr = 0.00001
),
# Save best model
ModelCheckpoint(
'modelos/modelo_lstm.keras' ,
save_best_only = True ,
monitor = 'val_loss' ,
verbose = 1
)
]
# Train
history = model.fit(
X_train, y_train,
validation_data = (X_val, y_val),
epochs = 15 ,
batch_size = 32 ,
callbacks = callbacks,
verbose = 1
)
Validation Set Results:
Overall Accuracy: 94.07%
Training Time: 15 epochs (~96 seconds per epoch)
Inference Time: ~15ms per sample
Model Size: 618 KB
Per-Language Performance:
Language Precision Recall F1-Score Support Swedish (sv) 0.9929 0.9543 0.9732 1,028 Dutch (nl) 0.9720 0.9666 0.9693 1,079 Portuguese (pt) 0.9347 0.9391 0.9369 1,051 Italian (it) 0.8518 0.9470 0.8969 1,056 French (fr) 0.9563 0.9119 0.9336 1,056 German (de) 0.9375 0.9649 0.9510 1,026 Spanish (es) 0.9538 0.9013 0.9268 1,054
Training Curves
Epoch-by-Epoch Results:
Epoch 1: Train acc: 63.54%, Val acc: 92.38%, Val loss: 0.2205
Epoch 2: Train acc: 91.48%, Val acc: 92.19%, Val loss: 0.2093 ✓ (best)
Epoch 15: Train acc: 94.03%, Val acc: 94.07%, Val loss: 0.1419 ✓ (final best)
No overfitting detected: Training-validation gap of only -0.03%
Bidirectional LSTM (BiLSTM)
Bidirectional LSTM processes sequences in both forward and backward directions, capturing context from both sides.
Architecture
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM , Dropout, Dense
from tensorflow.keras.optimizers import Adam
def crear_modelo_lstm_bidireccional ( vocab_size , embedding_dim , max_length ,
num_classes , dropout_rate = 0.3 ):
"""
Create Bidirectional LSTM model for language detection.
Args:
vocab_size: Vocabulary size (10,000)
embedding_dim: Embedding dimension (100)
max_length: Maximum sequence length (~50)
num_classes: Number of languages (7)
dropout_rate: Dropout rate for regularization (0.3)
Returns:
Compiled Keras model
"""
model = Sequential()
# Embedding layer
model.add(Embedding(
input_dim = vocab_size,
output_dim = embedding_dim,
input_length = max_length
))
# Bidirectional LSTM layer
model.add(Bidirectional(LSTM(
units = 32 , # Reduced because bidirectional doubles the output
dropout = dropout_rate,
recurrent_dropout = dropout_rate / 2
)))
# Regularization
model.add(Dropout(dropout_rate))
# Intermediate dense layer
model.add(Dense( 64 , activation = 'relu' ))
model.add(Dropout(dropout_rate / 2 ))
# Output layer
model.add(Dense(num_classes, activation = 'softmax' ))
# Compile
model.compile(
optimizer = Adam( learning_rate = 0.001 ),
loss = 'sparse_categorical_crossentropy' ,
metrics = [ 'accuracy' ]
)
return model
Layer Configuration
Layer Type Parameters Output Shape Embedding Embedding vocab_size=10,000 output_dim=100 (batch, 50, 100) Bidirectional(LSTM) Bidirectional units=32 dropout=0.3 recurrent_dropout=0.15 (batch, 64) Dropout Dropout rate=0.3 (batch, 64) Dense Dense units=64 activation=‘relu’ (batch, 64) Dropout Dropout rate=0.15 (batch, 64) Dense Dense units=7 activation=‘softmax’ (batch, 7)
Total Parameters: ~750K trainable parameters
The Bidirectional LSTM uses 32 units (vs 64 in regular LSTM) because the bidirectional wrapper concatenates forward and backward outputs, resulting in 64 total features.
Validation Set Results:
Overall Accuracy: 94.08%
Training Time: 15 epochs (~138 seconds per epoch)
Inference Time: ~30ms per sample (2x slower than LSTM)
Model Size: 618 KB
Per-Language Performance:
Language Precision Recall F1-Score Support Swedish (sv) 0.9939 0.9562 0.9747 1,028 Dutch (nl) 0.9596 0.9694 0.9645 1,079 Portuguese (pt) 0.9289 0.9448 0.9368 1,051 Italian (it) 0.8261 0.9536 0.8853 1,056 French (fr) 0.9744 0.9015 0.9365 1,056 German (de) 0.9650 0.9669 0.9659 1,026 Spanish (es) 0.9632 0.8937 0.9272 1,054
Training Curves
Epoch-by-Epoch Results:
Epoch 1: Train acc: 68.10%, Val acc: 92.72%, Val loss: 0.1913
Epoch 2: Train acc: 92.70%, Val acc: 93.06%, Val loss: 0.1739 ✓ (best)
Epoch 13: Train acc: 94.05%, Val acc: 94.15%, Val loss: 0.1445 ✓ (final best)
Epoch 15: Train acc: 94.16%, Val acc: 94.05%, Val loss: 0.1446
No overfitting detected: Training-validation gap of only 0.11%
LSTM vs BiLSTM Comparison
Metric LSTM BiLSTM Winner Validation Accuracy 94.07% 94.08% BiLSTM Training Time/Epoch 82s 138s LSTM Inference Time 15ms 30ms LSTM Model Size 618 KB 618 KB Tie Convergence Speed 15 epochs 13 epochs BiLSTM
Per-Language Comparison
Language LSTM F1 BiLSTM F1 Better Model Swedish (sv) 0.9732 0.9747 BiLSTM Dutch (nl) 0.9693 0.9645 LSTM Portuguese (pt) 0.9369 0.9368 LSTM Italian (it) 0.8969 0.8853 LSTM French (fr) 0.9336 0.9365 BiLSTM German (de) 0.9510 0.9659 BiLSTM Spanish (es) 0.9268 0.9272 BiLSTM
Key Findings
BiLSTM Advantages
Slightly better overall accuracy (+0.01%)
Better performance on German, French, Spanish
Faster convergence (13 vs 15 epochs)
LSTM Advantages
2x faster inference (15ms vs 30ms)
40% faster training per epoch
Better on Dutch, Portuguese, Italian
Common Misclassifications
Example 1 - OOV Words
Example 2 - Short Text
Example 3 - Similar Languages
Text: "Temos de mudar resolutamente de rumo."
Tokens: "<OOV> de <OOV> <OOV> de <OOV>"
Actual: Portuguese (pt)
Predicted: Dutch (nl)
Reason: Most words are out-of-vocabulary
When to Use Deep Learning
Deep learning models are appropriate when:
✅ You need embeddings for downstream tasks
✅ Text contains OOV words that traditional ML might miss
✅ You want to learn sequential patterns in language structure
✅ You have GPU resources for training
❌ Not recommended when:
You need maximum accuracy (use Naive Bayes instead)
You have limited computational resources
You need fast training and inference
For the Europarl corpus, traditional ML models outperform deep learning due to the structured, clean nature of the parliamentary text. Deep learning models achieve ~5% lower accuracy (94% vs 99.9%).
Complete Training Example
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM , Dropout, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
# Prepare data
data = preprocesar_secuencias(
X_train, X_val, X_test,
max_palabras = 10000 ,
max_longitud = None # Auto-calculate
)
# Create model
model = crear_modelo_lstm_bidireccional(
vocab_size = data[ 'vocab_size' ],
embedding_dim = 100 ,
max_length = data[ 'max_length' ],
num_classes = 7 ,
dropout_rate = 0.3
)
# Define callbacks
callbacks = [
EarlyStopping( monitor = 'val_loss' , patience = 5 , restore_best_weights = True ),
ReduceLROnPlateau( monitor = 'val_loss' , factor = 0.5 , patience = 3 , min_lr = 1e-5 ),
ModelCheckpoint( 'modelo_bilstm.keras' , save_best_only = True )
]
# Train
history = model.fit(
data[ 'X_train' ], y_train,
validation_data = (data[ 'X_val' ], y_val),
epochs = 15 ,
batch_size = 32 ,
callbacks = callbacks
)
# Evaluate
test_loss, test_acc = model.evaluate(data[ 'X_test' ], y_test)
print ( f "Test accuracy: { test_acc :.4f} " )
Next Steps
Model Comparison See detailed comparisons with traditional ML models
Training Guide Learn how to train models on your own data
Traditional ML Explore higher-accuracy traditional ML approaches
Using Models Start making predictions with trained models