Skip to main content
Deep learning models learn sequential patterns in text through recurrent neural networks. While they achieve lower accuracy than traditional ML on this task (94% vs 99.9%), they provide valuable embeddings and handle out-of-vocabulary words better.

Data Preprocessing

Deep learning models require tokenization and padding instead of TF-IDF vectorization:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

def preprocesar_secuencias(textos_train, textos_val, textos_test, 
                          max_palabras=10000, max_longitud=None):
    """
    Tokenize and pad sequences for recurrent neural networks.
    
    Args:
        textos_train: Training texts
        textos_val: Validation texts
        textos_test: Test texts
        max_palabras: Maximum vocabulary size (default: 10,000)
        max_longitud: Maximum sequence length (auto-calculated if None)
    
    Returns:
        Dictionary with processed data and tokenizer
    """
    # Initialize tokenizer
    tokenizer = Tokenizer(num_words=max_palabras, oov_token="<OOV>")
    tokenizer.fit_on_texts(textos_train)
    
    # Convert texts to sequences
    secuencias_train = tokenizer.texts_to_sequences(textos_train)
    secuencias_val = tokenizer.texts_to_sequences(textos_val)
    secuencias_test = tokenizer.texts_to_sequences(textos_test)
    
    # Determine max length (95th percentile to avoid outliers)
    if max_longitud is None:
        longitudes = [len(seq) for seq in secuencias_train]
        max_longitud = int(np.percentile(longitudes, 95))
    
    # Apply padding
    X_train = pad_sequences(secuencias_train, maxlen=max_longitud, padding='post')
    X_val = pad_sequences(secuencias_val, maxlen=max_longitud, padding='post')
    X_test = pad_sequences(secuencias_test, maxlen=max_longitud, padding='post')
    
    vocab_size = min(max_palabras, len(tokenizer.word_index) + 1)
    
    return {
        'X_train': X_train,
        'X_val': X_val,
        'X_test': X_test,
        'tokenizer': tokenizer,
        'vocab_size': vocab_size,
        'max_length': max_longitud
    }

Preprocessing Configuration

  • Vocabulary size: 10,000 words
  • OOV token: <OOV> for out-of-vocabulary words
  • Max sequence length: 95th percentile of training data (~50 tokens)
  • Padding: Post-padding with zeros

LSTM Model

Long Short-Term Memory (LSTM) networks learn long-range dependencies in sequential data.

Architecture

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense
from tensorflow.keras.optimizers import Adam

def crear_modelo_lstm(vocab_size, embedding_dim, max_length, 
                      num_classes, dropout_rate=0.3):
    """
    Create LSTM model for language detection.
    
    Args:
        vocab_size: Vocabulary size (10,000)
        embedding_dim: Embedding dimension (100)
        max_length: Maximum sequence length (~50)
        num_classes: Number of languages (7)
        dropout_rate: Dropout rate for regularization (0.3)
    
    Returns:
        Compiled Keras model
    """
    model = Sequential()
    
    # Embedding layer
    model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        input_length=max_length
    ))
    
    # LSTM layer
    model.add(LSTM(
        units=64,
        dropout=dropout_rate,
        recurrent_dropout=dropout_rate/2,  # Lower recurrent dropout for stability
        return_sequences=False  # Return only final state
    ))
    
    # Regularization
    model.add(Dropout(dropout_rate))
    
    # Output layer
    model.add(Dense(num_classes, activation='softmax'))
    
    # Compile
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

Layer Configuration

LayerTypeParametersOutput Shape
EmbeddingEmbeddingvocab_size=10,000
output_dim=100
(batch, 50, 100)
LSTMLSTMunits=64
dropout=0.3
recurrent_dropout=0.15
(batch, 64)
DropoutDropoutrate=0.3(batch, 64)
DenseDenseunits=7
activation=‘softmax’
(batch, 7)
Total Parameters: ~700K trainable parameters

Training Configuration

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

callbacks = [
    # Stop training when validation loss stops improving
    EarlyStopping(
        monitor='val_loss',
        patience=5,
        restore_best_weights=True,
        verbose=1
    ),
    
    # Reduce learning rate when validation loss plateaus
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=3,
        verbose=1,
        min_lr=0.00001
    ),
    
    # Save best model
    ModelCheckpoint(
        'modelos/modelo_lstm.keras',
        save_best_only=True,
        monitor='val_loss',
        verbose=1
    )
]

# Train
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=15,
    batch_size=32,
    callbacks=callbacks,
    verbose=1
)

Performance

Validation Set Results:
  • Overall Accuracy: 94.07%
  • Training Time: 15 epochs (~96 seconds per epoch)
  • Inference Time: ~15ms per sample
  • Model Size: 618 KB
Per-Language Performance:
LanguagePrecisionRecallF1-ScoreSupport
Swedish (sv)0.99290.95430.97321,028
Dutch (nl)0.97200.96660.96931,079
Portuguese (pt)0.93470.93910.93691,051
Italian (it)0.85180.94700.89691,056
French (fr)0.95630.91190.93361,056
German (de)0.93750.96490.95101,026
Spanish (es)0.95380.90130.92681,054

Training Curves

Epoch-by-Epoch Results:
  • Epoch 1: Train acc: 63.54%, Val acc: 92.38%, Val loss: 0.2205
  • Epoch 2: Train acc: 91.48%, Val acc: 92.19%, Val loss: 0.2093 ✓ (best)
  • Epoch 15: Train acc: 94.03%, Val acc: 94.07%, Val loss: 0.1419 ✓ (final best)
No overfitting detected: Training-validation gap of only -0.03%

Bidirectional LSTM (BiLSTM)

Bidirectional LSTM processes sequences in both forward and backward directions, capturing context from both sides.

Architecture

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dropout, Dense
from tensorflow.keras.optimizers import Adam

def crear_modelo_lstm_bidireccional(vocab_size, embedding_dim, max_length,
                                    num_classes, dropout_rate=0.3):
    """
    Create Bidirectional LSTM model for language detection.
    
    Args:
        vocab_size: Vocabulary size (10,000)
        embedding_dim: Embedding dimension (100)
        max_length: Maximum sequence length (~50)
        num_classes: Number of languages (7)
        dropout_rate: Dropout rate for regularization (0.3)
    
    Returns:
        Compiled Keras model
    """
    model = Sequential()
    
    # Embedding layer
    model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        input_length=max_length
    ))
    
    # Bidirectional LSTM layer
    model.add(Bidirectional(LSTM(
        units=32,  # Reduced because bidirectional doubles the output
        dropout=dropout_rate,
        recurrent_dropout=dropout_rate/2
    )))
    
    # Regularization
    model.add(Dropout(dropout_rate))
    
    # Intermediate dense layer
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(dropout_rate/2))
    
    # Output layer
    model.add(Dense(num_classes, activation='softmax'))
    
    # Compile
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

Layer Configuration

LayerTypeParametersOutput Shape
EmbeddingEmbeddingvocab_size=10,000
output_dim=100
(batch, 50, 100)
Bidirectional(LSTM)Bidirectionalunits=32
dropout=0.3
recurrent_dropout=0.15
(batch, 64)
DropoutDropoutrate=0.3(batch, 64)
DenseDenseunits=64
activation=‘relu’
(batch, 64)
DropoutDropoutrate=0.15(batch, 64)
DenseDenseunits=7
activation=‘softmax’
(batch, 7)
Total Parameters: ~750K trainable parameters
The Bidirectional LSTM uses 32 units (vs 64 in regular LSTM) because the bidirectional wrapper concatenates forward and backward outputs, resulting in 64 total features.

Performance

Validation Set Results:
  • Overall Accuracy: 94.08%
  • Training Time: 15 epochs (~138 seconds per epoch)
  • Inference Time: ~30ms per sample (2x slower than LSTM)
  • Model Size: 618 KB
Per-Language Performance:
LanguagePrecisionRecallF1-ScoreSupport
Swedish (sv)0.99390.95620.97471,028
Dutch (nl)0.95960.96940.96451,079
Portuguese (pt)0.92890.94480.93681,051
Italian (it)0.82610.95360.88531,056
French (fr)0.97440.90150.93651,056
German (de)0.96500.96690.96591,026
Spanish (es)0.96320.89370.92721,054

Training Curves

Epoch-by-Epoch Results:
  • Epoch 1: Train acc: 68.10%, Val acc: 92.72%, Val loss: 0.1913
  • Epoch 2: Train acc: 92.70%, Val acc: 93.06%, Val loss: 0.1739 ✓ (best)
  • Epoch 13: Train acc: 94.05%, Val acc: 94.15%, Val loss: 0.1445 ✓ (final best)
  • Epoch 15: Train acc: 94.16%, Val acc: 94.05%, Val loss: 0.1446
No overfitting detected: Training-validation gap of only 0.11%

LSTM vs BiLSTM Comparison

Overall Performance

MetricLSTMBiLSTMWinner
Validation Accuracy94.07%94.08%BiLSTM
Training Time/Epoch82s138sLSTM
Inference Time15ms30msLSTM
Model Size618 KB618 KBTie
Convergence Speed15 epochs13 epochsBiLSTM

Per-Language Comparison

LanguageLSTM F1BiLSTM F1Better Model
Swedish (sv)0.97320.9747BiLSTM
Dutch (nl)0.96930.9645LSTM
Portuguese (pt)0.93690.9368LSTM
Italian (it)0.89690.8853LSTM
French (fr)0.93360.9365BiLSTM
German (de)0.95100.9659BiLSTM
Spanish (es)0.92680.9272BiLSTM

Key Findings

BiLSTM Advantages

  • Slightly better overall accuracy (+0.01%)
  • Better performance on German, French, Spanish
  • Faster convergence (13 vs 15 epochs)

LSTM Advantages

  • 2x faster inference (15ms vs 30ms)
  • 40% faster training per epoch
  • Better on Dutch, Portuguese, Italian

Common Misclassifications

Text: "Temos de mudar resolutamente de rumo."
Tokens: "<OOV> de <OOV> <OOV> de <OOV>"
Actual: Portuguese (pt)
Predicted: Dutch (nl)
Reason: Most words are out-of-vocabulary

When to Use Deep Learning

Deep learning models are appropriate when: You need embeddings for downstream tasks ✅ Text contains OOV words that traditional ML might miss ✅ You want to learn sequential patterns in language structure ✅ You have GPU resources for training Not recommended when:
  • You need maximum accuracy (use Naive Bayes instead)
  • You have limited computational resources
  • You need fast training and inference
For the Europarl corpus, traditional ML models outperform deep learning due to the structured, clean nature of the parliamentary text. Deep learning models achieve ~5% lower accuracy (94% vs 99.9%).

Complete Training Example

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dropout, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

# Prepare data
data = preprocesar_secuencias(
    X_train, X_val, X_test,
    max_palabras=10000,
    max_longitud=None  # Auto-calculate
)

# Create model
model = crear_modelo_lstm_bidireccional(
    vocab_size=data['vocab_size'],
    embedding_dim=100,
    max_length=data['max_length'],
    num_classes=7,
    dropout_rate=0.3
)

# Define callbacks
callbacks = [
    EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-5),
    ModelCheckpoint('modelo_bilstm.keras', save_best_only=True)
]

# Train
history = model.fit(
    data['X_train'], y_train,
    validation_data=(data['X_val'], y_val),
    epochs=15,
    batch_size=32,
    callbacks=callbacks
)

# Evaluate
test_loss, test_acc = model.evaluate(data['X_test'], y_test)
print(f"Test accuracy: {test_acc:.4f}")

Next Steps

Model Comparison

See detailed comparisons with traditional ML models

Training Guide

Learn how to train models on your own data

Traditional ML

Explore higher-accuracy traditional ML approaches

Using Models

Start making predictions with trained models

Build docs developers (and LLMs) love