Preprocessing transforms raw text into a clean, normalized format suitable for machine learning. For language detection, preprocessing decisions significantly impact model performance.
Unlike many NLP tasks, minimal preprocessing often works best for language identification. Language-specific characters, capitalization patterns, and even stopwords contain valuable discriminative information.
Minimal preprocessing that preserves language-specific features:
import redef preprocess_text(text): """ Basic preprocessing for language detection. Preserves accents and language-specific characters. """ # Convert to lowercase text = text.lower() # Normalize whitespace text = re.sub(r'\s+', ' ', text) # Strip leading/trailing whitespace text = text.strip() return text# Example usageoriginal = "Reanudación del período de sesiones"processed = preprocess_text(original)print(processed) # "reanudación del período de sesiones"
Keeping accented characters (é, ñ, ö, etc.) is crucial for language detection as they are strong linguistic markers.
For experimental comparison, you might test different preprocessing strategies:
import stringfrom typing import List, Optionaldef advanced_preprocess( text: str, remove_punctuation: bool = False, remove_numbers: bool = False, min_word_length: int = 1) -> str: """ Advanced preprocessing with configurable options. Args: text: Input text remove_punctuation: Whether to remove punctuation marks remove_numbers: Whether to remove numeric characters min_word_length: Minimum word length to keep Returns: Preprocessed text """ # Lowercase text = text.lower() # Remove punctuation (optional) if remove_punctuation: # Keep accented characters text = ''.join(char for char in text if char not in string.punctuation or char.isalpha()) # Remove numbers (optional) if remove_numbers: text = re.sub(r'\d+', '', text) # Normalize whitespace text = re.sub(r'\s+', ' ', text).strip() # Filter by word length (optional) if min_word_length > 1: words = text.split() words = [w for w in words if len(w) >= min_word_length] text = ' '.join(words) return text
Stopword removal requires careful consideration for language detection:
from nltk.corpus import stopwordsimport nltk# Download stopwords (run once)nltk.download('stopwords')def remove_stopwords_multilang(text: str, language: str = None) -> str: """ Remove stopwords if language is known. Note: For language DETECTION, we typically DON'T remove stopwords during inference since the language is unknown. This function is mainly for experimental analysis. """ if language is None: return text try: stop_words = set(stopwords.words(language)) words = text.split() filtered_words = [w for w in words if w not in stop_words] return ' '.join(filtered_words) except: # Language not supported or other error return text# Example: Analyzing stopword impacttext_es = "el parlamento de europa tiene una sesión"text_en = "the european parliament has a session"print("Spanish without stopwords:", remove_stopwords_multilang(text_es, 'spanish'))# Output: "parlamento europa sesión"print("Dutch without stopwords:", remove_stopwords_multilang(text_nl, 'dutch'))# Output: "europees parlement sessie"
For language detection, removing stopwords is generally not recommended. Words like “el”, “la”, “der”, “die”, “le”, “la” are among the strongest indicators of language. Removing them reduces model accuracy.
def validate_preprocessing(df: pd.DataFrame) -> None: """ Perform quality checks on preprocessed data. """ print("Preprocessing Quality Report") print("="*50) # Check for empty texts empty_count = (df['texto_clean'].str.strip() == '').sum() print(f"Empty texts: {empty_count}") # Check text length distribution df['text_length'] = df['texto_clean'].str.split().str.len() print(f"\nText length statistics (words):") print(df['text_length'].describe()) # Check for potential issues very_short = (df['text_length'] < 3).sum() print(f"\nVery short texts (<3 words): {very_short}") # Sample preprocessed texts per language print("\nSample preprocessed texts:") for lang in df['idioma'].unique(): sample = df[df['idioma'] == lang]['texto_clean'].iloc[0] print(f"{lang}: {sample[:80]}...")validate_preprocessing(df)
Example output:
Preprocessing Quality Report==================================================Empty texts: 0Text length statistics (words):count 49000.00mean 15.23std 8.45min 1.0025% 9.0050% 14.0075% 20.00max 87.00Very short texts (<3 words): 234Sample preprocessed texts:es: reanudación del período de sesiones...de: wiederaufnahme der sitzungsperiode...fr: reprise de la session...