Overview
The preprocessing module provides functions to clean and normalize text data before vectorization. These functions are essential for improving model accuracy by standardizing input text.Functions
preprocesar_texto
Applies preprocessing to a single text string.The text string to preprocess
The preprocessed text string
Processing Steps
The function applies the following transformations:- Lowercase conversion - Converts all characters to lowercase
- Number removal - Removes all numeric digits using
r'\d+'pattern - Whitespace normalization - Removes extra whitespace and strips leading/trailing spaces
Example
preprocesa_dataset
Applies preprocessing to an entire dataset stored in a pandas DataFrame.DataFrame containing texts in a column named
'texto'A copy of the input DataFrame with an additional column
'texto_procesado' containing preprocessed textsBehavior
- Creates a copy of the input DataFrame (does not modify the original)
- Applies
preprocesar_texto()to each text in the'texto'column - Stores results in a new column called
'texto_procesado' - Preserves all original columns
Example
Best Practices
When to Preprocess
Preprocessing is recommended for:- Character-based models - Reduces vocabulary size and improves n-gram matching
- Word-based models - Standardizes word forms and reduces sparsity
- Memory-constrained environments - Reduces feature space dimensionality
Preprocessing Considerations
For language detection tasks:- Keep punctuation - Some languages have distinctive punctuation patterns
- Remove numbers - Numbers are language-agnostic and add noise
- Lowercase only - Case information is generally not distinctive for language detection
- Preserve special characters - Accented characters (á, é, í, ó, ú, ü, ñ) are important language markers
Example: Complete Preprocessing Pipeline
Related Documentation
- Vectorization API - Convert preprocessed text to numerical features
- Models API - Train classifiers on vectorized data
- Training Guide - Build end-to-end language detection systems