Overview
Both customer and sales normalization modules implement robust encoding detection to handle CSV files saved in different character encodings. This is critical for processing data from various sources, especially when dealing with Spanish language characters and special symbols.Supported Encodings
The system attempts to read files using the following encodings in order:- UTF-8 - Modern Unicode encoding (preferred)
- Latin-1 (ISO-8859-1) - Western European encoding
- CP1252 - Windows Western European encoding
- ISO-8859-1 - Alternative Western European encoding
Encoding Detection Strategy
Bothnormalizar_datos.py and normalizar_ventas.py use the same encoding detection approach:
How It Works
- Try Each Encoding: The system iterates through the encoding list
- Catch Decode Errors: If an encoding fails, it moves to the next one
- First Success Wins: The first successful encoding is used
- Report to User: The detected encoding is printed to console
- Fail Gracefully: If all encodings fail, a clear error message is shown
Why Multiple Encodings?
Common Scenarios
- UTF-8 Files: Modern exports from databases and web applications
- Latin-1/CP1252: Excel exports on Windows systems
- ISO-8859-1: Older Unix/Linux systems
Special Characters
Spanish language data often contains special characters that require proper encoding:- Accented vowels: á, é, í, ó, ú
- Ñ/ñ character
- Opening question/exclamation marks: ¿, ¡
- Currency symbols: $, €
- “Bogotá” might display as “Bogotá” with wrong encoding
- “Teléfono” might display as “Teléfono”
- “Medellín” might display as “MedellÃn”
Output Encoding
All output files are written using UTF-8 with BOM encoding:Why UTF-8-SIG?
- UTF-8-SIG: Adds a Byte Order Mark (BOM) to the file
- Excel Compatibility: Excel needs the BOM to correctly identify UTF-8 files
- Special Characters: Ensures Spanish characters display correctly
- Universal Standard: UTF-8 is the modern web and database standard
Data Cleaning Functions
The ETL system includes three core data cleaning functions used by both normalization modules.limpiar_texto()
Cleans and normalizes text fields by removing extra whitespace:- Returns
Nonefor empty or null input - Strips leading/trailing whitespace
- Replaces multiple spaces with single space
- Returns
Noneif result is empty string
| Input | Output |
|---|---|
" Empresa ABC " | "Empresa ABC" |
"Juan Pérez" | "Juan Pérez" |
" " | None |
"" | None |
None | None |
normalizar_fecha()
Converts various date formats to ISO 8601 format (YYYY-MM-DD):| Format | Example | Region |
|---|---|---|
%Y-%m-%d | 2023-03-15 | ISO 8601 (standard) |
%Y/%m/%d | 2023/03/15 | ISO variant |
%d/%m/%Y | 15/03/2023 | European/Latin American |
%d-%m-%Y | 15-03-2023 | European variant |
%m/%d/%Y | 03/15/2023 | US format |
%m-%d-%Y | 03-15-2023 | US variant |
| Input | Output |
|---|---|
"15/03/2023" | "2023-03-15" |
"2023-03-15" | "2023-03-15" |
"03/15/2023" | "2023-03-15" |
" 20/06/2023 " | "2023-06-20" |
"invalid" | None |
"" | None |
- First matching format is used
- Ambiguous dates (e.g., 01/02/2023) are tried in order
- Invalid dates return
Nonerather than raising errors
normalizar_numero()
Parses and validates numeric values, handling common formatting:- Returns
Nonefor empty, null, or “N/A” values - Removes thousands separators (commas)
- Removes spaces (common in European formatting)
- Converts to float
- Returns
Nonefor invalid numbers
| Input | Output |
|---|---|
"1250.50" | 1250.5 |
"1,250.50" | 1250.5 |
"1 250.50" | 1250.5 |
" 500 " | 500.0 |
"N/A" | None |
"n/a" | None |
"abc" | None |
"" | None |
- Does not handle European decimal separator (comma)
"1.250,50"would fail - needs to be converted to"1250.50"first- No currency symbol handling -
"$1,250.50"would fail
Encoding Detection Examples
Example 1: UTF-8 File
Example 2: Windows Excel Export
Example 3: All Encodings Fail
Best Practices
For Input Files
- Prefer UTF-8: Save source files as UTF-8 when possible
- Test Encodings: If issues occur, try resaving the file with different encoding
- Check Console: Always check which encoding was detected
- Validate Characters: Verify special characters display correctly after processing
For Output Files
- UTF-8-SIG Standard: All output uses UTF-8 with BOM
- Excel Compatible: Files open correctly in Excel with proper character display
- Database Ready: UTF-8 is compatible with modern databases
- Future Proof: UTF-8 is the recommended standard for all new systems
Troubleshooting Encoding Issues
Symptom: Garbled Characters
Problem:"Bogotá" displays as "Bogotá"
Solution:
- Input file is UTF-8 but was opened as Latin-1
- System should auto-detect, but verify console output
- If persists, manually convert file to UTF-8
Symptom: All Encodings Fail
Problem:"No se pudo leer el archivo con ninguna codificación"
Solutions:
- Check if file is actually CSV format
- File might be UTF-16 (not supported) - resave as UTF-8
- File might be corrupted - verify it opens in text editor
- Try opening in Excel and exporting as new CSV
Symptom: Some Characters Missing
Problem: Accented characters are replaced with?
Solution:
- Output was opened with wrong encoding
- Make sure viewing application uses UTF-8
- Excel should auto-detect UTF-8-SIG
- For other tools, explicitly select UTF-8
Technical Details
Encoding Detection Algorithms
The simple sequential try-catch approach is effective because:- Fast Failure:
UnicodeDecodeErroris raised immediately - No Ambiguity: Each encoding either works or fails completely
- Deterministic: Same file always produces same result
- Efficient: Most files are UTF-8 and succeed on first try
Alternative Approaches
More sophisticated encoding detection libraries exist (e.g.,chardet), but the simple approach is preferred because:
- No Dependencies: No external libraries required
- Known Encodings: We expect specific encodings for Latin American data
- Performance: Fast and deterministic
- Maintainability: Easy to understand and modify