Overview
PruebaETL includes utility functions for data normalization, validation, and database generation. All functions are designed to handle messy real-world data with graceful error handling.Data Normalization Functions
normalizar_fecha()
Description: Normalizes date strings from multiple formats to ISO format (YYYY-MM-DD). Location:normalizar_datos.py:6, normalizar_ventas.py:5
Signature:
Date string in any supported format
YYYY-MM-DD format, or None if invalid/empty.
Supported Date Formats:
YYYY-MM-DD(ISO format)YYYY/MM/DDDD/MM/YYYYDD-MM-YYYYMM/DD/YYYYM-D-Yvariations
- Tries multiple format parsers sequentially
- Strips whitespace before parsing
- Returns
Nonefor empty strings or unparseable dates - Uses Python’s
datetime.strptime()for validation
The function attempts formats in order. If a date is ambiguous (e.g., “01/02/2024”), it may be interpreted as Jan 2 or Feb 1 depending on format order.
limpiar_texto()
Description: Cleans and normalizes text by removing extra whitespace and handling empty values. Location:normalizar_datos.py:33, normalizar_ventas.py:25
Signature:
Text string to clean
None if empty/invalid.
Processing Steps:
- Strips leading/trailing whitespace
- Replaces multiple spaces with single space
- Returns
Noneif result is empty string
- Cliente names
- City names
- Segment names
- Channel names
- Any text field from CSV
normalizar_numero()
Description: Converts number strings to float, handling formatting and special values. Location:normalizar_datos.py:47, normalizar_ventas.py:37
Signature:
Number string with possible formatting (commas, spaces)
None if invalid/empty/N/A.
Processing Steps:
- Checks for empty string or “N/A”
- Removes commas and spaces
- Converts to float
- Returns
Noneif conversion fails
- Sale totals
- Monetary amounts
- Quantity fields
- Any numeric field from CSV
Data Processing Functions
procesar_clientes()
Description: Main function to process cliente CSV and generate normalized output files. Location:normalizar_datos.py:59
Signature:
clientes_prueba_mas_datos.csv in current directory)
Returns: None (generates CSV files as side effect)
Generated Files:
clientes_normalizados.csv- Normalized clientes with foreign keysciudades.csv- Unique cities with IDssegmentos.csv- Unique segments with IDsclientes_normalizados_completo.csv- Denormalized view for verification
-
Read CSV with multiple encoding fallbacks:
- UTF-8
- Latin-1
- CP1252
- ISO-8859-1
-
Extract and normalize data:
- Clean all text fields
- Normalize dates
- Handle missing cliente_id (auto-generate)
- Validate required fields
-
Build master tables:
- Extract unique cities
- Extract unique segments
- Assign auto-increment IDs
-
Create normalized records:
- Replace city names with city_id
- Replace segment names with segmento_id
- Write output files with UTF-8 BOM encoding
If a cliente has no name but has a city and segment, it generates a default name: “Cliente “
procesar_ventas()
Description: Main function to process sales CSV and generate normalized output files. Location:normalizar_ventas.py:48
Signature:
ventas_prueba_mas_datos.csv in current directory)
Returns: None (generates CSV files as side effect)
Generated Files:
ventas_normalizadas.csv- Normalized sales with foreign keyscanales.csv- Unique channels with IDsmonedas.csv- Unique currencies with IDsventas_normalizadas_completo.csv- Denormalized view for verification
- Read CSV with multiple encoding fallbacks
-
Extract and normalize data:
- Clean text fields (channel, currency)
- Normalize dates
- Normalize numeric totals
- Handle missing venta_id (starts at 1001)
- Validate required fields
-
Build master tables:
- Extract unique channels
- Extract unique currencies
- Assign auto-increment IDs
-
Create normalized records:
- Replace channel names with canal_id
- Replace currency codes with moneda_id
- Write output files with UTF-8 BOM encoding
SQL Generation Functions
generar_sql_crear_bd()
Description: Generates DDL SQL for creating database and all tables. Location:crear_base_datos.py:8
Signature:
- Database creation (
PruebaTecnicaDNI) - Raw tables (clientes_raw, ventas_raw)
- Master tables (ciudades, segmentos, canales, monedas)
- Business tables (clientes, ventas)
- Primary key constraints
- Foreign key constraints
- Indexes on foreign keys and date fields
leer_csv()
Description: Reads a CSV file from the cleanData directory. Location:crear_base_datos.py:165
Signature:
CSV filename (will be looked up in
cleanData/ directory)Returns empty list if file doesn’t exist. Prints warning but doesn’t raise exception.
generar_sql_insertar_datos()
Description: Generates INSERT statements for normalized data. Location:crear_base_datos.py:180
Signature:
cleanData/*.csv files)
Returns: String containing SQL INSERT statements.
Processing:
- Reads all normalized CSV files
- Generates INSERT statements in dependency order:
- Ciudades (referenced by clientes)
- Segmentos (referenced by clientes)
- Canales (referenced by ventas)
- Monedas (referenced by ventas)
- Clientes (references ciudades, segmentos)
- Ventas (references clientes, monedas, canales)
- Handles IDENTITY columns with SET IDENTITY_INSERT
- Escapes single quotes in text fields
- Converts empty strings to NULL
generar_sql_insertar_raw()
Description: Generates INSERT statements for raw data from original CSV files. Location:crear_base_datos.py:255
Signature:
clientes_prueba_mas_datos.csv→clientes_rawtableventas_prueba_mas_datos.csv→ventas_rawtable
generar_script_completo()
Description: Combines all SQL generation functions into one complete script. Location:crear_base_datos.py:319
Signature:
- Database and table creation
- Raw data inserts
- Normalized data inserts
guardar_script_sql()
Description: Generates and saves complete SQL script to file. Location:crear_base_datos.py:329
Signature:
crear_base_datos.sql file)
Example Usage:
Error Handling
Encoding Fallback
All CSV reading functions try multiple encodings:Missing Fields
Functions gracefully handle missing CSV fields:Validation
Required fields are validated before processing:Complete Workflow Example
Related Resources
Data Types
SQL Server data types reference
Troubleshooting
Common problems and solutions