This guide documents common issues encountered during the ETL process and their solutions. Each problem includes symptoms, root causes, and step-by-step solutions.
Root Cause:Encoding mismatch between source CSV file and Python’s file reading. The file may be encoded in Latin-1 or CP1252, but Python defaults to UTF-8.Solution:
Multiple Encoding Attempts:
encodings = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']contenido = Nonefor encoding in encodings: try: with open('clientes_prueba_mas_datos.csv', 'r', encoding=encoding) as f: contenido = list(csv.DictReader(f)) print(f"Archivo leído correctamente con codificación: {encoding}") break except UnicodeDecodeError: continueif contenido is None: raise ValueError("No se pudo leer el archivo con ninguna codificación")
Write with UTF-8 BOM:
with open('clientes_normalizados.csv', 'w', encoding='utf-8-sig', newline='') as f: writer = csv.DictWriter(f, fieldnames=[...]) writer.writeheader() writer.writerows(clientes)
The utf-8-sig encoding adds a BOM (Byte Order Mark) for Excel compatibility.
Verification:Open the output CSV in Excel or a text editor. Special characters should display correctly:
Symptom:Dates appear in different formats within the same CSV:
“2024-11-23” (ISO format)
“11/23/2024” (US format)
“23-11-2024” (European format)
“23/11/2024” (European format)
Root Cause:Source systems use different date formats. Excel may also reformat dates when saving CSVs.Solution:Implement flexible date parser that tries multiple formats:
def normalizar_fecha(fecha_str): """Normaliza diferentes formatos de fecha a YYYY-MM-DD""" if not fecha_str or fecha_str.strip() == '': return None fecha_str = fecha_str.strip() formatos = [ '%Y-%m-%d', # 2024-11-23 '%Y/%m/%d', # 2024/11/23 '%d/%m/%Y', # 23/11/2024 '%d-%m-%Y', # 23-11-2024 '%m/%d/%Y', # 11/23/2024 '%Y-%d-%m', # 2024-23-11 '%m-%d-%Y', # 11-23-2024 ] for formato in formatos: try: fecha = datetime.strptime(fecha_str, formato) return fecha.strftime('%Y-%m-%d') # Always output ISO format except ValueError: continue return None # If no format matches
Output:All dates normalized to ISO format: YYYY-MM-DDVerification:
Ambiguous dates like “01/02/2024” may be interpreted differently depending on format order. Consider data source to determine if DD/MM or MM/DD is more likely.
Symptom:CSV contains various representations of missing data:
Empty strings: ""
Whitespace: " "
“N/A” or “n/a”
NULL values in numeric fields
Root Cause:Different systems represent missing data differently. Excel may export NULL as empty string.Solution:
Text Fields - Return None:
def limpiar_texto(texto): if not texto: return None texto = texto.strip() texto = re.sub(r'\s+', ' ', texto) # Multiple spaces → single space if texto == '': return None return texto
Numeric Fields - Handle N/A:
def normalizar_numero(numero_str): if not numero_str or numero_str.strip() == '' or numero_str.upper() == 'N/A': return None numero_str = numero_str.replace(',', '').replace(' ', '').strip() try: return float(numero_str) except ValueError: return None
Validation - Check Required Fields:
# For clientes: nombre, ciudad, segmento requiredif (nombre or ciudad) and segmento: # Provide defaults for missing required fields if not nombre: nombre = f"Cliente {cliente_id}" if not ciudad: ciudad = "Sin especificar" clientes.append({...})
# For ventas: cliente_id, total, moneda, canal requiredif cliente_id and total is not None and moneda and canal: ventas.append({...})else: continue # Skip invalid record
SQL Handling:
# Convert Python None to SQL NULLfecha = cliente['fecha_registro'] if cliente['fecha_registro'] else 'NULL'if fecha != 'NULL': fecha = f"'{fecha}'"sql = f"INSERT INTO clientes (..., fecha_registro) VALUES (..., {fecha});"
Performance: Indexing on long text strings is slow
Solution:Relational normalization with master tables:
Extract Unique Values:
# Build sets of unique valuesciudades_set = set()segmentos_set = set()for row in contenido: ciudad = limpiar_texto(row.get('ciudad', '')) segmento = limpiar_texto(row.get('segmento', '')) if ciudad: ciudades_set.add(ciudad) if segmento: segmentos_set.add(segmento)
Create Master Tables:
# Create ciudades tableciudades = [{'ciudad_id': i+1, 'nombre': ciudad} for i, ciudad in enumerate(sorted(ciudades_set))]# Create segmentos tablesegmentos = [{'segmento_id': i+1, 'nombre': segmento} for i, segmento in enumerate(sorted(segmentos_set))]
Create Mapping:
# Create lookup dictionariesciudad_map = {ciudad['nombre']: ciudad['ciudad_id'] for ciudad in ciudades}segmento_map = {segmento['nombre']: segmento['segmento_id'] for segmento in segmentos}
Replace with Foreign Keys:
# Replace text with IDsfor cliente in clientes: ciudad_id = ciudad_map.get(cliente['ciudad']) segmento_id = segmento_map.get(cliente['segmento']) clientes_normalizados.append({ 'cliente_id': cliente['cliente_id'], 'nombre': cliente['nombre'], 'ciudad_id': ciudad_id, # FK instead of text 'segmento_id': segmento_id, # FK instead of text 'fecha_registro': cliente['fecha_registro'] })
Database Schema:
-- Master tableCREATE TABLE ciudades ( ciudad_id INT PRIMARY KEY IDENTITY(1,1), nombre NVARCHAR(100) NOT NULL UNIQUE);-- Business table with foreign keyCREATE TABLE clientes ( cliente_id INT PRIMARY KEY, nombre NVARCHAR(255) NOT NULL, ciudad_id INT NOT NULL, CONSTRAINT FK_clientes_ciudad FOREIGN KEY (ciudad_id) REFERENCES ciudades(ciudad_id));-- Index for performanceCREATE INDEX IX_clientes_ciudad_id ON clientes(ciudad_id);
Benefits:
✅ Storage: “La Romana” stored once instead of 50+ times
✅ Updates: Change city name in one place
✅ Performance: Integer foreign keys are faster than text
✅ Integrity: Foreign key constraints prevent invalid data
The INSERT statement conflicted with the FOREIGN KEY constraint "FK_clientes_ciudad". The conflict occurred in database "PruebaTecnicaDNI", table "dbo.ciudades", column 'ciudad_id'.
Root Cause:Inserting into child table (clientes) before parent table (ciudades).Solution:Insert in dependency order:
Master tables (no dependencies):
ciudades
segmentos
canales
monedas
Business tables (depend on master tables):
clientes (references ciudades, segmentos)
Transaction tables (depend on business tables):
ventas (references clientes, monedas, canales)
-- Correct orderINSERT INTO ciudades ...;INSERT INTO segmentos ...;INSERT INTO canales ...;INSERT INTO monedas ...;GOINSERT INTO clientes ...; -- Now ciudades and segmentos existGOINSERT INTO ventas ...; -- Now clientes, monedas, canales existGO
Symptom:Excel displays special characters correctly, but other tools show a BOM character () at the start of the file.Root Cause:UTF-8 BOM (Byte Order Mark) is added for Excel compatibility but may confuse other parsers.Solution:This is intentional behavior:
# Writing with BOM for Excelwith open('output.csv', 'w', encoding='utf-8-sig', newline='') as f: writer = csv.DictWriter(f, fieldnames=[...]) writer.writeheader() writer.writerows(data)
When to Use:
✅ Use utf-8-sig when files will be opened in Excel
✅ Use utf-8 when files will be parsed by other systems
Reading BOM Files:
# Python's csv module handles BOM automaticallywith open('output.csv', 'r', encoding='utf-8-sig') as f: reader = csv.DictReader(f) data = list(reader)
# Bad: O(n) lookupciudades_list = []if ciudad not in ciudades_list: # Slow for large lists ciudades_list.append(ciudad)
Batch SQL Inserts:
# Write all INSERTs to file, then execute once in SQL Serverwith open('crear_base_datos.sql', 'w', encoding='utf-8') as f: for cliente in clientes: f.write(f"INSERT INTO clientes (...) VALUES (...);\n")
Symptom:Queries on ventas table take several seconds.Root Cause:Missing indexes on foreign keys and frequently filtered columns.Solution:Create indexes on:
-- Foreign keysCREATE INDEX IX_clientes_ciudad_id ON clientes(ciudad_id);CREATE INDEX IX_clientes_segmento_id ON clientes(segmento_id);CREATE INDEX IX_ventas_cliente_id ON ventas(cliente_id);CREATE INDEX IX_ventas_moneda_id ON ventas(moneda_id);CREATE INDEX IX_ventas_canal_id ON ventas(canal_id);-- Date columns (frequently filtered)CREATE INDEX IX_ventas_fecha ON ventas(fecha);
Foreign keys are automatically indexed in some databases, but SQL Server requires explicit index creation.
SELECT 'clientes' AS tabla, COUNT(*) AS registros FROM clientesUNION ALLSELECT 'ventas', COUNT(*) FROM ventasUNION ALLSELECT 'ciudades', COUNT(*) FROM ciudades;
Check for Orphaned Records:
-- Clientes without valid ciudadSELECT * FROM clientes cLEFT JOIN ciudades ci ON c.ciudad_id = ci.ciudad_idWHERE ci.ciudad_id IS NULL;-- Ventas without valid clienteSELECT * FROM ventas vLEFT JOIN clientes c ON v.cliente_id = c.cliente_idWHERE c.cliente_id IS NULL;
Validate Foreign Keys:
-- Should return no errorsDBCC CHECKCONSTRAINTS WITH ALL_CONSTRAINTS;
Compare Raw vs Normalized:
SELECT COUNT(*) FROM clientes_raw;SELECT COUNT(*) FROM clientes;-- Counts should match (or normalized should be less if invalid records were skipped)