Troubleshooting

Overview

This guide documents common issues encountered during the ETL process and their solutions. Each problem includes symptoms, root causes, and step-by-step solutions.

Data Quality Issues

Problem 1: Caracteres Especiales Mal Codificados

Symptom: Spanish characters appear corrupted in the output:

“Raúl Pérez” → “RaÃ°LPA@rez”
“José García” → “JosÃ© GarcÃa”
“Peña” → “PeÃ±a”

Root Cause: Encoding mismatch between source CSV file and Python’s file reading. The file may be encoded in Latin-1 or CP1252, but Python defaults to UTF-8. Solution:

Multiple Encoding Attempts:

encodings = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']
contenido = None

for encoding in encodings:
    try:
        with open('clientes_prueba_mas_datos.csv', 'r', encoding=encoding) as f:
            contenido = list(csv.DictReader(f))
            print(f"Archivo leído correctamente con codificación: {encoding}")
            break
    except UnicodeDecodeError:
        continue

if contenido is None:
    raise ValueError("No se pudo leer el archivo con ninguna codificación")

Write with UTF-8 BOM:

with open('clientes_normalizados.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=[...])
    writer.writeheader()
    writer.writerows(clientes)

The utf-8-sig encoding adds a BOM (Byte Order Mark) for Excel compatibility.

Verification: Open the output CSV in Excel or a text editor. Special characters should display correctly:

✅ “Raúl Pérez”
✅ “José García”
✅ “Peña”

Location: normalizar_datos.py:69-84, normalizar_ventas.py:55-69

Problem 2: Fechas en Múltiples Formatos

Symptom: Dates appear in different formats within the same CSV:

“2024-11-23” (ISO format)
“11/23/2024” (US format)
“23-11-2024” (European format)
“23/11/2024” (European format)

Root Cause: Source systems use different date formats. Excel may also reformat dates when saving CSVs. Solution: Implement flexible date parser that tries multiple formats:

def normalizar_fecha(fecha_str):
    """Normaliza diferentes formatos de fecha a YYYY-MM-DD"""
    if not fecha_str or fecha_str.strip() == '':
        return None
    
    fecha_str = fecha_str.strip()
    
    formatos = [
        '%Y-%m-%d',      # 2024-11-23
        '%Y/%m/%d',      # 2024/11/23
        '%d/%m/%Y',      # 23/11/2024
        '%d-%m-%Y',      # 23-11-2024
        '%m/%d/%Y',      # 11/23/2024
        '%Y-%d-%m',      # 2024-23-11
        '%m-%d-%Y',      # 11-23-2024
    ]
    
    for formato in formatos:
        try:
            fecha = datetime.strptime(fecha_str, formato)
            return fecha.strftime('%Y-%m-%d')  # Always output ISO format
        except ValueError:
            continue
    
    return None  # If no format matches

Output: All dates normalized to ISO format: YYYY-MM-DD Verification:

print(normalizar_fecha("2024-11-23"))   # 2024-11-23
print(normalizar_fecha("23/11/2024"))   # 2024-11-23
print(normalizar_fecha("11-23-2024"))   # 2024-11-23
print(normalizar_fecha("invalid"))      # None

Ambiguous dates like “01/02/2024” may be interpreted differently depending on format order. Consider data source to determine if DD/MM or MM/DD is more likely.

Location: normalizar_datos.py:6-31, normalizar_ventas.py:5-23

Problem 3: Valores Nulos y Vacíos

Symptom: CSV contains various representations of missing data:

Empty strings: ""
Whitespace: " "
“N/A” or “n/a”
NULL values in numeric fields

Root Cause: Different systems represent missing data differently. Excel may export NULL as empty string. Solution:

Text Fields - Return None:

def limpiar_texto(texto):
    if not texto:
        return None
    
    texto = texto.strip()
    texto = re.sub(r'\s+', ' ', texto)  # Multiple spaces → single space
    
    if texto == '':
        return None
    
    return texto

Numeric Fields - Handle N/A:

def normalizar_numero(numero_str):
    if not numero_str or numero_str.strip() == '' or numero_str.upper() == 'N/A':
        return None
    
    numero_str = numero_str.replace(',', '').replace(' ', '').strip()
    
    try:
        return float(numero_str)
    except ValueError:
        return None

Validation - Check Required Fields:

# For clientes: nombre, ciudad, segmento required
if (nombre or ciudad) and segmento:
    # Provide defaults for missing required fields
    if not nombre:
        nombre = f"Cliente {cliente_id}"
    if not ciudad:
        ciudad = "Sin especificar"
    
    clientes.append({...})

# For ventas: cliente_id, total, moneda, canal required
if cliente_id and total is not None and moneda and canal:
    ventas.append({...})
else:
    continue  # Skip invalid record

SQL Handling:

# Convert Python None to SQL NULL
fecha = cliente['fecha_registro'] if cliente['fecha_registro'] else 'NULL'
if fecha != 'NULL':
    fecha = f"'{fecha}'"

sql = f"INSERT INTO clientes (..., fecha_registro) VALUES (..., {fecha});"

Location: normalizar_datos.py:33-46, 47-57, 108-123

Problem 4: Números con Formato de Texto

Symptom: Numeric fields contain text formatting:

“1,403.70” (thousands separator)
“1 403.70” (space separator)
“N/A” (missing value)
“$1403.70” (currency symbol)

Root Cause: Excel and other tools export numbers with locale-specific formatting. Solution:

Strip Formatting:

def normalizar_numero(numero_str):
    if not numero_str or numero_str.strip() == '' or numero_str.upper() == 'N/A':
        return None
    
    # Remove commas and spaces
    numero_str = numero_str.replace(',', '').replace(' ', '').strip()
    
    try:
        return float(numero_str)
    except ValueError:
        return None

Use DECIMAL in SQL:

CREATE TABLE ventas (
    venta_id INT PRIMARY KEY,
    total DECIMAL(18,2) NOT NULL,  -- Exact precision for money
    ...
);

Examples:

print(normalizar_numero("1,403.70"))    # 1403.7
print(normalizar_numero("1 403.70"))    # 1403.7
print(normalizar_numero("1403.70"))     # 1403.7
print(normalizar_numero("N/A"))         # None
print(normalizar_numero("$1403.70"))    # None (currency symbol not handled)

Currency symbols are NOT automatically removed. Pre-process data to remove them if present.

SQL Insert:

total = venta['total']  # Already converted to float
sql = f"INSERT INTO ventas (..., total) VALUES (..., {total});"
# Result: INSERT INTO ventas (..., total) VALUES (..., 1403.70);

Location: normalizar_datos.py:47-57, normalizar_ventas.py:37-46

Problem 5: Redundancia de Datos

Symptom:

“La Romana” appears 50+ times in clientes CSV
“Corporativo” segment repeated 100+ times
CSV files are very large (10+ MB for small datasets)
Slow query performance

Root Cause: Denormalized data structure repeats the same values many times. Impact:

Storage waste: ~80% redundancy
Update anomalies: Changing “La Romana” requires updating 50+ rows
Performance: Indexing on long text strings is slow

Solution: Relational normalization with master tables:

Extract Unique Values:

# Build sets of unique values
ciudades_set = set()
segmentos_set = set()

for row in contenido:
    ciudad = limpiar_texto(row.get('ciudad', ''))
    segmento = limpiar_texto(row.get('segmento', ''))
    
    if ciudad:
        ciudades_set.add(ciudad)
    if segmento:
        segmentos_set.add(segmento)

Create Master Tables:

# Create ciudades table
ciudades = [{'ciudad_id': i+1, 'nombre': ciudad} 
            for i, ciudad in enumerate(sorted(ciudades_set))]

# Create segmentos table
segmentos = [{'segmento_id': i+1, 'nombre': segmento} 
             for i, segmento in enumerate(sorted(segmentos_set))]

Create Mapping:

# Create lookup dictionaries
ciudad_map = {ciudad['nombre']: ciudad['ciudad_id'] 
              for ciudad in ciudades}
segmento_map = {segmento['nombre']: segmento['segmento_id'] 
                for segmento in segmentos}

Replace with Foreign Keys:

# Replace text with IDs
for cliente in clientes:
    ciudad_id = ciudad_map.get(cliente['ciudad'])
    segmento_id = segmento_map.get(cliente['segmento'])
    
    clientes_normalizados.append({
        'cliente_id': cliente['cliente_id'],
        'nombre': cliente['nombre'],
        'ciudad_id': ciudad_id,      # FK instead of text
        'segmento_id': segmento_id,  # FK instead of text
        'fecha_registro': cliente['fecha_registro']
    })

Database Schema:

-- Master table
CREATE TABLE ciudades (
    ciudad_id INT PRIMARY KEY IDENTITY(1,1),
    nombre NVARCHAR(100) NOT NULL UNIQUE
);

-- Business table with foreign key
CREATE TABLE clientes (
    cliente_id INT PRIMARY KEY,
    nombre NVARCHAR(255) NOT NULL,
    ciudad_id INT NOT NULL,
    CONSTRAINT FK_clientes_ciudad FOREIGN KEY (ciudad_id) 
        REFERENCES ciudades(ciudad_id)
);

-- Index for performance
CREATE INDEX IX_clientes_ciudad_id ON clientes(ciudad_id);

Benefits:

✅ Storage: “La Romana” stored once instead of 50+ times
✅ Updates: Change city name in one place
✅ Performance: Integer foreign keys are faster than text
✅ Integrity: Foreign key constraints prevent invalid data

File Size Reduction:

Before: 10 MB (redundant text)
After: 2 MB (normalized with IDs)
Reduction: ~80%

Location: normalizar_datos.py:137-161, normalizar_ventas.py:123-145

Database Issues

Foreign Key Constraint Violations

Symptom:

The INSERT statement conflicted with the FOREIGN KEY constraint "FK_clientes_ciudad". 
The conflict occurred in database "PruebaTecnicaDNI", table "dbo.ciudades", column 'ciudad_id'.

Root Cause: Inserting into child table (clientes) before parent table (ciudades). Solution: Insert in dependency order:

Master tables (no dependencies):
- ciudades
- segmentos
- canales
- monedas
Business tables (depend on master tables):
- clientes (references ciudades, segmentos)
Transaction tables (depend on business tables):
- ventas (references clientes, monedas, canales)

-- Correct order
INSERT INTO ciudades ...;
INSERT INTO segmentos ...;
INSERT INTO canales ...;
INSERT INTO monedas ...;
GO

INSERT INTO clientes ...;  -- Now ciudades and segmentos exist
GO

INSERT INTO ventas ...;    -- Now clientes, monedas, canales exist
GO

Location: crear_base_datos.py:180-253 (enforces correct order)

IDENTITY Insert Issues

Symptom:

An explicit value for the identity column in table 'ciudades' can only be specified 
when a column list is used and IDENTITY_INSERT is ON.

Root Cause: Trying to insert explicit ID values into an IDENTITY column without enabling IDENTITY_INSERT. Solution:

-- Enable explicit ID inserts
SET IDENTITY_INSERT ciudades ON;

-- Insert with explicit IDs
INSERT INTO ciudades (ciudad_id, nombre) VALUES (1, 'La Romana');
INSERT INTO ciudades (ciudad_id, nombre) VALUES (2, 'Santo Domingo');

-- Disable explicit ID inserts
SET IDENTITY_INSERT ciudades OFF;
GO

Only ONE table per database can have IDENTITY_INSERT ON at a time.

Location: crear_base_datos.py:189-194 (automatically handles this)

SQL Injection from Special Characters

Symptom:

INSERT INTO clientes (nombre) VALUES ('O'Brien');
-- Error: Unclosed quotation mark after 'Brien'.

Root Cause: Single quotes in data break SQL string delimiters. Solution: Escape single quotes by doubling them:

nombre = cliente['nombre'].replace("'", "''")
sql = f"INSERT INTO clientes (nombre) VALUES ('{nombre}');"

# "O'Brien" becomes "O''Brien"
# Result: INSERT INTO clientes (nombre) VALUES ('O''Brien');

Applied to All Text Fields:

nombre = ciudad['nombre'].replace("'", "''").replace('\n', ' ').replace('\r', ' ')
sql = f"INSERT INTO ciudades (ciudad_id, nombre) VALUES ({ciudad['ciudad_id']}, '{nombre}');"

Also remove newlines (\n) and carriage returns (\r) to prevent SQL syntax errors.

Location: crear_base_datos.py:191, 203, 213, 224, 234, 269-281

File and Encoding Issues

FileNotFoundError

Symptom:

Error: No se encontró el archivo 'clientes_prueba_mas_datos.csv'

Root Cause: Script executed from wrong directory, or CSV files not present. Solution:

Check Current Directory:
```
pwd
ls *.csv
```

Run from Correct Directory:

cd /path/to/pruebaetl
python normalizar_datos.py

Verify File Names:

ls -l clientes_prueba_mas_datos.csv
ls -l ventas_prueba_mas_datos.csv

Expected Files:

clientes_prueba_mas_datos.csv (source data)
ventas_prueba_mas_datos.csv (source data)

Scripts use relative paths. Always run from the directory containing the CSV files.

UTF-8 BOM in Excel

Symptom: Excel displays special characters correctly, but other tools show a BOM character (ï»¿) at the start of the file. Root Cause: UTF-8 BOM (Byte Order Mark) is added for Excel compatibility but may confuse other parsers. Solution: This is intentional behavior:

# Writing with BOM for Excel
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=[...])
    writer.writeheader()
    writer.writerows(data)

When to Use:

✅ Use utf-8-sig when files will be opened in Excel
✅ Use utf-8 when files will be parsed by other systems

Reading BOM Files:

# Python's csv module handles BOM automatically
with open('output.csv', 'r', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)
    data = list(reader)

Location: All CSV write operations use utf-8-sig

Performance Issues

Slow CSV Processing

Symptom: Processing large CSV files (100K+ rows) takes several minutes. Root Cause:

Inefficient looping
Multiple file reads
No batch processing

Solution:

Read Entire File Once:

# Good: Read once
with open('input.csv', 'r', encoding='utf-8') as f:
    contenido = list(csv.DictReader(f))

for row in contenido:
    # Process

# Bad: Read multiple times
for i in range(count):
    with open('input.csv', 'r', encoding='utf-8') as f:
        # Read entire file each iteration

Use Sets for Unique Values:

# Good: O(1) lookup
ciudades_set = set()
ciudades_set.add(ciudad)

# Bad: O(n) lookup
ciudades_list = []
if ciudad not in ciudades_list:  # Slow for large lists
    ciudades_list.append(ciudad)

Batch SQL Inserts:

# Write all INSERTs to file, then execute once in SQL Server
with open('crear_base_datos.sql', 'w', encoding='utf-8') as f:
    for cliente in clientes:
        f.write(f"INSERT INTO clientes (...) VALUES (...);\n")

Slow Database Queries

Symptom: Queries on ventas table take several seconds. Root Cause: Missing indexes on foreign keys and frequently filtered columns. Solution: Create indexes on:

-- Foreign keys
CREATE INDEX IX_clientes_ciudad_id ON clientes(ciudad_id);
CREATE INDEX IX_clientes_segmento_id ON clientes(segmento_id);
CREATE INDEX IX_ventas_cliente_id ON ventas(cliente_id);
CREATE INDEX IX_ventas_moneda_id ON ventas(moneda_id);
CREATE INDEX IX_ventas_canal_id ON ventas(canal_id);

-- Date columns (frequently filtered)
CREATE INDEX IX_ventas_fecha ON ventas(fecha);

Foreign keys are automatically indexed in some databases, but SQL Server requires explicit index creation.

Location: crear_base_datos.py:152-158

Validation and Testing

How to Verify Data Integrity

Check Record Counts:

SELECT 'clientes' AS tabla, COUNT(*) AS registros FROM clientes
UNION ALL
SELECT 'ventas', COUNT(*) FROM ventas
UNION ALL
SELECT 'ciudades', COUNT(*) FROM ciudades;

Check for Orphaned Records:

-- Clientes without valid ciudad
SELECT * FROM clientes c
LEFT JOIN ciudades ci ON c.ciudad_id = ci.ciudad_id
WHERE ci.ciudad_id IS NULL;

-- Ventas without valid cliente
SELECT * FROM ventas v
LEFT JOIN clientes c ON v.cliente_id = c.cliente_id
WHERE c.cliente_id IS NULL;

Validate Foreign Keys:

-- Should return no errors
DBCC CHECKCONSTRAINTS WITH ALL_CONSTRAINTS;

Compare Raw vs Normalized:

SELECT COUNT(*) FROM clientes_raw;
SELECT COUNT(*) FROM clientes;
-- Counts should match (or normalized should be less if invalid records were skipped)

Getting Help

Debug Mode

Add print statements to see processing details:

for row in contenido:
    cliente_id = row.get('cliente_id', '').strip()
    print(f"Processing cliente_id: {cliente_id}")  # Debug
    
    if not cliente_id:
        print(f"  → Skipping: Empty cliente_id")  # Debug
        continue

Common Commands

# Run normalization
python normalizar_datos.py
python normalizar_ventas.py

# Generate SQL
python crear_base_datos.py

# Check output files
ls -lh *.csv
ls -lh cleanData/*.csv

# View first few lines
head -n 5 clientes_normalizados.csv

Data Types

SQL Server data types reference

Functions

Python functions documentation

Getting Started

Core Concepts

Data Processing

Database

Reference

Overview

Data Quality Issues

Problem 1: Caracteres Especiales Mal Codificados

Problem 2: Fechas en Múltiples Formatos

Problem 3: Valores Nulos y Vacíos

Problem 4: Números con Formato de Texto

Problem 5: Redundancia de Datos

Database Issues

Foreign Key Constraint Violations

IDENTITY Insert Issues

SQL Injection from Special Characters

File and Encoding Issues

FileNotFoundError

UTF-8 BOM in Excel

Performance Issues

Slow CSV Processing

Slow Database Queries

Validation and Testing

How to Verify Data Integrity

Getting Help

Debug Mode

Common Commands

Data Types

Functions

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Processing

Database

Reference

​Overview

​Data Quality Issues

​Problem 1: Caracteres Especiales Mal Codificados

​Problem 2: Fechas en Múltiples Formatos

​Problem 3: Valores Nulos y Vacíos

​Problem 4: Números con Formato de Texto

​Problem 5: Redundancia de Datos

​Database Issues

​Foreign Key Constraint Violations

​IDENTITY Insert Issues

​SQL Injection from Special Characters

​File and Encoding Issues

​FileNotFoundError

​UTF-8 BOM in Excel

​Performance Issues

​Slow CSV Processing

​Slow Database Queries

​Validation and Testing

​How to Verify Data Integrity

​Getting Help

​Debug Mode

​Common Commands

​Related Resources

Data Types

Functions

Build docs developers (and LLMs) love

Overview

Data Quality Issues

Problem 1: Caracteres Especiales Mal Codificados

Problem 2: Fechas en Múltiples Formatos

Problem 3: Valores Nulos y Vacíos

Problem 4: Números con Formato de Texto

Problem 5: Redundancia de Datos

Database Issues

Foreign Key Constraint Violations

IDENTITY Insert Issues

SQL Injection from Special Characters

File and Encoding Issues

FileNotFoundError

UTF-8 BOM in Excel

Performance Issues

Slow CSV Processing

Slow Database Queries

Validation and Testing

How to Verify Data Integrity

Getting Help

Debug Mode

Common Commands

Related Resources