Overview
PruebaETL is an ETL (Extract, Transform, Load) pipeline that normalizes customer and sales data from CSV files and loads them into a SQL Server database. This guide will walk you through the complete workflow.What you’ll accomplish:
- Normalize messy CSV files with encoding issues
- Extract redundant data into separate reference tables
- Generate a SQL Server database with proper relationships
- Load data with full referential integrity
Prerequisites
Before starting, ensure you have:- Python 3.7 or higher installed
- SQL Server instance (local or remote)
- SQL Server Management Studio (SSMS) or Azure Data Studio
- Raw CSV files:
clientes_prueba_mas_datos.csvandventas_prueba_mas_datos.csv
Step-by-Step Workflow
Normalize Customer Data
Run the customer normalization script to clean and structure customer data:What it does:Output Files:
- Fixes character encoding issues (UTF-8, Latin-1, CP1252)
- Normalizes date formats to
YYYY-MM-DD - Cleans text fields (removes extra spaces, special characters)
- Extracts cities and segments into separate tables
- Generates foreign key references
View normalized customer schema
View normalized customer schema
clientes_normalizados.csv:ciudades.csv:segmentos.csv:
Normalize Sales Data
Run the sales normalization script to process transaction data:What it does:Output Files:
- Normalizes numeric values (removes commas, handles N/A)
- Converts dates to standard format
- Extracts sales channels and currencies into lookup tables
- Validates foreign key relationships with customers
View normalized sales schema
View normalized sales schema
ventas_normalizadas.csv:canales.csv:monedas.csv:
Organize Normalized Files
Move the generated CSV files into a
cleanData directory for SQL script generation:The SQL generation script expects normalized files in the
cleanData/ directory. The original input files should remain in the project root.Generate SQL Server Script
Create the database schema and INSERT statements:What it does:
- Reads all normalized CSV files from
cleanData/directory - Generates DDL for database
PruebaTecnicaDNI - Creates both RAW tables (original data) and normalized tables
- Defines primary keys, foreign keys, and indexes
- Generates INSERT statements for all data
View generated SQL structure
View generated SQL structure
The script creates the following tables:RAW Tables (Original Data):
clientes_raw- Original customer recordsventas_raw- Original sales transactions
ciudades- City lookup tablesegmentos- Customer segment typescanales- Sales channel typesmonedas- Currency codesclientes- Normalized customer master dataventas- Normalized sales transactions
Execute in SQL Server
Load the data into your SQL Server instance:Using SQL Server Management Studio:Expected Output:
- Open SSMS and connect to your SQL Server
- Open
crear_base_datos.sqlfile - Execute the entire script (F5 or click Execute)
- Verify the database was created successfully
- Connect to your SQL Server instance
- Open
crear_base_datos.sql - Click Run or press F5
- Check the Messages pane for confirmation
Understanding the Code
Date Normalization
The ETL pipeline handles multiple date formats automatically:Text Cleaning
Removes extra spaces and handles empty values:Number Normalization
Handles formatted numbers and special values:Troubleshooting
Character Encoding Issues
Character Encoding Issues
Problem: Characters like “Raúl Pérez” appear as “RaðLPA@rez”Solution: The scripts automatically try multiple encodings (UTF-8, Latin-1, CP1252, ISO-8859-1):If issues persist, check your CSV file encoding and save it as UTF-8.
File Not Found Error
File Not Found Error
Problem:
FileNotFoundError: clientes_prueba_mas_datos.csvSolution:- Ensure your CSV files are in the same directory as the Python scripts
- Verify file names match exactly (case-sensitive on Linux/Mac)
- Check for hidden file extensions (.csv.txt)
SQL Connection Failed
SQL Connection Failed
Problem: Cannot connect to SQL ServerSolution:
-
Verify SQL Server is running:
- Check connection string and credentials
- Ensure TCP/IP is enabled in SQL Server Configuration Manager
- Verify firewall allows port 1433
Foreign Key Constraint Violations
Foreign Key Constraint Violations
Problem: If you still see errors:
FK_ventas_cliente constraint violationSolution:
This happens when sales reference customers that don’t exist. The normalization scripts validate this automatically:- Run
normalizar_datos.pybeforenormalizar_ventas.py - Verify all customer IDs in sales exist in the customer file
- Check the console output for skipped records
Empty Output Files
Empty Output Files
Problem: Normalized CSV files are empty or missing recordsSolution:
Check the console output for validation errors:If records = 0, the data may not pass validation. Common issues:
- Missing required fields (nombre, segmento)
- Invalid data types
- Empty rows in source CSV
Script Fails During Insert
Script Fails During Insert
Problem: SQL script fails partway through executionSolution:
The script uses
SET IDENTITY_INSERT for reference tables. If it fails:-
Manually turn off identity insert:
-
Drop and recreate the database:
- Re-run the complete script from the beginning
Next Steps
Now that your database is set up, you can:- Query Your Data: Use the sample queries above to analyze sales patterns
- Build Reports: Connect Power BI or Tableau to visualize trends
- Add More Data: Run the normalization scripts with updated CSV files
- Schedule ETL: Automate the pipeline with cron jobs or Task Scheduler
- Extend the Schema: Add new tables for products, orders, or inventory
Pro Tip: The normalized structure reduces data redundancy by ~80% and significantly improves query performance through proper indexing and relationships.
Common Patterns
Incrementally Loading New Data
To add new records without recreating the database:Handling Updates to Master Data
When cities or segments change:Performance Optimization
The scripts create indexes automatically, but for large datasets:File Structure
After running all scripts, your project structure should look like:Summary
You’ve successfully completed the PruebaETL workflow:- Normalized messy CSV data with encoding and format issues
- Structured data into a relational schema (3NF)
- Generated a complete SQL Server database with referential integrity
- Loaded data with proper primary and foreign key relationships