Overview
Data validation is a critical component of the ETL process, ensuring that data is accurate, consistent, and fit for purpose before it enters the dimensional model. Our validation approach focuses on cleaning, standardizing, and validating data during the Bronze-to-Silver transformation.All validation rules are applied within the
silver.load_silver() stored procedure during the ETL process.Validation Categories
Format Validation
Trimming whitespace, standardizing formats
Data Standardization
Converting codes to readable values
Null Handling
Managing missing or invalid data
Deduplication
Removing duplicate records
Date Validation
Parsing and validating date formats
Business Rules
Enforcing logical constraints
Customer Information Validation
CRM Customer Data (silver.crm_cust_info)
Whitespace Trimming
Whitespace Trimming
All text fields are trimmed to remove leading and trailing whitespace.Purpose: Ensures consistency in joins and prevents matching issues due to invisible characters.
Marital Status Standardization
Marital Status Standardization
Single-letter codes are converted to full descriptive values.
- Input Values
- Output Values
'S'or's'→ Single'M'or'm'→ Married- Any other value → n/a
Gender Standardization
Gender Standardization
Gender codes are standardized to consistent values.Transformation:
'M'→'Male''F'→'Female'- Unknown/NULL →
'n/a'
Deduplication Logic
Deduplication Logic
When multiple records exist for the same customer, only the most recent is retained.
Product Information Validation
CRM Product Data (silver.crm_prd_info)
Category ID Extraction
Category ID Extraction
Extracts and standardizes category identifiers from product keys.Logic:
- Extract first 5 characters from product key
- Replace hyphens with underscores for consistency
- Extract remaining characters as the cleaned product key
Null Handling for Cost
Null Handling for Cost
Ensures product cost is never null by defaulting to zero.
Product Line Standardization
Product Line Standardization
Converts abbreviated product line codes to full descriptions.
| Input Code | Output Value |
|---|---|
| M | Mountain |
| R | Road |
| S | Other sales |
| T | Touring |
| Other/NULL | n/a |
End Date Calculation
End Date Calculation
Calculates product validity end dates using window functions.Logic:
- For each product version, the end date is one day before the next version’s start date
- This creates non-overlapping validity periods for slowly changing dimensions
- The latest version will have a NULL end date (currently active)
Sales Transaction Validation
Sales Details (silver.crm_sales_details)
This table contains the most complex validation rules due to the critical nature of financial data.
Date Format Validation
Date Format Validation
Converts integer dates (YYYYMMDD format) to proper date types with validation.
Sales Amount Validation
Sales Amount Validation
Ensures sales amounts are mathematically correct and non-negative.Validation Rules:
- If sales amount is NULL, calculate it from quantity × price
- If sales amount is ≤ 0, recalculate it
- If sales amount doesn’t match quantity × price, recalculate it
- Use absolute value of price to handle negative values
This approach ensures data consistency by deriving correct values rather than rejecting records.
Price Validation and Correction
Price Validation and Correction
Handles invalid prices by deriving them from sales and quantity.Correction Logic:
- If price is NULL or zero → Calculate from
sales ÷ quantity - If price is negative → Convert to absolute value
- Otherwise → Keep original price
ERP Data Validation
Customer Demographics (silver.erp_cust_az12)
Customer ID Cleaning
Customer ID Cleaning
Removes prefixes from customer IDs for standardization.Removes the
'NAS' prefix from customer IDs to align with other system identifiers.Birthdate Range Validation
Birthdate Range Validation
Ensures birthdates are within reasonable ranges.Validation:
- Future dates are set to NULL (impossible birthdates)
- Quality tests further validate dates are after 1924
Gender Normalization
Gender Normalization
Standardizes various gender representations.Accepted Inputs:
- Female:
'F','f','FEMALE','female','Female' - Male:
'M','m','MALE','male','Male' - Unknown: Any other value →
'n/a'
Location Data (silver.erp_loc_a101)
Customer ID Formatting
Customer ID Formatting
Removes hyphens from customer IDs for consistency.
Country Standardization
Country Standardization
Converts country codes to full country names.
| Input Code | Output Value |
|---|---|
| USA, US | United States |
| CAN | Canada |
| DE | Germany |
| Empty/NULL | n/a |
| Other codes | Kept as-is (trimmed) |
Product Categories (silver.erp_px_cat_g1v2)
This table currently passes through data without transformation, but quality tests ensure:
- No unwanted whitespace in category fields
- Standardized maintenance status values
Data Quality Best Practices
Use TRIM Consistently
Always trim text fields to prevent hidden whitespace issues in joins and filters.
Standardize Codes
Convert cryptic codes to meaningful values during transformation, not in reporting.
Handle NULLs Explicitly
Use COALESCE or CASE statements to make NULL handling intentional and visible.
Document Assumptions
Comment your validation logic to explain business rules and assumptions.
Fail Gracefully
Return NULL or default values rather than causing errors, but log these occurrences.
Test Thoroughly
Run quality checks after every transformation to catch issues early.
Validation Principles
Validate Early
Apply validation rules as early as possible in the data pipeline (Bronze → Silver transformation).
Be Explicit
Make all data quality rules explicit in code rather than relying on downstream error handling.
Document Exceptions
When data doesn’t meet quality standards, document why you’re correcting it vs. rejecting it.
Common Pitfalls to Avoid
Validation Rule Reference
Quick reference for all validation transformations applied in the Silver layer:| Source Table | Field | Validation Rule | Purpose |
|---|---|---|---|
crm_cust_info | All text | TRIM() | Remove whitespace |
crm_cust_info | cst_marital_status | Code → Description | Standardize values |
crm_cust_info | cst_gndr | Code → Description | Standardize values |
crm_cust_info | Duplicates | ROW_NUMBER() window | Keep latest record |
crm_prd_info | cat_id | Extract & replace | Normalize category ID |
crm_prd_info | prd_cost | COALESCE(cost, 0) | Handle nulls |
crm_prd_info | prd_line | Code → Description | Standardize values |
crm_prd_info | prd_end_dt | LEAD() window | Calculate SCD dates |
crm_sales_details | Date fields | Format validation | Ensure valid dates |
crm_sales_details | sls_sales | Recalculate if invalid | Ensure accuracy |
crm_sales_details | sls_price | Derive or correct | Handle negatives/nulls |
erp_cust_az12 | cid | Remove prefix | Standardize format |
erp_cust_az12 | bdate | Range check | Validate dates |
erp_cust_az12 | gen | Multiple formats → Standard | Normalize gender |
erp_loc_a101 | cid | Remove hyphens | Standardize format |
erp_loc_a101 | cntry | Code → Full name | Expand abbreviations |
Next Steps
Quality Testing
Learn how to run quality tests to verify validation effectiveness
Silver Layer
Deep dive into the Silver layer transformation procedures
Transformations
Explore detailed transformation logic in the Silver schema
ETL Procedures
Review the complete ETL pipeline procedures