Skip to main content

Overview

Data validation is a critical component of the ETL process, ensuring that data is accurate, consistent, and fit for purpose before it enters the dimensional model. Our validation approach focuses on cleaning, standardizing, and validating data during the Bronze-to-Silver transformation.
All validation rules are applied within the silver.load_silver() stored procedure during the ETL process.

Validation Categories

Format Validation

Trimming whitespace, standardizing formats

Data Standardization

Converting codes to readable values

Null Handling

Managing missing or invalid data

Deduplication

Removing duplicate records

Date Validation

Parsing and validating date formats

Business Rules

Enforcing logical constraints

Customer Information Validation

CRM Customer Data (silver.crm_cust_info)

Whitespace Trimming

All text fields are trimmed to remove leading and trailing whitespace.
SELECT
  cst_id, 
  TRIM(cst_key) AS cst_key,
  TRIM(cst_firstname) AS cst_firstname, 
  TRIM(cst_lastname) AS cst_lastname,
  -- ...
FROM bronze.crm_cust_info
Purpose: Ensures consistency in joins and prevents matching issues due to invisible characters.
Single-letter codes are converted to full descriptive values.
CASE UPPER(TRIM(cst_marital_status)) 
  WHEN 'S' THEN 'Single' 
  WHEN 'M' THEN 'Married' 
  ELSE 'n/a'
END AS cst_marital_status
  • 'S' or 's' → Single
  • 'M' or 'm' → Married
  • Any other value → n/a
Gender codes are standardized to consistent values.
CASE UPPER(TRIM(cst_gndr)) 
  WHEN 'M' THEN 'Male' 
  WHEN 'F' THEN 'Female' 
  ELSE 'n/a'
END AS cst_gndr
Transformation:
  • 'M''Male'
  • 'F''Female'
  • Unknown/NULL → 'n/a'
When multiple records exist for the same customer, only the most recent is retained.
SELECT *, 
  ROW_NUMBER() OVER (
    PARTITION BY cst_id 
    ORDER BY cst_create_date DESC
  ) AS flag_last
FROM bronze.crm_cust_info
WHERE flag_last = 1
1

Partition by Customer ID

Group all records by cst_id to identify duplicates.
2

Order by Create Date

Sort records within each group by creation date (most recent first).
3

Select Latest Record

Retain only the record with flag_last = 1 (most recent).

Product Information Validation

CRM Product Data (silver.crm_prd_info)

Extracts and standardizes category identifiers from product keys.
SELECT
  prd_id,
  REPLACE(SUBSTRING(TRIM(prd_key), 1, 5), '-', '_') AS cat_id,
  SUBSTRING(prd_key, 7, LENGTH(prd_key)) AS prd_key,
  -- ...
FROM bronze.crm_prd_info
Logic:
  1. Extract first 5 characters from product key
  2. Replace hyphens with underscores for consistency
  3. Extract remaining characters as the cleaned product key
Ensures product cost is never null by defaulting to zero.
COALESCE(prd_cost, 0) AS prd_cost
While defaulting to zero prevents null values, consider whether zero is a meaningful cost for your business logic. You may want to flag these records for review.
Converts abbreviated product line codes to full descriptions.
CASE UPPER(TRIM(prd_line))
  WHEN 'M' THEN 'Mountain'
  WHEN 'R' THEN 'Road'
  WHEN 'S' THEN 'Other sales'
  WHEN 'T' THEN 'Touring'
  ELSE 'n/a'
END AS prd_line
Input CodeOutput Value
MMountain
RRoad
SOther sales
TTouring
Other/NULLn/a
Calculates product validity end dates using window functions.
LEAD(prd_start_dt) OVER (
  PARTITION BY prd_key 
  ORDER BY prd_start_dt
) - 1 AS prd_end_dt
Logic:
  • For each product version, the end date is one day before the next version’s start date
  • This creates non-overlapping validity periods for slowly changing dimensions
  • The latest version will have a NULL end date (currently active)

Sales Transaction Validation

Sales Details (silver.crm_sales_details)

This table contains the most complex validation rules due to the critical nature of financial data.

Date Format Validation

Converts integer dates (YYYYMMDD format) to proper date types with validation.
-- Order Date Validation
CASE
  WHEN sls_ord_dt IS NULL
    OR sls_ord_dt = 0
    OR LENGTH(sls_ord_dt::TEXT) != 8
  THEN NULL
  ELSE TO_DATE(TRIM(sls_ord_dt::TEXT), 'YYYYMMDD')
END AS sls_ord_dt,

-- Ship Date Validation
CASE 
  WHEN sls_ship_dt IS NULL
    OR sls_ship_dt = 0
    OR LENGTH(sls_ship_dt::TEXT) != 8        
  THEN NULL
  ELSE TO_DATE(TRIM(sls_ship_dt::TEXT), 'YYYYMMDD')
END AS sls_ship_dt,

-- Due Date Validation
CASE
  WHEN sls_due_dt IS NULL 
    OR sls_due_dt = 0
    OR LENGTH(sls_due_dt::TEXT) != 8 
  THEN NULL
  ELSE TO_DATE(sls_due_dt::TEXT, 'YYYYMMDD')
END AS sls_due_dt
1

Check for NULL or Zero

Invalid dates are often represented as NULL or 0 in source systems.
2

Validate Length

Ensure the date string is exactly 8 characters (YYYYMMDD).
3

Parse Date

Convert the validated string to a proper DATE type.
4

Return NULL on Failure

If any validation fails, return NULL rather than causing an error.
Ensures sales amounts are mathematically correct and non-negative.
CASE 
  WHEN sls_sales IS NULL 
    OR sls_sales <= 0 
    OR sls_sales != sls_quantity * ABS(sls_price) 
  THEN sls_quantity * ABS(sls_price)
  ELSE sls_sales
END AS sls_sales
Validation Rules:
  • If sales amount is NULL, calculate it from quantity × price
  • If sales amount is ≤ 0, recalculate it
  • If sales amount doesn’t match quantity × price, recalculate it
  • Use absolute value of price to handle negative values
This approach ensures data consistency by deriving correct values rather than rejecting records.
Handles invalid prices by deriving them from sales and quantity.
CASE 
  WHEN sls_price IS NULL 
    OR sls_price = 0 
  THEN sls_sales / ABS(sls_quantity)
  WHEN sls_price < 0 
  THEN ABS(sls_price)
  ELSE sls_price
END AS sls_price
Correction Logic:
  1. If price is NULL or zero → Calculate from sales ÷ quantity
  2. If price is negative → Convert to absolute value
  3. Otherwise → Keep original price
This assumes that quantity and sales values are more reliable than price. Adjust logic based on your data quality assessment.

ERP Data Validation

Customer Demographics (silver.erp_cust_az12)

Removes prefixes from customer IDs for standardization.
CASE 
  WHEN cid LIKE 'NAS%' 
  THEN SUBSTRING(cid, 4, LENGTH(cid))
  ELSE cid
END AS cid
Removes the 'NAS' prefix from customer IDs to align with other system identifiers.
Ensures birthdates are within reasonable ranges.
CASE 
  WHEN bdate > CURRENT_DATE 
  THEN NULL
  ELSE bdate
END AS bdate
Validation:
  • Future dates are set to NULL (impossible birthdates)
  • Quality tests further validate dates are after 1924
Standardizes various gender representations.
CASE 
  WHEN UPPER(TRIM(gen)) IN ('F', 'FEMALE') 
  THEN 'Female'
  WHEN UPPER(TRIM(gen)) IN ('M', 'MALE') 
  THEN 'Male'
  ELSE 'n/a'
END AS gen
Accepted Inputs:
  • Female: 'F', 'f', 'FEMALE', 'female', 'Female'
  • Male: 'M', 'm', 'MALE', 'male', 'Male'
  • Unknown: Any other value → 'n/a'

Location Data (silver.erp_loc_a101)

Removes hyphens from customer IDs for consistency.
REPLACE(cid, '-', '') AS cid
Converts country codes to full country names.
CASE 
  WHEN TRIM(cntry) IN ('USA', 'US') 
  THEN 'United States'
  WHEN TRIM(cntry) = 'CAN' 
  THEN 'Canada'
  WHEN TRIM(cntry) = 'DE' 
  THEN 'Germany'
  WHEN TRIM(cntry) = '' OR TRIM(cntry) IS NULL 
  THEN 'n/a'
  ELSE TRIM(cntry)
END AS cntry
Input CodeOutput Value
USA, USUnited States
CANCanada
DEGermany
Empty/NULLn/a
Other codesKept as-is (trimmed)

Product Categories (silver.erp_px_cat_g1v2)

This table currently passes through data without transformation, but quality tests ensure:
  • No unwanted whitespace in category fields
  • Standardized maintenance status values
-- Simple pass-through with minimal transformation
INSERT INTO silver.erp_px_cat_g1v2 (id, cat, subcat, manteinance)
SELECT id, cat, subcat, manteinance
FROM bronze.erp_px_cat_g1v2;

Data Quality Best Practices

Use TRIM Consistently

Always trim text fields to prevent hidden whitespace issues in joins and filters.

Standardize Codes

Convert cryptic codes to meaningful values during transformation, not in reporting.

Handle NULLs Explicitly

Use COALESCE or CASE statements to make NULL handling intentional and visible.

Document Assumptions

Comment your validation logic to explain business rules and assumptions.

Fail Gracefully

Return NULL or default values rather than causing errors, but log these occurrences.

Test Thoroughly

Run quality checks after every transformation to catch issues early.

Validation Principles

1

Validate Early

Apply validation rules as early as possible in the data pipeline (Bronze → Silver transformation).
2

Be Explicit

Make all data quality rules explicit in code rather than relying on downstream error handling.
3

Document Exceptions

When data doesn’t meet quality standards, document why you’re correcting it vs. rejecting it.
4

Monitor Trends

Track validation failures over time to identify systemic data quality issues.
5

Collaborate with Source Systems

Work with upstream teams to fix data quality at the source when possible.

Common Pitfalls to Avoid

Anti-patterns to avoid:
  1. Silent Failures: Allowing invalid data to pass through without logging or alerting
  2. Over-correction: Automatically “fixing” data when the issue should be investigated
  3. Inconsistent Rules: Applying different validation logic to similar fields
  4. Missing Documentation: Not explaining why specific validation rules exist
  5. No Monitoring: Failing to track how often validation rules are triggered

Validation Rule Reference

Quick reference for all validation transformations applied in the Silver layer:
Source TableFieldValidation RulePurpose
crm_cust_infoAll textTRIM()Remove whitespace
crm_cust_infocst_marital_statusCode → DescriptionStandardize values
crm_cust_infocst_gndrCode → DescriptionStandardize values
crm_cust_infoDuplicatesROW_NUMBER() windowKeep latest record
crm_prd_infocat_idExtract & replaceNormalize category ID
crm_prd_infoprd_costCOALESCE(cost, 0)Handle nulls
crm_prd_infoprd_lineCode → DescriptionStandardize values
crm_prd_infoprd_end_dtLEAD() windowCalculate SCD dates
crm_sales_detailsDate fieldsFormat validationEnsure valid dates
crm_sales_detailssls_salesRecalculate if invalidEnsure accuracy
crm_sales_detailssls_priceDerive or correctHandle negatives/nulls
erp_cust_az12cidRemove prefixStandardize format
erp_cust_az12bdateRange checkValidate dates
erp_cust_az12genMultiple formats → StandardNormalize gender
erp_loc_a101cidRemove hyphensStandardize format
erp_loc_a101cntryCode → Full nameExpand abbreviations

Next Steps

Quality Testing

Learn how to run quality tests to verify validation effectiveness

Silver Layer

Deep dive into the Silver layer transformation procedures

Transformations

Explore detailed transformation logic in the Silver schema

ETL Procedures

Review the complete ETL pipeline procedures

Build docs developers (and LLMs) love