Data Model

Star Schema Design

The Gold layer implements a star schema - a dimensional modeling approach that organizes data into fact and dimension tables. This design pattern is optimized for analytical queries and business intelligence.

Star schemas are named for their visual appearance: a central fact table connected to multiple dimension tables, resembling a star.

Why Star Schema?

Star schemas provide several advantages for analytical workloads:

Query Performance

Denormalized structure minimizes joins and enables fast aggregation queries, improving dashboard and report performance.

Ease of Understanding

Business users can easily understand the model: facts represent business events, dimensions provide context.

Flexible Analysis

Analysts can slice and dice facts by any combination of dimensions without complex SQL.

BI Tool Compatible

Most BI tools are designed around star schemas, enabling drag-and-drop report building.

Data Model Overview

The data warehouse implements a focused star schema centered on sales analysis:

Fact Table

fact_sales

The central fact table captures sales transactions with numeric measures and foreign keys to dimensions.

Table Purpose

Records individual sales transactions with measures (amounts, quantities, prices) and links to customer and product dimensions for analysis.

Source Data

Built from silver.crm_sales_details joined with dimension tables to resolve foreign keys.

Schema Definition

The fact table is implemented as a view in ~/workspace/source/scripts/gold/ddl_gold.sql:

create view gold.fact_sales as
select
    si.sls_ord_num as order_number,
    pr.product_key,
    cu.customer_key,
    si.sls_ord_dt as order_date,
    si.sls_ship_dt as shipping_date,
    si.sls_sales as sales_amount,
    si.sls_quantity as quantity,
    si.sls_price as price
from silver.crm_sales_details si
left join gold.dim_product pr
    on si.sls_prd_key = pr.product_number
left join gold.dim_customer cu
    on si.sls_cust_id = cu.customer_id;

Column Reference

Keys & Identifiers
Measures
Dates

order_number

varchar

required

Unique identifier for the sales order. Primary key for the fact table.

product_key

integer

required

Foreign key to dim_product. Surrogate key for joining to product dimension.

customer_key

integer

required

Foreign key to dim_customer. Surrogate key for joining to customer dimension.

Use order_date for revenue recognition analysis and shipping_date for operational fulfillment metrics.

Dimension Tables

Dimension tables provide descriptive attributes for filtering, grouping, and labeling fact table measures.

dim_customer

Customer dimension for demographic and segmentation analysis.

View Definition

create view gold.dim_customer as 
select 
    ROW_NUMBER() over (order by cst_id) as customer_key,
    ci.cst_id as customer_id,
    ci.cst_key as customer_number,
    ci.cst_firstname as first_name,
    ci.cst_lastname as last_name,
    ci.cst_marital_status as marital_status,
    case 
        when ci.cst_gndr !='n/a' then ci.cst_gndr 
        else coalesce(ca.gen, 'n/a') 
    end as gender,
    ca.bdate as birth_date,
    ci.cst_create_date as create_date
from silver.crm_cust_info ci
left join silver.erp_cust_az12 ca
    on ci.cst_key = ca.cid
left join silver.erp_loc_a101 la
    on ci.cst_key = la.cid;

Customer Attributes

Identifiers
Demographics
Metadata

customer_key

integer

required

Surrogate key generated by ROW_NUMBER(). Use this for joins to fact tables.

customer_id

integer

required

Natural key from source system. Use for data lineage and debugging.

customer_number

varchar

required

Business-friendly customer identifier displayed in reports.

The gender field implements data quality logic: it uses the CRM value unless it’s ‘n/a’, then falls back to the ERP source using COALESCE. See gold/ddl_gold.sql:10-12.

dim_product

Product dimension for product performance and category analysis.

View Definition

create view gold.dim_product as
select
    row_number() over (order by pn.prd_start_dt, pn.prd_key) as product_key,
    pn.prd_id as product_id,
    pn.prd_key as product_number,
    pn.prd_name as product_name,
    pn.cat_id as category_id,
    pc.cat as category,
    pc.subcat as subcategory,
    pc.manteinance,
    pn.prd_cost as product_cost,
    pn.prd_line as product_line,
    pn.prd_start_dt as product_start_date
from silver.crm_prd_info pn
left join silver.erp_px_cat_g1v2 pc
    on pn.cat_id = pc.id
where pn.prd_end_dt is null;

Product Attributes

Identifiers
Product Details
Categories

product_key

integer

required

Surrogate key generated by ROW_NUMBER(). Use this for joins to fact tables.

product_id

integer

required

Natural key from source system. Use for data lineage and debugging.

product_number

varchar

required

Business-friendly product identifier displayed in reports and catalogs.

The dim_product view filters to only active products using WHERE prd_end_dt IS NULL (see gold/ddl_gold.sql:41). Discontinued products are excluded from the dimension.

Analytical Use Cases

The star schema enables a wide range of analytical queries:

Customer Segmentation Analysis

Analyze sales patterns by customer demographics:

SELECT 
    c.gender,
    c.marital_status,
    COUNT(DISTINCT c.customer_key) as customer_count,
    SUM(f.sales_amount) as total_sales,
    AVG(f.sales_amount) as avg_order_value
FROM gold.fact_sales f
JOIN gold.dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.gender, c.marital_status
ORDER BY total_sales DESC;

Product Performance Analysis

Track product and category performance metrics:

SELECT 
    p.category,
    p.subcategory,
    COUNT(DISTINCT f.order_number) as order_count,
    SUM(f.quantity) as units_sold,
    SUM(f.sales_amount) as revenue,
    AVG(f.price) as avg_price
FROM gold.fact_sales f
JOIN gold.dim_product p ON f.product_key = p.product_key
GROUP BY p.category, p.subcategory
ORDER BY revenue DESC;

Sales Trends Over Time

Examine sales trends and seasonality:

SELECT 
    DATE_TRUNC('month', f.order_date) as month,
    SUM(f.sales_amount) as monthly_sales,
    COUNT(DISTINCT f.order_number) as order_count,
    COUNT(DISTINCT f.customer_key) as unique_customers
FROM gold.fact_sales f
GROUP BY DATE_TRUNC('month', f.order_date)
ORDER BY month;

Customer Lifetime Value

Calculate customer lifetime metrics:

SELECT 
    c.customer_key,
    c.first_name || ' ' || c.last_name as customer_name,
    COUNT(f.order_number) as total_orders,
    SUM(f.sales_amount) as lifetime_value,
    AVG(f.sales_amount) as avg_order_value,
    MIN(f.order_date) as first_order_date,
    MAX(f.order_date) as last_order_date
FROM gold.fact_sales f
JOIN gold.dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.customer_key, c.first_name, c.last_name
ORDER BY lifetime_value DESC
LIMIT 100;

Design Decisions

Surrogate Keys
Views vs Tables
Denormalization
Active Products Only

All dimension tables use surrogate keys (integer sequences) rather than natural keys:

Benefits

Faster joins compared to multi-column or string natural keys
Insulates fact tables from changes to source system keys
Simplifies relationships and improves query performance
Enables consistent key type across all dimensions

Natural keys (customer_id, product_id) are retained as attributes for lineage and debugging.

Best Practices

Always Join Through Keys

Use surrogate keys (customer_key, product_key) for joins, not natural keys. This ensures optimal performance and consistency.

Filter Early

Apply filters on dimensions before joining to facts to reduce the working dataset and improve query performance.

Use Explicit Column Lists

Always specify exact columns needed rather than SELECT *. This improves performance and makes queries self-documenting.

Leverage Aggregations

Pre-aggregate fact measures (SUM, AVG, COUNT) at the grain needed for reporting to improve dashboard responsiveness.

Get Started

Architecture

Setup & Deployment

Data Pipeline

Data Quality

Star Schema Design

Why Star Schema?

Query Performance

Ease of Understanding

Flexible Analysis

BI Tool Compatible

Data Model Overview

Fact Table

fact_sales

Schema Definition

Column Reference

Dimension Tables

dim_customer

Customer Attributes

dim_product

Product Attributes

Analytical Use Cases

Design Decisions

Best Practices

Always Join Through Keys

Filter Early

Use Explicit Column Lists

Leverage Aggregations

Build docs developers (and LLMs) love

Get Started

Architecture

Setup & Deployment

Data Pipeline

Data Quality

​Star Schema Design

​Why Star Schema?

Query Performance

Ease of Understanding

Flexible Analysis

BI Tool Compatible

​Data Model Overview

​Fact Table

​fact_sales

​Schema Definition

​Column Reference

​Dimension Tables

​dim_customer

​Customer Attributes

​dim_product

​Product Attributes

​Analytical Use Cases

​Design Decisions

​Best Practices

Always Join Through Keys

Filter Early

Use Explicit Column Lists

Leverage Aggregations

Build docs developers (and LLMs) love

Star Schema Design

Why Star Schema?

Data Model Overview

Fact Table

fact_sales

Schema Definition

Column Reference

Dimension Tables

dim_customer

Customer Attributes

dim_product

Product Attributes

Analytical Use Cases

Design Decisions

Best Practices