Skip to main content

Star Schema Design

The Gold layer implements a star schema - a dimensional modeling approach that organizes data into fact and dimension tables. This design pattern is optimized for analytical queries and business intelligence.
Star schemas are named for their visual appearance: a central fact table connected to multiple dimension tables, resembling a star.

Why Star Schema?

Star schemas provide several advantages for analytical workloads:

Query Performance

Denormalized structure minimizes joins and enables fast aggregation queries, improving dashboard and report performance.

Ease of Understanding

Business users can easily understand the model: facts represent business events, dimensions provide context.

Flexible Analysis

Analysts can slice and dice facts by any combination of dimensions without complex SQL.

BI Tool Compatible

Most BI tools are designed around star schemas, enabling drag-and-drop report building.

Data Model Overview

The data warehouse implements a focused star schema centered on sales analysis:

Fact Table

fact_sales

The central fact table captures sales transactions with numeric measures and foreign keys to dimensions.

Table Purpose

Records individual sales transactions with measures (amounts, quantities, prices) and links to customer and product dimensions for analysis.
Built from silver.crm_sales_details joined with dimension tables to resolve foreign keys.

Schema Definition

The fact table is implemented as a view in ~/workspace/source/scripts/gold/ddl_gold.sql:
create view gold.fact_sales as
select
    si.sls_ord_num as order_number,
    pr.product_key,
    cu.customer_key,
    si.sls_ord_dt as order_date,
    si.sls_ship_dt as shipping_date,
    si.sls_sales as sales_amount,
    si.sls_quantity as quantity,
    si.sls_price as price
from silver.crm_sales_details si
left join gold.dim_product pr
    on si.sls_prd_key = pr.product_number
left join gold.dim_customer cu
    on si.sls_cust_id = cu.customer_id;

Column Reference

order_number
varchar
required
Unique identifier for the sales order. Primary key for the fact table.
product_key
integer
required
Foreign key to dim_product. Surrogate key for joining to product dimension.
customer_key
integer
required
Foreign key to dim_customer. Surrogate key for joining to customer dimension.
Use order_date for revenue recognition analysis and shipping_date for operational fulfillment metrics.

Dimension Tables

Dimension tables provide descriptive attributes for filtering, grouping, and labeling fact table measures.

dim_customer

Customer dimension for demographic and segmentation analysis.

View Definition

create view gold.dim_customer as 
select 
    ROW_NUMBER() over (order by cst_id) as customer_key,
    ci.cst_id as customer_id,
    ci.cst_key as customer_number,
    ci.cst_firstname as first_name,
    ci.cst_lastname as last_name,
    ci.cst_marital_status as marital_status,
    case 
        when ci.cst_gndr !='n/a' then ci.cst_gndr 
        else coalesce(ca.gen, 'n/a') 
    end as gender,
    ca.bdate as birth_date,
    ci.cst_create_date as create_date
from silver.crm_cust_info ci
left join silver.erp_cust_az12 ca
    on ci.cst_key = ca.cid
left join silver.erp_loc_a101 la
    on ci.cst_key = la.cid;

Customer Attributes

customer_key
integer
required
Surrogate key generated by ROW_NUMBER(). Use this for joins to fact tables.
customer_id
integer
required
Natural key from source system. Use for data lineage and debugging.
customer_number
varchar
required
Business-friendly customer identifier displayed in reports.
The gender field implements data quality logic: it uses the CRM value unless it’s ‘n/a’, then falls back to the ERP source using COALESCE. See gold/ddl_gold.sql:10-12.

dim_product

Product dimension for product performance and category analysis.

View Definition

create view gold.dim_product as
select
    row_number() over (order by pn.prd_start_dt, pn.prd_key) as product_key,
    pn.prd_id as product_id,
    pn.prd_key as product_number,
    pn.prd_name as product_name,
    pn.cat_id as category_id,
    pc.cat as category,
    pc.subcat as subcategory,
    pc.manteinance,
    pn.prd_cost as product_cost,
    pn.prd_line as product_line,
    pn.prd_start_dt as product_start_date
from silver.crm_prd_info pn
left join silver.erp_px_cat_g1v2 pc
    on pn.cat_id = pc.id
where pn.prd_end_dt is null;

Product Attributes

product_key
integer
required
Surrogate key generated by ROW_NUMBER(). Use this for joins to fact tables.
product_id
integer
required
Natural key from source system. Use for data lineage and debugging.
product_number
varchar
required
Business-friendly product identifier displayed in reports and catalogs.
The dim_product view filters to only active products using WHERE prd_end_dt IS NULL (see gold/ddl_gold.sql:41). Discontinued products are excluded from the dimension.

Analytical Use Cases

The star schema enables a wide range of analytical queries:
Analyze sales patterns by customer demographics:
SELECT 
    c.gender,
    c.marital_status,
    COUNT(DISTINCT c.customer_key) as customer_count,
    SUM(f.sales_amount) as total_sales,
    AVG(f.sales_amount) as avg_order_value
FROM gold.fact_sales f
JOIN gold.dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.gender, c.marital_status
ORDER BY total_sales DESC;
Track product and category performance metrics:
SELECT 
    p.category,
    p.subcategory,
    COUNT(DISTINCT f.order_number) as order_count,
    SUM(f.quantity) as units_sold,
    SUM(f.sales_amount) as revenue,
    AVG(f.price) as avg_price
FROM gold.fact_sales f
JOIN gold.dim_product p ON f.product_key = p.product_key
GROUP BY p.category, p.subcategory
ORDER BY revenue DESC;
Calculate customer lifetime metrics:
SELECT 
    c.customer_key,
    c.first_name || ' ' || c.last_name as customer_name,
    COUNT(f.order_number) as total_orders,
    SUM(f.sales_amount) as lifetime_value,
    AVG(f.sales_amount) as avg_order_value,
    MIN(f.order_date) as first_order_date,
    MAX(f.order_date) as last_order_date
FROM gold.fact_sales f
JOIN gold.dim_customer c ON f.customer_key = c.customer_key
GROUP BY c.customer_key, c.first_name, c.last_name
ORDER BY lifetime_value DESC
LIMIT 100;

Design Decisions

All dimension tables use surrogate keys (integer sequences) rather than natural keys:
Benefits
  • Faster joins compared to multi-column or string natural keys
  • Insulates fact tables from changes to source system keys
  • Simplifies relationships and improves query performance
  • Enables consistent key type across all dimensions
Natural keys (customer_id, product_id) are retained as attributes for lineage and debugging.

Best Practices

Always Join Through Keys

Use surrogate keys (customer_key, product_key) for joins, not natural keys. This ensures optimal performance and consistency.

Filter Early

Apply filters on dimensions before joining to facts to reduce the working dataset and improve query performance.

Use Explicit Column Lists

Always specify exact columns needed rather than SELECT *. This improves performance and makes queries self-documenting.

Leverage Aggregations

Pre-aggregate fact measures (SUM, AVG, COUNT) at the grain needed for reporting to improve dashboard responsiveness.

Build docs developers (and LLMs) love