Skip to main content
This walkthrough demonstrates data transformation and analysis using the MotherDuck UI. The tutorial uses a wine list dataset to showcase common data cleaning, transformation, and analytical queries.

Overview

This exercise guides you through:
  • Loading CSV data into MotherDuck
  • Performing exploratory data analysis (EDA)
  • Cleaning and transforming data
  • Building analytical queries
  • Solving a real-world data problem
This walkthrough is designed to be run one query at a time from the MotherDuck UI. While you can run these queries from the DuckDB CLI, the UI provides a better interactive experience.

Prerequisites

  • Access to the MotherDuck UI
  • The winelist_sample.csv file (included in the demo repository)

The Challenge

Exercise Goal: Find the difference in price per bottle between:
  • The highest rated, least expensive bottle
  • The lowest rated, most expensive bottle
For bottles with a vintage of 1990 or later. Rating calculation: Use coalesce(coalesce("WA score","Vinous score"),-1) Price calculation: Use offer_price / qty as price_per_bottle

Step-by-Step Walkthrough

1

Load the data

Start by creating a table from the CSV file:
CREATE OR REPLACE TABLE winelist AS 
SELECT * FROM read_csv_auto(['winelist_sample.csv']);
This uses DuckDB’s read_csv_auto function which automatically detects column types and delimiters.
2

Summarize the dataset

Get a high-level overview of the data:
SUMMARIZE winelist;
This provides statistics for each column including:
  • Data types
  • Null counts
  • Min/max values
  • Distinct value counts
The SUMMARIZE command is a powerful DuckDB feature that gives you instant insights into your dataset’s structure and quality.
3

Explore the data

View the data with some columns removed for readability:
SELECT * EXCLUDE("Wine Name")
FROM winelist;
The EXCLUDE clause is a DuckDB feature that lets you select all columns except specified ones.
4

Filter bad data

Remove rows with invalid vintage years:
SELECT * EXCLUDE("Wine Name")
FROM winelist
WHERE vintage > 1000;
This filters out any vintage values that are clearly incorrect (less than or equal to 1000).
5

Parse unit size

Split the unit size into quantity and volume:
SELECT 
  * EXCLUDE("Unit size"),
  CAST(SUBSTR("Unit size", 1, INSTR("Unit size", 'x') - 1) AS INTEGER) as qty,
  CAST(
    SUBSTR(
      "Unit size",
      INSTR("Unit size", 'x') + 1,
      LENGTH("Unit size") - INSTR("Unit size", 'x') - 2
    ) AS INTEGER
  ) as volume_cl
FROM winelist
WHERE vintage > 1000;
This extracts:
  • qty: Number of bottles (e.g., “6” from “6x75cl”)
  • volume_cl: Volume in centiliters (e.g., “75” from “6x75cl”)
6

Calculate price per bottle

Transform the price field and calculate normalized prices:
SELECT 
  * EXCLUDE("Unit size"),
  CAST(SUBSTR("Unit size", 1, INSTR("Unit size", 'x') - 1) AS INTEGER) as qty,
  CAST(
    SUBSTR(
      "Unit size",
      INSTR("Unit size", 'x') + 1,
      LENGTH("Unit size") - INSTR("Unit size", 'x') - 2
    ) AS INTEGER
  ) as volume_cl,
  CAST(
    REPLACE(REPLACE("Offer price", '$', ''), ',', '') AS DECIMAL(10, 2)
  ) as offer_price,
  offer_price / (qty * volume_cl) * 75 as price_per_75cl,
  offer_price / qty as price_per_bottle
FROM winelist
WHERE vintage > 1000;
This transformation:
  • Removes $ and , from prices
  • Converts to decimal format
  • Calculates price per standard 75cl bottle
  • Calculates price per bottle (regardless of size)
7

Create a clean view

Save the cleaned data for analysis:
CREATE OR REPLACE VIEW winelist_clean AS
SELECT 
  * EXCLUDE("Unit size"),
  CAST(SUBSTR("Unit size", 1, INSTR("Unit size", 'x') - 1) AS INTEGER) as qty,
  CAST(
    SUBSTR(
      "Unit size",
      INSTR("Unit size", 'x') + 1,
      LENGTH("Unit size") - INSTR("Unit size", 'x') - 2
    ) AS INTEGER
  ) as volume_cl,
  CAST(
    REPLACE(REPLACE("Offer price", '$', ''), ',', '') AS DECIMAL(10, 2)
  ) as offer_price,
  offer_price / (qty * volume_cl) * 75 as price_per_75cl,
  offer_price / qty as price_per_bottle
FROM winelist
WHERE vintage > 1000;
8

Solve the exercise

Find the price difference between the best value and worst value bottles:
WITH cte_cheap_but_good AS (
  SELECT * 
  FROM winelist_clean
  WHERE 
    Vintage >= 1990 
    AND ("WA score" IS NOT NULL OR "Vinous score" IS NOT NULL)
  ORDER BY 
    COALESCE(COALESCE("WA score", "Vinous score"), -1) DESC, 
    price_per_bottle
  LIMIT 1
),
cte_expensive_and_bad AS (
  SELECT * 
  FROM winelist_clean
  WHERE 
    Vintage >= 1990 
    AND ("WA score" IS NOT NULL OR "Vinous score" IS NOT NULL)
  ORDER BY 
    COALESCE(COALESCE("WA score", "Vinous score"), -1) ASC, 
    price_per_bottle DESC
  LIMIT 1
)
SELECT 
  a.price_per_bottle - b.price_per_bottle as difference 
FROM cte_cheap_but_good a
JOIN cte_expensive_and_bad b ON true
This query:
  1. Finds the highest-rated, cheapest bottle (1990+)
  2. Finds the lowest-rated, most expensive bottle (1990+)
  3. Calculates the price difference

Key SQL Techniques

This walkthrough demonstrates several important DuckDB/SQL patterns:

String Manipulation

REPLACE(REPLACE("Offer price", '$', ''), ',', '')

Column Selection

SELECT * EXCLUDE("Wine Name")
FROM winelist

Handling Nulls

-- Use the first non-null value from multiple columns
COALESCE(COALESCE("WA score", "Vinous score"), -1)

-- Filter for rows with at least one score
WHERE "WA score" IS NOT NULL OR "Vinous score" IS NOT NULL

Common Table Expressions (CTEs)

WITH cte_cheap_but_good AS (
  -- First subquery
  SELECT ...
),
cte_expensive_and_bad AS (
  -- Second subquery
  SELECT ...
)
SELECT ... 
FROM cte_cheap_but_good a
JOIN cte_expensive_and_bad b ON true
CTEs make complex queries more readable by breaking them into logical steps. They’re particularly useful for comparing different subsets of data.

Exploratory Queries

Once you’ve cleaned the data, try these analytical queries:
SELECT 
  "Wine Name",
  Vintage,
  COALESCE("WA score", "Vinous score") as rating,
  price_per_bottle
FROM winelist_clean
WHERE "WA score" IS NOT NULL OR "Vinous score" IS NOT NULL
ORDER BY rating DESC
LIMIT 10

Tips for Using the MotherDuck UI

1

Run queries one at a time

Execute each query individually to see the results and understand each transformation step.
2

Use the data preview

The UI shows a preview of results automatically. Use this to verify your transformations are working correctly.
3

Save important queries

Bookmark or save queries you want to reuse. The UI provides options to save and organize your SQL.
4

Export results

Once you have the results you need, you can export them to CSV or other formats directly from the UI.
5

Create views for reuse

Use CREATE VIEW to save transformed data that you’ll query multiple times, like the winelist_clean view.

Next Steps

After completing this walkthrough, you can:
  • Load your own CSV files into MotherDuck
  • Apply similar data cleaning techniques to other datasets
  • Build more complex analytical queries
  • Create materialized views for performance
  • Share your queries with team members
Remember that this example uses CREATE OR REPLACE TABLE which will overwrite any existing winelist table. In production environments, use more careful table management strategies.

Build docs developers (and LLMs) love