Skip to main content
This example demonstrates how to load TPC-DS data into DuckLake using dbt and create analytical queries in a MotherDuck database. It shows a modern data architecture pattern with DuckLake for storage and MotherDuck for analytics.

Overview

DuckLake is MotherDuck’s lakehouse storage format that provides:
  • Efficient storage for large datasets
  • Direct querying of cloud storage (S3)
  • Integration with dbt for transformations
  • Separation of storage and compute

Architecture

The project uses a two-layer architecture:
  1. Raw Data Layer: TPC-DS tables stored in DuckLake (local tables)
  2. Analytics Layer: Query models stored in MotherDuck database (my_db)

Data Sources

The project uses TPC-DS Scale Factor 100 data:
  • Location: s3://devrel-test-data/tpcds/sf100/
  • Format: Parquet files
  • Tables: 25 TPC-DS tables including customer, store_sales, catalog_sales, web_sales, etc.

Configuration

profiles.yml

Configure dbt to use DuckLake with MotherDuck:
profiles.yml
dbt_ducklake:
  outputs:
    motherduck:
      type: duckdb
      path: 'md:dbt_ducklake'
      is_ducklake: true
      schema: test
      threads: 4

    local:
      type: duckdb
      path: 'ducklake:local_dev.db'
      threads: 4
      extensions:
        - ducklake
      attach:
        - path: "md:jdw_dev"

  target: motherduck
The is_ducklake: true flag enables DuckLake features in the MotherDuck profile.

Source Configuration

Define TPC-DS tables as sources with S3 locations:
models/tpcds/raw/_sources.yml
sources:
  - name: tpc-ds
    database: jdw_dev
    schema: jdw_tpcds
    meta:
      external_location: |- 
        s3://devrel-test-data/tpcds/sf100/{name}.parquet
    tables:
      - name: call_center
      - name: catalog_page
      - name: catalog_returns
      - name: catalog_sales
      - name: customer_address
      - name: customer_demographics
      - name: customer
      - name: date_dim
      - name: household_demographics
      - name: income_band
      - name: inventory
      - name: item
      - name: promotion
      - name: reason
      - name: ship_mode
      - name: store_returns
      - name: store_sales
      - name: store
      - name: time_dim
      - name: warehouse
      - name: web_page
      - name: web_returns
      - name: web_sales
      - name: web_site
The {name} template in external_location automatically substitutes the table name, making it easy to define many tables with a consistent naming pattern.

Models

Raw Tables

Create tables from S3 parquet files:
models/tpcds/raw/customer.sql
from {{ source("tpc-ds", "customer") }}
This simple model reads from the source defined in _sources.yml and materializes it as a table in DuckLake.

Model Configuration

  • Raw tables: Stored in DuckLake with raw schema
  • Query models: Stored in MotherDuck database my_db
  • Materialization: Tables for raw data, views or tables for queries

Project Structure

models/
├── tpcds/
│   ├── raw/          # TPC-DS base tables (materialized in DuckLake)
│   │   ├── _sources.yml
│   │   ├── call_center.sql
│   │   ├── customer.sql
│   │   └── ... (25 tables total)
│   └── queries/      # TPC-DS analytical queries (in MotherDuck)
│       ├── query_1.sql
│       ├── query_2.sql
│       └── ... (99 queries total)

Running the Project

1

Install dependencies

uv sync
2

Build the project

uv run dbt build
Your browser will prompt for MotherDuck authentication unless you have token authentication configured.
3

Verify results

This will:
  • Create 25 TPC-DS tables in DuckLake from S3 parquet files
  • Execute 99 TPC-DS analytical queries and store results in MotherDuck

TPC-DS Benchmark

The project includes all 99 standard TPC-DS queries, providing a comprehensive benchmark suite for testing analytical performance across the DuckLake-MotherDuck architecture.
TPC-DS is a decision support benchmark that models complex business analytics. It includes:
  • 25 tables with realistic relationships
  • 99 analytical queries of varying complexity
  • Multiple scale factors (SF100 = ~100GB of data)
  • Queries covering reporting, ad-hoc analysis, and data mining

Use Cases

  • Query parquet files directly from S3
  • No data movement required
  • Leverage DuckDB’s fast columnar engine
  • Scale storage independently from compute
  • Load data from S3 into DuckLake
  • Transform with dbt models
  • Output to MotherDuck for analytics
  • Version and test transformations
  • Raw data in DuckLake (cheap storage)
  • Transformed data in MotherDuck (fast queries)
  • Mix local and cloud as needed
  • Optimize for cost and performance

Best Practices

When working with large datasets from S3, be mindful of data transfer costs and query performance. DuckLake’s predicate pushdown and column pruning help minimize data scanned.
Use partitioned parquet files in S3 for better query performance. DuckDB can skip entire files based on partition predicates.
DuckLake tables are read-only from S3. To update data, you need to write new parquet files to S3 and refresh the DuckLake catalog.

Advanced Configuration

Creating a DuckLake Database

Before running dbt, create your DuckLake database in MotherDuck:
CREATE DATABASE dbt_ducklake 
    (TYPE ducklake, DATA_PATH 's3://your-bucket/path/')

Multiple Storage Backends

DuckLake supports various storage backends:
  • S3 (shown in this example)
  • Local parquet files
  • Azure Blob Storage
  • Google Cloud Storage

Performance Tips

  1. Partition your data: Use Hive-style partitioning in S3 for better pruning
  2. Optimize file sizes: Aim for 100MB-1GB parquet files
  3. Use compression: Snappy or Zstd compression reduces I/O
  4. Filter early: Push predicates down to the source
  5. Materialize frequently used tables: Cache hot data in MotherDuck

Build docs developers (and LLMs) love