DuckLake with dbt (Cloud)

This example demonstrates how to load TPC-DS data into DuckLake using dbt and create analytical queries in a MotherDuck database. It shows a modern data architecture pattern with DuckLake for storage and MotherDuck for analytics.

Overview

DuckLake is MotherDuck’s lakehouse storage format that provides:

Efficient storage for large datasets
Direct querying of cloud storage (S3)
Integration with dbt for transformations
Separation of storage and compute

Architecture

The project uses a two-layer architecture:

Raw Data Layer: TPC-DS tables stored in DuckLake (local tables)
Analytics Layer: Query models stored in MotherDuck database (my_db)

Data Sources

The project uses TPC-DS Scale Factor 100 data:

Location: s3://devrel-test-data/tpcds/sf100/
Format: Parquet files
Tables: 25 TPC-DS tables including customer, store_sales, catalog_sales, web_sales, etc.

Configuration

profiles.yml

Configure dbt to use DuckLake with MotherDuck:

profiles.yml

dbt_ducklake:
  outputs:
    motherduck:
      type: duckdb
      path: 'md:dbt_ducklake'
      is_ducklake: true
      schema: test
      threads: 4

    local:
      type: duckdb
      path: 'ducklake:local_dev.db'
      threads: 4
      extensions:
        - ducklake
      attach:
        - path: "md:jdw_dev"

  target: motherduck

The is_ducklake: true flag enables DuckLake features in the MotherDuck profile.

Source Configuration

Define TPC-DS tables as sources with S3 locations:

models/tpcds/raw/_sources.yml

sources:
  - name: tpc-ds
    database: jdw_dev
    schema: jdw_tpcds
    meta:
      external_location: |- 
        s3://devrel-test-data/tpcds/sf100/{name}.parquet
    tables:
      - name: call_center
      - name: catalog_page
      - name: catalog_returns
      - name: catalog_sales
      - name: customer_address
      - name: customer_demographics
      - name: customer
      - name: date_dim
      - name: household_demographics
      - name: income_band
      - name: inventory
      - name: item
      - name: promotion
      - name: reason
      - name: ship_mode
      - name: store_returns
      - name: store_sales
      - name: store
      - name: time_dim
      - name: warehouse
      - name: web_page
      - name: web_returns
      - name: web_sales
      - name: web_site

The {name} template in external_location automatically substitutes the table name, making it easy to define many tables with a consistent naming pattern.

Models

Raw Tables

Create tables from S3 parquet files:

models/tpcds/raw/customer.sql

from {{ source("tpc-ds", "customer") }}

This simple model reads from the source defined in _sources.yml and materializes it as a table in DuckLake.

Model Configuration

Raw tables: Stored in DuckLake with raw schema
Query models: Stored in MotherDuck database my_db
Materialization: Tables for raw data, views or tables for queries

Project Structure

models/
├── tpcds/
│   ├── raw/          # TPC-DS base tables (materialized in DuckLake)
│   │   ├── _sources.yml
│   │   ├── call_center.sql
│   │   ├── customer.sql
│   │   └── ... (25 tables total)
│   └── queries/      # TPC-DS analytical queries (in MotherDuck)
│       ├── query_1.sql
│       ├── query_2.sql
│       └── ... (99 queries total)

Running the Project

Install dependencies

uv sync

Build the project

uv run dbt build

Your browser will prompt for MotherDuck authentication unless you have token authentication configured.

Verify results

This will:

Create 25 TPC-DS tables in DuckLake from S3 parquet files
Execute 99 TPC-DS analytical queries and store results in MotherDuck

TPC-DS Benchmark

The project includes all 99 standard TPC-DS queries, providing a comprehensive benchmark suite for testing analytical performance across the DuckLake-MotherDuck architecture.

What is TPC-DS?

TPC-DS is a decision support benchmark that models complex business analytics. It includes:

25 tables with realistic relationships
99 analytical queries of varying complexity
Multiple scale factors (SF100 = ~100GB of data)
Queries covering reporting, ad-hoc analysis, and data mining

Use Cases

Data lake analytics

Query parquet files directly from S3
No data movement required
Leverage DuckDB’s fast columnar engine
Scale storage independently from compute

ETL/ELT pipelines

Load data from S3 into DuckLake
Transform with dbt models
Output to MotherDuck for analytics
Version and test transformations

Hybrid architectures

Raw data in DuckLake (cheap storage)
Transformed data in MotherDuck (fast queries)
Mix local and cloud as needed
Optimize for cost and performance

Best Practices

When working with large datasets from S3, be mindful of data transfer costs and query performance. DuckLake’s predicate pushdown and column pruning help minimize data scanned.

Use partitioned parquet files in S3 for better query performance. DuckDB can skip entire files based on partition predicates.

DuckLake tables are read-only from S3. To update data, you need to write new parquet files to S3 and refresh the DuckLake catalog.

Advanced Configuration

Creating a DuckLake Database

Before running dbt, create your DuckLake database in MotherDuck:

CREATE DATABASE dbt_ducklake 
    (TYPE ducklake, DATA_PATH 's3://your-bucket/path/')

Multiple Storage Backends

DuckLake supports various storage backends:

S3 (shown in this example)
Local parquet files
Azure Blob Storage
Google Cloud Storage

Performance Tips

Partition your data: Use Hive-style partitioning in S3 for better pruning
Optimize file sizes: Aim for 100MB-1GB parquet files
Use compression: Snappy or Zstd compression reduces I/O
Filter early: Push predicates down to the source
Materialize frequently used tables: Cache hot data in MotherDuck

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

Overview

Architecture

Data Sources

Configuration

profiles.yml

Source Configuration

Models

Raw Tables

Model Configuration

Project Structure

Running the Project

TPC-DS Benchmark

Use Cases

Best Practices

Advanced Configuration

Creating a DuckLake Database

Multiple Storage Backends

Performance Tips

Build docs developers (and LLMs) love

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

​Overview

​Architecture

​Data Sources

​Configuration

​profiles.yml

​Source Configuration

​Models

​Raw Tables

​Model Configuration

​Project Structure

​Running the Project

​TPC-DS Benchmark

​Use Cases

​Best Practices

​Advanced Configuration

​Creating a DuckLake Database

​Multiple Storage Backends

​Performance Tips

Build docs developers (and LLMs) love

Overview

Architecture

Data Sources

Configuration

profiles.yml

Source Configuration

Models

Raw Tables

Model Configuration

Project Structure

Running the Project

TPC-DS Benchmark

Use Cases

Best Practices

Advanced Configuration

Creating a DuckLake Database

Multiple Storage Backends

Performance Tips