Overview
DuckLake is MotherDuck’s lakehouse storage format that provides:- Efficient storage for large datasets
- Direct querying of cloud storage (S3)
- Integration with dbt for transformations
- Separation of storage and compute
Architecture
The project uses a two-layer architecture:- Raw Data Layer: TPC-DS tables stored in DuckLake (local tables)
- Analytics Layer: Query models stored in MotherDuck database (
my_db)
Data Sources
The project uses TPC-DS Scale Factor 100 data:- Location:
s3://devrel-test-data/tpcds/sf100/ - Format: Parquet files
- Tables: 25 TPC-DS tables including
customer,store_sales,catalog_sales,web_sales, etc.
Configuration
profiles.yml
Configure dbt to use DuckLake with MotherDuck:profiles.yml
The
is_ducklake: true flag enables DuckLake features in the MotherDuck profile.Source Configuration
Define TPC-DS tables as sources with S3 locations:models/tpcds/raw/_sources.yml
Models
Raw Tables
Create tables from S3 parquet files:models/tpcds/raw/customer.sql
_sources.yml and materializes it as a table in DuckLake.
Model Configuration
- Raw tables: Stored in DuckLake with
rawschema - Query models: Stored in MotherDuck database
my_db - Materialization: Tables for raw data, views or tables for queries
Project Structure
Running the Project
Build the project
TPC-DS Benchmark
The project includes all 99 standard TPC-DS queries, providing a comprehensive benchmark suite for testing analytical performance across the DuckLake-MotherDuck architecture.What is TPC-DS?
What is TPC-DS?
TPC-DS is a decision support benchmark that models complex business analytics. It includes:
- 25 tables with realistic relationships
- 99 analytical queries of varying complexity
- Multiple scale factors (SF100 = ~100GB of data)
- Queries covering reporting, ad-hoc analysis, and data mining
Use Cases
Data lake analytics
Data lake analytics
- Query parquet files directly from S3
- No data movement required
- Leverage DuckDB’s fast columnar engine
- Scale storage independently from compute
ETL/ELT pipelines
ETL/ELT pipelines
- Load data from S3 into DuckLake
- Transform with dbt models
- Output to MotherDuck for analytics
- Version and test transformations
Hybrid architectures
Hybrid architectures
- Raw data in DuckLake (cheap storage)
- Transformed data in MotherDuck (fast queries)
- Mix local and cloud as needed
- Optimize for cost and performance
Best Practices
DuckLake tables are read-only from S3. To update data, you need to write new parquet files to S3 and refresh the DuckLake catalog.
Advanced Configuration
Creating a DuckLake Database
Before running dbt, create your DuckLake database in MotherDuck:Multiple Storage Backends
DuckLake supports various storage backends:- S3 (shown in this example)
- Local parquet files
- Azure Blob Storage
- Google Cloud Storage
Performance Tips
- Partition your data: Use Hive-style partitioning in S3 for better pruning
- Optimize file sizes: Aim for 100MB-1GB parquet files
- Use compression: Snappy or Zstd compression reduces I/O
- Filter early: Push predicates down to the source
- Materialize frequently used tables: Cache hot data in MotherDuck