DuckLake with dbt (Local)

This example demonstrates how to use DuckLake locally with dbt to manage TPC-H benchmark data. DuckLake provides lakehouse-style catalog management on top of local storage - perfect for local development and testing.

Overview

DuckLake is DuckDB’s extension that adds catalog management capabilities:

Lakehouse-style table management
Metadata store (SQLite or PostgreSQL)
Data files in parquet format
Local development without cloud overhead

Think of DuckLake as a lightweight lakehouse right on your laptop - Iceberg-like table management without the complexity.

Architecture

This project uses a local-first architecture:

Data Generation: TPC-H data generated locally using tpchgen-cli at scale factor 10
Storage Layer: Parquet files stored in the data/ directory
Catalog Layer: DuckLake manages metadata using SQLite (stored in ducklake_sqlite.db)
Transform Layer: dbt models that read from local parquet and materialize tables in the DuckLake catalog

Data Sources

The project uses TPC-H benchmark data:

Generated with: tpchgen-cli at Scale Factor 10 (~10GB)
Location: data/ directory (local parquet files)
Format: Parquet
Tables: 8 TPC-H tables (customer, lineitem, nation, orders, part, partsupp, region, supplier)

Project Structure

dbt-local-ducklake/
├── data/                    # TPC-H parquet files
│   ├── customer.parquet
│   ├── lineitem.parquet
│   ├── nation.parquet
│   ├── orders.parquet
│   ├── part.parquet
│   ├── partsupp.parquet
│   ├── region.parquet
│   └── supplier.parquet
├── models/
│   └── tpch/
│       ├── raw/             # TPC-H base tables
│       │   ├── _sources.yml
│       │   ├── customer.sql
│       │   └── ... (8 tables)
│       └── queries/         # Analytical queries
│           └── test.sql
├── ducklake_sqlite.db       # SQLite metadata catalog
├── ducklake_files/          # DuckLake managed parquet files
└── profiles.yml

Configuration

profiles.yml - SQLite Backend (Default)

Use SQLite as the metadata store:

profiles.yml

dbt_local_ducklake:
  outputs:
    local_sqlite:
      type: duckdb
      threads: 4
      extensions:
        - ducklake
        - sqlite
      attach:
        - path: "ducklake:sqlite:ducklake_sqlite.db"
          alias: catalog
          options:
            data_path: ducklake_files
  target: local_sqlite

profiles.yml - PostgreSQL Backend (Advanced)

Use PostgreSQL for multi-user scenarios:

profiles.yml

dbt_local_ducklake:
  outputs:
    local:
      type: duckdb
      threads: 4
      extensions:
        - ducklake
        - postgres
      secrets:
        - name: ducklake_secret
          type: postgres
          host: localhost
          port: 5432
          database: ducklake_catalog
      attach:
        - path: "ducklake:postgres:"
          alias: catalog
          options:
            data_path: data_files
            meta_secret: ducklake_secret
  target: local

SQLite vs PostgreSQL backends

SQLite (recommended for single-user):

Simple setup
No server required
Fast for local development
Entire catalog in one file

PostgreSQL (for multi-user):

Supports concurrent access
Better for team environments
Requires PostgreSQL server
More complex setup

Source Configuration

Define TPC-H tables as sources:

models/tpch/raw/_sources.yml

sources:
  - name: tpch
    meta:
      external_location: |- 
        data/{name}.parquet
    tables:
      - name: customer
      - name: lineitem
      - name: nation
      - name: orders
      - name: part
      - name: partsupp
      - name: region
      - name: supplier

Models

Raw Tables

Create tables from local parquet files:

models/tpch/raw/customer.sql

select * from {{ source('tpch', 'customer') }}

This reads from data/customer.parquet and materializes it as a table in the DuckLake catalog.

Model Configuration

Raw tables: Materialized as tables in DuckLake catalog with raw schema
Source data: Read from local parquet files in data/ directory
Catalog: Attached as catalog database in DuckDB

Running the Project

Generate TPC-H data

If you don’t have the data files yet:

# Install tpchgen-cli
pip install tpchgen-cli

# Generate TPC-H data at scale factor 10
tpchgen-cli --scale-factor 10 --output-dir data --format=parquet

Install dbt dependencies

uv sync

Build the project

uv run dbt build

This will:

Read TPC-H parquet files from the data/ directory
Create 8 TPC-H tables in the DuckLake catalog
Store metadata in SQLite and data files in ducklake_files/

Querying Your Data

Using dbt

# Run all models
uv run dbt run

# Run specific models
uv run dbt run --select customer

# Test your models
uv run dbt test

Using DuckDB CLI

Open a DuckDB REPL with the catalog attached:

duckdb -c "INSTALL ducklake; LOAD ducklake; ATTACH 'ducklake:sqlite:ducklake_sqlite.db' AS catalog; USE catalog; SELECT * FROM raw.customer LIMIT 10;"

Using Python

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL ducklake")
conn.execute("LOAD ducklake")
conn.execute("ATTACH 'ducklake:sqlite:ducklake_sqlite.db' AS catalog")

# Query your data
result = conn.execute("SELECT * FROM catalog.raw.customer LIMIT 10").fetchdf()
print(result)

Why Use DuckLake Locally?

Catalog Management

Get table versioning and metadata management without setting up a full lakehouse:

Track table schemas over time
Manage table metadata separately from data
Version control your catalog

Development Parity

Match your production lakehouse patterns in local development:

Same table management patterns as Iceberg/Delta
Test lakehouse workflows locally
Avoid surprises when deploying to production

Flexible Backends

Choose the right metadata store for your needs:

SQLite for simplicity and single-user workflows
PostgreSQL for multi-user scenarios
Swap backends without changing your models

Data Isolation

Keep metadata separate from data files:

Easy to version control catalog
Separate data and metadata backups
Clear separation of concerns

Portability

Your entire analytical database fits in a few files:

Commit to git (for small datasets)
Share with colleagues
Easy disaster recovery

TPC-H Benchmark

The TPC-H benchmark is an industry-standard decision support benchmark consisting of:

8 tables with realistic relationships
22 analytical queries of varying complexity
Scale Factor 10 generates approximately 10GB of data
Realistic dataset for testing and development

TPC-H is excellent for testing dbt patterns, performance tuning, and learning SQL analytics without requiring cloud resources.

Best Practices

Don’t commit large parquet files to git. Use .gitignore to exclude the data/ and ducklake_files/ directories.

For development, use a smaller scale factor (SF 1 or SF 0.1) to iterate quickly. Use SF 10 for realistic performance testing.

The DuckLake extension is still evolving. Always use the latest version for bug fixes and new features:

extensions:
  - name: ducklake
    repo: core_nightly  # Use nightly builds for latest features

Advanced Usage

Attaching Multiple Catalogs

You can work with multiple DuckLake catalogs simultaneously:

attach:
  - path: "ducklake:sqlite:catalog_dev.db"
    alias: dev_catalog
  - path: "ducklake:sqlite:catalog_prod.db"
    alias: prod_catalog

Hybrid Local and Cloud

Combine local DuckLake with MotherDuck:

attach:
  - path: "ducklake:sqlite:ducklake_sqlite.db"
    alias: catalog
  - path: "md:jdw_ducklake"
    is_ducklake: true
    alias: cloud_catalog

This allows you to:

Develop locally with DuckLake
Test against cloud data
Mix local and cloud tables in queries

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

Overview

Architecture

Data Sources

Project Structure

Configuration

profiles.yml - SQLite Backend (Default)

profiles.yml - PostgreSQL Backend (Advanced)

Source Configuration

Models

Raw Tables

Model Configuration

Running the Project

Querying Your Data

Using dbt

Using DuckDB CLI

Using Python

Why Use DuckLake Locally?

TPC-H Benchmark

Best Practices

Advanced Usage

Attaching Multiple Catalogs

Hybrid Local and Cloud

Build docs developers (and LLMs) love

Get Started

Data Ingestion

dbt Patterns

Data Replication

Integrations

​Overview

​Architecture

​Data Sources

​Project Structure

​Configuration

​profiles.yml - SQLite Backend (Default)

​profiles.yml - PostgreSQL Backend (Advanced)

​Source Configuration

​Models

​Raw Tables

​Model Configuration

​Running the Project

​Querying Your Data

​Using dbt

​Using DuckDB CLI

​Using Python

​Why Use DuckLake Locally?

​TPC-H Benchmark

​Best Practices

​Advanced Usage

​Attaching Multiple Catalogs

​Hybrid Local and Cloud

Build docs developers (and LLMs) love

Overview

Architecture

Data Sources

Project Structure

Configuration

profiles.yml - SQLite Backend (Default)

profiles.yml - PostgreSQL Backend (Advanced)

Source Configuration

Models

Raw Tables

Model Configuration

Running the Project

Querying Your Data

Using dbt

Using DuckDB CLI

Using Python

Why Use DuckLake Locally?

TPC-H Benchmark

Best Practices

Advanced Usage

Attaching Multiple Catalogs

Hybrid Local and Cloud