Skip to main content
This example demonstrates how to use DuckLake locally with dbt to manage TPC-H benchmark data. DuckLake provides lakehouse-style catalog management on top of local storage - perfect for local development and testing.

Overview

DuckLake is DuckDB’s extension that adds catalog management capabilities:
  • Lakehouse-style table management
  • Metadata store (SQLite or PostgreSQL)
  • Data files in parquet format
  • Local development without cloud overhead
Think of DuckLake as a lightweight lakehouse right on your laptop - Iceberg-like table management without the complexity.

Architecture

This project uses a local-first architecture:
  1. Data Generation: TPC-H data generated locally using tpchgen-cli at scale factor 10
  2. Storage Layer: Parquet files stored in the data/ directory
  3. Catalog Layer: DuckLake manages metadata using SQLite (stored in ducklake_sqlite.db)
  4. Transform Layer: dbt models that read from local parquet and materialize tables in the DuckLake catalog

Data Sources

The project uses TPC-H benchmark data:
  • Generated with: tpchgen-cli at Scale Factor 10 (~10GB)
  • Location: data/ directory (local parquet files)
  • Format: Parquet
  • Tables: 8 TPC-H tables (customer, lineitem, nation, orders, part, partsupp, region, supplier)

Project Structure

dbt-local-ducklake/
├── data/                    # TPC-H parquet files
│   ├── customer.parquet
│   ├── lineitem.parquet
│   ├── nation.parquet
│   ├── orders.parquet
│   ├── part.parquet
│   ├── partsupp.parquet
│   ├── region.parquet
│   └── supplier.parquet
├── models/
│   └── tpch/
│       ├── raw/             # TPC-H base tables
│       │   ├── _sources.yml
│       │   ├── customer.sql
│       │   └── ... (8 tables)
│       └── queries/         # Analytical queries
│           └── test.sql
├── ducklake_sqlite.db       # SQLite metadata catalog
├── ducklake_files/          # DuckLake managed parquet files
└── profiles.yml

Configuration

profiles.yml - SQLite Backend (Default)

Use SQLite as the metadata store:
profiles.yml
dbt_local_ducklake:
  outputs:
    local_sqlite:
      type: duckdb
      threads: 4
      extensions:
        - ducklake
        - sqlite
      attach:
        - path: "ducklake:sqlite:ducklake_sqlite.db"
          alias: catalog
          options:
            data_path: ducklake_files
  target: local_sqlite

profiles.yml - PostgreSQL Backend (Advanced)

Use PostgreSQL for multi-user scenarios:
profiles.yml
dbt_local_ducklake:
  outputs:
    local:
      type: duckdb
      threads: 4
      extensions:
        - ducklake
        - postgres
      secrets:
        - name: ducklake_secret
          type: postgres
          host: localhost
          port: 5432
          database: ducklake_catalog
      attach:
        - path: "ducklake:postgres:"
          alias: catalog
          options:
            data_path: data_files
            meta_secret: ducklake_secret
  target: local
SQLite (recommended for single-user):
  • Simple setup
  • No server required
  • Fast for local development
  • Entire catalog in one file
PostgreSQL (for multi-user):
  • Supports concurrent access
  • Better for team environments
  • Requires PostgreSQL server
  • More complex setup

Source Configuration

Define TPC-H tables as sources:
models/tpch/raw/_sources.yml
sources:
  - name: tpch
    meta:
      external_location: |- 
        data/{name}.parquet
    tables:
      - name: customer
      - name: lineitem
      - name: nation
      - name: orders
      - name: part
      - name: partsupp
      - name: region
      - name: supplier

Models

Raw Tables

Create tables from local parquet files:
models/tpch/raw/customer.sql
select * from {{ source('tpch', 'customer') }}
This reads from data/customer.parquet and materializes it as a table in the DuckLake catalog.

Model Configuration

  • Raw tables: Materialized as tables in DuckLake catalog with raw schema
  • Source data: Read from local parquet files in data/ directory
  • Catalog: Attached as catalog database in DuckDB

Running the Project

1

Generate TPC-H data

If you don’t have the data files yet:
# Install tpchgen-cli
pip install tpchgen-cli

# Generate TPC-H data at scale factor 10
tpchgen-cli --scale-factor 10 --output-dir data --format=parquet
2

Install dbt dependencies

uv sync
3

Build the project

uv run dbt build
This will:
  • Read TPC-H parquet files from the data/ directory
  • Create 8 TPC-H tables in the DuckLake catalog
  • Store metadata in SQLite and data files in ducklake_files/

Querying Your Data

Using dbt

# Run all models
uv run dbt run

# Run specific models
uv run dbt run --select customer

# Test your models
uv run dbt test

Using DuckDB CLI

Open a DuckDB REPL with the catalog attached:
duckdb -c "INSTALL ducklake; LOAD ducklake; ATTACH 'ducklake:sqlite:ducklake_sqlite.db' AS catalog; USE catalog; SELECT * FROM raw.customer LIMIT 10;"

Using Python

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL ducklake")
conn.execute("LOAD ducklake")
conn.execute("ATTACH 'ducklake:sqlite:ducklake_sqlite.db' AS catalog")

# Query your data
result = conn.execute("SELECT * FROM catalog.raw.customer LIMIT 10").fetchdf()
print(result)

Why Use DuckLake Locally?

Get table versioning and metadata management without setting up a full lakehouse:
  • Track table schemas over time
  • Manage table metadata separately from data
  • Version control your catalog
Match your production lakehouse patterns in local development:
  • Same table management patterns as Iceberg/Delta
  • Test lakehouse workflows locally
  • Avoid surprises when deploying to production
Choose the right metadata store for your needs:
  • SQLite for simplicity and single-user workflows
  • PostgreSQL for multi-user scenarios
  • Swap backends without changing your models
Keep metadata separate from data files:
  • Easy to version control catalog
  • Separate data and metadata backups
  • Clear separation of concerns
Your entire analytical database fits in a few files:
  • Commit to git (for small datasets)
  • Share with colleagues
  • Easy disaster recovery

TPC-H Benchmark

The TPC-H benchmark is an industry-standard decision support benchmark consisting of:
  • 8 tables with realistic relationships
  • 22 analytical queries of varying complexity
  • Scale Factor 10 generates approximately 10GB of data
  • Realistic dataset for testing and development
TPC-H is excellent for testing dbt patterns, performance tuning, and learning SQL analytics without requiring cloud resources.

Best Practices

Don’t commit large parquet files to git. Use .gitignore to exclude the data/ and ducklake_files/ directories.
For development, use a smaller scale factor (SF 1 or SF 0.1) to iterate quickly. Use SF 10 for realistic performance testing.
The DuckLake extension is still evolving. Always use the latest version for bug fixes and new features:
extensions:
  - name: ducklake
    repo: core_nightly  # Use nightly builds for latest features

Advanced Usage

Attaching Multiple Catalogs

You can work with multiple DuckLake catalogs simultaneously:
attach:
  - path: "ducklake:sqlite:catalog_dev.db"
    alias: dev_catalog
  - path: "ducklake:sqlite:catalog_prod.db"
    alias: prod_catalog

Hybrid Local and Cloud

Combine local DuckLake with MotherDuck:
attach:
  - path: "ducklake:sqlite:ducklake_sqlite.db"
    alias: catalog
  - path: "md:jdw_ducklake"
    is_ducklake: true
    alias: cloud_catalog
This allows you to:
  • Develop locally with DuckLake
  • Test against cloud data
  • Mix local and cloud tables in queries

Build docs developers (and LLMs) love