Overview
DuckLake is DuckDB’s extension that adds catalog management capabilities:- Lakehouse-style table management
- Metadata store (SQLite or PostgreSQL)
- Data files in parquet format
- Local development without cloud overhead
Think of DuckLake as a lightweight lakehouse right on your laptop - Iceberg-like table management without the complexity.
Architecture
This project uses a local-first architecture:- Data Generation: TPC-H data generated locally using
tpchgen-cliat scale factor 10 - Storage Layer: Parquet files stored in the
data/directory - Catalog Layer: DuckLake manages metadata using SQLite (stored in
ducklake_sqlite.db) - Transform Layer: dbt models that read from local parquet and materialize tables in the DuckLake catalog
Data Sources
The project uses TPC-H benchmark data:- Generated with:
tpchgen-cliat Scale Factor 10 (~10GB) - Location:
data/directory (local parquet files) - Format: Parquet
- Tables: 8 TPC-H tables (
customer,lineitem,nation,orders,part,partsupp,region,supplier)
Project Structure
Configuration
profiles.yml - SQLite Backend (Default)
Use SQLite as the metadata store:profiles.yml
profiles.yml - PostgreSQL Backend (Advanced)
Use PostgreSQL for multi-user scenarios:profiles.yml
SQLite vs PostgreSQL backends
SQLite vs PostgreSQL backends
SQLite (recommended for single-user):
- Simple setup
- No server required
- Fast for local development
- Entire catalog in one file
- Supports concurrent access
- Better for team environments
- Requires PostgreSQL server
- More complex setup
Source Configuration
Define TPC-H tables as sources:models/tpch/raw/_sources.yml
Models
Raw Tables
Create tables from local parquet files:models/tpch/raw/customer.sql
data/customer.parquet and materializes it as a table in the DuckLake catalog.
Model Configuration
- Raw tables: Materialized as tables in DuckLake catalog with
rawschema - Source data: Read from local parquet files in
data/directory - Catalog: Attached as
catalogdatabase in DuckDB
Running the Project
Querying Your Data
Using dbt
Using DuckDB CLI
Open a DuckDB REPL with the catalog attached:Using Python
Why Use DuckLake Locally?
Catalog Management
Catalog Management
Get table versioning and metadata management without setting up a full lakehouse:
- Track table schemas over time
- Manage table metadata separately from data
- Version control your catalog
Development Parity
Development Parity
Match your production lakehouse patterns in local development:
- Same table management patterns as Iceberg/Delta
- Test lakehouse workflows locally
- Avoid surprises when deploying to production
Flexible Backends
Flexible Backends
Choose the right metadata store for your needs:
- SQLite for simplicity and single-user workflows
- PostgreSQL for multi-user scenarios
- Swap backends without changing your models
Data Isolation
Data Isolation
Keep metadata separate from data files:
- Easy to version control catalog
- Separate data and metadata backups
- Clear separation of concerns
Portability
Portability
Your entire analytical database fits in a few files:
- Commit to git (for small datasets)
- Share with colleagues
- Easy disaster recovery
TPC-H Benchmark
The TPC-H benchmark is an industry-standard decision support benchmark consisting of:- 8 tables with realistic relationships
- 22 analytical queries of varying complexity
- Scale Factor 10 generates approximately 10GB of data
- Realistic dataset for testing and development
Best Practices
The DuckLake extension is still evolving. Always use the latest version for bug fixes and new features:
Advanced Usage
Attaching Multiple Catalogs
You can work with multiple DuckLake catalogs simultaneously:Hybrid Local and Cloud
Combine local DuckLake with MotherDuck:- Develop locally with DuckLake
- Test against cloud data
- Mix local and cloud tables in queries