DuckDB
DuckDB is an in-process SQL OLAP database management system designed for analytical query workloads. It’s lightweight, fast, and requires no separate server process.
Install dlt with DuckDB
To use DuckDB as a destination, install dlt with the DuckDB extra:
pip install "dlt[duckdb]"
Quick Start
Here’s a simple example to get you started:
import dlt
# Define your data source
@dlt.resource
def my_data ():
yield { "id" : 1 , "name" : "Alice" }
yield { "id" : 2 , "name" : "Bob" }
# Create pipeline
pipeline = dlt.pipeline(
pipeline_name = "my_pipeline" ,
destination = "duckdb" ,
dataset_name = "my_dataset"
)
# Run the pipeline
info = pipeline.run(my_data())
print (info)
Configuration
Database Path
DuckDB can use a file-based database or an in-memory database:
File-based Database
Custom Path
Existing Connection
import dlt
pipeline = dlt.pipeline(
destination = "duckdb" ,
dataset_name = "my_dataset"
)
# Creates a .duckdb file in the pipeline directory
Using :memory: as the database path is not supported by dlt as the database would be lost after the pipeline completes.
Basic Configuration
Path to the DuckDB database file or special value :pipeline: to use the pipeline working directory
Open the database in read-only mode
Whether to create unique indexes (this slows loading down significantly)
Advanced Configuration
List of DuckDB extensions to load on each connection (e.g., ["httpfs", "parquet"])
Global configuration settings applied once on each newly opened connection
Local configuration settings applied to each cursor connection
List of PRAGMA statements applied to each cursor connection
Using DuckDB Credentials
Configure DuckDB in .dlt/secrets.toml:
[ destination . duckdb . credentials ]
database = "my_pipeline.duckdb"
read_only = false
With extensions and configuration:
[ destination . duckdb . credentials ]
database = "my_pipeline.duckdb"
extensions = [ "httpfs" , "parquet" ]
[ destination . duckdb . credentials . global_config ]
memory_limit = "4GB"
threads = 4
[ destination . duckdb . credentials . local_config ]
max_memory = "2GB"
Data Loading
DuckDB provides excellent performance for analytical queries and supports nested data structures.
JSONL (Default)
Parquet (Recommended)
CSV
pipeline = dlt.pipeline(
destination = "duckdb" ,
dataset_name = "my_dataset"
)
# JSONL is the default format
DuckDB works exceptionally well with Parquet files due to its columnar storage engine.
Write Dispositions
DuckDB supports all write dispositions:
@dlt.resource ( write_disposition = "append" )
def append_data ():
yield { "id" : 1 , "value" : "new" }
Data Types
DuckDB data type mapping:
dlt Type DuckDB Type text VARCHAR double DOUBLE bool BOOLEAN timestamp TIMESTAMP WITH TIME ZONE date DATE time TIME bigint BIGINT binary BLOB decimal DECIMAL json JSON
Nested Data Support
DuckDB has excellent support for nested data structures:
import dlt
@dlt.resource
def nested_data ():
yield {
"id" : 1 ,
"user" : {
"name" : "Alice" ,
"age" : 30
},
"tags" : [ "python" , "data" ]
}
pipeline = dlt.pipeline( destination = "duckdb" , dataset_name = "my_dataset" )
pipeline.run(nested_data())
Query nested data:
SELECT
id,
user . name as user_name,
user . age as user_age,
tags
FROM nested_data;
Integer Precision
DuckDB supports various integer sizes based on precision:
import dlt
@dlt.resource
def integer_types ():
yield {
"tiny" : 100 , # TINYINT (8-bit)
"small" : 10000 , # SMALLINT (16-bit)
"medium" : 100000 , # INTEGER (32-bit)
"big" : 10 ** 15 # BIGINT (64-bit)
}
Advanced Features
Extensions
Load DuckDB extensions for additional functionality:
import dlt
from dlt.destinations import duckdb
pipeline = dlt.pipeline(
destination = duckdb(
extensions = [ "httpfs" , "parquet" , "json" ]
),
dataset_name = "my_dataset"
)
Common extensions:
httpfs: Read files from HTTP/S3
parquet: Enhanced Parquet support
json: JSON functions
postgres_scanner: Query PostgreSQL databases
Global Configuration
Set global DuckDB settings:
import dlt
from dlt.destinations import duckdb
pipeline = dlt.pipeline(
destination = duckdb(
global_config = {
"memory_limit" : "4GB" ,
"threads" : 4 ,
"default_order" : "DESC"
}
),
dataset_name = "my_dataset"
)
Pragmas
Apply PRAGMA statements:
import dlt
from dlt.destinations import duckdb
pipeline = dlt.pipeline(
destination = duckdb(
pragmas = [ "enable_profiling" , "memory_limit='2GB'" ]
),
dataset_name = "my_dataset"
)
Querying DuckDB
Access the DuckDB connection to run queries:
import dlt
pipeline = dlt.pipeline(
pipeline_name = "my_pipeline" ,
destination = "duckdb" ,
dataset_name = "my_dataset"
)
# Load data
pipeline.run(my_data())
# Query the data
with pipeline.sql_client() as client:
result = client.execute_sql( "SELECT * FROM my_data LIMIT 10" )
for row in result:
print (row)
Or use the DuckDB connection directly:
import dlt
import duckdb
pipeline = dlt.pipeline(
pipeline_name = "my_pipeline" ,
destination = "duckdb" ,
dataset_name = "my_dataset"
)
pipeline.run(my_data())
# Get the database path
db_path = pipeline.pipeline_name + ".duckdb"
# Connect and query
conn = duckdb.connect(db_path)
df = conn.execute( "SELECT * FROM my_dataset.my_data" ).df()
print (df.head())
Memory Management
Control memory usage:
import dlt
from dlt.destinations import duckdb
pipeline = dlt.pipeline(
destination = duckdb(
global_config = { "memory_limit" : "2GB" }
),
dataset_name = "my_dataset"
)
Parallel Processing
Adjust thread count:
import dlt
from dlt.destinations import duckdb
pipeline = dlt.pipeline(
destination = duckdb(
global_config = { "threads" : 8 }
),
dataset_name = "my_dataset"
)
Use Parquet for best performance:
import dlt
pipeline = dlt.pipeline(
destination = "duckdb" ,
dataset_name = "my_dataset"
)
pipeline.run(my_data(), loader_file_format = "parquet" )
Pandas
Convert query results to pandas DataFrames:
import dlt
import duckdb
pipeline = dlt.pipeline(
destination = "duckdb" ,
dataset_name = "my_dataset"
)
pipeline.run(my_data())
conn = duckdb.connect( f " { pipeline.pipeline_name } .duckdb" )
df = conn.execute( "SELECT * FROM my_dataset.my_data" ).df()
print (df.describe())
Arrow
Work with Arrow tables:
import dlt
import duckdb
pipeline = dlt.pipeline(
destination = "duckdb" ,
dataset_name = "my_dataset"
)
pipeline.run(my_data())
conn = duckdb.connect( f " { pipeline.pipeline_name } .duckdb" )
arrow_table = conn.execute( "SELECT * FROM my_dataset.my_data" ).arrow()
print (arrow_table.schema)
Local Development
DuckDB is perfect for local development and testing:
import dlt
# Development: use DuckDB locally
if __name__ == "__main__" :
pipeline = dlt.pipeline(
destination = "duckdb" ,
dataset_name = "dev_dataset"
)
pipeline.run(my_data())
# Query and verify
with pipeline.sql_client() as client:
result = client.execute_sql( "SELECT COUNT(*) FROM my_data" )
print ( f "Loaded { result[ 0 ][ 0 ] } rows" )
Limitations
DuckDB is single-process (not suitable for concurrent writes)
Database files are not compatible across different DuckDB versions
Limited support for concurrent readers in older versions
Not recommended for production OLTP workloads
Use Cases
DuckDB is ideal for:
Local development and testing : Fast iteration without infrastructure
Data analysis : Excellent analytical query performance
ETL pipelines : Efficient data transformations
Embedded analytics : No separate server process needed
Prototyping : Quick setup and exploration
Additional Resources
DuckDB Docs Official DuckDB documentation
DuckDB Extensions Available DuckDB extensions