Overview
The DLT pipeline allows you to:- Extract data from specified PostgreSQL tables in parallel
- Transform data during the normalization phase
- Load data into MotherDuck with configurable batch sizes
- Track detailed metrics about each pipeline run
Requirements
- Python 3.11 or higher
- PostgreSQL database (source)
- MotherDuck account and token
Installation
Install uv (recommended)
uv is a fast Python package installer built in Rust.
Install dependencies
Create a Install with:
pyproject.toml file with the following dependencies:pyproject.toml
Why these dependencies?
dlt[motherduck]: Core DLT library with MotherDuck supportconnectorx: High-performance database connector for PostgreSQLpsycopg2-binary: PostgreSQL adapter for Pythonsqlalchemy: SQL toolkit and ORMhumanize: Human-readable output formatting
Configuration
Create a.dlt/config.toml file to control pipeline behavior:
Runtime Settings
Runtime Settings
Control log levels and telemetry:
.dlt/config.toml
Source Configuration
Source Configuration
Define which PostgreSQL schema and tables to extract:
.dlt/config.toml
Connection Pool Configuration
Connection Pool Configuration
Match the pool size to your worker count:
.dlt/config.toml
Data Writer Configuration
Data Writer Configuration
Configure interim data storage format:
.dlt/config.toml
MotherDuck Destination
MotherDuck Destination
Configure batch size for loading:
.dlt/config.toml
Pipeline Stage Parallelization
Pipeline Stage Parallelization
Configure workers for each pipeline stage:
.dlt/config.toml
Higher worker counts improve performance but require more system resources. Adjust based on your infrastructure.
Pipeline Implementation
Createsql_database_pipeline.py with the following code:
sql_database_pipeline.py
Write Disposition Options
The pipeline useswrite_disposition="replace" which drops and recreates tables on each run. Other options include:
append: Add new rows to existing tablesmerge: Update existing rows and add new ones (requires primary key)
Running the Pipeline
- Connect to your PostgreSQL database
- Extract the configured tables in parallel using ConnectorX
- Transform the data during normalization
- Load the data into MotherDuck in batches
- Output detailed metrics about the run
Performance Optimization
The pipeline is configured for high performance with:Parallel Extraction
Extract data from PostgreSQL using multiple workers:This allows multiple tables to be extracted simultaneously.
Efficient Backend
Use ConnectorX for fast data extraction:ConnectorX is significantly faster than traditional database adapters.
Parquet Format
Use Parquet for interim storage:Parquet provides efficient compression and columnar storage.
Large Batch Sizes
Load data to MotherDuck in large batches:Larger batches reduce network overhead and improve throughput.
Troubleshooting
Connection Errors
Connection Errors
If you encounter connection errors:
- Verify PostgreSQL credentials in
.dlt/secrets.toml - Check network connectivity to PostgreSQL host
- Ensure PostgreSQL allows connections from your IP
- Verify MotherDuck token is valid
Memory Issues
Memory Issues
If you experience out-of-memory errors:
-
Reduce batch size in MotherDuck configuration:
-
Decrease worker counts:
-
Process fewer tables at once by splitting
tablesconfiguration
Performance Issues
Performance Issues
If the pipeline is slower than expected:
-
Increase parallelization settings:
-
Verify ConnectorX backend is being used:
- Check PostgreSQL server load and connection limits
- Monitor network bandwidth between PostgreSQL and your runner
Table Not Found Errors
Table Not Found Errors
If specific tables cannot be found:
-
Verify the schema name in
.dlt/config.toml: -
Check table names are correct and exist in PostgreSQL:
- Ensure the PostgreSQL user has SELECT permissions on the tables
Next Steps
- Explore DLT incremental loading for efficient updates
- Learn about DLT transformations to modify data during pipeline execution
- Set up monitoring and alerting for production pipelines
- Consider scheduling the pipeline with cron, Airflow, or other orchestration tools