Filesystem
The filesystem destination allows you to load data into local or cloud file systems including AWS S3, Google Cloud Storage, Azure Blob Storage, and more. It supports multiple file formats and table formats like Delta Lake and Apache Iceberg.
Install dlt with Filesystem
To use filesystem as a destination, install dlt with the filesystem extra:
pip install "dlt[filesystem]"
Quick Start
Here’s a simple example to get you started:
import dlt
# Define your data source
@dlt.resource
def my_data ():
yield { "id" : 1 , "name" : "Alice" }
yield { "id" : 2 , "name" : "Bob" }
# Create pipeline - saves to local filesystem
pipeline = dlt.pipeline(
pipeline_name = "my_pipeline" ,
destination = "filesystem" ,
dataset_name = "my_dataset"
)
# Run the pipeline
info = pipeline.run(my_data())
print (info)
Configuration
Basic Configuration
The filesystem path or bucket URL. Supports various protocols:
Local: file:///path/to/directory
AWS S3: s3://bucket-name/path
GCS: gs://bucket-name/path
Azure: az://container-name/path
Memory: memory://m
Layout of files in the destination using placeholders like {table_name}, {load_id}, {file_id}, etc.
Additional custom placeholders for the layout
Function to provide current datetime for time-based placeholders
Maximum number of pipeline state files to keep. Set to 0 or negative to disable cleanup.
Supported Protocols
Local Filesystem
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem( "file:///tmp/my_data" ),
dataset_name = "my_dataset"
)
AWS S3
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem( "s3://my-bucket/data" ),
dataset_name = "my_dataset"
)
Configure AWS credentials:
[ destination . filesystem ]
bucket_url = "s3://my-bucket/data"
[ destination . filesystem . credentials ]
aws_access_key_id = "your-access-key"
aws_secret_access_key = "your-secret-key"
region_name = "us-east-1"
Google Cloud Storage
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem( "gs://my-bucket/data" ),
dataset_name = "my_dataset"
)
Configure GCS credentials:
[ destination . filesystem ]
bucket_url = "gs://my-bucket/data"
[ destination . filesystem . credentials ]
project_id = "my-project"
private_key = "-----BEGIN PRIVATE KEY----- \n ... \n -----END PRIVATE KEY----- \n "
client_email = "[email protected] "
Azure Blob Storage
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem( "az://my-container/data" ),
dataset_name = "my_dataset"
)
Configure Azure credentials:
[ destination . filesystem ]
bucket_url = "az://my-container/data"
[ destination . filesystem . credentials ]
azure_storage_account_name = "mystorageaccount"
azure_storage_account_key = "your-account-key"
Hugging Face Datasets
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem( "hf://datasets/my-username/my-dataset" ),
dataset_name = "my_dataset"
)
Configure Hugging Face credentials:
[ destination . filesystem ]
bucket_url = "hf://datasets/my-username/my-dataset"
[ destination . filesystem . credentials ]
token = "hf_..."
The filesystem destination supports multiple file formats:
pipeline = dlt.pipeline(
destination = "filesystem" ,
dataset_name = "my_dataset"
)
pipeline.run(my_data(), loader_file_format = "jsonl" )
File Layout
Customize the directory structure and file naming:
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem(
bucket_url = "s3://my-bucket/data" ,
layout = " {table_name} / {load_id} . {file_id} . {ext} "
),
dataset_name = "my_dataset"
)
Available Placeholders
{table_name}: Name of the table
{schema_name}: Name of the schema
{load_id}: Unique load identifier
{file_id}: File number within the load
{ext}: File extension (jsonl, parquet, csv)
{timestamp}: Load timestamp
{curr_date}: Current date
Custom Placeholders
Define custom placeholders:
import dlt
from dlt.destinations import filesystem
from datetime import datetime
pipeline = dlt.pipeline(
destination = filesystem(
bucket_url = "s3://my-bucket/data" ,
layout = "year= {year} /month= {month} / {table_name} . {ext} " ,
extra_placeholders = {
"year" : datetime.now().year,
"month" : datetime.now().month
}
),
dataset_name = "my_dataset"
)
Delta Lake
Create Delta Lake tables:
import dlt
@dlt.resource ( table_format = "delta" )
def delta_data ():
yield { "id" : 1 , "value" : "data" }
pipeline = dlt.pipeline(
destination = "filesystem" ,
dataset_name = "my_dataset"
)
pipeline.run(delta_data(), loader_file_format = "parquet" )
Delta Lake requires Parquet file format.
Apache Iceberg
Create Apache Iceberg tables:
import dlt
@dlt.resource ( table_format = "iceberg" )
def iceberg_data ():
yield { "id" : 1 , "value" : "data" }
pipeline = dlt.pipeline(
destination = "filesystem" ,
dataset_name = "my_dataset"
)
pipeline.run(iceberg_data(), loader_file_format = "parquet" )
Iceberg requires Parquet file format.
Write Dispositions
Regular Tables
For regular file-based tables, only truncate-and-insert is supported:
@dlt.resource ( write_disposition = "replace" )
def replace_data ():
yield { "id" : 1 , "value" : "new" }
Delta/Iceberg Tables
Delta and Iceberg tables support all write dispositions:
@dlt.resource (
write_disposition = "append" ,
table_format = "delta"
)
def append_data ():
yield { "id" : 1 , "value" : "new" }
Querying Data
Query filesystem data using DuckDB:
import dlt
import duckdb
pipeline = dlt.pipeline(
destination = "filesystem" ,
dataset_name = "my_dataset"
)
pipeline.run(my_data())
# Query Parquet files
conn = duckdb.connect()
df = conn.execute( """
SELECT * FROM read_parquet('file:///tmp/my_data/**/*.parquet')
""" ).df()
print (df.head())
Advanced Features
Partitioning
Partition data using the layout:
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem(
bucket_url = "s3://my-bucket/data" ,
layout = " {table_name} /year= {year} /month= {month} /data. {file_id} . {ext} "
),
dataset_name = "my_dataset"
)
Compression
Parquet files are automatically compressed. Configure compression level:
import dlt
pipeline = dlt.pipeline(
destination = "filesystem" ,
dataset_name = "my_dataset"
)
pipeline.run(my_data(), loader_file_format = "parquet" )
State Management
Control pipeline state file retention:
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
destination = filesystem(
bucket_url = "s3://my-bucket/data" ,
max_state_files = 50 # Keep last 50 state files
),
dataset_name = "my_dataset"
)
Use Cases
Filesystem destination is ideal for:
Data Lake Storage : Store raw data in cloud storage
Staging Area : Stage data before loading to warehouses
Backup and Archive : Long-term data retention
Data Sharing : Share data files across systems
Cost Optimization : Lower storage costs compared to warehouses
Hugging Face Datasets : Publish datasets to Hugging Face
Use Parquet : Better compression and query performance
Partition Data : Use date-based partitioning for large datasets
Table Formats : Use Delta/Iceberg for ACID guarantees
Batch Size : Adjust batch size for optimal file sizes
Compression : Leverage Parquet’s built-in compression
Limitations
Regular file tables don’t support merge operations
Delta/Iceberg require Parquet format
No built-in query engine (use DuckDB, Spark, etc.)
File-level operations only (no row-level updates for regular files)
Additional Resources
Delta Lake Learn about Delta Lake format
Apache Iceberg Learn about Iceberg tables