Pipeline class is the core component of dlt that orchestrates the data loading process. It manages the extract, normalize, and load steps, handles state synchronization, and provides methods to run complete data loading workflows.
Creating a Pipeline
Pipelines are typically created using thedlt.pipeline() function rather than instantiating the class directly.
Properties
pipeline_name
The name of the pipeline.dataset_name
The name of the dataset where data will be loaded.default_schema
Returns the default schema associated with the pipeline.Schema - The default schema object
Source: ~/workspace/source/dlt/pipeline/pipeline.py:961
state
Returns the current pipeline state as a dictionary.TPipelineState - Dictionary with the pipeline state
Source: ~/workspace/source/dlt/pipeline/pipeline.py:965
has_data
Tells if the pipeline contains any data: schemas, extracted files, load packages or loaded packages.bool - True if pipeline has data
Source: ~/workspace/source/dlt/pipeline/pipeline.py:938
has_pending_data
Tells if the pipeline contains any pending packages to be normalized or loaded.bool - True if pipeline has pending data
Source: ~/workspace/source/dlt/pipeline/pipeline.py:948
last_trace
Returns or loads last trace generated by pipeline.PipelineTrace - The last trace object
Source: ~/workspace/source/dlt/pipeline/pipeline.py:975
Methods
run()
Loads data from a source into the destination. This is the primary method for executing a complete ETL pipeline.Data to be loaded to destination. Can be a list, iterator, generator, dlt.resource, or dlt.source.
A name of the destination or a destination module. If not provided, uses the value from pipeline initialization.
A name of the staging destination for intermediate data loading.
Dataset name where data will be loaded. Defaults to pipeline_name if not specified.
Credentials for the destination. Usually inferred from secrets.toml or environment variables.
Name of the table to load data into. Required for lists/iterables without name attribute.
Controls how to write data: “append”, “replace”, “skip”, or “merge”.
Column schemas with names, data types, and performance hints.
Column name(s) that comprise the primary key. Used with merge write disposition.
An explicit Schema object to group table schemas.
The file format for load packages (e.g., “jsonl”, “parquet”).
The table format used by destination (“delta” or “iceberg”).
Schema contract settings override for all tables.
Refresh mode: “drop_sources”, “drop_resources”, or “drop_data”.
LoadInfo - Information about the loaded data including package ids and job statuses.
Source: ~/workspace/source/dlt/pipeline/pipeline.py:637
extract()
Extracts data and prepares it for normalization. Does not require destination configuration.run() except for destination-related parameters.
Returns: ExtractInfo - Information about the extraction step
Source: ~/workspace/source/dlt/pipeline/pipeline.py:424
normalize()
Normalizes extracted data, infers schema, and creates load packages.Number of worker processes for normalization.
NormalizeInfo - Information about the normalization step
Source: ~/workspace/source/dlt/pipeline/pipeline.py:527
load()
Loads prepared packages into the destination.Destination to load data into.
Dataset name at the destination.
Credentials for the destination.
Number of parallel loading workers.
Whether to raise exception on failed jobs.
LoadInfo - Information about the load step
Source: ~/workspace/source/dlt/pipeline/pipeline.py:578
activate()
Activates the pipeline, making it the active context for dlt components.deactivate()
Deactivates the currently active pipeline.drop()
Deletes local pipeline state, schemas, and working files. Re-initializes the pipeline.Optional new pipeline name. Creates and activates new instance if different from current name.
Pipeline - The pipeline instance (self or new instance)
Source: ~/workspace/source/dlt/pipeline/pipeline.py:375
sync_destination()
Synchronizes pipeline state with the destination’s state.Destination to sync with.
Staging destination.
Dataset name.
sync_schema()
Synchronizes a schema with the destination.Name of schema to sync. Uses default schema if not provided.
TSchemaTables - Dictionary of synchronized tables
Source: ~/workspace/source/dlt/pipeline/pipeline.py:1062
list_extracted_load_packages()
Returns a list of all load package IDs that are or will be normalized.Sequence[str] - List of load package IDs
Source: ~/workspace/source/dlt/pipeline/pipeline.py:1021
list_normalized_load_packages()
Returns a list of all load package IDs that are or will be loaded.Sequence[str] - List of load package IDs
Source: ~/workspace/source/dlt/pipeline/pipeline.py:1025
list_completed_load_packages()
Returns a list of all load package IDs that are completely loaded.Sequence[str] - List of load package IDs
Source: ~/workspace/source/dlt/pipeline/pipeline.py:1029
get_load_package_info()
Returns information about a specific load package.The load package ID.
LoadPackageInfo - Information about the package including jobs and statuses
Source: ~/workspace/source/dlt/pipeline/pipeline.py:1033
drop_pending_packages()
Deletes all extracted and normalized packages, including partially loaded ones by default.Whether to also delete partially loaded packages.
set_local_state_val()
Sets a value in local state (not synchronized with destination).State key.
Value to store.
get_local_state_val()
Gets a value from local state.State key to retrieve.
Any - The stored value
Source: ~/workspace/source/dlt/pipeline/pipeline.py:1090