Dagster Pipelines
Dagster is a data orchestration platform that focuses on assets rather than tasks. It treats data, models, and metrics as first-class citizens with built-in lineage tracking, data quality checks, and observability.Why Dagster?
Dagster’s asset-centric approach offers unique advantages:- Asset-first paradigm: Define what you want to produce, not just tasks
- Data quality checks: Built-in asset checks for validation and anomaly detection
- Rich metadata: Attach JSON, markdown, plots to assets for observability
- Software-defined assets: Assets are code, enabling testing and version control
- Flexible execution: Run locally, on Modal, Kubernetes, or other executors
Setup
Text-to-SQL Pipeline
This example demonstrates fine-tuning a Phi-3 model for text-to-SQL generation with comprehensive data quality checks.Asset: Load SQL Data
dagster_pipelines/text2sql_pipeline.py
@assetdecorator defines a software-defined assetcontext.add_output_metadata()attaches rich metadata visible in UI- Returns
DatasetDictpassed to downstream assets
Asset Check: No Empty Datasets
passed=True/Falseindicates check result- Failures don’t stop execution but are highlighted in UI
- Useful for data quality, schema validation, anomaly detection
Asset: Process Dataset
Asset: Trained Model
compute_kind="modal"indicates execution environmentmodal.Function.lookup()calls deployed Modal function- Training runs on Modal’s GPU infrastructure
- Returns model name for downstream inference
Modal Training Function
dagster_pipelines/text2sql_functions.py
Asset: Model Metrics
Asset Checks: ROUGE Thresholds
- Each check evaluates a different ROUGE metric
- Failed checks indicate model performance issues
- Visible in UI as warnings without blocking execution
Pipeline Definition
Running the Pipeline
- Materialize All Assets
- Materialize Specific Assets
- View Asset Lineage
In the Dagster UI:
- Navigate to Assets view
- Select all assets (or click Materialize all)
- Click Materialize selected
- Monitor execution in Runs view
Asset Metadata & Observability
Attaching Metadata
Attaching Metadata
Dagster supports rich metadata types:All metadata is visible in the Dagster UI.
Asset Checks vs Assertions
Asset Checks vs Assertions
Asset Checks (recommended):
- Non-blocking: execution continues even if checks fail
- Visible in UI with status indicators
- Can be run independently of materialization
- Support rich failure messages
- Blocking: raise exceptions to halt execution
- Useful for critical validations
- Example:
Asset Groups
Asset Groups
Organize assets with Groups appear in UI for better organization:
group_name:- data: Data loading and preprocessing
- model: Training and evaluation
- inference: Prediction and serving
Best Practices
Asset Design
- Define assets by what they produce, not how
- Keep assets pure functions when possible
- Use descriptive names (e.g.,
cleaned_data, notstep_2) - Group related assets logically
Data Quality
- Add asset checks for critical validations
- Use checks for schema validation, null checks, ranges
- Set meaningful thresholds (don’t over-check)
- Document why checks exist in docstrings
Metadata
- Attach metadata for observability
- Log samples, counts, metrics, distributions
- Use markdown for human-readable summaries
- Include links to external systems (W&B, MLflow)
Resource Management
- Use
compute_kindto indicate execution environment - Offload heavy compute to Modal, Kubernetes, etc.
- Partition large assets for incremental processing
- Configure retries for flaky dependencies
Comparison with Other Orchestrators
- vs. Airflow
- vs. Kubeflow
Dagster Advantages:
- Asset-centric: focuses on data products, not tasks
- Built-in data quality checks
- Rich metadata and observability
- Better local development experience
- More mature ecosystem
- Broader community and integrations
- Battle-tested at scale
- More flexible scheduling options
Troubleshooting
Modal Functions Not Found
Modal Functions Not Found
If
modal.Function.lookup() fails:Asset Materialization Fails
Asset Materialization Fails
Debug failed materializations:
- Click failed asset in Runs view
- Expand Logs to see error traceback
- Check Compute Logs for stdout/stderr
- Verify upstream assets materialized successfully
- Test asset function locally:
Asset Checks Always Fail
Asset Checks Always Fail
If checks consistently fail:
- Review threshold values (e.g., ROUGE > 0.8 may be too strict)
- Check if asset returns expected format
- Add logging in check functions to debug
- Consider making checks warnings instead of failures
Additional Resources
- Dagster Documentation
- Introducing Asset Checks
- Orchestrating Machine Learning Pipelines
- ML Pipelines for Fine-tuning LLMs
- Anomaly Detection in Dagster
Next Steps
Practice Exercises
Complete hands-on exercises to build your own orchestration pipelines