Apache Airflow Pipelines
Apache Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. In this module, we use Airflow with theKubernetesPodOperator to run ML workloads as containerized pods on Kubernetes.
Installation & Setup
Create Kubernetes Storage
Apply PersistentVolume and PersistentVolumeClaim for pipeline storage:
volumes.yaml
volumes.yaml
Training DAG
The training DAG orchestrates the full model training lifecycle usingKubernetesPodOperator to run containerized tasks.
DAG Definition
airflow_pipelines/dags/training_dag.py
Key Components
KubernetesPodOperator
KubernetesPodOperator
Runs tasks as Kubernetes pods, providing:
- Isolation: Each task runs in its own container
- Resource management: Kubernetes handles pod scheduling and resources
- Image flexibility: Use any Docker image with required dependencies
- Volume mounting: Share data between tasks via PersistentVolumes
Volume Configuration
Volume Configuration
PersistentVolumes enable data sharing:All tasks mount
/tmp/ to share training data and model artifacts.Task Dependencies
Task Dependencies
Airflow uses This ensures tasks run sequentially, with each task completing before the next starts.
>> operator to define execution order:Inference DAG
The inference DAG loads a trained model from the registry and runs predictions on new data.DAG Definition
airflow_pipelines/dags/inference_dag.py
Parallel Task Execution
Unlike the training DAG, the inference DAG runsload_data and load_model in parallel since they’re independent:
Running Pipelines
- Trigger Single Run
- Trigger Multiple Runs
- Monitor in UI
Manually trigger a pipeline:
Best Practices
Volume Management
- Use PersistentVolumes for data sharing
- Clean up storage between runs
- Set
trigger_rule="all_done"for cleanup tasks
Resource Configuration
- Set appropriate
startup_timeout_seconds - Use
image_pull_policy="Always"for latest images - Configure resource requests/limits as needed
Error Handling
- Set
is_delete_operator_pod=Falsefor debugging - Use
trigger_ruleto control failure behavior - Monitor pod logs via kubectl or k9s
Security
- Store credentials in environment variables
- Use Airflow Connections for external services
- Avoid hardcoding API keys in DAG code
Troubleshooting
Pod Startup Timeouts
Pod Startup Timeouts
If pods fail to start within 600 seconds:
- Check cluster resources:
kubectl top nodes - Verify image pull:
kubectl get pods -n default - Increase
startup_timeout_secondsif needed
Volume Mount Errors
Volume Mount Errors
If volume mounting fails:
DAG Not Appearing in UI
DAG Not Appearing in UI
If your DAG doesn’t show up:
- Check DAG file syntax:
airflow dags list - Verify
AIRFLOW_HOMEpoints to correct directory - Look for errors:
airflow dags list-import-errors
Additional Resources
Next Steps
Try Kubeflow Pipelines
Learn Kubernetes-native ML orchestration with built-in artifact tracking