Installation
Quick Start
Here’s a simple word count example:Core Concepts
Pipeline
A pipeline encapsulates your data processing workflow:PCollection
PCollection represents a distributed dataset:
Transforms
Transforms process data in your pipeline:DoFn (Do Functions)
DoFn allows for more complex processing:
Python-Specific Features
Type Hints
Improve pipeline validation with type hints:Lambda Functions
Use Python lambdas for simple transformations:DataFrame API
Use familiar pandas-like operations:Machine Learning with RunInference
Integrate ML models directly into your pipeline:Streaming Pipelines
Process unbounded data streams:Windowing
Group streaming data into windows:I/O Connectors
The Python SDK supports various data sources:Files
Google Cloud Platform
Databases
Running Pipelines
Direct Runner (Local)
Google Cloud Dataflow
Apache Flink
Best Practices
Use Pipeline Options
Use Pipeline Options
Define configurable options for your pipeline:
Handle Dependencies
Handle Dependencies
Manage pipeline dependencies properly:
Efficient DoFn Implementation
Efficient DoFn Implementation
Use lifecycle methods for expensive operations:
Use Combine for Aggregations
Use Combine for Aggregations
Prefer Combine transforms for better performance:
Interactive Beam
Develop and debug pipelines in Jupyter notebooks:Testing Pipelines
Test your pipelines effectively:Resources
API Reference
Complete Python API documentation
Code Examples
Example pipelines and patterns
ML Guide
Machine learning with Beam
Interactive Beam
Jupyter notebook development
Next Steps
- Explore Runners to execute your pipeline
- Learn about streaming pipelines with Python
- Check out ML inference with PyTorch and Scikit-learn
- Browse I/O transforms for data sources