Overview
Google Cloud Dataflow provides:- Fully Managed: No cluster management required
- Autoscaling: Automatic resource scaling based on workload
- Optimization: Automatic pipeline optimization and execution
- Monitoring: Built-in monitoring and logging with Cloud Monitoring
- Security: Integration with Google Cloud IAM and VPC
When to Use DataflowRunner
Best For
- Production workloads on GCP
- Large-scale data processing
- Auto-scaling requirements
- Managed infrastructure
- Integration with GCP services
Consider Alternatives
- Small local datasets (use DirectRunner)
- Non-GCP environments
- Existing Spark/Flink clusters
- Cost-sensitive batch jobs
Setup and Configuration
Prerequisites
- Google Cloud Project: Create a project in Google Cloud Console
- Enable APIs: Enable Cloud Dataflow, Compute Engine, and Cloud Storage APIs
- Authentication: Set up authentication credentials
- Cloud Storage: Create a GCS bucket for staging and temp files
Dependencies
- Java
- Python
- Go
Add the Dataflow runner dependency:For Gradle:
Authentication
Set up Google Cloud credentials:Running a Pipeline
Basic Example
- Java
- Python
- Go
Command Line Execution
DataflowPipelineOptions
Key configuration options for the DataflowRunner:Required Options
Google Cloud project ID.
Google Cloud region for job execution (e.g.,
us-central1, europe-west1).Cloud Storage path for staging files (must start with
gs://).Cloud Storage path for temporary files.
Worker Configuration
Initial number of workers. Dataflow will autoscale from this value.
Maximum number of workers for autoscaling.
Compute Engine machine type for workers.
Disk size in GB for each worker.
Streaming Options
Enable streaming mode for unbounded sources.
Use Dataflow Streaming Engine for streaming pipelines.
Network Configuration
Compute Engine network for launching workers.
Compute Engine subnetwork for launching workers.
Whether workers should have public IP addresses.
Advanced Configuration
Autoscaling
Dataflow automatically scales workers based on workload:Flex Templates
Create reusable Dataflow templates:- Java
- Python
Update Existing Jobs
Update a running Dataflow job:Monitoring and Debugging
Cloud Console
Monitor jobs in the Dataflow Console:- View job graph and metrics
- Monitor worker resource usage
- Inspect logs and errors
- Track data throughput
Logging
Logs are automatically sent to Cloud Logging:- Java
- Python
Metrics
Dataflow provides built-in metrics:Best Practices
Cost Optimization
-
Use Appropriate Machine Types
-
Enable Autoscaling
-
Use Streaming Engine (for streaming jobs)
Performance
-
Optimize Windowing
- Use appropriate window sizes
- Consider allowed lateness for late data
-
Batch Elements
- Use
GroupIntoBatchesfor downstream API calls - Reduce per-element overhead
- Use
-
Use Side Inputs Wisely
- Keep side inputs small
- Consider using external lookups for large datasets
Security
-
Use VPC Networks
-
Service Accounts
-
Encryption
- Data is encrypted at rest and in transit by default
- Use Customer Managed Encryption Keys (CMEK) for additional control
Streaming vs Batch
Batch Pipeline
Streaming Pipeline
Troubleshooting
Common Issues
Job fails with 'Quota exceeded' error
Job fails with 'Quota exceeded' error
Increase quotas in the GCP Console:
- Go to IAM & Admin > Quotas
- Filter by service (Compute Engine)
- Request quota increase
Workers fail to start
Workers fail to start
Check:
- Service account permissions
- Network/firewall configuration
- Region availability
- Machine type availability in the region
High costs
High costs
Optimize:
- Reduce worker machine sizes
- Set appropriate max workers
- Use Flex templates for repeated jobs
- Enable Streaming Engine for streaming
- Set appropriate worker disk sizes
Slow performance
Slow performance
Consider:
- Increasing worker count or machine type
- Optimizing transforms (reduce shuffles)
- Using Combiner functions
- Partitioning data appropriately
Next Steps
Dataflow Console
Monitor and manage your Dataflow jobs
FlinkRunner
Alternative for self-managed clusters
Monitoring Guide
Learn about metrics and monitoring
Pricing
Understand Dataflow pricing