Overview
Apache Spark is a unified analytics engine that provides:- Batch Processing: Fast in-memory batch processing
- Streaming: Structured Streaming for real-time processing
- SQL Support: Integrate with Spark SQL and DataFrames
- Ecosystem: Rich ecosystem of libraries and connectors
- Scalability: Run on clusters from laptops to data centers
When to Use SparkRunner
Best For
- Existing Spark infrastructure
- Spark ecosystem integration
- Unified batch and streaming
- On-premise deployments
- Cost-effective large-scale batch
- Teams familiar with Spark
Consider Alternatives
- Pure streaming apps (FlinkRunner)
- GCP workloads (DataflowRunner)
- Local development (DirectRunner)
- Low-latency streaming (FlinkRunner)
Setup and Configuration
Prerequisites
- Apache Spark 3.0 or later
- Java 8 or later
- Spark cluster or local Spark installation
Dependencies
- Java
- Python
- Scala
Add to your For Gradle:
pom.xml:Use
beam-runners-spark-3 for Spark 3.x. For Spark 2.x, use beam-runners-spark.Spark Cluster Setup
Local Spark
Spark on YARN
Spark on Kubernetes
Running a Pipeline
Basic Example
- Java
- Python
- Scala
Execution Modes
Local Mode
Run with local Spark:Cluster Mode
Submit to a Spark cluster:SparkPipelineOptions
Core Options
Spark master URL:
local[n]- Local mode with n threadslocal[*]- Local mode with all coresspark://host:port- Spark standalone clusteryarn- YARN clusterk8s://host:port- Kubernetes cluster
Application name displayed in Spark UI.
Streaming Options
Enable streaming mode for unbounded sources.
Batch interval for Spark Streaming in milliseconds.
Checkpoint interval in milliseconds. -1 uses Spark’s default.
Directory for Spark checkpoints.
Performance Options
Bundle size for splitting bounded sources. 0 uses Spark’s default parallelism.
Disable caching of reused PCollections.
Maximum records per micro-batch for streaming sources.
Resource Configuration
Spark configuration properties.
Advanced Configuration
Using Existing SparkContext
Spark Configuration
Checkpointing
For streaming pipelines:Batch vs Streaming
Batch Pipeline
Streaming Pipeline
Monitoring and Debugging
Spark UI
Access Spark UI for monitoring:- Master UI:
http://master:8080(standalone mode) - Application UI:
http://driver:4040 - History Server:
http://history-server:18080
- Job stages and tasks
- Executor metrics
- Storage and memory usage
- Environment configuration
Metrics
Beam metrics are available in Spark:Logging
- In Spark executor logs
- Through Spark UI
- In configured log aggregation system
Performance Tuning
Parallelism
Memory Management
Caching Strategy
Shuffle Optimization
Best Practices
Resource Allocation
-
Right-size executors
-
Use dynamic allocation
Streaming Best Practices
-
Choose appropriate batch interval
-
Enable checkpointing
-
Configure backpressure
Batch Best Practices
- Partition data appropriately
- Use broadcast variables for small datasets
- Avoid wide transformations when possible
- Coalesce output partitions
Troubleshooting
Job fails to submit
Job fails to submit
Check:
- Spark master is running and accessible
- JAR contains all dependencies
- Spark version compatibility
- Network connectivity
Out of memory errors
Out of memory errors
- Increase executor memory
- Increase memory overhead
- Reduce partition size
- Enable off-heap memory
- Disable caching if not beneficial
Slow performance
Slow performance
- Check data skew in Spark UI
- Increase parallelism
- Optimize shuffle operations
- Enable dynamic allocation
- Tune memory fractions
Streaming lag
Streaming lag
- Reduce batch interval
- Increase executors
- Enable backpressure
- Optimize transforms
- Check for bottlenecks in UI
Checkpoint failures
Checkpoint failures
- Verify checkpoint directory is accessible
- Check HDFS/storage permissions
- Ensure sufficient disk space
- Monitor checkpoint size
Runner Capabilities
Supported Features
- ✅ Batch processing
- ✅ Structured Streaming
- ✅ Windowing
- ✅ Triggers
- ✅ State and timers
- ✅ Side inputs
- ✅ Metrics
Limitations
- Limited exactly-once streaming support
- Batch interval impacts latency
- Some Beam features not fully supported
- Requires Spark cluster management
Integration Examples
Reading from Hive
Writing to Cassandra
Next Steps
Spark Documentation
Learn more about Apache Spark
FlinkRunner
Alternative for streaming workloads
Performance
Optimize pipeline performance
Monitoring
Monitor Spark applications