Apache Beam provides a rich set of I/O connectors for reading from and writing to various data sources and sinks. I/O transforms enable your pipelines to interact with external storage systems, databases, message queues, and more.
Beam provides two primary patterns for reading data:
Python
Java
Go
import apache_beam as beamfrom apache_beam.io import ReadFromText# Pattern 1: Direct read transformwith beam.Pipeline() as p: lines = p | ReadFromText('gs://bucket/file.txt')# Pattern 2: Using Read with a sourcefrom apache_beam.io.textio import _TextSourcewith beam.Pipeline() as p: lines = p | beam.io.Read(_TextSource('file.txt'))
import org.apache.beam.sdk.Pipeline;import org.apache.beam.sdk.io.TextIO;import org.apache.beam.sdk.values.PCollection;Pipeline p = Pipeline.create();// Reading from text filesPCollection<String> lines = p.apply( TextIO.read().from("gs://bucket/file.txt"));p.run().waitUntilFinish();
import ( "github.com/apache/beam/sdks/v2/go/pkg/beam" "github.com/apache/beam/sdks/v2/go/pkg/beam/io/textio")func main() { p := beam.NewPipeline() s := p.Root() lines := textio.Read(s, "gs://bucket/file.txt") beam.Run(context.Background(), p)}
Unbounded sources continuously read data from streams.Characteristics:
Infinite or ongoing data streams
Processing continues indefinitely
Requires windowing for aggregations
Examples: Pub/Sub, Kafka, Kinesis
import apache_beam as beamfrom apache_beam.io.gcp.pubsub import ReadFromPubSub# Reading unbounded data from Pub/Subwith beam.Pipeline() as p: messages = p | ReadFromPubSub(topic='projects/my-project/topics/my-topic') # Apply windowing and process...
Beam automatically splits bounded sources for parallel processing:
from apache_beam.io import BoundedSourceclass CustomSource(BoundedSource): def split(self, desired_bundle_size, start_position=None, stop_position=None): # Split source into bundles for parallel processing pass def estimate_size(self): # Estimate total size in bytes pass def read(self, range_tracker): # Read data within the given range pass