The Pipeline API provides the foundation for building Apache Beam data processing workflows. A Pipeline manages a directed acyclic graph of PTransforms and the PCollections they consume and produce.
Pipeline Class
The Pipeline class represents a data processing workflow.
Creating a Pipeline
create()
Creates a pipeline from default PipelineOptions.
public static Pipeline create()
Returns: A new Pipeline instance
Example:
Pipeline pipeline = Pipeline.create();
create(PipelineOptions)
Creates a pipeline from the provided PipelineOptions.
public static Pipeline create(PipelineOptions options)
Configuration options for the pipeline, including runner specification
Returns: A new Pipeline instance
Example:
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create(options);
Adds a root PTransform to the pipeline.
public <OutputT extends POutput> OutputT apply(
PTransform<? super PBegin, OutputT> root)
root
PTransform<? super PBegin, OutputT>
required
The root transform to apply (e.g., Read or Create)
The output PCollection or POutput from the transform
Example:
PCollection<String> lines = pipeline.apply(
TextIO.read().from("gs://bucket/dir/file*.txt")
);
Adds a named root PTransform to the pipeline.
public <OutputT extends POutput> OutputT apply(
String name,
PTransform<? super PBegin, OutputT> root)
The name for this transform node in the pipeline graph
root
PTransform<? super PBegin, OutputT>
required
The root transform to apply
The output PCollection or POutput from the transform
Example:
PCollection<String> lines = pipeline.apply(
"ReadInputFiles",
TextIO.read().from("gs://bucket/dir/file*.txt")
);
Running the Pipeline
run()
Runs the pipeline using the PipelineOptions specified during creation.
public PipelineResult run()
The result of the pipeline execution
Example:
Pipeline p = Pipeline.create();
// ... build pipeline ...
PipelineResult result = p.run();
run(PipelineOptions)
Runs the pipeline using the specified PipelineOptions.
public PipelineResult run(PipelineOptions options)
The pipeline options to use for execution
The result of the pipeline execution
Other Methods
begin()
Returns a PBegin for this pipeline, used as input for root transforms.
A PBegin owned by this pipeline
getOptions()
Returns the PipelineOptions for this pipeline.
public PipelineOptions getOptions()
getCoderRegistry()
Returns the CoderRegistry for this pipeline.
public CoderRegistry getCoderRegistry()
The coder registry for encoding/decoding data
getSchemaRegistry()
Returns the SchemaRegistry for this pipeline.
public SchemaRegistry getSchemaRegistry()
The schema registry for schema-aware transforms
PipelineResult Interface
Represents the result of a pipeline execution.
Methods
getState()
Retrieves the current state of the pipeline execution.
The current state (UNKNOWN, STOPPED, RUNNING, DONE, FAILED, CANCELLED, UPDATED, UNRECOGNIZED)
cancel()
Cancels the pipeline execution.
State cancel() throws IOException
The state after cancellation attempt
waitUntilFinish()
Waits until the pipeline finishes and returns the final status.
The final state of the pipeline
Example:
PipelineResult result = pipeline.run();
State finalState = result.waitUntilFinish();
if (finalState == State.DONE) {
System.out.println("Pipeline completed successfully!");
}
waitUntilFinish(Duration)
Waits until the pipeline finishes with a timeout.
State waitUntilFinish(Duration duration)
The maximum time to wait (less than 1ms means infinite wait)
The final state of the pipeline, or null on timeout
metrics()
Returns metrics from the pipeline execution.
The metrics from pipeline execution
PipelineOptions Interface
Configuration options for pipeline execution. Created using PipelineOptionsFactory.
Example:
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DataflowRunner.class);
options.setTempLocation("gs://bucket/temp");
Pipeline pipeline = Pipeline.create(options);
Complete Example
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.values.TypeDescriptors;
public class WordCount {
public static void main(String[] args) {
// Create pipeline options
PipelineOptions options = PipelineOptionsFactory
.fromArgs(args)
.create();
// Create the pipeline
Pipeline p = Pipeline.create(options);
// Build the pipeline
p.apply("ReadLines", TextIO.read().from("input.txt"))
.apply("CountWords", Count.perElement())
.apply("FormatResults", MapElements
.into(TypeDescriptors.strings())
.via(kv -> kv.getKey() + ": " + kv.getValue()))
.apply("WriteResults", TextIO.write().to("output"));
// Run the pipeline
p.run().waitUntilFinish();
}
}