Executing streamable pipelines with iterator()
ThePipeline.iterator() method returns an iterator object that yields NumPy arrays of up to chunk_size size at a time.
Parameters
- chunk_size (default=10000): The maximum number of points to include in each yielded array
- prefetch (default=0): Allows prefetching up to this number of arrays in parallel and buffering them until they are yielded to the caller
execute_streaming() method
If you just want to execute a streamable pipeline in streaming mode and don’t need to access the data points (typically when the pipeline has Writer stage(s)), you can use thePipeline.execute_streaming(chunk_size) method instead.
This is functionally equivalent to sum(map(len, pipeline.iterator(chunk_size))) but more efficient as it avoids allocating and filling any arrays in memory.
Using arrays as buffers with stream handlers
It’s possible to treat NumPy arrays passed to PDAL as buffers that are iteratively populated through custom Python functions during the execution of the pipeline. This may be useful in cases where you want the reading of the input data to be handled in a streamable fashion, such as:- When the total NumPy array data wouldn’t fit into memory
- To initiate execution of a streamable PDAL pipeline while the input data is still being read
Complete streaming example
The following example demonstrates how to stream the read and write of a very large LAZ file with a low memory footprint:- Creates an input pipeline to read a LAZ file in chunks of 10 million points
- Sets up an output pipeline to write to a new LAZ file
- Uses a buffer array and handler function to transfer data between pipelines
- Executes the streaming write operation with a 50 million point chunk size
- Prints progress as chunks are loaded and the final point count