Overview
Streaming inference enables real-time processing of patient data by breaking large datasets into manageable chunks. This approach reduces latency and memory usage compared to batch processing while maintaining prediction accuracy.Batch vs Streaming Processing
The platform supports two processing modes:- Batch processing: Process entire dataset at once, higher memory usage
- Streaming processing: Process data in chunks, lower latency per row
Comparing Performance
Chunk-Based Processing
The streaming system processes data using a generator pattern that yields chunks of configurable size:real_time/streaming.py:14-16
Processing with Metrics
Track performance metrics during streaming:StreamMetrics dataclass tracks:
latency_ms: Average milliseconds per row processedthroughput_rows_per_s: Rows processed per second
real_time/streaming.py:8-11
Real-Time Inference Pipeline
Run streaming inference on patient features:real_time/inference.py:8-21
Throughput Optimization
Choosing Chunk Size
Chunk size affects throughput and latency:- Small chunks (10-50): Lower latency, more overhead
- Medium chunks (100-500): Balanced performance
- Large chunks (1000+): Higher throughput, increased latency
Example Optimization
Implementation Details
The streaming processor:- Splits input DataFrame into chunks using
ilocslicing - Processes each chunk through the provided function
- Measures elapsed time and calculates per-row latency
- Computes throughput as rows per second
- Concatenates results maintaining index order
real_time/streaming.py:19-29
Best Practices
- Start with chunk_size=100 and adjust based on metrics
- Monitor
latency_msfor real-time requirements - Use
compare_batch_vs_streaming()to validate performance gains - Process chunks independently to enable parallel processing
- Maintain sorted index in output for result consistency