Introduction
In Chapter 10, we discussed batch processing: running a job on a bounded dataset. Now we explore stream processing: processing unbounded data that arrives continuously. Stream: Incrementally made available over time- User activity events on a website
- Sensor readings from IoT devices
- Stock price updates
- Log messages from servers
1. Transmitting Event Streams
An event is a small, self-contained, immutable object containing details of something that happened. Events are written to a topic or stream, and consumers read from it.Message Brokers vs Event Logs
Traditional message broker (RabbitMQ):- Messages deleted after acknowledgment
- Supports complex routing
- Low throughput per topic
- Messages retained (configurable)
- Simple sequential reads
- High throughput
Apache Kafka Architecture
Key concepts: Partition: Ordered, immutable sequence of recordsLog Compaction
Problem: Logs grow forever Solution: Keep only the latest value for each key Use case: Maintaining database state in a log2. Databases and Streams
Key insight: Database can be viewed as a stream of changes.Change Data Capture (CDC)
CDC: Observe changes written to a database and replicate them to other systems. Implementation approaches: Example: Debezium (CDC tool)Event Sourcing
Event sourcing: Store all changes as immutable events, derive current state. Benefits: Example:3. Processing Streams
Three main types of stream processing:Complex Event Processing
Goal: Search for patterns in event streams Example pattern queries:Stream Analytics
Aggregations over time windows Tumbling window: Hopping window: Stream analytics example:Time in Stream Processing
Challenge: Events may arrive out of order Two notions of time: Watermarks: Indicate progress in event time Handling late events:4. Stream Joins
Joining streams is more complex than joining tables.Stream-Stream Join (Window Join)
Example: Implementation:Stream-Table Join
Join stream events with database table (enrichment). Challenge: Database table changes over time Solutions:Table-Table Join
Both inputs are changelogs (CDC streams from databases). Example:5. Fault Tolerance
Stream processors must handle failures gracefully.Microbatching
Approach: Break stream into small batches (Spark Streaming) Advantages:- Can use batch processing techniques
- Easier fault tolerance (batch atomicity)
- Higher latency (wait for batch)
- Not true streaming
Checkpointing
Approach: Periodically save complete state (Flink) Flink checkpointing: Exactly-once semantics: Implementing exactly-once:Idempotence
Idempotent operation: Can be performed multiple times with same effect as once Making operations idempotent:6. Stream Processing Frameworks Comparison
Performance comparison:Summary
Key Takeaways:-
Event logs are fundamental:
- Durable, ordered, partitioned
- Kafka pioneered log-based messaging
- Enable replay and multiple consumers
-
Time is complex in streams:
- Event time vs processing time
- Watermarks indicate progress
- Late events require special handling
-
Windowing enables aggregations:
- Tumbling: fixed, non-overlapping
- Hopping: fixed, overlapping
- Sliding: continuous
- Session: based on inactivity
-
Joins are more complex than batch:
- Stream-stream: within time window
- Stream-table: lookup enrichment
- Table-table: maintain materialized view
-
Fault tolerance is critical:
- Exactly-once semantics ideal
- Checkpointing saves state
- Idempotence simplifies recovery
-
Different frameworks, different trade-offs:
- Flink: True streaming, low latency
- Spark: Unified batch/stream, microbatching
- Kafka Streams: Simple, Kafka-native
| Aspect | Batch Processing | Stream Processing |
|---|---|---|
| Input | Bounded (complete dataset) | Unbounded (continuous) |
| Latency | Minutes to hours | Milliseconds to seconds |
| Results | Complete, final | Continuous, approximate |
| State | Materialized to disk | In-memory with checkpoints |
| Time | Processing time only | Event time + processing time |
| Failures | Retry entire job | Checkpoint and replay |
| Use cases | Daily reports, ML training | Fraud detection, monitoring |
Previous: Chapter 10: Batch Processing