Introduction
There are three main types of systems:- Services (online systems): Wait for request, send response, measured by response time
- Batch processing (offline systems): Take large input, produce output, measured by throughput
- Stream processing (near-real-time systems): Between batch and services, consume inputs shortly after they’re produced
1. Batch Processing with Unix Tools
Let’s start with a simple task: analyze web server logs to find the top 5 most popular URLs.Simple Log Analysis
Using Unix tools:The Unix Philosophy
Uniform interface: Everything is a file (or stream of bytes)Sorting vs In-Memory Aggregation
GNU sort is remarkably efficient:- Automatically uses disk when data exceeds memory
- Parallelizes sorting across CPU cores
- Can merge pre-sorted files
The Unix Pipe
Pipe (|) connects output of one program to input of another:
Benefits:
- No temporary files on disk
- Programs run in parallel
- Backpressure: slower program controls rate
2. MapReduce and Distributed Filesystems
Unix tools work great on a single machine, but what about datasets that don’t fit on one machine? MapReduce: Like Unix tools, but distributed across thousands of machines.MapReduce Job Execution
Example: Count URL visits (like earlier Unix example)Distributed Filesystem
MapReduce relies on a distributed filesystem (HDFS, GFS) to store input and output. Replication for fault tolerance: Locality optimization: Run computation where data is storedMapReduce Workflows
Often need to chain multiple MapReduce jobs: Problem: Each job writes to and reads from HDFS (slow!)Joins in MapReduce
Problem: Join two datasets (e.g., user activity with user profiles) Sort-merge join: Reducer input:Group By in MapReduce
Example: Calculate average age per cityHandling Skew
Problem: Some keys have many more values than others (hot keys) Solution: Skewed join with sampling3. Beyond MapReduce
MapReduce has limitations:- Materializes intermediate state to HDFS (slow)
- Only supports map and reduce operations
- Verbose code for simple operations
Dataflow Engines
Spark, Flink, Tez: More flexible than MapReduce Arbitrary DAG (Directed Acyclic Graph) of operations:Apache Spark
Key idea: Resilient Distributed Datasets (RDDs) with transformationsFault Tolerance in Dataflow Engines
MapReduce: Recompute failed tasks from HDFS Spark: Recompute from lineage Checkpointing for long lineages:Materialization of Intermediate State
Example:4. Graph Processing
Many algorithms need to traverse graphs:- Social network analysis
- PageRank
- Shortest paths
- Connected components
Bulk Synchronous Parallel (BSP)
Pregel model (used by Apache Giraph, GraphX): Example: Finding shortest pathsGraph Partitioning
Challenge: Distribute graph across machines Problem with poor partitioning: Good partitioning:5. Comparing Batch Processing Systems
Performance comparison:Use Cases
Declarative Query Languages
High-level SQL on top of batch systems: Examples:- Hive: SQL on MapReduce/Tez/Spark
- Spark SQL: SQL on Spark
- Presto: SQL for interactive queries
Summary
Key Takeaways:-
Unix philosophy still relevant:
- Simple, composable tools
- Uniform interfaces (stdin/stdout)
- Separation of concerns
-
MapReduce pioneered distributed batch processing:
- Fault tolerance through replication
- Data locality optimization
- But limited by materialization overhead
-
Dataflow engines improve on MapReduce:
- Arbitrary DAGs instead of just map/reduce
- Pipeline operations in memory
- Better for iterative algorithms
-
Different models for different problems:
- Sort-merge joins for large datasets
- Broadcast joins for small datasets
- Graph processing for connected data
-
Fault tolerance strategies:
- MapReduce: Re-execute from HDFS
- Spark: Recompute from lineage
- Trade-off: Recomputation vs materialization
-
High-level abstractions winning:
- SQL on batch engines
- Declarative > imperative
- Query optimization opportunities
| System | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Unix tools | Simple, fast on single machine | Doesn’t scale | Ad-hoc analysis, small data |
| MapReduce | Scalable, fault tolerant, mature | Slow, limited operators | Very large batch jobs |
| Spark | Fast, rich API, in-memory | More memory required | Iterative ML, interactive queries |
| Pregel/Giraph | Optimized for graphs | Limited to graph algorithms | Social networks, PageRank |
Previous: Chapter 9: Consistency and Consensus