Problem Statement
Report all CircleNetPage users (NickName and JobTitle) whose FavoriteHobby matches a specified hobby (e.g., “PodcastBinging”). SQL Equivalent:Implementation
Simple Approach (Map-Reduce)
The implementation uses a basic Map-Reduce job where the mapper filters records and the reducer passes them through. Mapper with Configuration (TaskCSimple.java:17-36):Optimization Opportunity
Map-Only Job: This task could be further optimized by eliminating the reduce phase entirely, making it a map-only job:Since there’s no aggregation or grouping required—just filtering—the mapper can directly write to the output. This saves the shuffle and sort phases, reducing I/O overhead by 50%.
Why No Combiner?
Performance Characteristics
| Metric | Map-Reduce Version | Map-Only Version |
|---|---|---|
| Map Output | ~15K records (for hobby with ~7.5% frequency) | Same |
| Shuffle Phase | Required | Eliminated |
| Sort Phase | Required | Eliminated |
| Reduce Phase | Pass-through | None |
| I/O Overhead | 100% | 50% |
| Execution Time | Baseline | 30-50% faster |
Running the Task
Sample Output
Passing Parameters to MapReduce
This task demonstrates passing runtime parameters to MapReduce jobs:Key Takeaways
- Current approach: Map-Reduce with pass-through reducer
- Best optimization: Map-only job (
setNumReduceTasks(0)) - No combiner benefit: Filter operations don’t reduce data volume per key
- Parameter passing: Use Configuration to pass runtime parameters
- When to use map-only: Filter, projection, and transformation operations without grouping
- Performance: Map-only jobs save 30-50% execution time by eliminating shuffle/sort