Skip to main content
Task A analyzes the CircleNetPage dataset to report the frequency of each favorite hobby, demonstrating a classic MapReduce word count pattern with aggregation optimization.

Problem Statement

Report the frequency of each favorite hobby on CircleNet by counting occurrences of each hobby value in the FavoriteHobby field. SQL Equivalent:
SELECT FavoriteHobby, COUNT(*) as frequency
FROM CircleNetPage
GROUP BY FavoriteHobby;

Approaches

Simple Implementation

The basic approach uses a standard Map-Reduce pattern without optimization.Mapper (TaskA.java:17-31):
public static class TaskAMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text hobby = new Text();

@Override
protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException{
    String line = value.toString();
    String[] fields = line.split(",");
    if(fields.length == 5){
        hobby.set(fields[4]);
        context.write(hobby, one);
    }
}
}
Reducer (TaskA.java:33-47):
public static class TaskAReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException{
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
}
}
Characteristics:
  • Emits one record per input line (200K emissions)
  • All emissions shuffled to reducers
  • No pre-aggregation

Performance Comparison

MetricSimpleOptimizedImprovement
Shuffle I/O200K records~100s of records99%+ reduction
Network UsageHighMinimalSignificant
Execution TimeBaselineFaster20-40% faster
Combiner optimization is most effective when there’s high key duplication (many users share the same hobby). With only ~100 unique hobbies among 200K users, this task sees excellent combiner performance.

Running the Task

# Set environment variables
export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export OUT=/circlenet/output

# Run simple version
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

# View results
hdfs dfs -cat $OUT/taskA/simple/part-r-00000 | head -20

Sample Output

PodcastBinging    15234
Gardening         12876
Photography       11543
Hiking            10987
Cooking           9876
...

Key Takeaways

  • Simple approach: Standard MapReduce aggregation
  • Optimization technique: Combiner for pre-aggregation
  • When to use combiners: Aggregation operations (SUM, COUNT, MAX, MIN)
  • When NOT to use combiners: Average calculations, set operations requiring all values

Build docs developers (and LLMs) love