Skip to main content
Task C filters the CircleNetPage dataset to return all users whose favorite hobby matches a specified value, demonstrating a simple filter operation that could be optimized as a map-only job.

Problem Statement

Report all CircleNetPage users (NickName and JobTitle) whose FavoriteHobby matches a specified hobby (e.g., “PodcastBinging”). SQL Equivalent:
SELECT NickName, JobTitle
FROM CircleNetPage
WHERE FavoriteHobby = 'PodcastBinging';

Implementation

Simple Approach (Map-Reduce)

The implementation uses a basic Map-Reduce job where the mapper filters records and the reducer passes them through. Mapper with Configuration (TaskCSimple.java:17-36):
public static class MapperC extends Mapper<LongWritable, Text, Text, Text> {
    private String targetHobby;
    private final Text outKey = new Text();
    private final Text outVal = new Text();

    @Override
    protected void setup(Context context) {
        // Read hobby parameter from job configuration
        targetHobby = context.getConfiguration().get("task.c.hobby", "").trim();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException {
        String[] f = CsvUtils.split(value.toString());
        if (f.length >= 5 && f[4].trim().equalsIgnoreCase(targetHobby)) {
            outKey.set(f[1].trim());  // NickName
            outVal.set(f[2].trim());  // JobTitle
            context.write(outKey, outVal);
        }
    }
}
Pass-Through Reducer (TaskCSimple.java:38-45):
public static class PassReducer extends Reducer<Text, Text, Text, Text> {
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {
        for (Text v : values) {
            context.write(key, v);
        }
    }
}
Setting Configuration Parameter (TaskCSimple.java:53-54):
Configuration conf = new Configuration();
conf.set("task.c.hobby", args[2]);  // Pass hobby as 3rd argument
Job job = Job.getInstance(conf, "TaskCSimple");

Optimization Opportunity

Map-Only Job: This task could be further optimized by eliminating the reduce phase entirely, making it a map-only job:
job.setNumReduceTasks(0);  // No reducers needed
Since there’s no aggregation or grouping required—just filtering—the mapper can directly write to the output. This saves the shuffle and sort phases, reducing I/O overhead by 50%.

Why No Combiner?

Combiner optimization doesn’t apply here because:
  • There’s no aggregation operation
  • We’re simply filtering and projecting fields
  • The mapper already outputs the final format
  • A combiner would just pass through data without reducing it

Performance Characteristics

MetricMap-Reduce VersionMap-Only Version
Map Output~15K records (for hobby with ~7.5% frequency)Same
Shuffle PhaseRequiredEliminated
Sort PhaseRequiredEliminated
Reduce PhasePass-throughNone
I/O Overhead100%50%
Execution TimeBaseline30-50% faster
The current implementation uses a reducer even though it’s not necessary. The run instructions show this is acknowledged:
# optimized version is optional; latest timing shows no real gain
For true optimization, set numReduceTasks=0 to make it map-only.

Running the Task

export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export OUT=/circlenet/output

# Run with hobby parameter (3rd argument)
hadoop jar $JAR circlenet.taskC.TaskCSimple \
  $PAGES \
  $OUT/taskC/simple \
  PodcastBinging

# View results
hdfs dfs -cat $OUT/taskC/simple/part-r-00000 | head -20

Sample Output

PodcastFan123    Software Engineer
AudioLover88     Product Manager
CastAddict       Data Scientist
PodNinja         Marketing Director
...

Passing Parameters to MapReduce

This task demonstrates passing runtime parameters to MapReduce jobs:
// In main method
Configuration conf = new Configuration();
conf.set("task.c.hobby", args[2]);

// In mapper setup
targetHobby = context.getConfiguration().get("task.c.hobby", "");
This pattern allows the same job to filter by different hobbies without recompilation.

Key Takeaways

  • Current approach: Map-Reduce with pass-through reducer
  • Best optimization: Map-only job (setNumReduceTasks(0))
  • No combiner benefit: Filter operations don’t reduce data volume per key
  • Parameter passing: Use Configuration to pass runtime parameters
  • When to use map-only: Filter, projection, and transformation operations without grouping
  • Performance: Map-only jobs save 30-50% execution time by eliminating shuffle/sort

Build docs developers (and LLMs) love