Task C - Hobby Filter - CircleNet Analytics

Task C filters the CircleNetPage dataset to return all users whose favorite hobby matches a specified value, demonstrating a simple filter operation that could be optimized as a map-only job.

Problem Statement

Report all CircleNetPage users (NickName and JobTitle) whose FavoriteHobby matches a specified hobby (e.g., “PodcastBinging”). SQL Equivalent:

SELECT NickName, JobTitle
FROM CircleNetPage
WHERE FavoriteHobby = 'PodcastBinging';

Implementation

Simple Approach (Map-Reduce)

The implementation uses a basic Map-Reduce job where the mapper filters records and the reducer passes them through. Mapper with Configuration (TaskCSimple.java:17-36):

public static class MapperC extends Mapper<LongWritable, Text, Text, Text> {
    private String targetHobby;
    private final Text outKey = new Text();
    private final Text outVal = new Text();

    @Override
    protected void setup(Context context) {
        // Read hobby parameter from job configuration
        targetHobby = context.getConfiguration().get("task.c.hobby", "").trim();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException {
        String[] f = CsvUtils.split(value.toString());
        if (f.length >= 5 && f[4].trim().equalsIgnoreCase(targetHobby)) {
            outKey.set(f[1].trim());  // NickName
            outVal.set(f[2].trim());  // JobTitle
            context.write(outKey, outVal);
        }
    }
}

Pass-Through Reducer (TaskCSimple.java:38-45):

public static class PassReducer extends Reducer<Text, Text, Text, Text> {
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) 
        throws IOException, InterruptedException {
        for (Text v : values) {
            context.write(key, v);
        }
    }
}

Setting Configuration Parameter (TaskCSimple.java:53-54):

Configuration conf = new Configuration();
conf.set("task.c.hobby", args[2]);  // Pass hobby as 3rd argument
Job job = Job.getInstance(conf, "TaskCSimple");

Optimization Opportunity

Map-Only Job: This task could be further optimized by eliminating the reduce phase entirely, making it a map-only job:

job.setNumReduceTasks(0);  // No reducers needed

Since there’s no aggregation or grouping required—just filtering—the mapper can directly write to the output. This saves the shuffle and sort phases, reducing I/O overhead by 50%.

Why No Combiner?

Combiner optimization doesn’t apply here because:

There’s no aggregation operation
We’re simply filtering and projecting fields
The mapper already outputs the final format
A combiner would just pass through data without reducing it

Performance Characteristics

Metric	Map-Reduce Version	Map-Only Version
Map Output	~15K records (for hobby with ~7.5% frequency)	Same
Shuffle Phase	Required	Eliminated
Sort Phase	Required	Eliminated
Reduce Phase	Pass-through	None
I/O Overhead	100%	50%
Execution Time	Baseline	30-50% faster

The current implementation uses a reducer even though it’s not necessary. The run instructions show this is acknowledged:

# optimized version is optional; latest timing shows no real gain

For true optimization, set numReduceTasks=0 to make it map-only.

Running the Task

export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export OUT=/circlenet/output

# Run with hobby parameter (3rd argument)
hadoop jar $JAR circlenet.taskC.TaskCSimple \
  $PAGES \
  $OUT/taskC/simple \
  PodcastBinging

# View results
hdfs dfs -cat $OUT/taskC/simple/part-r-00000 | head -20

Sample Output

PodcastFan123    Software Engineer
AudioLover88     Product Manager
CastAddict       Data Scientist
PodNinja         Marketing Director
...

Passing Parameters to MapReduce

This task demonstrates passing runtime parameters to MapReduce jobs:

// In main method
Configuration conf = new Configuration();
conf.set("task.c.hobby", args[2]);

// In mapper setup
targetHobby = context.getConfiguration().get("task.c.hobby", "");

This pattern allows the same job to filter by different hobbies without recompilation.

Key Takeaways

Current approach: Map-Reduce with pass-through reducer
Best optimization: Map-only job (setNumReduceTasks(0))
No combiner benefit: Filter operations don’t reduce data volume per key
Parameter passing: Use Configuration to pass runtime parameters
When to use map-only: Filter, projection, and transformation operations without grouping
Performance: Map-only jobs save 30-50% execution time by eliminating shuffle/sort

Get Started

Dataset

Analytics Tasks

Guides

Task C - Hobby Filter

Problem Statement

Implementation

Simple Approach (Map-Reduce)

Optimization Opportunity

Why No Combiner?

Performance Characteristics

Running the Task

Sample Output

Passing Parameters to MapReduce

Key Takeaways

Build docs developers (and LLMs) love

Get Started

Dataset

Analytics Tasks

Guides

​Problem Statement

​Implementation

​Simple Approach (Map-Reduce)

​Optimization Opportunity

​Why No Combiner?

​Performance Characteristics

​Running the Task

​Sample Output

​Passing Parameters to MapReduce

​Key Takeaways

Build docs developers (and LLMs) love

Problem Statement

Implementation

Simple Approach (Map-Reduce)

Optimization Opportunity

Why No Combiner?

Performance Characteristics

Running the Task

Sample Output

Passing Parameters to MapReduce

Key Takeaways