Skip to main content
Task D computes the “popularity factor” for each CircleNetPage owner by counting how many people follow them, demonstrating join optimization techniques between CircleNetPage and Follows datasets.

Problem Statement

For each CircleNetPage, report the owner’s nickname and the number of people that follow them. For page owners not listed in Follows, return a score of zero. SQL Equivalent:
SELECT p.ID, p.NickName, COUNT(f.ID1) as follower_count
FROM CircleNetPage p
LEFT JOIN Follows f ON p.ID = f.ID2
GROUP BY p.ID, p.NickName
ORDER BY follower_count DESC;

Approaches

Simple Implementation (Reduce-Side Join)

The simple approach uses a single-job reduce-side join with MultipleInputs.Page Mapper (TaskDSimple.java:19-32):
public static class PageMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text outKey = new Text();
private final Text outVal = new Text();

@Override
protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException {
    String[] f = CsvUtils.split(value.toString());
    if (f.length >= 2) {
        outKey.set(f[0].trim());  // Page ID
        outVal.set("P," + f[1].trim());  // Tag with "P" for Page
        context.write(outKey, outVal);
    }
}
}
Follows Mapper (TaskDSimple.java:34-46):
public static class FollowsMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text outKey = new Text();
private static final Text OUT = new Text("F,1");

@Override
protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException {
    String[] f = CsvUtils.split(value.toString());
    if (f.length >= 3) {
        outKey.set(f[2].trim());  // ID2 - who is being followed
        context.write(outKey, OUT);  // Tag with "F" for Follow
    }
}
}
Join Reducer (TaskDSimple.java:48-68):
public static class JoinReducer extends Reducer<Text, Text, Text, IntWritable> {
private final IntWritable out = new IntWritable();

@Override
protected void reduce(Text key, Iterable<Text> values, Context context) 
    throws IOException, InterruptedException {
    String nick = null;
    int count = 0;
    
    // Process all values for this page ID
    for (Text t : values) {
        String[] p = CsvUtils.split(t.toString());
        if (p.length >= 2 && "P".equals(p[0])) {
            nick = p[1];  // Extract nickname
        } else if (p.length >= 2 && "F".equals(p[0])) {
            count += CsvUtils.toInt(p[1], 0);  // Count followers
        }
    }
    
    if (nick != null) {
        out.set(count);
        context.write(new Text(key.toString() + "," + nick), out);
    }
}
}
Job Configuration (TaskDSimple.java:76-80):
// Use MultipleInputs to read from two sources
MultipleInputs.addInputPath(job, new Path(args[0]), 
                       TextInputFormat.class, PageMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), 
                       TextInputFormat.class, FollowsMapper.class);
job.setReducerClass(JoinReducer.class);
Characteristics:
  • Shuffles both datasets (200K pages + 20M follows)
  • Reduce-side join with tagged values
  • High I/O overhead
  • Single job but inefficient

Performance Comparison

MetricSimpleOptimizedImprovement
Number of Jobs12Better I/O efficiency
Job 1 Shuffle20M + 200K records~200K records99% reduction
Join TypeReduce-sideMap-sideEliminates shuffle
Job 2 ReducersRequired0 (map-only)50% I/O saved
Execution TimeBaseline50-70% fasterSignificant
Trade-off: The optimized version uses 2 jobs instead of 1, but the dramatic I/O reduction more than compensates. This demonstrates that more jobs doesn’t always mean slower if each job is optimized.

Running the Task

export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export OUT=/circlenet/output

# Run simple version (reduce-side join)
hadoop jar $JAR circlenet.taskD.TaskDSimple \
  $PAGES $FOLLOWS \
  $OUT/taskD/simple

# View results
hdfs dfs -cat $OUT/taskD/simple/part-r-00000 | head -20

Sample Output

1,UserNick1,234
2,TechGuru,0
3,DataFan,567
4,CodeNinja,89
...
Format: PageID,NickName,FollowerCount

Key Takeaways

  • Simple approach: Single-job reduce-side join
  • Optimization techniques:
    • Combiner: Pre-aggregate follower counts (99% shuffle reduction)
    • Map-side join: Load aggregated data into memory
    • Map-only job: Eliminate reduce phase in Job 2
  • Design pattern: Sometimes 2 optimized jobs outperform 1 unoptimized job
  • When to use map-side joins: When aggregated result fits in memory
  • Performance: 50-70% faster with optimized approach

Build docs developers (and LLMs) love