Task D - Popularity Factor - CircleNet Analytics

Task D computes the “popularity factor” for each CircleNetPage owner by counting how many people follow them, demonstrating join optimization techniques between CircleNetPage and Follows datasets.

Problem Statement

For each CircleNetPage, report the owner’s nickname and the number of people that follow them. For page owners not listed in Follows, return a score of zero. SQL Equivalent:

SELECT p.ID, p.NickName, COUNT(f.ID1) as follower_count
FROM CircleNetPage p
LEFT JOIN Follows f ON p.ID = f.ID2
GROUP BY p.ID, p.NickName
ORDER BY follower_count DESC;

Approaches

Simple
Optimized

Simple Implementation (Reduce-Side Join)

The simple approach uses a single-job reduce-side join with MultipleInputs.Page Mapper (TaskDSimple.java:19-32):

public static class PageMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text outKey = new Text();
private final Text outVal = new Text();

@Override
protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException {
    String[] f = CsvUtils.split(value.toString());
    if (f.length >= 2) {
        outKey.set(f[0].trim());  // Page ID
        outVal.set("P," + f[1].trim());  // Tag with "P" for Page
        context.write(outKey, outVal);
    }
}
}

Follows Mapper (TaskDSimple.java:34-46):

public static class FollowsMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text outKey = new Text();
private static final Text OUT = new Text("F,1");

@Override
protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException {
    String[] f = CsvUtils.split(value.toString());
    if (f.length >= 3) {
        outKey.set(f[2].trim());  // ID2 - who is being followed
        context.write(outKey, OUT);  // Tag with "F" for Follow
    }
}
}

Join Reducer (TaskDSimple.java:48-68):

public static class JoinReducer extends Reducer<Text, Text, Text, IntWritable> {
private final IntWritable out = new IntWritable();

@Override
protected void reduce(Text key, Iterable<Text> values, Context context) 
    throws IOException, InterruptedException {
    String nick = null;
    int count = 0;
    
    // Process all values for this page ID
    for (Text t : values) {
        String[] p = CsvUtils.split(t.toString());
        if (p.length >= 2 && "P".equals(p[0])) {
            nick = p[1];  // Extract nickname
        } else if (p.length >= 2 && "F".equals(p[0])) {
            count += CsvUtils.toInt(p[1], 0);  // Count followers
        }
    }
    
    if (nick != null) {
        out.set(count);
        context.write(new Text(key.toString() + "," + nick), out);
    }
}
}

Job Configuration (TaskDSimple.java:76-80):

// Use MultipleInputs to read from two sources
MultipleInputs.addInputPath(job, new Path(args[0]), 
                       TextInputFormat.class, PageMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), 
                       TextInputFormat.class, FollowsMapper.class);
job.setReducerClass(JoinReducer.class);

Characteristics:

Shuffles both datasets (200K pages + 20M follows)
Reduce-side join with tagged values
High I/O overhead
Single job but inefficient

Optimized Implementation (Map-Side Join)

The optimized version uses two jobs: first aggregates followers with a combiner, then performs a map-side join.Job 1: Count Followers with Combiner (TaskDOptimized.java:25-51)

public static class CountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable ONE = new IntWritable(1);
private final Text outKey = new Text();

@Override
protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException {
    String[] f = CsvUtils.split(value.toString());
    if (f.length >= 3) {
        outKey.set(f[2].trim());  // ID2 - who is being followed
        context.write(outKey, ONE);
    }
}
}

public static class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private final IntWritable out = new IntWritable();

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
    int s = 0;
    for (IntWritable v : values) {
        s += v.get();
    }
    out.set(s);
    context.write(key, out);
}
}

// Enable combiner for pre-aggregation
job1.setCombinerClass(SumReducer.class);
job1.setReducerClass(SumReducer.class);

Combiner Benefit: Reduces shuffle from 20M follow records to ~200K aggregated counts (one per followed page). This is a 99% reduction in shuffle I/O!

Job 2: Map-Side Join (TaskDOptimized.java:53-88)

public static class JoinMapper extends Mapper<LongWritable, Text, NullWritable, Text> {
private final Map<String, Integer> followers = new HashMap<String, Integer>();
private final Text out = new Text();

@Override
protected void setup(Context context) throws IOException {
    // Load follower counts into memory from distributed cache
    URI[] files = context.getCacheFiles();
    if (files != null) {
        FileSystem fs = FileSystem.get(context.getConfiguration());
        for (URI u : files) {
            BufferedReader br = new BufferedReader(
                new InputStreamReader(fs.open(new Path(u.getPath()))));
            String line;
            while ((line = br.readLine()) != null) {
                String[] p = CsvUtils.splitTab(line);
                if (p.length >= 2) {
                    followers.put(p[0].trim(), CsvUtils.toInt(p[1], 0));
                }
            }
            br.close();
        }
    }
}

@Override
protected void map(LongWritable key, Text value, Context context) 
    throws IOException, InterruptedException {
    String[] f = CsvUtils.split(value.toString());
    if (f.length >= 2) {
        String id = f[0].trim();
        String nick = f[1].trim();
        // Lookup follower count from in-memory map
        int cnt = followers.containsKey(id) ? followers.get(id) : 0;
        out.set(id + "," + nick + "," + cnt);
        context.write(NullWritable.get(), out);
    }
}
}

// Configure as map-only job
job2.setNumReduceTasks(0);
job2.addCacheFile(new Path(args[2] + "/part-r-00000").toUri());

Map-Only Optimization: Job 2 has no reducers (setNumReduceTasks(0)), performing the join entirely in the mapper. This eliminates shuffle and sort phases, saving significant I/O.

Optimization Benefits:

Combiner reduces Job 1 shuffle by 99% (20M → ~200K)
Map-side join eliminates Job 2 shuffle entirely
Map-only design saves additional I/O
50-70% faster execution

Performance Comparison

Metric	Simple	Optimized	Improvement
Number of Jobs	1	2	Better I/O efficiency
Job 1 Shuffle	20M + 200K records	~200K records	99% reduction
Join Type	Reduce-side	Map-side	Eliminates shuffle
Job 2 Reducers	Required	0 (map-only)	50% I/O saved
Execution Time	Baseline	50-70% faster	Significant

Trade-off: The optimized version uses 2 jobs instead of 1, but the dramatic I/O reduction more than compensates. This demonstrates that more jobs doesn’t always mean slower if each job is optimized.

Running the Task

export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export OUT=/circlenet/output

# Run simple version (reduce-side join)
hadoop jar $JAR circlenet.taskD.TaskDSimple \
  $PAGES $FOLLOWS \
  $OUT/taskD/simple

# View results
hdfs dfs -cat $OUT/taskD/simple/part-r-00000 | head -20

Sample Output

1,UserNick1,234
2,TechGuru,0
3,DataFan,567
4,CodeNinja,89
...

Format: PageID,NickName,FollowerCount

Key Takeaways

Simple approach: Single-job reduce-side join
Optimization techniques:
- Combiner: Pre-aggregate follower counts (99% shuffle reduction)
- Map-side join: Load aggregated data into memory
- Map-only job: Eliminate reduce phase in Job 2
Design pattern: Sometimes 2 optimized jobs outperform 1 unoptimized job
When to use map-side joins: When aggregated result fits in memory
Performance: 50-70% faster with optimized approach

Get Started

Dataset

Analytics Tasks

Guides

Task D - Popularity Factor

Problem Statement

Approaches

Simple Implementation (Reduce-Side Join)

Optimized Implementation (Map-Side Join)

Performance Comparison

Running the Task

Sample Output

Key Takeaways

Build docs developers (and LLMs) love

Get Started

Dataset

Analytics Tasks

Guides

​Problem Statement

​Approaches

​Simple Implementation (Reduce-Side Join)

​Optimized Implementation (Map-Side Join)

​Performance Comparison

​Running the Task

​Sample Output

​Key Takeaways

Build docs developers (and LLMs) love

Problem Statement

Approaches

Simple Implementation (Reduce-Side Join)

Optimized Implementation (Map-Side Join)

Performance Comparison

Running the Task

Sample Output

Key Takeaways