Skip to main content

Overview

Task A analyzes the CircleNet Pages dataset to count the frequency of each hobby.

TaskA (Simple)

Package: circlenet.taskA
Class: TaskA
Source: src/main/java/circlenet/taskA/TaskA.java

Main Method

public static void main(String[] args) throws Exception

Command-Line Arguments

args[0]
string
required
Input path to the CircleNet Pages CSV file (e.g., /circlenet/pages/CircleNetPage.csv)
args[1]
string
required
Output path for results (e.g., /circlenet/output/taskA/simple)

Mapper: TaskAMapper

Extracts hobbies from the Pages dataset and emits (hobby, 1) pairs.
public static class TaskAMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text hobby = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) 
            throws IOException, InterruptedException{
        String line = value.toString();
        String[] fields = line.split(",");
        if(fields.length == 5){
            hobby.set(fields[4]);
            context.write(hobby,one);
        }
    }
}

Reducer: TaskAReducer

Sums the counts for each hobby.
public static class TaskAReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) 
            throws IOException, InterruptedException{
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

Example Usage

hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

TaskAOptimized

Package: circlenet.taskA
Class: TaskAOptimized
Source: src/main/java/circlenet/taskA/TaskAOptimized.java

Main Method

public static void main(String[] args) throws Exception

Command-Line Arguments

args[0]
string
required
Input path to the CircleNet Pages CSV file
args[1]
string
required
Output path for results (e.g., /circlenet/output/taskA/optimized)

Mapper: MapperA

Uses CsvUtils for better CSV parsing and validates hobby field is not empty.
public static class MapperA extends Mapper<LongWritable, Text, Text, IntWritable> {
    private static final IntWritable ONE = new IntWritable(1);
    private final Text hobby = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) 
            throws IOException, InterruptedException {
        String[] fields = CsvUtils.split(value.toString());
        if (fields.length >= 5) {
            hobby.set(fields[4].trim());
            if (!fields[4].trim().isEmpty()) {
                context.write(hobby, ONE);
            }
        }
    }
}

Reducer: SumReducer

Combiner: Yes, uses SumReducer as combiner for optimization.
public static class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private final IntWritable out = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) 
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable v : values) {
            sum += v.get();
        }
        out.set(sum);
        context.write(key, out);
    }
}

Optimizations

  • Uses combiner to reduce network shuffle
  • Better CSV parsing with CsvUtils
  • Empty string validation

Example Usage

hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized

Build docs developers (and LLMs) love