Overview
Task A analyzes the CircleNet Pages dataset to count the frequency of each hobby.
TaskA (Simple)
Package: circlenet.taskA
Class: TaskA
Source: src/main/java/circlenet/taskA/TaskA.java
Main Method
public static void main(String[] args) throws Exception
Command-Line Arguments
Input path to the CircleNet Pages CSV file (e.g., /circlenet/pages/CircleNetPage.csv)
Output path for results (e.g., /circlenet/output/taskA/simple)
Mapper: TaskAMapper
Extracts hobbies from the Pages dataset and emits (hobby, 1) pairs.
public static class TaskAMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text hobby = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException{
String line = value.toString();
String[] fields = line.split(",");
if(fields.length == 5){
hobby.set(fields[4]);
context.write(hobby,one);
}
}
}
Reducer: TaskAReducer
Sums the counts for each hobby.
public static class TaskAReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Example Usage
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple
TaskAOptimized
Package: circlenet.taskA
Class: TaskAOptimized
Source: src/main/java/circlenet/taskA/TaskAOptimized.java
Main Method
public static void main(String[] args) throws Exception
Command-Line Arguments
Input path to the CircleNet Pages CSV file
Output path for results (e.g., /circlenet/output/taskA/optimized)
Mapper: MapperA
Uses CsvUtils for better CSV parsing and validates hobby field is not empty.
public static class MapperA extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable ONE = new IntWritable(1);
private final Text hobby = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = CsvUtils.split(value.toString());
if (fields.length >= 5) {
hobby.set(fields[4].trim());
if (!fields[4].trim().isEmpty()) {
context.write(hobby, ONE);
}
}
}
}
Reducer: SumReducer
Combiner: Yes, uses SumReducer as combiner for optimization.
public static class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private final IntWritable out = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable v : values) {
sum += v.get();
}
out.set(sum);
context.write(key, out);
}
}
Optimizations
- Uses combiner to reduce network shuffle
- Better CSV parsing with
CsvUtils
- Empty string validation
Example Usage
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized