Skip to main content
This guide walks you through running Task A, which analyzes the frequency of favorite hobbies across all CircleNet users. You’ll learn how to build the JAR, copy it to your container, and execute a MapReduce job.

Prerequisites

Before you begin, ensure you have:
  • Docker container running with Hadoop installed (see setup guide)
  • CircleNet datasets loaded into HDFS at /circlenet/pages/CircleNetPage.csv
  • Maven installed for building the project
  • SSH access to your container on port 3000
If you haven’t set up your environment yet, follow the complete setup guide first.

Step 1: Build the JAR file

From your project root directory, build the project using Maven:
mvn clean package -DskipTests
This command:
  • Cleans previous builds
  • Compiles all Java source files in the circlenet package
  • Packages everything into target/ds503_bdm-1.0-SNAPSHOT.jar
  • Skips tests for faster builds
The JAR file contains all compiled MapReduce jobs including mappers, reducers, and driver classes for all eight tasks.

Step 2: Copy JAR to container

Get your container ID and copy the JAR file:
docker ps
Copy the JAR using your container ID (replace dadc9d47d16e with yours):
docker cp target/ds503_bdm-1.0-SNAPSHOT.jar dadc9d47d16e:/home/ds503/

Step 3: Set up environment variables

SSH into your container:
ssh -p 3000 ds503@localhost
Default credentials are username ds503 and password ds503.
Set up your environment variables for easier job execution:
export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export OUT=/circlenet/output
export LOCAL_OUT=/home/ds503/results
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv
Create the local output directory:
mkdir -p $LOCAL_OUT

Step 4: Run Task A

Task A analyzes hobby frequency by reading the CircleNetPage dataset and counting occurrences of each hobby.

Run the simple version

hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple
This command:
  • Executes the TaskA class from the JAR
  • Reads input from /circlenet/pages/CircleNetPage.csv
  • Writes output to /circlenet/output/taskA/simple

Run the optimized version

The optimized version uses a combiner to reduce data shuffling:
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized
The optimized version typically runs 20-30% faster by using a combiner to aggregate hobby counts locally before the shuffle phase.

Step 5: View the results

Inspect the output directly in HDFS:
hdfs dfs -ls $OUT/taskA/simple
View the first 20 results:
hdfs dfs -cat $OUT/taskA/simple/part-r-00000 | head -20
hdfs dfs -cat $OUT/taskA/simple/part-r-00000 | head -20
Expected output format:
PodcastBinging	4523
Gardening	3876
Photography	4102
Hiking	3945
Cooking	4234
...

Step 6: Copy results to local

Copy the results from HDFS to your container’s local filesystem:
mkdir -p $LOCAL_OUT/taskA/simple
mkdir -p $LOCAL_OUT/taskA/optimized
hdfs dfs -get $OUT/taskA/simple $LOCAL_OUT/taskA/
hdfs dfs -get $OUT/taskA/optimized $LOCAL_OUT/taskA/
From your host machine, copy results out of the container:
docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA

Step 7: Check performance metrics

View the timing information:
cat $CIRCLENET_TIMING_FILE
Filter for total execution time:
grep ",total," $CIRCLENET_TIMING_FILE
The timing file shows execution time in milliseconds for each phase (map, reduce, total).

Understanding the code

Here’s how Task A works under the hood:

Mapper

The mapper reads each line from CircleNetPage.csv and emits the hobby as the key:
src/main/java/circlenet/taskA/TaskA.java
public static class TaskAMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text hobby = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException{
        String line = value.toString();
        String[] fields = line.split(",");
        if(fields.length == 5){
            hobby.set(fields[4]);  // FavoriteHobby is the 5th field
            context.write(hobby, one);
        }
    }
}

Reducer

The reducer sums up all the counts for each hobby:
src/main/java/circlenet/taskA/TaskA.java
public static class TaskAReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException{
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}
This is a classic word count pattern - the most fundamental MapReduce algorithm. The mapper emits (hobby, 1) pairs, and the reducer sums the counts.

Next steps

Explore Task B

Find the 10 most popular CircleNetPages

All tasks

View all 8 analytics tasks

Optimization guide

Learn MapReduce optimization techniques

Dataset details

Understand the data structure

Troubleshooting

If you see an error about the output directory already existing, remove it first:
hdfs dfs -rm -r $OUT/taskA
Then rerun your job.
This usually means the JAR wasn’t copied correctly. Verify the JAR exists:
ls -lh /home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
If missing, repeat Step 2.
Ensure your datasets are loaded into HDFS:
hdfs dfs -ls /circlenet/pages/
If empty, follow the setup guide to load your datasets.

Build docs developers (and LLMs) love